
A system administrator's job is always in between two -or most likely more then two- fires; one is the existing infrastructure, and the others consist of every other stakeholder in that infrastructure. The worst stakeholders are users, especially those that will burn you down without knowing what et al is involved, without any respect or perspective as to the position you are in as a sysadmin, juggling with more requirements to the infrastructure then just their personal needs. In the end though, ironically, they are also the reason why. Noted, most of my users are all sysadmins themselves, which makes any kind of disposition the more awkward, and any kind of decision be evaluated a lot differently by many more. In that regard, being a sysadmin in a company full of sysadmins is similar to participation in the Free Software world, if you will.
It came to that point. That point where a decision needed to be made on whether to scale up, flesh out and divide in order to conquer, or accept the downtime as a consequence of incidental shortage of resources. That time will come in most environments, and you can only prepare for it. Hence, I'm going to list a few tips on things to do, and things not to do, in order for you to -hopefully- be a little better prepared when you're in a similar position ;-)
First, I'll draw you up an overview picture of the environment in this case.
There's this web-server and it doesn't consume a lot of resources most of the time, and you feel confident it's going to make it through this week unharmed. It does Zarafa Webaccess, and ActiveSync through Z-Push. Both of these are heavily used by many users -think in terms of hundreds. It's a mission-critical environment to many, since groupware includes their email and calendaring, and however pitiful, this apparently still is the primary way for my colleagues to communicate.
Then, on Tuesday, during office hours, a peak in resource usage, and many users start complaining about dramatically decreased performance. Fair enough, it seems.
This web-server just so happens to be on the back-end Zarafa groupware server. Infrastructural architecture aside, tt was juggling it's available resources between the back-end and the front-end server. As a result, not only Zarafa Webaccess and ActiveSync (z-push) users were impacted, but our IMAP and Outlook users were as well -and it's the Outlook users (Account Managers, Senior Consultants, ...) that we think are truly mission-critical.
This is an incident, and it's, or so it seems, isolated. Incidents though, if I remember correctly, become problems when they occur two or more times. This situation I'm describing happened to us on Tuesday, had happened to us just one time before. The last occurrence was months ago, and we've been running this environment for about 3 or 4 years. This last incident months ago just so happened to be on the day of the release of the iPhone 3GS -a "funny" coincidence, probably. Either way, incidents, I think, have some retention to it. It's not like it's a let i++ kinda for loop that goes on and on and on, if you know what I mean.
Either way, I fixed it right on the spot, once more, by freeing up some resources. The incident did cause me to think about a different, more robust architecture though, but I wasn't in any hurry to implement such different environment/infrastructure architecture. I did some thinking, some research, I made some preparations, but I wasn't going to implement them quite yet.
Wednesday, everything is fine. Thursday, everything is fine. Friday, the very same situation happens again. Now, to me, that's a second strike, if not a third, and I'm going to take on this problem as a special project and make sure it is dealt with accordingly, my perspective being long-term sustainability and all that.
So, after a little study as to the exact cause, the proposed changes are:
A lot of thought, again, went into long term sustainability. For one, I want to put front-end, public facing processing on a node positioned in the perimeter network, as much as I can. I want my perimeter network node to go down but never ever the backend server, if you will, by whatever means.
And so I did make those changes by the time the second occurrence of this incident was about to get the entire groupware environment to a grinding halt, as part of an emergency change, that very same Friday afternoon -the worst time to implement any kind of change.
It took me all about 5 seconds to merge and push it through our configuration management -with Puppet. Since no DNS entries had changed yet, no firewalls were forwarding to anything different yet, this change was non-intrusive. So far, the changes I made are basically an extra VirtualHost for an existing reverse proxy, and I created a new web-server.
The intrusive part of this emergency change, the part that makes me do this blog post, is still to come. Either way though, the fact that it was a non-intrusive change so far enabled me to test the implementation without a staging environment. Note that in the environment I work with right now, we only turn to a staging environment for god-awful intrusive changes (Zarafa upgrades, LDAP migrations, Disaster Recovery tests, Restores from Backup, etc.), since we have the top Senior experts on any technology on our payroll.
Anyway, as a result, I now had everything in place to flip the proverbial switches, without anyone noticing anything. Everything was tested (/etc/hosts FTW!), and so I switched the internal DNS entry for our Webaccess and ActiveSync over to the new web-server.
I shouldn't have.
I made a big, big mistake.
I assumed -which in itself very accuratly describes the fsckup-, that a host name of webmail.domain.tld was used for *webmail* only, and not, say, FTP sites or even just Outlook clients.
The new, internal web-server had no business reverse proxying Outlook clients to the back-end mail-server, yet more and more Outlook clients configured to use server webmail.domain.tld started attempting to connect to the new web-server.
The reverse proxy for external clients did have business reverse proxying such connections, but that is no back-end server in any way. The back-end web-server for webmail.domain.tld wasn't configured to do any kind of reverse proxying for Outlook clients, and justifiably so. It had been configured to perform any of the tasks that involve *webmail*, and that alone.
Luckily, I had the configuration in place on that reverse proxy I mentioned. Copy, paste, commit, push, pull, apply, run, done. No matter what the IP address for webmail.domain.tld was internally, one would either end up with the back-end mail-server, or the new (back-end) web-server that, for the time being, also reverse proxied the Outlook connection to the back-end mail-server. That too was solved pretty quickly, but for the users to understand why or how exactly they had no email or calendaring for 5 seconds... different story.
Users will argue "it used to work", when also arguing to not understand why "it doesn't work right now". Well, you know, sometimes changes are necessary in order for anything to continue to work. Sometimes the implementation of such changes impacts you as a user, but only serves other users. Sometimes, hopefully even less frequently, those other users justify the change to be an emergency change, hitting you right in the face during production hours. I'm sorry, you were saying?
Then, and this is especially the case with me, I require a feedback cycle. I myself am not an Outlook nor ActiveSync user, so I need someone else to tell me what their needs and expectations are. Not in terms of "it doesn't work anymore", but accompanied with the details that allow me to pin-point the exact cause and then solve it.
Some of these causes can be found in arbitrary ActiveSync software not accepting a wildcard-certificate as valid, even though it is in fact valid for the rest of the world. This kind of software would check whether the certificate CN is exactly equal to the fully qualified domain name of whatever you're hitting, but won't expand matching characters. That level of detail is beyond many sysadmins, let alone users. While neither party is at fault per se, users do look in the direction of a sysadmin "to solve the problem" because "it doesn't work".
"Right. Thanks. Can you check a couple of things for me, like, view the certificate you are getting?" I had my suspicions, but I'm not familiar with whether such software will even allow you to view the certificate it's rejecting.
"I don't understand how this works and I have no time to Google all day as I have a job to do... Can't you just solve it?" is a commonly heard answer.
"Right. So, well, no, I can't just go around and attempt to fix arbitrary things in arbitrary ways. I didn't ask you to Google all day and solve the problem by yourself, I asked you to provide that feedback in order for me to be able to confirm my suspicions as to the cause of your problems."
I've always understood reverse proxying Outlook Anywhere is a challenge, given the sheer volume and the way it sets up a connection and expects that connection to be around for a relatively long time. Hence, I'm going to share some of my configuration, with comments in-line.
First, the reverse proxy:
# Need a specific IP address other then the one used for all the other reverse proxied
# websites, unless you use the very same certificates for all sites (no nss here yet).
<VirtualHost 10.0.0.19:443 10.0.0.98:443>
ServerAdmin kc-ux@ogd.nl
# Some Outlook clients may still have been configured with 'webmail.ogd.nl'
ServerName webmail.ogd.nl
# The new configuration for Outlook clients is to use the following DNS names
ServerAlias outlook-anywhere.ogd.nl outlookanywhere.ogd.nl
DocumentRoot /var/www/html/
ErrorLog logs/webmail.ogd.nl-error_log
CustomLog logs/webmail.ogd.nl-access_log combined
SSLEngine on
SSLProxyEngine on
SSLProtocol all -SSLv2
SSLCipherSuite ALL:!ADH:!EXPORT:!SSLv2:RC4+RSA:+HIGH:+MEDIUM:+LOW
# Note that this can be a wildcard certificate as far as Outlook clients is concerned, but
# not for ActiveSync clients.
SSLCertificateFile /etc/pki/tls/certs/webmail.ogd.nl.cert
SSLCertificateKeyFile /etc/pki/tls/private/webmail.ogd.nl.key
SSLCACertificateFile /etc/pki/tls/certs/webmail.ogd.nl.ca.cert
KeepAlive On
# Crank up the KeepAliveTimeout
KeepAliveTimeout 300
# Prevent the connection from ever being reset unexpectedly.
MaxKeepAliveRequests 0
# Prevent the logs from filling up.
SecRuleRemoveById 960010
SecRuleRemoveById 960012
SecRuleRemoveById 960013
SecRuleRemoveById 960015
SecRuleRemoveById 960032
SecRuleRemoveById 960902
SecRuleRemoveById 970902
SecRuleRemoveById 970903
ProxyRequests Off
# This is where we actually create the reverse proxy. Note that the parameters
# to ProxyPass make it a keepalive enabled connection, and we keep retrying in case
# of errors, instead of throwing the error back at the client.
# One for the Zarafa back-end mail-server
ProxyPass /zarafa https://backend-mailserver.ogd.nl:237/zarafa keepalive=on retry=0
ProxyPassReverse /zarafa https://backend-mailserver.ogd.nl:237/zarafa
# And one for the webmail back-end web-server
ProxyPass / https://webmail.ogd.nl/ keepalive=on retry=0
ProxyPassReverse / https://webmail.ogd.nl/
</VirtualHost>
Given the new situation, we now have the following situation;