sysadmin

jmeeuwen's picture

Using the noop I/O Scheduler for KVM Virtualization through Puppet and Augeas

For a virtualization environment, it often makes sense to use a kernel I/O scheduler that does not take into account whether and/or which hardware seek time penalty may or may not be applicable for the disks used. Hence, where in my case I use a storage device over iSCSI, I want to set the noop scheduler for the hypervisors (which use iSCSI), and all guests on it (which use logical volumes). Neither the hypervisors nor the guests will experience a seek time penalty, so I thought, and so scheduling their I/O does not need to be optimized for such. The noop scheduler does exactly that.

On a side-note: Luckily, all guests run Linux ;-)

Using Puppet and Augeas, it's particularly easy to just manage the kernel cmdline options. In the Puppet manifest:

    # If the system is virtualized, just use the noop I/O scheduler
    # for all block devices
    if ( $is_virtual ) {
        augeas { "kernel_elevator_noop":
            context => "/files/etc/grub.conf",
            changes => "setm title kernel/elevator noop",
            onlyif => "get title/kernel/elevator != noop"
        }
    }

To change the I/O scheduler during runtime, just use:

# echo noop > /sys/block/<device>/queue/scheduler

For a full, more verbose description of what to do (including loading the necessary kernel module, etc.), check out this awesome, short walk-through.

jmeeuwen's picture

10 things to do in Tempe, Arizona

You are somewhere in a Sun Belt state in the U.S., and it's nearing the and of January. For some, this sounds like a hopeless situation. For others, it is the opportunity of a lifetime!

Here's 10 things to do in Tempe, Arizona, nearing the end of January, 2011, while FUDCon is taking place.

  1. Bacula Backup and Recovery; Get some insight on the subject and engage a Bacula Systems Certified Trainer on the subject. I'd love to insert a "Tuesdays at noon" type of thing, but we're just not sure when exactly such sessions would be taking place. It's not important either, there's at least 9 other things to do!
  2. Koji Build Systems. Do you use it? Do you use it because you need something happening downstream? Are you a third-party repository enthusiast? Let's get to it, and have at least one session on the build system suite.
  3. Long Term Support (LTS), Extended Lifecycle Support (ELS), or whatever you may want to call an extention on the regularly supported time period of updates coming to a stable Fedora release. I like the latter terminology better. If you're interested in stopping it though, please stay away?
  4. Spins. Custom spins, localized spins, remixes or bluntly put, dwarf forks. How can we make sure all this momentum is harnassed and nurtured up and until the point little projects become proud masters of the universe with our help.
  5. Packaging Guidelines and the Package Review Process. I've had my share, how about you?
  6. Cyrus IMAP; first-hand experience of a) How Fedora Rules the Universe, b) What Upstreams Expect From Distributors and c) How To NOT Behave As A Distributor.
  7. The Ever So Awesome Fedora Cloud. I'm not sure where it's at, but I do think I know a little something about what cloud is all about. It's time to get that first concept implementation for Metadata DNS going, while showing off a complete Open Source stack for Any Company(TM) in a demo-environment with public access of sorts.
  8. Fedora Hosted Groupware. I know Fedora Infrastructure to have looked at Zarafa, not to say I have shamelessly plugged it back in the day. In the interest of full disclosure, I now work for Kolab Systems -another Groupware ISV, this time fully open source and free software.
  9. Puppet. Modules. Modules. Modules. Cross-platform consistency. Approaches. Packaging. Writing modules (why and how)
  10. Beer, meat (that's me), friends, laughs (that's me too!). I can think of very little more I need in life.
jmeeuwen's picture

Subscribing Your Company To A Mailing List

When you run Kolab, or Postfix and Cyrus IMAP, you may have any number of your users subscribed to a mailing list.

As such, many mailing list posts may delivered to a large number of users separately, increasing the data footprint on disk, and processing, while only including some of your users in the loop on what may be interesting content for everyone.

You may want to have one shared folder to deliver new mailing list messages to, to which individual users can then subscribe. Here's how (in a nutshell):

  • Create a Cyrus IMAP user that is called 'shared',
  • Set the postuser to 'shared' in /etc/imapd.conf
  • Set the sharedprefix to 'shared' in /etc/imapd.conf
  • Create a shared/lists mailbox (cm shared/lists@kolabsys.com in my case),
  • Create some subfolders you want mailing list posts be delivered to, like shared/lists/kolab.org/devel@kolabsys.com,
  • Make sure that at the very least 'anyone' as the 'lrsip' permissions on the mailbox,
  • Give yourself 'all' privileges on the mailbox,
  • Make sure the recipient_delimiter is set to '+' in /etc/postfix/main.cf,
  • Make sure shared@kolabsys.com is listed in /etc/postfix/transport with transport lmtp:unix:/var/lib/imap/socket/lmtp
  • Reload what you need to reload,
  • Send a test message to shared+shared/lists/kolab.org/devel@kolabsys.com,
  • It should end up in the correct folder.

Note, for those of you not using virtual domain support in Cyrus, skip the @kolabsys.com at the end of folder names to be created and you should be good to go.

Also note that for each individual user that wants to post messages to the mailing list, they still need to be subscribed to the mailing list themselves; they can then disable delivery through the mailing list preferences pages if available, reading the list through the shared folder.

jmeeuwen's picture

Piwik Web Analytics FTW!

Paul Adams asked me to install Web Analytics for Kolab Systems, his preference being Piwik, which he had used before.

Now, when some request like that hits me, I'm usually like "Oh yeah? Web analytics is it? What is it you actually want? Number of hits?" just because I would like to go from the functional requirements as opposed to being the monkey to implement whatever and then having to maintain/modify/develop whatever because it isn't quite all what was expected.

However, in this case, and those kudos go to Piwik, I looked at it, I tried it once and then decided to put it in production -normally I deploy something to testing first, then have someone make up their mind and make absolutely sure it's good and ready.

However, in this case, the case of Piwik, the installation procedure was like next-next-finish, it got me what I needed right-away, and it runs exactly the way I expect it to run -that's unique compared to how I needed to run in circles for many, many other web-based applications.

Either way, long story short, if you need Web Analytics do take a look at Piwik. It's awesome, even though it's not as extensive in admin features such as delegating a per-profile per-user set-of-privileges (to view or admin such profile). You would most commonly deploy Piwik within one organisation or multiple organisations that trust one another to some or the other extent, specifically for a limited set of websites anyway (e.g. not the multi-multi user deployment that is Google Analytics), and there's very valid reasons to not want to use Google Analytics.

jmeeuwen's picture

OTRS 2.4.7 SendmailEncodingForce

If you ever run into a situation where OTRS 2.4.7 on your CentOS system (for which packages live here BTW) bails out retrieving new email from a mailbox, it might actually fail sending out notifications to agents:

[otrs@app01 ~]$ perl -d /var/www/otrs/bin/PostMasterMailbox.pl -d 9 -f 1
Loading DB routines from perl5db.pl version 1.28
Editor support available.

Enter h or `h h' for help, or `man perldebug' for more help.

main::(/var/www/otrs/bin/PostMasterMailbox.pl:34):
34: $VERSION = qw($Revision: 1.10 $) [1];
DB<1> step

DB<2> [...System/MailAccount/IMAPS.pm line 82 in sub new] looking for greeting
[...System/MailAccount/IMAPS.pm line 82 in sub new] got a greeting: * OK [CAPABILITY IMAP4 (...snip...)
[...rl/5.8.8/Net/IMAP/Simple.pm line 931 in sub _send_cmd] 0 LOGIN (...snip...)
[...rl/5.8.8/Net/IMAP/Simple.pm line 196 in sub _process_cmd] 0 OK [CAPA(...snip...)
(...snip...)
[...rl/5.8.8/Net/IMAP/Simple.pm line 509 in sub _process_cmd] )\r\n
[...rl/5.8.8/Net/IMAP/Simple.pm line 942 in sub _cmd_ok] )\r\n
[...rl/5.8.8/Net/IMAP/Simple.pm line 903 in sub _seterrstr] warning unknown return string (id=3): )\r\n
[...rl/5.8.8/Net/IMAP/Simple.pm line 509 in sub _process_cmd] 3 OK Completed (0.000 sec)\r\n
[...rl/5.8.8/Net/IMAP/Simple.pm line 942 in sub _cmd_ok] 3 OK Completed (0.000 sec)\r\n
Wide character in subroutine entry at (eval 72)[/usr/lib/perl5/vendor_perl/5.8.8 (...snip...)
at (eval 72)[/usr/lib/perl5/vendor_perl/5.8.8/MIME/Decoder/QuotedPrint.pm:78] line 1
MIME::Decoder::QuotedPrint::encode_qp_threearg('> > > ��Buil(...snip...)
', 'undef', '') called at /usr/lib/perl5/vendor_perl/5.8.8/MIME/Decoder/QuotedPrint.pm line 95
MIME::Decoder::QuotedPrint::encode_qp_really('> > > ��Build OS:���(...snip...)
', 1) called at /usr/lib/perl5/vendor_perl/5.8.8/MIME/Decoder/QuotedPrint.pm line 154
MIME::Decoder::QuotedPrint::encode_it('MIME::Decoder::QuotedPrint=HASH(0x7811330)', (...snip...)
MIME::Decoder::encode('MIME::Decoder::QuotedPrint=HASH(0x7811330)', 'IO::ScalarArray (...snip...)
MIME::Entity::print_bodyhandle('MIME::Entity=HASH(0x78c07a0)', 'IO::ScalarArray=GLOB(0 (...snip...)
MIME::Entity::print_body('MIME::Entity=HASH(0x78c07a0)', 'IO::ScalarArray=GLOB(0x78c09 (...snip...)
MIME::Entity::print('MIME::Entity=HASH(0x78c07a0)', 'IO::ScalarArray=GLOB(0x78c0980)') called (...snip...)
MIME::Entity::print_body('MIME::Entity=HASH(0x782b520)', 'IO::ScalarArray=GLOB(0x78c0980)') (...snip...)
MIME::Entity::stringify_body('MIME::Entity=HASH(0x782b520)') called at /usr/lib/perl5/vendor_perl/ (...snip...)
MIME::Entity::body_as_string('MIME::Entity=HASH(0x782b520)') called at /var/www/otrs/Kernel/Sys (...snip...)
Kernel::System::Email::Send('Kernel::System::Email=HASH(0x76199f0)', 'From', (...snip...)
Kernel::System::Ticket::Article::SendAgentNotification('Kernel::System::Ticket=(...snip...)
Kernel::System::Ticket::Article::ArticleCreate('Kernel::System::Ticket=HASH(0x(...snip...)
Kernel::System::PostMaster::FollowUp::Run('Kernel::System::PostMaster::Foll(...snip...)
Kernel::System::PostMaster::Run('Kernel::System::PostMaster=HASH(0x7232(...snip...)
Kernel::System::MailAccount::IMAPS::_Fetch('Kernel::System::MailAccount::I(...snip...)
Kernel::System::MailAccount::IMAPS::Fetch('Kernel::System::MailAccount::IM(...snip...)
Kernel::System::MailAccount::MailAccountFetch('Kernel::System::MailAccount(...snip...)
main::Fetch('EncodeObject', 'Kernel::System::Encode=HASH(0x6bf86e0)', 'C(...snip...)
[otrs@app01 ~]$

Note that the error in the web interface or from a normal regular run may look like:

Wide character in subroutine entry at (...something...) at 1

This is actually caused by the default OTRS setting for SendmailEncodingForce which is set to 'base64'. Hence, in /var/www/otrs/Kernel/Config.pm, add the following line:

$Self->{'SendmailEncodingForce'} = '7bit';

Thanks to Paul Adams for some ingenious Googling ;-)

jmeeuwen's picture

Where I Have Been (the past two weeks)

It's been a very busy past two weeks and I haven't had much time to show my face on IRC or respond to email. I've participated in or executed some interesting activities though, so please allow me to share those with you.

On Monday, almost two weeks ago, I visited Bacula Systems headquarters in Switzerland, to get a 1-on-1 training from Kern Sibbald -the original author of Bacula backup & recovery software-, in order to become a certified Bacula Systems Trainer. A quick walk-through the slides along with some insight on what Kern thought were the most important highlights to mention and/or explain during the training, a review of the exercises and one day later, you have sufficient additional information apart from the course material itself.

The next day, Eric Schirardin introduced me to how the Bacula Systems course classroom setup was to be configured, so that all the exercises could be performed without the course attendees having to go through the motions that in fact would have had little to do with Bacula itself.

Since my flight back to the Netherlands was leaving early that evening, and I had an important Kolab Systems conference call in between, I had to hurry and catch a train to the airport, find myself a nice, quiet corner somewhere, a wifi connection and call in. I made it, but that morning I had no idea how smoothly we would go through the motions to get me to be able to set up the classroom and at what time we'd be done exactly. Let's just say I had too little time to get any lunch before I caught a train just in time for me to arrive at the airport and not have any of the ambient train noises while sitting in a conference call. Then again, I don't usually eat during lunch anyways ;-)

The next day (Thursday last week), I was going to travel to Osnabrueck for a visit to a valued partner on behalf of Kolab Systems; a three-hour train-ride from Utrecht. In order to arrive at a reasonable time, I had to get up at around 6 in the morning, not something I'm very much used to, nor a fan of ;-) Either way though, the visit was planned for Thursday as well as Friday, and it was definitely worth the travel! I was staying with Christoph Wickert -you know him from his contributions to the Fedora Project and various upstream projects- in Munster throughout my stay in Germany.

On Saturday, I was on my way back home again. Since the Bacula Systems Administration Course was to start on Tuesday, and Monday was allocated to preparing the classroom (and testing the setup), most of my weekend was spent at going through the slides and exercises and preparing my story. Luckily, I was still very familiar with the course as I had recently participated in the course myself, so I could afford to visit a friend's birthday party as well.

On Monday, it turned out that VirtualBox (used to create a second backup client for the Bacula Systems Administration Course), installed on a 64-bit laptop without hardware virtualization acceleration capabilities cannot run a 64-bit guest, so I quickly installed a 32-bit guest and configured it, which actually took the majority of the single day that I had available.

Tuesday, Wednesday and Thursday, at Amaziq Source in Amsterdam, I ran the Bacula Systems Administration I Course (there's a part II coming up as well, with more advanced topics) with 6 participants from 4 countries, and I'd like to think that it was very successful. At least Bacula Systems (through their senior engineer Arno Lehmann) thought so, because I passed the test running the course according to their quality standards, and so now I'm a proud Bacula Systems Certified Trainer ;-)

jmeeuwen's picture

Trunking Bonded Etherchannels for Virtualization and Storage on Enterprise Linux

I was setting up a network infrastructure the other day, and needed trunking over bonded (Linux) network interfaces connected to a Cisco switch for a virtualization <-> storage network (Network #1), and "the rest" (production network, management network, etc.). Here's just some quick notes:

  • My rule of thumb was: The higher the VLAN number or IP space network number, the more private the network is.
  • I decided the Management network was to be VLAN 666 (how appropriate) with IP space 10.66.6.0/24. The storage network inherently became 10.66.7.0/24 (VLAN 667) because it was even more private then the management network.
  • I decided that any public servers would be in VLAN 2 (very public, very close to the Internet, etc.).

Let's suppose these were all the networks I needed.

On the Hypervisor, I configured eth0 and eth1 as slave interfaces for bond0. bond0 itself though was not supposed to have any IP address configuration.

# cat ifcfg-eth0
DEVICE=eth0
BOOTPROTO=none
HWADDR=<some-mac-address>
ONBOOT=yes
MASTER=bond0
SLAVE=yes
# cat ifcfg-eth1
DEVICE=eth1
BOOTPROTO=none
HWADDR=<some-mac-address>
ONBOOT=yes
MASTER=bond0
SLAVE=yes
# cat ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes

For a bonded interface, you need to choose the exact mode of operation, and because the interfaces were bonded to increase the throughput, I chose 802.3ad. For it to actually happen, you need to configure the bonding kernel module through /etc/modprobe.conf:

alias eth0 bnx2
alias eth1 bnx2
alias bond0 bonding
options bond0 mode=4 miimon=100

NOTE: Listing the physical interfaces first is mandatory.

Still, we have no network configuration. The only network configuration I needed was a semi-physical interface in the storage network, and a set of 802.1q encapsulated interfaces for the rest of the network communications. Ergo, I created the following interfaces:

  • bond0.2 (public)
  • bond0.666 (management)
  • bond0.667 (storage)

The only interface out of the three that would actually get an IP address though was the storage network interface bond0.667. The configuration would look as follows:

DEVICE=bond0.667
BOOTPROTO=static
ONBOOT=yes
VLAN=yes
TYPE=Ethernet
IPADDR=10.66.7.1
NETMASK=255.255.255.0
NOZEROCONF=yes

The other two interfaces (bond0.2 for the internet and bond0.666 for the management) are a little more tricky. They needed to be bridged interfaces, in order to allow virtualized guest nodes to be positioned in either one of those two networks. The configuration for bond0.2 therefore looked as follows:

DEVICE=bond0.2
BOOTPROTO=static
ONBOOT=yes
VLAN=yes
TYPE=Ethernet
BRIDGE=br2

Bridge interface br2 was to be used to connect the virtualized guest nodes to. Its configuration looks like:

DEVICE=br2
TYPE=Bridge
BOOTPROTO=none
ONBOOT=yes
VLAN=yes
STP=yes
DELAY=5

Note that the bridge interface does not have its own IP address, or we would be connecting the Hypervisor directly to the Internet (and we don't want to, FWIW).

The management network interface though, which also needed to be bridged, does have its own IP address (in the management network, of course):

# cat ifcfg-bond0.666
DEVICE=bond0.666
BOOTPROTO=none
ONBOOT=yes
VLAN=yes
TYPE=Ethernet
BRIDGE=br666
# cat ifcfg-br666
DEVICE=br666
TYPE=Bridge
BOOTPROTO=static
ONBOOT=yes
VLAN=yes
IPADDR=10.66.6.1
NETMASK=255.255.255.0
GATEWAY=10.66.6.254
NOZEROCONF=yes
STP=yes
DELAY=5

Now, we're done for the Linux Hypervisor part of the infrastructure. Lets get to the Cisco side of things!

All that a Cisco Catalyst 3560G really requires is that you:

  • Create the VLANs you want to use. This used to be through enable mode's "vlan database" command, which is still a valid way of configuring the available VLANs, but will be (may already have been on newer equipment) deprecated for a series of commands in configure mode.
  • Create the "channel-group" with the appropriate interfaces.
  • Enable trunking on the channel-group interface.

Ergo, here we go (from enable mode):

conf t
int range gi0/1-2
no shut
speed 1000
duplex full
switchport trunk encapsulation dot1q
switchport mode trunk
switchport trunk allowed vlan 2,666,667
no switchport trunk native vlan
description **some etherchannel**
channel-group 1 mode active

You should now get an interface called Po1, with the following configuration:

show running-config interface Po1
Building configuration...

Current configuration : 142 bytes
!
interface Port-channel1
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 2,666,667
switchport mode trunk
end

You should be good to go by now.

jmeeuwen's picture

MySQL: Graphing the Queries per Second Average Trend (Part 2)

Continuing on my previous blog post on trending the number of Average Queries per Second on a MySQL server;

It didn't really cover the scenario in which the MySQL server was rebooted, and so the average queries per second would restart at 0 (but not really).

The longer a peak in number of queries per second have to spread out over, the less the average is influenced; It's a good thing your average MySQL server does not have a 99.999% uptime ;-)

So, here's how the graph continues after a reboot of the server (we had a power interruption):

And so it seems we're back to where we were again, after one week of running in production.

jmeeuwen's picture

kernel: nfs server <server> not responding, still trying

Where <server> is my fileserver with a bunch of terabyte drives, my NFS clients get this message every once in a while (like every minute or so):

kernel: nfs server <server> not responding, still trying
kernel: nfs server <server> OK

Sometimes, in between these two messages, there's a couple of seconds. That can't be good ;-)

Most of the time, it's just the server being overloaded. Since most of my boxes at home are pimped yet slow desktop form-factor machines, this can happen. Google will show you the exact right hits on this phenomenon.

However, I found out there's another cause as well. Note how /etc/sysconfig/nfs has a setting, commented out by default, that says:

# Number of nfs server processes to be started.
# The default is 8.
#RPCNFSDCOUNT=8

Such I changed to 24, restarted NFS et voila! Yet another teeny weeny setting to tweak ;-)

jmeeuwen's picture

The Position Of A Sysadmin

A system administrator's job is always in between two -or most likely more then two- fires; one is the existing infrastructure, and the others consist of every other stakeholder in that infrastructure. The worst stakeholders are users, especially those that will burn you down without knowing what et al is involved, without any respect or perspective as to the position you are in as a sysadmin, juggling with more requirements to the infrastructure then just their personal needs. In the end though, ironically, they are also the reason why. Noted, most of my users are all sysadmins themselves, which makes any kind of disposition the more awkward, and any kind of decision be evaluated a lot differently by many more. In that regard, being a sysadmin in a company full of sysadmins is similar to participation in the Free Software world, if you will.

It came to that point. That point where a decision needed to be made on whether to scale up, flesh out and divide in order to conquer, or accept the downtime as a consequence of incidental shortage of resources. That time will come in most environments, and you can only prepare for it. Hence, I'm going to list a few tips on things to do, and things not to do, in order for you to -hopefully- be a little better prepared when you're in a similar position ;-)

First, I'll draw you up an overview picture of the environment in this case.

There's this web-server and it doesn't consume a lot of resources most of the time, and you feel confident it's going to make it through this week unharmed. It does Zarafa Webaccess, and ActiveSync through Z-Push. Both of these are heavily used by many users -think in terms of hundreds. It's a mission-critical environment to many, since groupware includes their email and calendaring, and however pitiful, this apparently still is the primary way for my colleagues to communicate.

Then, on Tuesday, during office hours, a peak in resource usage, and many users start complaining about dramatically decreased performance. Fair enough, it seems.

This web-server just so happens to be on the back-end Zarafa groupware server. Infrastructural architecture aside, tt was juggling it's available resources between the back-end and the front-end server. As a result, not only Zarafa Webaccess and ActiveSync (z-push) users were impacted, but our IMAP and Outlook users were as well -and it's the Outlook users (Account Managers, Senior Consultants, ...) that we think are truly mission-critical.

This is an incident, and it's, or so it seems, isolated. Incidents though, if I remember correctly, become problems when they occur two or more times. This situation I'm describing happened to us on Tuesday, had happened to us just one time before. The last occurrence was months ago, and we've been running this environment for about 3 or 4 years. This last incident months ago just so happened to be on the day of the release of the iPhone 3GS -a "funny" coincidence, probably. Either way, incidents, I think, have some retention to it. It's not like it's a let i++ kinda for loop that goes on and on and on, if you know what I mean.

Either way, I fixed it right on the spot, once more, by freeing up some resources. The incident did cause me to think about a different, more robust architecture though, but I wasn't in any hurry to implement such different environment/infrastructure architecture. I did some thinking, some research, I made some preparations, but I wasn't going to implement them quite yet.

Wednesday, everything is fine. Thursday, everything is fine. Friday, the very same situation happens again. Now, to me, that's a second strike, if not a third, and I'm going to take on this problem as a special project and make sure it is dealt with accordingly, my perspective being long-term sustainability and all that.

So, after a little study as to the exact cause, the proposed changes are:

  • move Webaccess and ActiveSync to a different, dedicated web-server (virtualization makes that easy enough)
  • position, in between external Outlook clients of the back-end mail-server, a reverse proxy (heavy stuff)
  • position, in between external Zarafa Webaccess and ActiveSync (z-push) users, a reverse proxy (heavy stuff once more)

A lot of thought, again, went into long term sustainability. For one, I want to put front-end, public facing processing on a node positioned in the perimeter network, as much as I can. I want my perimeter network node to go down but never ever the backend server, if you will, by whatever means.

And so I did make those changes by the time the second occurrence of this incident was about to get the entire groupware environment to a grinding halt, as part of an emergency change, that very same Friday afternoon -the worst time to implement any kind of change.

It took me all about 5 seconds to merge and push it through our configuration management -with Puppet. Since no DNS entries had changed yet, no firewalls were forwarding to anything different yet, this change was non-intrusive. So far, the changes I made are basically an extra VirtualHost for an existing reverse proxy, and I created a new web-server.

The intrusive part of this emergency change, the part that makes me do this blog post, is still to come. Either way though, the fact that it was a non-intrusive change so far enabled me to test the implementation without a staging environment. Note that in the environment I work with right now, we only turn to a staging environment for god-awful intrusive changes (Zarafa upgrades, LDAP migrations, Disaster Recovery tests, Restores from Backup, etc.), since we have the top Senior experts on any technology on our payroll.

Anyway, as a result, I now had everything in place to flip the proverbial switches, without anyone noticing anything. Everything was tested (/etc/hosts FTW!), and so I switched the internal DNS entry for our Webaccess and ActiveSync over to the new web-server.

I shouldn't have.

I made a big, big mistake.

I assumed -which in itself very accuratly describes the fsckup-, that a host name of webmail.domain.tld was used for *webmail* only, and not, say, FTP sites or even just Outlook clients.

The new, internal web-server had no business reverse proxying Outlook clients to the back-end mail-server, yet more and more Outlook clients configured to use server webmail.domain.tld started attempting to connect to the new web-server.

The reverse proxy for external clients did have business reverse proxying such connections, but that is no back-end server in any way. The back-end web-server for webmail.domain.tld wasn't configured to do any kind of reverse proxying for Outlook clients, and justifiably so. It had been configured to perform any of the tasks that involve *webmail*, and that alone.

Luckily, I had the configuration in place on that reverse proxy I mentioned. Copy, paste, commit, push, pull, apply, run, done. No matter what the IP address for webmail.domain.tld was internally, one would either end up with the back-end mail-server, or the new (back-end) web-server that, for the time being, also reverse proxied the Outlook connection to the back-end mail-server. That too was solved pretty quickly, but for the users to understand why or how exactly they had no email or calendaring for 5 seconds... different story.

The Users

Users will argue "it used to work", when also arguing to not understand why "it doesn't work right now". Well, you know, sometimes changes are necessary in order for anything to continue to work. Sometimes the implementation of such changes impacts you as a user, but only serves other users. Sometimes, hopefully even less frequently, those other users justify the change to be an emergency change, hitting you right in the face during production hours. I'm sorry, you were saying?

Then, and this is especially the case with me, I require a feedback cycle. I myself am not an Outlook nor ActiveSync user, so I need someone else to tell me what their needs and expectations are. Not in terms of "it doesn't work anymore", but accompanied with the details that allow me to pin-point the exact cause and then solve it.

Some of these causes can be found in arbitrary ActiveSync software not accepting a wildcard-certificate as valid, even though it is in fact valid for the rest of the world. This kind of software would check whether the certificate CN is exactly equal to the fully qualified domain name of whatever you're hitting, but won't expand matching characters. That level of detail is beyond many sysadmins, let alone users. While neither party is at fault per se, users do look in the direction of a sysadmin "to solve the problem" because "it doesn't work".

"Right. Thanks. Can you check a couple of things for me, like, view the certificate you are getting?" I had my suspicions, but I'm not familiar with whether such software will even allow you to view the certificate it's rejecting.

"I don't understand how this works and I have no time to Google all day as I have a job to do... Can't you just solve it?" is a commonly heard answer.

"Right. So, well, no, I can't just go around and attempt to fix arbitrary things in arbitrary ways. I didn't ask you to Google all day and solve the problem by yourself, I asked you to provide that feedback in order for me to be able to confirm my suspicions as to the cause of your problems."

The Settings

I've always understood reverse proxying Outlook Anywhere is a challenge, given the sheer volume and the way it sets up a connection and expects that connection to be around for a relatively long time. Hence, I'm going to share some of my configuration, with comments in-line.

First, the reverse proxy:

# Need a specific IP address other then the one used for all the other reverse proxied
# websites, unless you use the very same certificates for all sites (no nss here yet).
<VirtualHost 10.0.0.19:443 10.0.0.98:443>
ServerAdmin kc-ux@ogd.nl
# Some Outlook clients may still have been configured with 'webmail.ogd.nl'
ServerName webmail.ogd.nl
# The new configuration for Outlook clients is to use the following DNS names
ServerAlias outlook-anywhere.ogd.nl outlookanywhere.ogd.nl
DocumentRoot /var/www/html/

ErrorLog logs/webmail.ogd.nl-error_log
CustomLog logs/webmail.ogd.nl-access_log combined

SSLEngine on
SSLProxyEngine on
SSLProtocol all -SSLv2
SSLCipherSuite ALL:!ADH:!EXPORT:!SSLv2:RC4+RSA:+HIGH:+MEDIUM:+LOW
# Note that this can be a wildcard certificate as far as Outlook clients is concerned, but
# not for ActiveSync clients.
SSLCertificateFile /etc/pki/tls/certs/webmail.ogd.nl.cert
SSLCertificateKeyFile /etc/pki/tls/private/webmail.ogd.nl.key
SSLCACertificateFile /etc/pki/tls/certs/webmail.ogd.nl.ca.cert

KeepAlive On
# Crank up the KeepAliveTimeout
KeepAliveTimeout 300
# Prevent the connection from ever being reset unexpectedly.
MaxKeepAliveRequests 0

# Prevent the logs from filling up.
SecRuleRemoveById 960010
SecRuleRemoveById 960012
SecRuleRemoveById 960013
SecRuleRemoveById 960015
SecRuleRemoveById 960032
SecRuleRemoveById 960902
SecRuleRemoveById 970902
SecRuleRemoveById 970903

ProxyRequests Off

# This is where we actually create the reverse proxy. Note that the parameters
# to ProxyPass make it a keepalive enabled connection, and we keep retrying in case
# of errors, instead of throwing the error back at the client.

# One for the Zarafa back-end mail-server
ProxyPass /zarafa https://backend-mailserver.ogd.nl:237/zarafa keepalive=on retry=0
ProxyPassReverse /zarafa https://backend-mailserver.ogd.nl:237/zarafa

# And one for the webmail back-end web-server
ProxyPass / https://webmail.ogd.nl/ keepalive=on retry=0
ProxyPassReverse / https://webmail.ogd.nl/

</VirtualHost>

The Impact/Implications

Given the new situation, we now have the following situation;

  • If anything is hammered (from the Internet) to the point where it is going down, it'll be a simple reverse proxy,
  • If anything from the Internet is invalid in any way, we can stop it as close to the source as possible, so it'll never end up on our back-end Infrastructure,
  • No webmail or ActiveSync client is going to cause the web-server to need so many resources it's getting in the way of the resources required by the back-end Zarafa mail-server, making the overall environment just that little more robust,
  • We finally got our Outlook clients to no longer use "webmail.domain.tld" in their configuration,
  • Problems with webmail and ActiveSync (timeouts and connections being reset) have now been resolved merely through the feedback cycle triggered by the sometimes disruptful implementation of these changes.

The Tips

  • Assume that your predecessors have not had the same perspective you have now, and as a result may very well have caused Outlook clients to connect to the Zarafa server using a hostname of www.yourcompany.com
  • Profile your environment's resource usage as much as you can. The ability to anticipate a shortage of resources works magically well. For example, if you know one incident with a 1.000 queries per second made your MySQL database server go haywire, monitor the trend of Average Queries per Second, which in healthy environments should only go up and up over longer periods of time, allowing you to anticipate on whether and when any limitations are going to be met.
  • Use Configuration Management through a Source Code Management system. Not only does it allow you to quickly revert changes applied, it is also your audit trail, a large part of your documentation, and the means to verify overall system consistency. Two years from now, you may not work for your company any longer. Somebody else is going to have to figure out why the "keepalive=on retry=0" is in the configuration for webmail.domain.tld, and why it is so darn essential. This also goes to using Consistent Changesets (no changeset breaks the overall configuration and all changesets can be reverted by themselves), as well as proper Commit messages (the documentation of the change is in the change itself).
  • Try to relieve as much stress as possible and don't pick up the phone when you're expecting a user to be on the other end. Make the communication with users somebody else's priority number 1, so that you can focus on what this arbitrary ActiveSync client might actually be doing totally unexpectedly, as opposed to getting even more stressed out over "it doesn't work" types of other people's frustrations.
  • Move forward. Move backwards a little in order to be able to move forward at all, if necessary. Nobody (not you, not the other sysadmins, not your manager, nor the users) is helped in any way by fast hot-fixing or dirty patch-work if a situation can happen again and again. Rather, consider whether there's a (possibly) more sustainable, true way forward, even if that includes taking a small step backwards for a little while, and then implement that instead. That is, of course, unless you can hot-fix the situation temporarily, and await with the actual implementation of the permanent change 'till after production hours.
Syndicate content