Thoughts on Kolab and (3rd Party) Application Caching

jmeeuwen's picture

Applications that we integrate with Kolab Groupware have a genuine need for caching. But what is it exactly that causes this need for caching?

Why Does One Cache?

Let's first think about why one caches in the first place. Caching is usually implemented to eliminate a bottleneck and boost performance. Data can then be obtained from the relatively quick cache -it is "close by", and it usually understands and is optimized for some form of querying- as opposed from the relatively slow original source of the data.

With Kolab, using Cyrus IMAP as the backend storage for all groupware related information like Email, Events and Contacts, it can hardly be argued the IMAP server does not perform up to specifications. Cyrus IMAP is extremely fast and scalable; it is, arguably, just as lean and mean as a cache would be.

So, Where's the Bottleneck?

Yet, the following topology does introduce and work around a bottleneck, seriously impacting and later improving performance. I'm off to explore what exactly is the bottleneck, and how to work around it.

We'll use Z-Push (Free Software ActiveSync implementation) in a regular, current scenario workflow as an example. The following happens, justifying caching, when a mobile device requests synchronization;

  1. The 3rd party application connects to Cyrus IMAP to retrieve "the information". Since not all folders may need syncing, and some folders contain Email while others contain Events or Contacts, a list of IMAP folders needs to be obtained.
  2. Cyrus IMAP, its efficiency in this matter aside, interacts with the mailbox storage to retrieve certain information from its mailbox database, its annotations database, the message files, etc.

Using the IMAP protocol, this means;

  • Listing the folders the user is authorized for.
  • Iterate over the list of folders and retrieve the following annotations for each folder:
    • "Should this folder be synchronized with the mobile device at all?"
    • "What type of groupware items does the folder contain?"
  • Then, excuse my paraphrasing, on a per IMAP folder target basis, the changes that may or may not have been applied on the mobile device, need to be compared with the changes that may or may not have been applied in the IMAP folder, and vice-versa. Retrieving changes, messages, parsing them, and comparing them, and applying the changes on either end, while tracking which changes have already been communicated to one or the other end of the synchronization exercise.

Naturally, the last step is what Z-Push is in charge of, in this example. and It does have certain characteristics and interactions with Cyrus IMAP as well as with its own caches to optimize the performance, scalability and user experience.

The former notwithstanding, this is just one application integrated with Kolab responsible of maintaining its own cache. In current generations of Kolab, Horde Webmail does the exact same but using a different cache, and future generations of Kolab will include RoundCube, which again also maintains its own cache.

One Cache per Application?

Maintaining a cache per 3rd party application integrated with Kolab isn't necessarily the most sustainable route to go. Feasible? Yes. Sustainable? Perhaps not. Let's take one step back and look at the bigger picture again;

Presumably, the interaction between Cyrus IMAP and its storage can not be optimized further (which the dotted double arrow is supposed to indicate). Not without intrusive changes at the very least, that is to say, while admittedly I'm unaware of our options to further increase performance in this part of the flow of information. If you have ideas and the necessary experience, let me know and I can get you hooked up.

It is, perhaps, the IMAP protocol used in between the Cyrus IMAP server and the 3rd party application that is the bottleneck. For example, Z-Push cannot do the following over IMAP, eliminating a number of iterations and sequences issuing IMAP commands;

SELECT folder FROM folders
INNER JOIN annotations ON folders.id = annotations.folder_id
WHERE annotations.key = '/vendor/kolab/activesync' AND annotations.value = 'true';

Hey, this does somewhat represent what it does against the cache it maintains, having obtained the information over IMAP once (slow) it uses its cache to obtain the information (fast) in a number of subsequent synchronizations -limited to an expiry interval, expiring and updating cache, of course. This builds us the following picture, where IMAP is "slow" for the task at hand, and SQL is fast;

Ignoring the interaction that Cyrus IMAP requires with the filesystem as being a negligible performance penalty, and focussing on how the 3rd party application wishes to optimize performance, apparently it would rather perform caching (cheap), then it would want to interact over the IMAP protocol (expensive).

Suggestion #1: One Cache To Rule Them ALL

It has been suggested, since most if not all of the 3rd party applications integrated with Kolab would require some form of caching, we create "one cache to rule them all";

Although probably this is in fact entirely feasible, the following constraints to such architecture come to mind;

  • Session reliability and personal information security, complex to implement but even more complex to audit, and implemented up to specifications with IMAP ACLs already,
  • Duplication of ((a significant) part of) the data,
  • Abstraction from caching required in all 3rd party applications, each of which has their own already (i.e. significant development effort right from the start, and continuous development effort for more 3rd party applications to integrate with Kolab Groupware), and one uniform caching specification across all of the 3rd party applications (i.e. significant design complexity).

We (within the Kolab community) regularly refer to the "one cache to rule them all" as "server-side akonadi" - currently the very efficient client-side (offline) caching in our primary smart client, Kontact.

Suggestion #2: Maintain Cyrus IMAP Databases in Networked SQL

It has also been suggested (by me, in fact), to have Cyrus IMAP use database formats other applications within a Kolab Groupware deployment could read from. By having Cyrus IMAP maintain its mailbox, annotations and perhaps even mail folder indexes and caches in a database format like SQL (instead of Berkely or skiplist), these would become available to the 3rd party applications without them having to populate the cache first, and the cache would be updated "automagically"; as a result, the level of interactions with Cyrus IMAP over the "inefficient" IMAP protocol would be further reduced -"inefficient" for the task at hand, that is.

However, this would greatly impact the scalability of Cyrus IMAP. It would, in fact, greatly impact the overall performance of Cyrus IMAP as an IMAP server. It's a valid option, but not considered feasible targetting for because of the projected performance penalties.

Suggestion #3: Use Cyrus IMAP

If you agree it's fair to label having to use the IMAP protocol to get to the data required as the bottleneck, here's the suggestion I have in mind; Add a thin, lean and mean, network-enabled, read-only C application to interface between the 3rd party applications and the Cyrus IMAP databases (on the filesystem), thus enabling the 3rd party applications to use a different protocol or querying language to obtain the data in a more efficient manner. Perhaps this would look as follows:

Benefits would include many requirements have already been implemented; Locking, networking, database maintenance, threading, thread safety, TLS/SSL, access control though IMAP ACLs and its handling and more of that stuff. The new application could, presumably, also maintain its own caching capabilities to be even quicker.

Just some early Saturday morning thoughts... let's see what the rest of the weekend brings.