Applications that we integrate with Kolab Groupware have a genuine need for caching. But what is it exactly that causes this need for caching?
Let's first think about why one caches in the first place. Caching is usually implemented to eliminate a bottleneck and boost performance. Data can then be obtained from the relatively quick cache -it is "close by", and it usually understands and is optimized for some form of querying- as opposed from the relatively slow original source of the data.
With Kolab, using Cyrus IMAP as the backend storage for all groupware related information like Email, Events and Contacts, it can hardly be argued the IMAP server does not perform up to specifications. Cyrus IMAP is extremely fast and scalable; it is, arguably, just as lean and mean as a cache would be.
Yet, the following topology does introduce and work around a bottleneck, seriously impacting and later improving performance. I'm off to explore what exactly is the bottleneck, and how to work around it.
We'll use Z-Push (Free Software ActiveSync implementation) in a regular, current scenario workflow as an example. The following happens, justifying caching, when a mobile device requests synchronization;
Using the IMAP protocol, this means;
Naturally, the last step is what Z-Push is in charge of, in this example. and It does have certain characteristics and interactions with Cyrus IMAP as well as with its own caches to optimize the performance, scalability and user experience.
The former notwithstanding, this is just one application integrated with Kolab responsible of maintaining its own cache. In current generations of Kolab, Horde Webmail does the exact same but using a different cache, and future generations of Kolab will include RoundCube, which again also maintains its own cache.
Maintaining a cache per 3rd party application integrated with Kolab isn't necessarily the most sustainable route to go. Feasible? Yes. Sustainable? Perhaps not. Let's take one step back and look at the bigger picture again;
Presumably, the interaction between Cyrus IMAP and its storage can not be optimized further (which the dotted double arrow is supposed to indicate). Not without intrusive changes at the very least, that is to say, while admittedly I'm unaware of our options to further increase performance in this part of the flow of information. If you have ideas and the necessary experience, let me know and I can get you hooked up.
It is, perhaps, the IMAP protocol used in between the Cyrus IMAP server and the 3rd party application that is the bottleneck. For example, Z-Push cannot do the following over IMAP, eliminating a number of iterations and sequences issuing IMAP commands;
SELECT folder FROM folders
INNER JOIN annotations ON folders.id = annotations.folder_id
WHERE annotations.key = '/vendor/kolab/activesync' AND annotations.value = 'true';
Hey, this does somewhat represent what it does against the cache it maintains, having obtained the information over IMAP once (slow) it uses its cache to obtain the information (fast) in a number of subsequent synchronizations -limited to an expiry interval, expiring and updating cache, of course. This builds us the following picture, where IMAP is "slow" for the task at hand, and SQL is fast;
Ignoring the interaction that Cyrus IMAP requires with the filesystem as being a negligible performance penalty, and focussing on how the 3rd party application wishes to optimize performance, apparently it would rather perform caching (cheap), then it would want to interact over the IMAP protocol (expensive).
It has been suggested, since most if not all of the 3rd party applications integrated with Kolab would require some form of caching, we create "one cache to rule them all";
Although probably this is in fact entirely feasible, the following constraints to such architecture come to mind;
We (within the Kolab community) regularly refer to the "one cache to rule them all" as "server-side akonadi" - currently the very efficient client-side (offline) caching in our primary smart client, Kontact.
It has also been suggested (by me, in fact), to have Cyrus IMAP use database formats other applications within a Kolab Groupware deployment could read from. By having Cyrus IMAP maintain its mailbox, annotations and perhaps even mail folder indexes and caches in a database format like SQL (instead of Berkely or skiplist), these would become available to the 3rd party applications without them having to populate the cache first, and the cache would be updated "automagically"; as a result, the level of interactions with Cyrus IMAP over the "inefficient" IMAP protocol would be further reduced -"inefficient" for the task at hand, that is.
However, this would greatly impact the scalability of Cyrus IMAP. It would, in fact, greatly impact the overall performance of Cyrus IMAP as an IMAP server. It's a valid option, but not considered feasible targetting for because of the projected performance penalties.
If you agree it's fair to label having to use the IMAP protocol to get to the data required as the bottleneck, here's the suggestion I have in mind; Add a thin, lean and mean, network-enabled, read-only C application to interface between the 3rd party applications and the Cyrus IMAP databases (on the filesystem), thus enabling the 3rd party applications to use a different protocol or querying language to obtain the data in a more efficient manner. Perhaps this would look as follows:
Benefits would include many requirements have already been implemented; Locking, networking, database maintenance, threading, thread safety, TLS/SSL, access control though IMAP ACLs and its handling and more of that stuff. The new application could, presumably, also maintain its own caching capabilities to be even quicker.
Just some early Saturday morning thoughts... let's see what the rest of the weekend brings.