Services · Blog · Demo
Get on the scene
29 Nov 2006 – 11:40 in tagged , , , by Michael Daum
tins.jpg
Today most sites use content engines that produce highly dynamic content, by assembling different objects to finally generate html and send over to your browser. This makes caching on the server side a must. As wikis emerge from simple one-page services and leap into the regions of content management systems, caching is increasingly attractive here too. Last, but not least, there are many more lurkers and page hoppers than contributors to a wiki, and there's no need to recompute the same pages again and again.
Client-side as opposed to server-side caches try to cache as close to the browser as possible, to reduce bandwidth congestion early. Such a cache gets an url, and returns a full html page. The only additional knowledge it has is the time the page is supposed to expire. There are also server-side caches that work along the same lines — called reverse caches.

The central task of a cache is to maintain its integrity (don't return outdated information). Caches tend to sacrifice a bit of integrity to get any caching effects at all. Now, the problem is that the cache and the content source don't talk to each other. If the content source has an update it should notify the cache about it, so that the cache can actively invalidate some of its store. To my knowledge there's no third party cache that implements such an API. Nevertheless, there are third party cache solutions that have very good APIs to integrate them into your own software. Two of them are Memcached and Varnish. While Memcached has bindings for several (scripting) languages, Varnish comes with its own scripting language. Memcached is more of a multi-purpose cache for any kind of storable objects; Varnish is a highly optimized server-side cache and restricted to url based caching only.

Frankly, notifying the cache so that it performs invalidation is no big conceptual problem. The hard bit is when and which parts of the cache shall be invalidated. The task of “dependency tracking” is to record which objects were needed to compute one another. Based on this information, dependencies are “fired” to recursively invalidate cache information. As you can imagine, one page can depend on a lot of other information spread all over the place. For example, the same page may look differently for each user due to user preferences, url parameters or session values. A wiki page may depend on a WikiWord not being defined yet, as it renders a link to it differently when it exists or not. While these case can be detected automatically there are also cases when the engine can't. Imagine a RandomPlugin that generates random blind text. A page using such a feature is basically not cacheable, as content materializes out of nowhere. More realistic examples are plugins that integrate external data sources. The knowledge about the content source is part of an external system. So it is responsible to establish dependencies to the objects that are constructed on its base.

Dependency tracking will be nearly impossible in other cases where there's no a priori knowledge at hand: a page showing query results. The page is assembled using its search results, and thus depends on the found items. Either this page can become invalid because the query does not match an item anymore, or a new item comes into existence that now matches the query. This is bad news as search queries are the most expensive and most valuable operation in any content engine. However, things are not that bad for caching as data does not change so frequently, and we can present the same results as long as we know that no data has changed. Think of it as a 1:n dependency instead of a couple of 1:1 dependencies between different objects.

Yet another type of dependency is content that changes over time, without any external input, e.g. a clock. You really don't want to cache anything here, the same way as it is pointless to cache random numbers on a page using your shiny new RandomPlugin. From the angle of a cache this type of content is pure “dirt”. In fact, while the rest of the page is quite static for a while, some areas in it may be filled with up-to-date information on every request. A way to deal with this is so-called “dirty areas” … which is nothing offensive but just a way to prevent the cache from being trashed with information it can't cache at all. In a way, cached pages that have dirty areas in it are rather similar to very specific templates.

Alright, so far I mused about caching in a more general way motivating the typical problems and outlined what can be done about it. Let me introduce you to the TWiki::Cache that has been implemented recently in one of my next postings.


Leave a Reply

You may have to login or register to comment if you haven't already.
r8 – 27 Feb 2007 – 23:47:09 – Main.MichaelDaum
Copyright © 1999-2008 WikiRing Partnership – Contact us