Squid and Mediawiki
Last Modified: 31-July-2009; 20:48 WST; adrian
This document outlines the Mediawiki extensions for supporting Squid and some of the issues which appear.
Overview
Mediawiki has some basic extensions to support a network of Squid reverse proxies to distribute content and dramatically reduce the wiki server load.
Unfortunately the default support is not well documented and requires some extensions to the default Squid codebase to support these features. By default Squid will seem to work fine but will not cache as much information as is needed.
Configuration Overview
A typical Mediawiki + Squid install uses one backend Mediawiki+Database server with one or more Squid front-end servers providing access to clients on the internet.
All client traffic goes via the Squid server(s). This includes logged in user traffic, page reads and page writes. Since even a page read of the current version can result in a large amount of work (database access and PHP processing) done behind the scenes, a properly configured reverse proxy will dramatically reduce the server load.
To retain control over content - and because of issues with various browsers handling content caching incorrectly - Wikipedia tend to configure their client-facing reverse proxies to issue non-cachable replies regardless of whether the content is being cached by the reverse proxy or not.
Object Revalidation
Typically, a well designed HTTP service designed with caching in mind will make it very cheap for downstream clients and intermediaries to revalidate content. Revalidation is where a client or intermediary issues a conditional HTTP request to validate whether the local copy is still fresh enough to use, or whether it needs to be replaced with a more up to date version.
If the site owner wishes to retain control over the content - specifically so content can be modified almost at will without stale copies being served to clients - they will configure a low expiry time (on the order of a minute or two) and then rely on cheap object revalidation to save on server resources. Each revalidated object thus results in both savings in bandwidth and local server resources. Everyone wins in this instance - clients and intermediaries benefit from the content being cached where possible, the site administrators have to deploy less servers and they end up needing much less bandwidth.
Unfortunately Mediawiki does not currently make revalidating objects very cheap. Consequently, Mediawiki returns content with a large expiry time to the intermediary proxy server - and the proxy thus has to be explicitly told when to remove objects that have been cached.
This object invalidation is typically done using either HTTP or HTCP PURGE messages. Objects are invalidated whenever a user edits a page or uploads a new version of a file.
Mediawiki can call an external script to issue the purge request or it can issue PURGE directly to each of the configured proxy servers. The installation size will dictate the best way to distribute PURGE requests to the reverse proxies.
Logged in Users
Another thing to keep in mind is how Mediawiki handles logged-in users. Since the entire page is being cached by Mediawiki, this will include the "username" section for logged-in users and any notifications which may apply to them.
This means that individual versions of objects must be cached based on the username. Or, if possible (and it is not currently clear whether this is doable) - mark content for users as not-cachable. Since a typical Mediawiki install serves the majority of its content to anonymous users, this scales well. This however means that a Mediawiki install is not going to benefit much if the majority of users are logged in. A good example of this is a site which requires Mediawiki authentication to access the majority of content.
Serving Compressed Objects
A normal Mediawiki install includes configuring the web server to also compress response entities. Clients are able to specify which kind of content encoding they support by listing said types in the Accept-Encoding header in the HTTP request. The web server may then optionally re-encode the content (typically compression) and reply with the relevant encoding in the Content-Encoding header.
Like with authenticated clients, this results in multiple versions of pages being cached with the same URL, but varying on the encoding.
Shortcomings with Squid and HTTP
There are a few shortcomings with Squid and HTTP when it comes to caching and serving content from Mediawiki.
Correctly Caching Vary Content
Vary and User-Agents
In the past, caches would simply not cache content with the Vary header set. This resulted in many compression modules simply not setting the Vary header on compressed responses, hoping that user-agents would happily accept whichever version happened to be cached.
Since the majority of early Vary use was for compression, users for the most part never noticed. But one particular browser (Microsoft IE 5.5 for Macintosh) would not accept a compressed response to a request with no compression explicitly sent. Users would thus occasionally get garbled responses and then invariably blame the web proxy cache involved.
Today, Squid handles caching content with Vary header set. It will record the headers mentioned in the Vary: response header and further index items based on the contents of those request headers. It is important to note that Vary content is identified as Variant based on the reply, but the request headers sent by the client header are used for the cache lookup.
Shortcomings with Vary
Compressed Content
Since Vary content is cached based on both the request URL and some request header content, Squid relies on the content of said request headers to exactly match.
For example, if the server reply indicates that a response varies on the content of the Accept-Encoding request header, the only time Squid will serve a cached response is if client requests have the exact matching header content for Accept-Encoding.
Unfortunately, browsers are not consistent with either the order of allowed encoding or their names. This means that by default Squid may end up caching many copies of the same content. Other commercial caches will shortcut Vary header content and only cache compressed/uncompressed variants - and specifically mark any other variant headers as uncachable.
Authenticated Mediawiki Content
To make matters worse, Mediawiki also caches content based on the logged in user details. Since it uses cookies to track user info, it also marks content as varying on "Cookie". This will end up marking a large part of the content as uncachable - not only may the cookie details differ based on username, but there may be other cookies sent with the request that do not at all have anything to do with Mediawiki authentication!
Purging Objects
Squid has supported purging objects from the cache for a number of years. Until recently however, this has only supported purging a specific object - it did not attempt to handle purging all the Vary objects that may occur for the given URL.
Again, since Mediawiki will be sending at least a compressed and uncompressed variant of any given object to the user, purging one must result in purging of all other variants in order for users to have a consistent view of the site content.
Response header manipulation
A common Mediawiki configuration includes attempting to make content as uncachable by proxies and users as possible - primarily to ensure that page modifications are quickly visible to users.
There is a draft extension to HTTP - Surrogate extensions - which allows an origin server to mark responses with seperate caching information for the reverse proxies and clients. For example, the surrogate control headers could indicate to the reverse proxies that content is cachable, but indicate to clients that the content is immediately stale and requires revalidation.
At the time of deployment, the stable release of Squid (Squid-2) did not support these surrogate headers. Thus, response headers need to be explicitly modified as they are returned to the client.
Mediawiki Specific Squid Configuration
The shortcomings in general HTTP and Squid have been addressed by the Mediawiki group. These patches are for now specific to Mediawiki and should not be used in a general-purpose proxy/cache.
Caching Vary Objects
There is a Mediawiki patch which implements a feature which allows Squid to use specific parts of the Vary headers in determining the Vary key.
The patch introduces a new header, X-Vary-Options, which Mediawiki will use to implement caching variants based on the encoding type (Compressed or Uncompressed) and the particular username which is logged in.
TODO: document exactly what the patch implements as it really isn't documented anywhere!
The current patch is available here. Please note it introduces a new configuration option - --enable-vary-options - which is not on by default. This option implements the X-Vary-Option parsing. Finally, it only patches configure.in and not configure - thus bootstrap.sh will need to be run to rebuild the configure script.
PURGE and Vary Objects
The development version of Squid-2 does have some support for PURGE and variant objects but this is not yet available for the stable Squid-2.7 release. Instead, there is a Mediawiki specific patch which implements enough functionality for the wiki framework.
The current patch is available here.
Default validation and expiry information
By default, a lot of the non-HTML content generated by Mediawiki (images, CSS, Javascript) is cachable but requires constant revalidation. It may prove useful to override the default revalidation information for CSS template images, CSS and Javascript files so the Mediawiki backend is not constantly loaded with conditional/revalidation HTTP requests.
Verification Checklist
The following is a brief overview to verify that Squid is properly caching Mediawiki content.
The easiest way to follow the access.log content is to filter on mime type. For example, to watch HTML only pages from Mediawiki, try " tail -f access.log | grep 'text/html' ".
- Browse normal anonymous wiki content. The access.log entries should begin being TCP_MISS but soon migrate to being TCP_HIT, TCP_REFRESH_HIT, TCP_IMS_HIT.
- Try browsing anonymous wiki content from a different browser. Be sure that the subsequent accesses are HITs - Squid should serve up the cached content regardless of what the browser is sending in the Accept-Encoding or Cookie headers. If you see MISS followed by HITs on subsequent accesses then please double-check you have correctly enabled the Squid integration in Mediawiki and that you have correctly compiled in the X-Vary-Options support.
- TODO: add entry for verifying PURGE functionality.
