Web server cache warming by crawling sitemaps

Background
Hosting a website on a home server, especially one with low traffic, can result in long response times when retrieving pages. Intuitively, it seems like high traffic should slow the response of a server, and that a server with no CPU load should be able to return web pages faster. But–high traffic servers have a cache advantage. When a user visits a page that has already been recently served, that page is fresh in memory. This is a definite bonus from having the web server retrieve the page from a hard drive, which can take many orders of magnitude longer in IO time 1.

The low traffic home server is at a disadvantage because it is serving little content. The CPU is mostly idle and pages aren’t being frequently accessed, decreasing the probability that a page will already be in memory when a user requests it. This results in cache misses, forcing data to be retrieved from disk. It is also said that the cache is “cold”. As an example of how bad this can be, this site takes 1-2 seconds longer to load after a recent reboot, forcing all web data to be accessed from disk.

There are many solutions to this. Notable methods are: the entire web directory could be allocated to ramdisk 2; implement mem-caching in apache 3; object code (or opcode) caching for server-side languages like PHP 4; and setting expiration dates (expire headers) to allow caching on the client’s browser 5.

I use opcode caching and expire headers to speed page load time, which are very easy to implement. But these don’t ensure web data is in memory. Or at least, ensure that there is a decent probability pages are in memory. For my ancient server, retrieving data from disk is the most time consuming factor of page load. Further, I don’t want caching to require memory like a process; the operating system’s disk cache can be leveraged to provide a cache that is optional. That is, if I’m running a bunch of memory intensive code in the background, the operating system will automatically prioritize it instead of keeping web data in memory, since the cached data is regarded as free. This is demonstrated below by the command vmstat -m, where the cache column indicates that 101 MB of disk is in memory, but classifies as free memory (see 6).

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0     43    573    141    101    0    0     5    20   25   18  1  1 90  7

Implementation
One way to leverage the operating system cache as described above is to crawl the website. Crawling the site forces the server to load all resources from the disk. If there is sufficient free RAM and the site is small, like most blogs, the entirety of the web data will sit in the operating system cache. In the domain of serving web content, this can be called cache warming, which is analogous to user-guided anticipatory paging 7 or pre-fetching.

Crawling a site is easy if you have a sitemap. Several programs can automatically generate them, including sitemaps specific to blogs like WordPress 8. Mine is here, where the XML content is visible when viewing the source. Since sitemaps have a standardized format, the following commands can be used to grab the URL of every page to crawl:

wget -O - erikriffs.com/sitemap.xml | grep -E -o '<loc>.*</loc>' | sed -e 's/<loc>//g' -e 's/<\/loc>//g'

Output:

[email protected] ~ $ wget -O - erikriffs.com/sitemap.xml | grep -E -o '.*' | sed -e 's/<loc>//g' -e 's/<\/loc>//g'
--2012-04-02 14:29:38--  http://erikriffs.com/sitemap.xml
Resolving erikriffs.com... 98.234.52.166
Connecting to erikriffs.com|98.234.52.166|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1641 (1.6K) [application/xml]
Saving to: `STDOUT'

100%[=======================================>] 1,641       --.-K/s   in 0s

2012-04-02 14:29:39 (93.0 MB/s) - written to stdout [1641/1641]

http://erikriffs.com/
http://erikriffs.com/latex-ieee-bibliography-error/
http://erikriffs.com/index/
http://erikriffs.com/matlab-parfor-performance/

Having obtained the list of URLs, this can be piped back into wget to download every page along with the resources required, like css, js, and images.

wget -O - erikriffs.com/sitemap.xml | grep -E -o '<loc>.*</loc>' | sed -e 's/<loc>//g' -e 's/<\/loc>//g' | wget -i - -p -r --delete-after

The above one-liner keeps web data fresh in the cache. Temporarily. As other activity occurs on the operating system, the amount of site content in memory will slowly decrease. Using crontab -e and the crawling one-liner allows for an hourly cache warming. It is also a far simpler alternative to a PHP program, and unlike this Drupal example, will crawl images and other dependent resources (the author is missing the -p flag in wget).

Notes:

  1. http://en.wikipedia.org/wiki/Memory_hierarchy#Application_of_the_concept
  2. http://www.cyberciti.biz/faq/howto-create-linux-ram-disk-filesystem/
  3. http://httpd.apache.org/docs/2.2/caching.html
  4. http://blog.digitalstruct.com/2008/02/27/php-performance-series-caching-techniques/
  5. http://httpd.apache.org/docs/2.2/mod/mod_expires.html
  6. http://stackoverflow.com/questions/6345020/linux-memory-buffer-vs-cache
  7. http://en.wikipedia.org/wiki/Paging#Anticipatory_paging
  8. http://wordpress.org/extend/plugins/google-sitemap-generator/

8 Responses to “Web server cache warming by crawling sitemaps”

  1. brad

    I have a drupal site with xmlsitemap installed. I tried to run your one-liner and got some errors. I was wondering if you could look at the output and let me know why it is doing that?

    [[email protected] ~]# wget -O – http://bytesofweb.com/sitemap.xml | grep loc | sed -e ‘s///g’ -e ‘s///g’ | wget -i – -p –delete-after
    –2012-07-27 11:36:38– http://bytesofweb.com/sitemap.xml
    Resolving bytesofweb.com… 108.59.249.94
    Connecting to bytesofweb.com|108.59.249.94|:80… connected.
    HTTP request sent, awaiting response… 200 OK
    Length: 548 [text/xml]
    Saving to: `STDOUT’

    100%[========================================================================================================================================================================>] 548 –.-K/s in 0s

    2012-07-27 11:36:41 (55.0 MB/s) – `-‘ saved [548/548]

    -: Invalid URL http://bytesofweb.com/daily1.0: Unsupported scheme
    -: Invalid URL http://bytesofweb.com/node/22012-07-25T18:54Zhourly1.0: Unsupported scheme
    -: Invalid URL http://bytesofweb.com/node/52012-07-25T18:55Zhourly1.0: Unsupported scheme
    No URLs found in -.
    You have new mail in /var/spool/mail/root

  2. erik

    Oops, didn’t see this earlier. Your comment was flagged because it has multiple links :)

    If you haven’t cracked it already, it’s because my sitemap has different line break placing, like so:

    <loc>http://erikriffs.com/ </loc>
    <lastmod>2012-04-03T02:56:16+00:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
    

    compared to yours:

    <loc>http://bytesofweb.com/ </loc><changefreq>daily</changefreq><priority>1.0</priority>

    Modifying grep to

    grep -E -o '<loc>.*</loc>'

    does the trick and should work for the majority of sitemaps. So all together, this should work for you:

    wget -O - http://bytesofweb.com/sitemap.xml | grep -E -o '<loc>.*</loc>' | sed -e 's/<loc>//g' -e 's/<\/loc>//g' | wget -i - -p --delete-after

    I updated the original post to use this because it’s more robust; nice catch.

  3. Simon Brown

    I’m ging a strange error which I’m not technical enough to understand:
    sed: -e expression #2, char 0: no previous regular expression

    Do you know what this means?

  4. erik

    Ah that’s my fault–sorry! (due to a plugin update on my end, the <loc> tags weren’t being escaped).

    I have updated the post to properly display the commands–you shouldn’t get the sed error anymore.

  5. Mike A

    I have a Magento eCommerce site. I set up a cron using the command as shown. I get a 414 Error URI is too long. Is there any way to fix.

  6. erik

    I assume you get the 414 during 2nd wget (which does the crawling/fetching)? You might have some code that keeps appending parameters to the URLs, resulting in a 414, so maybe try removing the “-r” flag from the 2nd wget or adding a “–level=2″.

  7. Bas

    Hello!

    Saw your great article!
    I’m using the command:

    wget -O – http://bytesofweb.com/sitemap.xml | grep -E -o ‘.*’ | sed -e ‘s///g’ -e ‘s///g’ | wget -i – -p –delete-after

    For my own site, but when running, there will be also folders created in the root of the website.

    How to delete them? Because they are not needed, only for crawling options?

Leave a Reply