Hosting a website on a home server, especially one with low traffic, can result in long response times when retrieving pages. Intuitively, it seems like high traffic should slow the response of a server, and that a server with no CPU load should be able to return web pages faster. But–high traffic servers have a cache advantage. When a user visits a page that has already been recently served, that page is fresh in memory. This is a definite bonus from having the web server retrieve the page from a hard drive, which can take many orders of magnitude longer in IO time 1.
The low traffic home server is at a disadvantage because it is serving little content. The CPU is mostly idle and pages aren’t being frequently accessed, decreasing the probability that a page will already be in memory when a user requests it. This results in cache misses, forcing data to be retrieved from disk. It is also said that the cache is “cold”. As an example of how bad this can be, this site takes 1-2 seconds longer to load after a recent reboot, forcing all web data to be accessed from disk.
There are many solutions to this. Notable methods are: the entire web directory could be allocated to ramdisk 2; implement mem-caching in apache 3; object code (or opcode) caching for server-side languages like PHP 4; and setting expiration dates (expire headers) to allow caching on the client’s browser 5.
I use opcode caching and expire headers to speed page load time, which are very easy to implement. But these don’t ensure web data is in memory. Or at least, ensure that there is a decent probability pages are in memory. For my ancient server, retrieving data from disk is the most time consuming factor of page load. Further, I don’t want caching to require memory like a process; the operating system’s disk cache can be leveraged to provide a cache that is optional. That is, if I’m running a bunch of memory intensive code in the background, the operating system will automatically prioritize it instead of keeping web data in memory, since the cached data is regarded as free. This is demonstrated below by the command
vmstat -m, where the cache column indicates that 101 MB of disk is in memory, but classifies as free memory (see 6).
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 43 573 141 101 0 0 5 20 25 18 1 1 90 7
One way to leverage the operating system cache as described above is to crawl the website. Crawling the site forces the server to load all resources from the disk. If there is sufficient free RAM and the site is small, like most blogs, the entirety of the web data will sit in the operating system cache. In the domain of serving web content, this can be called cache warming, which is analogous to user-guided anticipatory paging 7 or pre-fetching.
Crawling a site is easy if you have a sitemap. Several programs can automatically generate them, including sitemaps specific to blogs like WordPress 8. Mine is here, where the XML content is visible when viewing the source. Since sitemaps have a standardized format, the following commands can be used to grab the URL of every page to crawl:
wget -O - erikriffs.com/sitemap.xml | grep -E -o '<loc>.*</loc>' | sed -e 's/<loc>//g' -e 's/<\/loc>//g'
[email protected] ~ $ wget -O - erikriffs.com/sitemap.xml | grep -E -o '
.*' | sed -e 's/<loc>//g' -e 's/<\/loc>//g' --2012-04-02 14:29:38-- http://erikriffs.com/sitemap.xml Resolving erikriffs.com... 22.214.171.124 Connecting to erikriffs.com|126.96.36.199|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 1641 (1.6K) [application/xml] Saving to: `STDOUT' 100%[=======================================>] 1,641 --.-K/s in 0s 2012-04-02 14:29:39 (93.0 MB/s) - written to stdout [1641/1641] http://erikriffs.com/ http://erikriffs.com/latex-ieee-bibliography-error/ http://erikriffs.com/index/ http://erikriffs.com/matlab-parfor-performance/
Having obtained the list of URLs, this can be piped back into wget to download every page along with the resources required, like css, js, and images.
wget -O - erikriffs.com/sitemap.xml | grep -E -o '<loc>.*</loc>' | sed -e 's/<loc>//g' -e 's/<\/loc>//g' | wget -i - -p -r --delete-after
The above one-liner keeps web data fresh in the cache. Temporarily. As other activity occurs on the operating system, the amount of site content in memory will slowly decrease. Using
crontab -e and the crawling one-liner allows for an hourly cache warming. It is also a far simpler alternative to a PHP program, and unlike this Drupal example, will crawl images and other dependent resources (the author is missing the -p flag in wget).
- http://en.wikipedia.org/wiki/Memory_hierarchy#Application_of_the_concept ↩
- http://www.cyberciti.biz/faq/howto-create-linux-ram-disk-filesystem/ ↩
- http://httpd.apache.org/docs/2.2/caching.html ↩
- http://blog.digitalstruct.com/2008/02/27/php-performance-series-caching-techniques/ ↩
- http://httpd.apache.org/docs/2.2/mod/mod_expires.html ↩
- http://stackoverflow.com/questions/6345020/linux-memory-buffer-vs-cache ↩
- http://en.wikipedia.org/wiki/Paging#Anticipatory_paging ↩
- http://wordpress.org/extend/plugins/google-sitemap-generator/ ↩