How to crawl a quarter billion webpages in 40 hours
More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances.
I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web. In this post I describe some details of what I did. Of course, there’s nothing especially new: I wrote a vanilla (distributed) crawler, mostly to teach myself something about crawling and distributed computing. Still, I learned some lessons that may be of interest to a few others, and so in this post I describe what I did. The post also mixes in some personal working notes, for my own future reference.
What does it mean to crawl a non-trivial fraction of the web? In fact, the notion of a “non-trivial fraction of the web” isn’t well defined. Many websites generate pages dynamically, in response to user input – for example, Google’s search results pages are dynamically generated in response to the user’s search query. Because of this it doesn’t make much sense to say there are so-and-so many billion or trillion pages on the web. This, in turn, makes it difficult to say precisely what is meant by “a non-trivial fraction of the web”. However, as a reasonable proxy for the size of the web we can use the number of webpages indexed by large search engines. According to this presentation by Googler Jeff Dean, as of November 2010 Google was indexing “tens of billions of pages”. (Note that the number of urls is in the trillions, apparently because of duplicated page content, and multiple urls pointing to the same content.) The now-defunct search engine Cuil claimed to index 120 billion pages. By comparison, a quarter billion is, obviously, very small. Still, it seemed to me like an encouraging start.
Code: Originally I intended to make the crawler code available under an open source license at GitHub. However, as I better understood the cost that crawlers impose on websites, I began to have reservations. My crawler is designed to be polite and impose relatively little burden on any single website, but could (like many crawlers) easily be modified by thoughtless or malicious people to impose a heavy burden on sites. Because of this I’ve decided to postpone (possibly indefinitely) releasing the code.
There’s a more general issue here, which is this: who gets to crawl the web? Relatively few sites exclude crawlers from companies such as Google and Microsoft. But there are a lot of crawlers out there, many of them without much respect for the needs of individual siteowners. Quite reasonably, many siteowners take an aggressive approach to shutting down activity from less well-known crawlers. A possible side effect is that if this becomes too common at some point in the future, then it may impede the development of useful new services, which need to crawl the web. A possible long-term solution may be services like Common Crawl, which provide access to a common corpus of crawl data.
I’d be interested to hear other people’s thoughts on this issue.
Architecture: Here’s the basic architecture:
The domain whitelist was partitioned across the 20 EC2 machine instances in the crawler. This was done by numbering the instances and then allocating the domain domain to instance number hash(domain) % 20, where hash is the standard Python hash function.
Deployment and management of the cluster was handled using Fabric, a well-documented and nicely designed Python library which streamlines the use of ssh over clusters of machines. I managed the connection to Amazon EC2 using a set of Python scripts I wrote, which wrap the boto library.
I used 20 Amazon EC2 extra large instances, running Ubuntu 11.04 (Natty Narwhal) under the ami-68ad5201 Amazon machine image provided by Canonical. I used the extra large instance after testing on several instance types; the extra large instances provided (marginally) more pages downloaded per dollar spent. I used the US East (North Virginia) region, because it’s the least expensive of Amazon’s regions (along with the US West, Oregon region).
Single instance architecture: Each instance further partitioned its domain whitelist into 141 separate blocks of domains, and launched 141 Python threads, with each thread responsible for crawling the domains in one block. Here’s how it worked (details below):
The reason for using threads is that the Python standard library uses blocking I/O to handle http network connections. This means that a single-threaded crawler would spend most of its time idling, usually waiting on the network connection of the remote machine being crawled. It’s much better to use a multi-threaded crawler, which can make fuller use of the resources available on an EC2 instance. I chose the number of crawler threads (141) empirically: I kept increasing the number of threads until the speed of the crawler started to saturate. With this number of threads the crawler was using a considerable fraction of the CPU capacity available on the EC2 instance. My informal testing suggested that it was CPU which was the limiting factor, but that I was not so far away from the network and disk speed becoming bottlenecks; in this sense, the EC2 extra large instance was a good compromise. Memory useage was never an issue. It’s possible that for this reason EC2′s high-CPU extra large instance type would have been a better choice; I only experimented with this instance type with early versions of the crawler, which were more memory-limited.
How domains were allocated across threads: The threads were numbered , and domains were allocated on the basis of the Python hash function, to thread number hash(domain) % 141 (similar to the allocation across machines in the cluster). Once the whitelisted domains / seed urls were allocated to threads, the crawl was done in a simple breadth-first fashion, i.e., for each seed url we download the corresponding web page, extract the linked urls, and check each url to see: (a) whether the extracted url is a fresh url which has not already been seen and added to the url frontier; and (b) whether the extracted url is in the same seed domain as the page which has just been crawled. If both these conditions are met, the url is added to the url frontier for the current thread, otherwise the url is discarded. With this architecture we are essentially carrying out a very large number of independent crawls of the whitelisted domains obtained from Alexa.
Note that this architecture also ensures that if, for example, we are crawling a page from TechCrunch, and extract from that page a link to the Huffington Post, then the latter link will be discarded, even though the Huffington Post is in our domain whitelist. The only links added to the url frontier will be those that point back to TechCrunch itself. The reason we avoid adding dealing with (whitelisted) external links is because: (a) it may require communication between different EC2 instances, which would substantially complicate the crawler; and, more importantly, (b) in practice, most sites have lots of internal links, and so it’s unlikely that this policy means the crawler is missing much.
One advantage of allocating all urls from the same domain to the same crawler thread is that it makes it much easier to crawl politely, since no more than one connection to a site will be open at any given time. In particular, this ensures that we won’t be hammering any given domain with many simultaneous connections from different threads (or different machines).
Problems for the author
- For some very large and rapidly changing websites it may be necessary to open multiple simultaneous connections in order for the crawl to keep up with the changes on the site. How can we decide when that is appropriate?
How the url frontiers work: A separate url frontier file was maintained for each domain. This was simply a text file, with each line containing a single url to be crawled; initially, the file contains just a single line, with the seed url for the domain. I spoke above of the url frontier for a thread; that frontier can be thought of as the combination of all the url frontier files for domains being crawled by that thread.
Each thread maintained a connection to a redis server. For each domain being crawled by the thread a redis key-value pair was used to keep track of the current position in the url frontier file for that domain. I used redis (and the Python bindings) to store this information in a fashion that was both persistent and fast to look up. The persistence was important because it meant that the crawler could be stopped and started at will, without losing track of where it was in the url frontier.
Each thread also maintained a dictionary whose keys were the (hashed) domains for that thread. The corresponding values were the next time it would be polite to crawl that domain. This value was set to be 70 seconds after the last time the domain was crawled, to ensure that domains weren’t getting hit too often. The crawler thread simply iterated over the keys in this dictionary, looking for the next domain it was polite to crawl. Once it found such a domain it then extracted the next url from the url frontier for that domain, and went about downloading that page. If the url frontier was exhausted (some domains run out of pages to crawl) then the domain key was removed from the dictionary. One limitation of this design was that when restarting the crawler each thread had to identify again which domains had already been exhausted and should be deleted from the dictionary. This slowed down the restart a little, and is something I’d modify if I were to do further work with the crawler.
Use of a Bloom filter: I used a Bloom filter to keep track of which urls had already been seen and added to the url frontier. This enabled a very fast check of whether or not a new candidate url should be added to the url frontier, with only a low probability of erroneously adding a url that had already been added. This was done using Mike Axiak‘s very nice C-based pybloomfiltermmap.
Anticipated versus unanticipated errors: Because the crawler ingests input from external sources, it needs to deal with many potential errors. By design, there are two broad classes of error: anticipated errors and unanticipated errors.
Anticipated errors are things like a page failing to download, or timing out, or containing unparseable input, or a robots.txt file disallowing crawling of a page. When anticipated errors arise, the crawler writes the error to a (per-thread) informational log (the “info log” in the diagram above), and continues in whatever way is appropriate. For example, if the robots.txt file disallows crawling then we simply continue to the next url in the url frontier.
Unanticipated errors are errors which haven’t been anticipated and designed for. Rather than the crawler falling over, the crawler simply logs the error (to the “critical log” in the diagram above), and moves on to the next url in the url frontier. At the same time, the crawler tracks how many unanticipated errors have occurred in close succession. If many unanticipated errors occur in close succession it usually indicates that some key piece of infrastructure has failed. Because of this, if there are too many unanticipated errors in close succession, the crawler shuts down entirely.
As I was developing and testing the crawler, I closely followed the unanticipated errors logged in the critical log. This enabled me to understand many of the problems faced by the crawler. For example, early on in development I found that sometimes the html for a page would be so badly formed that the html parser would have little choice but to raise an exception. As I came to understand such errors I would rewrite the crawler code so such errors become anticipated errors that were handled as gracefully as possible. Thus, the natural tendency during development was for unanticipated errors to become anticipated errors.
Domain and subdomain handling: As mentioned above, the crawler works by doing lots of parallel intra-domain crawls. This works well, but a problem arises because of the widespread use of subdomains. For example, if we start at the seed url http://barclays.com and crawl only urls within the barclays.com domain, then we quickly run out of urls to crawl. The reason is that most of the internal links on the barclays.com site are actually to group.barclays.com, not barclays.com. Our crawler should also add urls from the latter domain to the url frontier for barclays.com.
We resolve this by stripping out all subdomains, and working with the stripped domains when deciding whether to add a url to the url frontier. Removing subdomains turns out to be a surprisingly hard problem, because of variations in the way domain names are formed. Fortunately, the problem seems to be well solved using John Kurkowski’s tldextract library.
On the representation of the url frontier: I noted above that a separate url frontier file was maintained for each domain. In an early version of the code, each crawler thread had a url frontier maintained as a single flat text file. As a crawler thread read out lines in the file, it would crawl those urls, and append any new urls found to the end of the file.
This approach seemed natural to me, but organizing the url frontier files on a per-thread (rather than per-domain) basis caused a surprising number of problems. As the crawler thread moved through the file to find the next url to crawl, the crawler thread would encounter urls belonging to domains that were not yet polite to crawl because they’d been crawled too recently. My initial strategy was simply to append such urls to the end of the file, so they would be found again later. Unfortunately, there were often a lot of such urls in a row – consecutive urls often came from the same domain (since they’d been extracted from the same page). And so this strategy caused the file for the url frontier to grow very rapidly, eventually consuming most disk space.
Exacerbating this problem, this approach to the url frontier caused an unforseen “domain clumping problem”. To understand this problem, imagine that the crawler thread encountered (say) 20 consecutive urls from a single domain. It might crawl the first of these, extracting (say) 20 extra urls to append to the end of the url frontier. But the next 19 urls would all be skipped over, since it wouldn’t yet be polite to crawl them, and they’d also be appended to the end of the url frontier. Now we have 39 urls from the same domain at the end of the url frontier. But when the crawler thread gets to those, we may well have the same process repeat – leading to a clump of 58 urls from the same domain at the end of the file. And so on, leading to very long runs of urls from the same domain. This consumes lots of disk space, and also slows down the crawler, since the crawler thread may need to examine a large number of urls before it finds a new url it’s okay to crawl.
These problems could have been solved in various ways; moving to the per-domain url frontier file was how I chose to address the problems, and it seemed to work well.
Choice of number of threads: I mentioned above that the number of crawler threads (141) was chosen empirically. However, there is an important constraint on that number, and in particular its relationship to the number (20) of EC2 instances being used. Suppose that instead of 141 threads I’d used (say) 60 threads. This would create a problem. To see why, note that any domain allocated to instance number 7 (say) would necessarily satisfy hash(domain) % 20 = 7. This would imply that hash(domain) % 60 = 7 or 27 or 47, and as a consequence all the domains would be allocated to just one of three crawler threads (thread numbers 7, 27 and 47), while the other 57 crawler threads would lie idle, defeating the purpose of using multiple threads.
One way to solve this problem would be to use two independent hash functions to allocate domains to EC2 instances and crawler threads. However, an even simpler way of solving the problem is to choose the number of crawler threads to be coprime to the number of EC2 instances. This coprimality ensures that domains will be allocated reasonably evenly across both instance and threads. (I won’t prove this here, but it can be proved with a little effort). It is easily checked that 141 and 20 are coprime.
Note, incidentally, that Python’s hash is not a true hash function, in the sense that it doesn’t guarantee that the domains will be spread evenly across EC2 instances. It turns out that Python’s hash takes similar key strings to similar hash values. I talk more about this point (with examples) in the fifth paragraph of this post. However, I found empirically that hash seems to spread domains evenly enough across instances, and so I didn’t worry about using a better (but slower) hash function, like those available through Python’s hashlib library.
Use of Python: All my code was written in Python. Initially, I wondered if Python might be too slow, and create bottlenecks in the crawling. However, profiling the crawler showed that most time was spent either (a) managing network connections and downloading data; or (b) parsing the resulting webpages. The parsing of the webpages was being done using lxml, a Python binding to fast underlying C libraries. It didn’t seem likely to be easy to speed that up, and so I concluded that Python was likely not a particular bottleneck in the crawling.
Politeness: The crawler used Python’s robotparser library in order to observe the robots exclusion protocol. As noted above, I also imposed an absolute 70-second minimum time interval between accesses to any given domain. In practice, the mean time between accesses was more like 3-4 minutes.
In initial test runs of the crawler I got occasional emails from webmasters asking for an explanation of why I was crawling their site. Because of this, in the crawler’s User-agent I included a link to a webpage explaining the purpose of my crawler, how to exclude it from a site, and what steps I was taking to crawl politely. This was (I presume) both helpful to webmasters and also helpful to me, for it reduced the number of inquiries. A handful of people asked me to exclude their sites from the crawl, and I complied quickly.
Problems for the author
- Because my crawl didn’t take too long, the robots.txt file was downloaded just once for each domain, at the beginning of the crawl. In a longer crawl, how should we decide how long to wait between downloads of robots.txt?
Truncation: The crawler truncates large webpages rather than downloading the full page. It does this in part because it’s necessary – it really wouldn’t surprise me if someone has a terabyte html file sitting on a server somewhere – and in part because for many applications it will be of more interest to focus on earlier parts of the page.
What’s a reasonable threshold for truncation? According to this report from Google, as of May 2010 the average network size of a webpage from a top site is 312.04 kb. However, that includes images, scripts and stylesheets, which the crawler ignores. If you ignore the images and so on, then the average network size drops to just 33.66 kb.
However, that number of 33.66 kb is for content which may be served compressed over the network. Our truncation will be based on the uncompressed size. Unfortunately, the Google report doesn’t tell us what the average size of the uncompressed content is. However, we can get an estimate of this, since Google reports that the average uncompressed size of the total page (including images and so on) is 477.26 kb, while the average network size is 312.04 kb.
Assuming that this compression ratio is typical, we estimate that the average uncompressed size of the content the crawler downloads is 51 kb. In the event, I experimented with several truncation settings, and found that a truncation threshold of 200 kilobytes enabled me to download the great majority of webpages in their entirety, while addressing the problem of very large html files mentioned above. (Unfortunately, I didn’t think to check what the actual average uncompressed size was, my mistake.)
Storage: I stored all the data using EC2′s built-in instance storage – 1.69 Terabytes for the extra-large instances I was using. This storage is non-persistent, and so any data stored on an instance will vanish when that instance is terminated. Now, for many kinds of streaming or short-term analysis of data this would be adequate – indeed, it might not even be necessary to store the data at all. But, of course, for many applications of a crawl this approach is not appropriate, and the instance storage should be supplemented with something more permanent, such as S3. For my purposes using the instance storage seemed fine.
Price: The price broke down into two components: (1) 512 dollars for the use of the 20 extra-large EC2 instances for 40 hours; and (2) about 65 dollars for a little over 500 gigabytes of outgoing bandwidth, used to make http requests. Note that Amazon does not charge for incoming bandwidth (a good thing, too!) It would be interesting to compare these costs to the (appropriately amortized) costs of using other cloud providers, or self-hosting.
Something I didn’t experiment with is the use of Amazon’s spot instances, where you can bid to use Amazon’s unused EC2 capacity. I didn’t think of doing this until just as I was about to launch the crawl. When I went to look at the spot instance pricing history, I discovered to my surprise that the spot instance prices are often a factor of 10 or so lower than the prices for on-demand instances! Factoring in the charges for outgoing bandwidth, this means it may be possible to use spot instances to do a similar crawl for 120 dollars or so, a factor of five savings. I considered switching, but ultimately decided against it, thinking that it might take 2 or 3 days work to properly understand the implications of switching, and to get things working exactly as I wanted. Admittedly, it’s possible that it would have taken much less time, in which case I missed an opportunity to trade some money for just a little extra time.
Improvements to the crawler architecture: Let me finish by noting a few ways it’d be interesting to improve the current crawler:
- For many long-running applications the crawler would need a smart crawl policy so that it knows when and how to re-crawl a page. According to a presentation from Jeff Dean, Google’s mean time to index a new page is now just minutes. I don’t know how that works, but imagine that notification protocols such as pubsubhubbub play an important role. It’d be good to change the crawler so that it’s pubsubhubbub aware.
- The crawler currently uses a threaded architecture. Another quite different approach is to use an evented architecture. What are the pros and cons of a multi-threaded versus an evented architecture?
- The instances in the cluster are configured using fabric and shell scripts to install programs such as redis, pybloomfilter, and so on. This is slow and not completely reliable. Is there a better way of doing this? Creating my own EC2 AMI? Configuration management software such as Chef and Puppet? I considered using one of the latter, but deferred it because of the upfront cost of learning the systems.
- Logging is currently done using Python’s logging module. Unfortunately, I’m finding this is not well-adapted to Python’s threading. Is there a better solution?
- The crawler was initially designed for crawling in a batch environment, where it is run and then terminates. I’ve since modified it so that it can be stopped, modifications made, and restarted. It’d be good to add instrumentation so it can be modified more dynamically, in real time.
- Many interesting research papers have been published about crawling. I read or skimmed quite a few while writing my crawler, but ultimately used only a few of the ideas; just getting the basics right proved challenging enough. In future iterations it’d be useful to look at this work again and to incorporate the best ideas. Good starting points include a chapter in the book by Manning, Raghavan and Sch\”utze, and a survey paper by Olston and Najork. Existing open source crawlers such as Heritrix and Nutch would also be interesting to look at in more depth.