Distributed java web crawler for small research projects

Some of our CS84 students required a distributed web crawler for the research projects. This guide walks through using Selenium (htmlunit) to automatically download several pages from a website for data analysis. For time sensitive projects, it may also be useful to distribute the crawler over multiple servers. This is especially true for large websites where htmlunit may consume a fair amount of memory parsing the document (DOM).

For this tutorial, the project dependencies are managed using Maven. The following Maven repositories must be in your pom.xml :

Additionally, it is useful to run the individual crawling services as jar files on separate servers. The following build section should also be added to the pom.xml to allow the project to be built into a jar file:

The first component is the WebFetchService, a Java application that listens for HTTP requests with web URLs to crawl. Upon receiving a request to the /query route, it creates an HtmlUnitDriver object and parses the page source. The meat of that logic is in the get(“/query”….) section in the code below:

With that service provided, your project needs to create a list of URLs and query a WebFetchServer.

  1. Create a Start URL (or set of start URLs), and add to the Queue of ToFetch
  2. Loop as long as ToFetch is not empty
    1. Remove one item from ToFetch and call that the NextPageToFetch
    2. Try this until a successful HTML document is parsed:
      1. Grab a URL of one WebFetchServer, and pass it NextPageToFetch to the /query route
    3. Take the HTML and use regular expressions or document parsing to get whatever content is needed
      1. Fetch any future links and add them to ToFetch by parsing anchor tags (<a href=””>), e.g. with a regular expression
    4. Record this URL as parsed (add to a list FinishedFetch)
      1. Write the HTML to a file on disk with filename equal to the URL. One way to convert URLs to filenames is with URLEncoder:
  3. Optionally (if not completed in 2.3.1) loop through all files and parse through any additional content that was not previously recorded, and write to a database / csv.



Leave a Reply

Your email address will not be published. Required fields are marked *