Some of our CS84 students required a distributed web crawler for the research projects. This guide walks through using
Selenium (htmlunit) to automatically download several pages from a website for data analysis. For time sensitive projects, it may also be useful to
distribute the crawler over multiple servers. This is especially true for large websites where htmlunit may consume a fair amount of memory parsing the document (DOM).
For this tutorial, the project dependencies are managed using Maven. The following Maven repositories must be in your pom.xml :
<dependencies>
<!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-java -->
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>2.52.0</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-remote-driver</artifactId>
<version>2.52.0</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-htmlunit-driver</artifactId>
<version>2.52.0</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.7</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.sparkjava/spark-core -->
<dependency>
<groupId>com.sparkjava</groupId>
<artifactId>spark-core</artifactId>
<version>2.3</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty.websocket</groupId>
<artifactId>websocket-api</artifactId>
<version>9.2.0.M0</version>
</dependency>
<!-- To run websockets in embedded server -->
<dependency>
<groupId>org.eclipse.jetty.websocket</groupId>
<artifactId>websocket-server</artifactId>
<version>9.2.0.M0</version>
</dependency>
<!-- To run websockets client -->
<dependency>
<groupId>org.eclipse.jetty.websocket</groupId>
<artifactId>websocket-client</artifactId>
<version>9.2.0.M0</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
<version>9.2.0.M0</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-webapp</artifactId>
<version>9.2.0.M0</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlets</artifactId>
<version>9.2.0.M0</version>
</dependency>
</dependencies>
Additionally, it is useful to run the individual crawling services as jar files on separate servers. The following build section should also be added to the pom.xml to allow the project to be built into a jar file:
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.ktbyte.webauto.WebFetchService</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
The first component is the
WebFetchService, a Java application that listens for HTTP requests with web URLs to crawl. Upon receiving a request to the /query route, it creates an HtmlUnitDriver object and parses the page source. The meat of that logic is in the get(“/query”….) section in the code below:
package com.ktbyte.webauto;
import static spark.Spark.get;
import java.util.Collections;
import java.util.Date;
import java.util.LinkedList;
import java.util.List;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import spark.Spark;
public class WebFetchService {
static {
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(java.util.logging.Level.SEVERE);
}
static List<HtmlUnitDriver> drivers = Collections.synchronizedList(new LinkedList<>());
static int maxDrivers = 4;
static int numDrivers = 0;
public static synchronized HtmlUnitDriver getDriver() {
if(drivers.size() == 0) {
if(numDrivers >= maxDrivers) {
//uhoh
System.err.println("Ran out of drivers");
return null;
}
numDrivers++;
System.err.println("Num drivers: "+numDrivers);
HtmlUnitDriver driver = new HtmlUnitDriver(BrowserVersion.FIREFOX_38); // important is
driver.setJavascriptEnabled(false);
return driver;
}else {
return drivers.remove(0);
}
}
public static synchronized void returnDriver(HtmlUnitDriver driver) {
if(maxDrivers == 1) {
driver.close(); //remove it and just count down numdrivers
numDrivers--;
}
else {
drivers.add(driver);
}
}
public static void main(String[] args) {
if(args.length > 0) maxDrivers = Integer.parseInt(args[0]);
Spark.port(65234);
get("/query", (req, res) -> {
HtmlUnitDriver driver = getDriver();
if(driver == null) {
//uhoh
System.err.println("uhoh.. out of threads, max: "+maxDrivers);
return null;
}
String url = req.queryParams("url");
System.err.println(numDrivers+" - "+System.currentTimeMillis()+" "+(new Date())+" "+"Received Request for page: "+url);
driver.get(url);
String html = driver.getPageSource();
returnDriver(driver);
return html;
});
}
}
With that service provided, your project needs to create a list of URLs and query a WebFetchServer.
- Create a Start URL (or set of start URLs), and add to the Queue of ToFetch
- Loop as long as ToFetch is not empty
- Remove one item from ToFetch and call that the NextPageToFetch
- Try this until a successful HTML document is parsed:
- Grab a URL of one WebFetchServer, and pass it NextPageToFetch to the /query route
- Take the HTML and use regular expressions or document parsing to get whatever content is needed
- Fetch any future links and add them to ToFetch by parsing anchor tags (<a href=””>), e.g. with a regular expression
- Record this URL as parsed (add to a list FinishedFetch)
- Write the HTML to a file on disk with filename equal to the URL. One way to convert URLs to filenames is with URLEncoder:
public static Path getPathFromURL(String url) throws IOException {
String filename = URLEncoder.encode(url, "UTF-8");
return Paths.get("data/"+filename);
}
- Optionally (if not completed in 2.3.1) loop through all files and parse through any additional content that was not previously recorded, and write to a database / csv.