27 Jan 2014

Web Scraping with node.js and cheerio

Use node.js and the cheerio node module to scrape or extract data content from a web page. We use the HTTP core module in combination with cheerio, which converts an HTTP response into a DOM, which we use to parse and extract data from specific HTML elements.

This Wikipedia page about London is going to be our test case. We'll extract some of the fields from the info box on the righthand side, which contains political and geographical data.

Wikipedia London article screenshot

Install the cheerio module using npm:

$ sudo npm install cheerio

Edit your main node file, server.js:

var http = require("http");
var cheerio = require("cheerio");

var server = http.createServer(function(req, res) {

  var req_opts = {
    host:"en.wikipedia.org",
    path:"/wiki/London"
  };
  var response_text = "";

  // 1. Perform an HTTP request to Wikipedia
  var request = http.request(req_opts, function(resp) {
    if(resp.statusCode != 200) {
      throw "Error: " + resp.statusCode; 
    };
    resp.setEncoding("utf8");
    resp.on("data", function (chunk) {
      response_text += chunk;
    });
    resp.on("end", function() {

      // 2. Parse response using cheerio
      $ = cheerio.load(response_text);

      // Begin writing our output HTML
      res.writeHead(200, {"Content-Type": "text/html"});
      res.write("<html><head><meta charset='UTF-8' />");
      res.write("</head><body><table>");

      // Iterate over TR elements in the Wikipedia infobox
      $("table.geography tr").each(function(tr_index, tr) {
        var th_text = $(this).find("th").text();
        var prop_name
          = th_text.trim().toLowerCase().replace(/[^a-z]/g,"");

        // We're only interested in these 3 fields
        if({"country":1,"mayor":1,"elevation":1}[prop_name])
        {
          // 3. Write out our tabulated data
          res.write("<tr><th>" + prop_name + "</th><td>");
          res.write($(this).find("td").text());
          res.write("</td></tr>");
        }
      });

      // And... we're done
      res.end("</table></body></html>");
    });
  });

  request.on("error", function(e) {
    throw "Error: " + e.message;
  });

  request.end();

}).listen(8080);

There are 3 stages: First we perform an HTTP request to the specified Wikipedia page. Next we use cheerio to parse the received Wikipedia HTML, which converts the response string into a DOM object. Finally we use the table.geography tr selector to retreive our desired HTML content, which we iterate over to extract specific data fields.

Start the node instance:

$ node server.js

Invoke using your browser:

http://localhost:8080
country England
mayorBoris Johnson
elevation24 m (79 ft)

For a good alternative solution check out PhantomJS for node.