31 Jul 2014

Web Scraping with node.js and PhantomJS

Use node.js with PhantomJS to scrape (or extract) data from a web page. We use the Phantom bridge and jQuery to fetch, open and parse a page from Wikipedia.

In this example the node-phantom module—a bridge to PhantomJS—is used to scrape data from a web page.

As with the Cheerio example, this Wikipedia page about London is our test case. We'll extract 3 fields from the info box on the righthand side.

Wikipedia London article screenshot

Install the PhantomJS module using npm:

$ npm install phantom

Edit your main node file, server.js:

var http = require("http");
var phantom = require("phantom");

var url = "http://en.wikipedia.org/wiki/London";

var server = http.createServer(function(req, res) {

phantom.create(function (ph) {
ph.createPage(function (page) {
page.open(url, function (status) {

// We use jQuery to parse the document
page.includeJs(
  "http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js",
  function() {
    page.evaluate(function() {

      var data = {};

      $("table.geography tr").each(function(tr_index, tr) {
        var th_text = $(this).find("th").text();
        var prop_name
          = th_text.trim().toLowerCase().replace(/[^a-z]/g,"");

        // We're only interested in these 3 fields
        if({"country":1,"mayor":1,"elevation":1}[prop_name]) {
          data[prop_name] = $(this).find("td").text();
        }
      });

      return data;

    }, function(data) {

      ph.exit();

      // Begin writing our output HTML
      res.writeHead(200, {"Content-Type": "text/html"});
      res.write("<html><head><meta charset='UTF-8' />");
      res.write("</head><body><table>");

      for(var prop in data) {
        res.write("<tr><th>" + prop + "</th><td>");
        res.write(data[prop]);
        res.write("</td></tr>");
      }

      res.end("</table></body></html>");

      process.exit(0);
    });
  }
);

});
});
});

}).listen(8080);

jQuery is included and used to parse the retrieved HTML document. In the body of the function passed as the first argument of page.evaluate, the required properties are selected and assigned to an object, which is returned. A simple HTML document is then written as part of the HTTP response.

Start the node instance:

$ node server.js

Invoke using your browser:

http://localhost:8080
country England
mayorBoris Johnson
elevation24 m (79 ft)