How to Crawl Web Pages in a Chrome Extension

Christoph Herzog
  • google-chrome
  • chrome
  • extension
  • crawl
  • crawling
  • parsing
  • fetch
  • promise
  • random-tip
  • develop

I've found myself using a website on a regular basis, which shall remain unnamed.
It is overloaded with ads, has an ugly design and is cumbersome to use. So what shall a coder do, facing this plight? Well, write a Google Chrome extension that fetches the required data and presents it in a React app, of course!

Hint: If you just want a reusable, working function for crawling pages in a Chrome background page, jump HERE.

I might cover my extension in more detail in a later blog post, but the first step was to find a good way to fetch and parse web pages.

A background page seems to be the right approach.

My first approach was trying to set window.location.href in the background page.
For some reason, this sadly just does not work. I haven't found a reason why, but presumably Chrome ignores location changes for background pages.

Then I tried to just fetch the whole page and insert it into the body.
This has two issues though:

  • You need to parse out the <body> tag first to exclude everything in <head>, which might prevent you from seeing content generated by Javascript.
  • Inserting large amounts of content by setting .innerHTML is painfully slow (in Chrome at least).
  • After setting .innerHTML, the whole page becomes really slow, including simple dom queries.

A workaround is to use document.write, which replaces the whole document.
Subsequent queries DOM access remains fast.

So, what works well is this:

  • Load the required page with the new Fetch API
  • Insert the loaded content with document.write
  • Use a timeout to detect document.readyState === "complete".
  • Query the document however you want to.

#Example

Note: I'm using Chrome 52, which now fully supports the Fetch API, arrow functions and let / const variable declarations.

So, here is a working example that extracts all urls from link tags in a given page:

url = "https://www.wikipedia.org/";
fetch(url).then(r => {
  return r.text();
}).then(html => {
  document.open("text/html");
  document.write(html);
  document.close();

  const handler = function() {
    if (document.readyState !== "complete") {
      setTimeout(handler, 50);
    }
    else {
      const links = document.body.querySelectorAll("a");
      const urls = [].map.call(links, l => l.href);
      // Do something with the URLs.
      console.log(urls);
    }
  }
  setTimeout(handler, 50);
}); 


#Reusable Function Returning a Promise

So, would'nt it be nice to have a reusable function returning a Promise?

Here you go:

function fetchPage(url) {
  return fetch(url).then(r => {
    return r.text();
  }).then(html => {
    document.open("text/html");
    document.write(html);
    document.close();

    return new Promise((resolve, reject) => {
      const handler = function() {
        if (document.readyState !== "complete") {
          setTimeout(handler, 50);
        }
        else {
          resolve();
        }
      }
      handler();
    });
  });
}

// Usage:

fetchPage("https://some-page.com").then(function() {
  const links = document.body.querySelectorAll("a");
  const urls = [].map.call(links, l => l.href);
  // Do something with the URLs.
  console.log(urls);
});


#References

theduke.at | © Christoph Herzog (theduke), 2016 | Vienna, Austria