Thursday, November 18, 2010

Page scraping: Enlive and CSS selectors

So here I am again, thinking about page scraping (earlier) - that part of a Web robot that comes between retrieval of a given page and some data representation that is what we're really after.

Unsurprisingly, there are a great number of solutions to this problem, some better, some worse. The one I want to talk about right now is Enlive, which is based on Clojure and looks pretty darned interesting. I want to do that in dperl. There's an Enlive tutorial that's quite well-written.

Now Enlive uses something very CSS-selector-like to extract interesting information from HTML trees. This is the part that struck my fancy. So here is a CSS selector tutorial as well.

Very promising direction.

No comments:

Post a Comment