Monday, November 22, 2010

More scraping

Ooh. The "Readability" Javascript tool munges a page to put its "content" - that poorly defined part of the HTML that represents the parts the humans actually read - into a separate area for actual, well, reading, minus all the ads and links and sidebars and so on.

That algorithm has been ported into Perl as HTML::ExtractMain. So going into WWW::Declarative.

In other news on this front, O'Reilly has a sale on a bunch of relevant books. Sigh.

No comments:

Post a Comment