Sunday, December 9, 2012

boilerpipe

When scraping, removal of boilerplate is job #1.  Boilerpipe is a library to do that (one of several, of course).  It provides a sort of "de-boilerplating" step.  (And this is probably a really good way of looking at things.)

On the same topic, a really fantastic overview of web scraping here.

No comments:

Post a Comment