When scraping, removal of boilerplate is job #1. Boilerpipe is a library to do that (one of several, of course). It provides a sort of "de-boilerplating" step. (And this is probably a really good way of looking at things.)
On the same topic, a really fantastic overview of web scraping here.
Sunday, December 9, 2012
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment