Saturday, February 5, 2011

Scraping

I have a spider app in mind. I was DuckDuckGoing on Toonbots, for whatever reason, and ran across a couple of links to "Toonbots forum, being able to avoid spammers and trolls and whatnot." I vaguely recall that conversation.

Well, it turns out that that sentence was incorporated into a whole series of spammy landing pages inserted all over the web, pointing back to e-loan.expert.com via Javascript redirect. This has been a couple of weeks ago, so many of these are getting rolled back up and fixed, but ... it would be absolutely fascinating to make it a statistical project.

I know, I know, I'm a sucker for Web spidering, too. Sigh.

This is how it would work:
  • Seed it with one or more of the target sentences.
  • Google a sentence.
  • Try to find its actual origin.
  • Store all the other URLs and text; break the text up into sentences.
  • Spread from there.
  • Index all the Javascript, in case techniques varied.
  • Try to categorize the type of host (different breaking techniques were probably in use).
  • Track down everybody making this possible, and fix each and every one of them. Oy.
Wouldn't that be cool?

Seed sentences:
  • Toonbots forum, being able to avoid spammers and trolls and whatnot.
  • The world is awash in fast money, he said, and it is changing the structure of capital markets.
  • When you are pre-approved by spruce mortgage, you will have access to hundreds of loan programs.
  • Many sellers would rather have a monthly check than a lump sum settlement when they sell.
Example page: http://cas.ncat.edu/Departments/dance/js/dojo/pbs/ohioautotax.html - hasn't been cleaned up yet! Note that it's inside a Javascript directory for something. This is the kind of thing it would be cool to track down.

Update 11/19/11: That particular link has been cleaned up, but the network as a whole is still there and still forwarding to the same ... actually, just a very similar site. Of course, there may have been multiple sites all along. So this project is still waiting to be done.

No comments:

Post a Comment