The ClueWeb09 database is a set of a billion Web pages harvested in 10 languages in 2009 for research in natural language programming. You can self-host if you buy a 2T hard drive for $600, or you can use their APIs to do your research.
The license essentially requires you to not to republish content, and to delete anything anybody asks them to delete from the repository, both of which make eminent sense.
No comments:
Post a Comment