Saturday, November 6, 2010

NLP link dump

So there's an NLP challenge: find "semantically related terms" over a large vocabulary (> 1 million words) given a large corpus, in a reasonable amount of computer time and with little or no human input. And from its links as to how the corpus was prepared, I ran across splitta (a sentence boundary finder) that will be very useful in the Xlat project, a Penn treebank tokenizer sed script, and the Topia term extractor.

All of these will be essential parts of the eventual front end to OpenLogos, for example. I urgently need a way to find significant terms in the source text in order to facilitate early glossary work (i.e. ideally do the glossary work before starting the translation). This is pretty critical.

I might actually tackle the NLP challenge as well, but honestly - all this stuff is mind-meltingly fascinating to me. I need to get serious.

No comments:

Post a Comment