- A list of R packages for NLP, with an intriguing link to Weka, a set of Java implementation of data mining algorithms.
- StackOverflow reference to NLTK and n-gram extraction.
- Note "PMI", point-wise mutual information, cited in the SO link.
- Lucene is NLP for Apache; there is a PyLucene as well. But honestly I think I'm going to have to get my hands dirty in Java, because Java seems inordinately popular in the NLP field.
- JCC is a code generator developed for use in PyLucene.
OK, so I got Text::Aspell working on Windows with MingW32, which was no trivial task (in the end it just required some library setup and a small change to the installation script for the module, but it took a full day to figure that out), and got down to the business of building word lists and checking them. This works almost well, except that it immediately became obvious that I need a better tokenizer.
While prowling the Net for English tokenizers in Perl (note: there aren't any good ones - yet), I found:
- Europarl sentence splitter
- Europarl tokenizer
- A post on sentence splitting options (2007)
- Lingua::EN::Sentence, Text::Sentence
- A tech report on tokenizing for biomedical text indexing
In the end, I resolved to write a proper tokenizer for English; it would recognize certain entities embedded in the text (URLs, numbers, numbers with units, some abbreviations, possibly chemical names, and IDs with capital numbers and digits and dashes and the like), mark punctuation as such, and attempt to deal with quotes in a sane way, including distinguishing between quotes and apostrophes. The result would be an arrayref of words and arrayrefs - same as the other Decl tokenizers, and should work fine not only for English, but for the other European languages as well. I don't yet know enough about non-European languages to know how well I can tokenize them, except for the fact that I know it's non-trivial to tokenize Chinese and Japanese, and presumably Korean.
Finally, searching CPAN for NLP, I found an interface to the Stanford parser (a link on how to use it in Java as part of this book).
It's looking like I'll end up with classes in Xlat for basic translation tool functionality (Xlat::Wordlist and Xlat::Speller), along with probably NLP::Tokenizer if I end up making that a HOP-based pure Perl endeavor.
Update: But as always, note that using a deadline for a job (in this case an editing job) to induce urgency for basic research is generally going to lead to sleep deprivation and depression. I've switched to manually doing this job like a schlub, but hopefully the remembered urgency will get me through the next development cycle on this. I gained some good insights.
No comments:
Post a Comment