Sunday, March 18, 2012

NLP class assignment 1

The first assignment for the Coursera/Stanford NLP class was a regexp approach to finding email addresses and phone numbers in a set of faculty pages from the Stanford CS department.

I ended up doing a first pass with a list of regexps, then doing some post-processing afterwards. As is always the case, to hit all their test cases, a lot of fiddly special-casing had to be done, the most irritating being Jurafsky's own email address, which consists of a JavaScript call. As I had no particular intention to embed Spidermonkey in the Python code, this had to be entirely special-cased, although a sane system design would have put this into the page retrieval spider, not in a regexp-based recognizer.

But that's neither here nor there (just low-level irritation) - the real point of the assignment was of course both to provide a little exposure to regular expressions while making it clear that natural language is utterly rife with special cases, and it succeeded on both counts.

It led me to note that splitting the logic between regular expressions and a separate post-processing step was irritating from the point of view of code understandability and maintainability. Probably it would be better to have a more powerful text processing language, perhaps explicitly based on a parser. And it should work from a real tokenizer to eliminate all possible HTML obfuscation - for example, the simple trick of having DIVs break up your email address would have made this code useless. (Which is why it irritated me to have the JavaScript instance in there.)

Anyway, it was fun.

No comments:

Post a Comment