Monday, December 5, 2011

N-grams

So I wrote some code (finally!) into NLP::Tokenizer to pull out n-grams, and ran across this article about using bilingual n-grams in translation. This article blew my mind, for a simple reason: it assumes that n-gram alignment between languages even makes sense. Fine, I guess if you're restricted to English and French, like the article (actually a set of slides, not an article - whatever), then you might be OK. But German? Hungarian? These guys aren't translators.

So anyway, I ran the n-gram extractor on a rather large German corpus extracted from some HTML files, and ... honestly, I couldn't see much of a way to use the results. I'm thinking really that something more like a Markov network and subsequent identification of ... frames or whatever you want to call them would be more useful.

Not sure yet, but later this month I want to spend some time finding out.

No comments:

Post a Comment