So I wrote some code (finally!) into NLP::Tokenizer to pull out n-grams, and ran across this article about using
bilingual n-grams in translation. This article blew my mind, for a simple reason: it assumes that n-gram alignment between languages even makes sense. Fine, I guess if you're restricted to English and French, like the article (actually a set of slides, not an article - whatever), then you might be OK. But German?
Hungarian? These guys aren't translators.
So anyway, I ran the n-gram extractor on a rather large German corpus extracted from some HTML files, and ... honestly, I couldn't see much of a way to use the results. I'm thinking really that something more like a Markov network and subsequent identification of ... frames or whatever you want to call them would be more useful.
Not sure yet, but later this month I want to spend some time finding out.
No comments:
Post a Comment