Sunday, November 20, 2011

(Natural) language recognition

The recognition of the probable language of a text is problematic, of course, but there are a number of different ways to get a "pretty good" estimate that don't even extend entirely to just spell checking the words. (After all, until you know the language, it's not possible to get all the word boundaries right - although you're still going to see most of them, of course. Except for Chinese adn Japanese.)

Wikipedia has a really nice and thorough language recognition chart. It would be nice to put that into a Perl module. The Wikipedia page also lists a couple of additional leads that are kind of neat:
  • Translated online guesser - uses a vector space model
  • Huh. The other two links are dead. That's a shame - but it may be worth following up on them at a later date.
Perl already has Lingua::Identify (0.30 in 2011), but I don't know how accurate it is or its coverage. It's definitely worth looking at, though. There's also a statistical approach in Lingua::Ident (1.7 in 2010).

No comments:

Post a Comment