Tuesday, April 3, 2012

Frustration with statistical methods

The "custom language model" section of HW2 is impossible. I implemented a hacked Kneser-Ney smoothing (I think perhaps the corpus is too small, but the problem was lots of zeroes), a trigram-to-bigram backoff, and linear interpolation, and all failed. Linear interpolation of trigrams, bigrams, and unigrams failed miserably.

I dunno. Clearly people are making these things work, but I get the impression that mostly it's throwing a bucket of tacks at the problem and hoping something will stick.

Update: some perusing of the forum led me to the realization that I wasn't testing trigrams correctly. Looking at only the first trigram in a sentence gave me "[s] One two", so that any spelling error in "One" would be lost. Once that was fixed, trigram double-backoff worked as well as bigram backoff. A little twiddling with the backoff coefficients got me slightly better performance than my original bigram backoff with a 0.4.

Moral of the story: the choice of backoff coefficient makes a difference. Which is why I hate statistical approaches.

No comments:

Post a Comment