Saturday, January 5, 2013

Norvig

So Peter Norvig has a blog, and uses it very interestingly indeed.  Here he is opining on Chomsky (not that he's wrong, but I cannot believe the words coming out of my mouth, I agree with Chomsky [not due to Chomsky's politics, which I largely agree with, but because of his linguistics, which I don't]).

Where I think Chomsky is right is that right now, statistical techniques dominate the field.  It's misguided; we're describing the how and not the why of human language.  Great in terms of engineering, no doubt, but ... empty.  Devoid of semantics.  Norvig reacts with no little acerbity, which I find misplaced.  But it had to sting.

That said, the post is a good one, and well worth reading.

And then he tops it off today with a re-do of Mark Mayzner's work from the 60's - in response from a letter from Mayzner, no less!  Mayzner did some frequency counting using Hollerith cards and an IBM sorter, working with a randomly selected 20,000 word corpus, and wondered if a larger corpus, such as, oh, Google Books, might show different results.

Interestingly, it does!  Norvig's corpus, 37,000,000 times the size of Mayzner's, has slightly different letter frequencies, which I find pretty fascinating.  If punched on Hollerith cards, it would fill up NASA's Vehicle Assembly Building to the 2/3 point, and a single pass through an IBM card sorter of the model Mayzner used would take only 700 years.

It would be interesting to build a lorem-ipsum generator that used a tuned Markov chain to return a sample text that exactly matched Norvig's statistics.  That would be neat.

No comments:

Post a Comment