So Peter Norvig has a blog, and uses it very interestingly indeed. Here he is opining on Chomsky (not that he's wrong, but I cannot believe the words coming out of my mouth, I agree with Chomsky [not due to Chomsky's politics, which I largely agree with, but because of his linguistics, which I don't]).
Where I think Chomsky is right is that right now, statistical techniques dominate the field. It's misguided; we're describing the how and not the why of human language. Great in terms of engineering, no doubt, but ... empty. Devoid of semantics. Norvig reacts with no little acerbity, which I find misplaced. But it had to sting.
That said, the post is a good one, and well worth reading.
And then he tops it off today with a re-do of Mark Mayzner's work from the 60's - in response from a letter from Mayzner, no less! Mayzner did some frequency counting using Hollerith cards and an IBM sorter, working with a randomly selected 20,000 word corpus, and wondered if a larger corpus, such as, oh, Google Books, might show different results.
Interestingly, it does! Norvig's corpus, 37,000,000 times the size of Mayzner's, has slightly different letter frequencies, which I find pretty fascinating. If punched on Hollerith cards, it would fill up NASA's Vehicle Assembly Building to the 2/3 point, and a single pass through an IBM card sorter of the model Mayzner used would take only 700 years.
It would be interesting to build a lorem-ipsum generator that used a tuned Markov chain to return a sample text that exactly matched Norvig's statistics. That would be neat.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment