Wednesday, October 6, 2010

Text analysis - identification of sources in news articles

So I'm taking this online class about journalism, and one of the exercises is to identify the sources in a news article. By hand, of course, this is easy. Wouldn't it be nice to automate it (even partially)?

Of course, nothing is easy when natural language is concerned. I see two parts to this, clearly. First is taking a page and extracting the news item. Frankly, I don't see any better way to do this than simply to have a bunch of definitions for different news services that could identify the CSS classes used by each of them to mark their payload text. And this is exactly the kind of task that a pattern-matching language would be dandy for.

Which leaves us with the text, and its analysis. Which is hard. I can think of a couple of ways to get some sources out of a given text: "'...' said x" is one obvious pattern. Without language-savvy tools, it would be a series of hacks, but maybe worth the effort. (With language-savvy tools, a lot of this stuff starts to look more amenable to solution, though, doesn't it?)

No comments:

Post a Comment