Sunday, September 28, 2014

Scratching the surface of German NLP, from ParZu down

Back in June, looking for parsers for the German language, I ran across ParZu, which is from the University of Zurich. Test sentences thrown against its online demo were translated handily, and all in all it's a convincing parser, so I'm going to be working with it for a while to get a handle on things. It is written in Prolog.

For the past three days, I've gone down the rabbit hole of NLP tools for German, starting from ParZu. There is (of course) a vast amount of previous work, and it's really difficult to get a comprehensive grasp, but this post should at least link to some of it, with initial thoughts, and I can go from there later. I had considered writing an article, but honestly none of this is sufficiently coherent for an article. There's kind of a threshold of effort I expect from articles on the Vivtek site, and that's not there. Yet.

OK. So ParZu can work with any tool that delivers text in a tab-delimited format (token-tab-tag) using the STTS tagset (Stuttgart-Tübingen TagSet, if you were wondering). My Lex::DE can already be converted to generate some of these, so my best bet at the moment would simply be to continue work on Lex::DE and feed it directly into ParZu.  Even better, of course, would be to do this online by talking directly to Prolog, probably ideally through HTTP to avoid 32/64-bit process boundaries. More on this notion later. The cheap way to do this is just to kick out tagged text and go on.

The output from ParZu uses the CoNLL format, which seems pretty straightforward.

Which is all very nice and self-contained, but how do the Zurchers do their tagging? I'm glad you asked! The main tagger is clevertagger, which works on the output of Zmorge. Zmorge is the Zurich variant of SMOR, which is the Stuttgart morphological analyzer, although active development seems to have moved to Munich.

clevertagger has a statistical component that uses CRF (Conditional Random Field) training to judge, based on the Zmorge lemmatization output, which POS is most likely for the word based on your corpus. You can use either Wapiti or CRF++. The point of doing this is to eliminate POS amibiguity (or to quantize it? but no, I think it's a disambiguation step), which is what I hope to use Marpa to do directly - instead of providing unambiguous parts of speech, with Marpa I'll be able to provide alternatives for a given word, and disambiguate after parsing. Well, that's the idea, anyway - but that's going to take some effort.

(Note, by the way, that since ParZu is coded in Prolog, I can probably cannibalize it relatively smoothly to convert to a Marpa grammar, so none of this effort will be lost even if I do switch to Marpa later.)

Anyway, the CRF thing leaves me relatively unexcited. It would be nice to take an aside and figure out just what the heck it's doing, but that's pretty low priority.

Zmorge is based (somehow) on a crawl of the Wiktionary lexicon for German, and uses a variant of SMOR, SMORlemma, for the meat of the processing. I'm unclear on exactly how this step is done, but I do know that SMOR has a lexicon that is read into the FST on a more-or-less one-to-one basis, so I presume that Zmorge is putting the Wiktionary data into that lexicon, and then using updated rules for the rest of the morphological analysis. It would take a little exegesis to confirm that supposition. Maybe later.

SMOR and SMORlemma are both written in an FST-specific language SFST, which is just one example of a general FST language. It's roughly a tool for writing very, very extensive regular expressions (well, that's nearly tautological, in a sense). There are other FST-specific languages originating in different lineages, including OpenFST (developed by Google Research and NYU), AFST (an SFST fork developed in Helsinki - notice that a lot of the original FST work in NLP was done in Helsinki), and the umbrella library that sort of combines all of the above and some other stuff as well, HFST (Helsinki Finite State Technology). Overall, there's been a lot of work in finite-state transducers for the processing of natural language.

There are some tasty-looking links proceeding from the OpenFST project, by the way.

From my point of view, what I'd like to do might consist of a couple of different threads. First, it would be nice to look at each of these toolsets and produce Perl modules to work with them. Maybe. That, or possibly some kind of exegetical approach that could approximate some kind of general semantics of FSTs and allow implementation of the ideas in any appropriate library or something. I'm not even sure.

But second, it would be ideal to take some of the morphological information already contained in the various open-source morphologies here (note: OMor at Helsinki, which aims to do something along these lines, and of course our old friend Freeling) and build that knowledge into Lex::DE where it can do me some good. How that would specifically work is still up in the air, but to get good parses from ParZu (and later from Marpa), it's clear that solid morphological analysis is going to be crucial.

Third, I still want to look at compilation of FSTs and friends into fast C select structures as a speed optimization. I'm not sure what work has already been done here, but the various FST tools above all seem to compile to some binary structure that calls into complex code. I'm not sure how necessary that is - until I examine those libraries, anyway. Also, I'd really like to get something out of lemmatization that isn't a string. Those structures bug the hell out of me, because I still need to parse them again next time I do something. I want something in memory that I can use directly. (Although truth be told I have no idea whether that's premature optimization or not - until I try it out.)

Fourth, there are other POS systems as well. One that naturally caught my eye is hunpos.

So that's the state of the German parsing effort as of today. Lots of things to try, not much actually tried yet.

Update 2014-09-30: A closer look at the underlying technology of ParZu, the Pro3gres parser originally written for English, as described in a technical report by the author, has me somewhat dismayed. I'm simply not convinced that a probabilistic approach is ideal - sure, I might be wrong about this, but first I want to try the Marpa route. Yesterday I sat down to try parsing something with ParZu, and found myself writing an initial Marpa parser for German, working from my own tokenizer (which, granted, has absolutely horrible lemmatization and POS assignment). I think I'm going to continue down that path for now.

That said, SFST is a fascinating system and the German morphologies written in it are really going to come in handy - so I might end up using that before even considering the parser level.

No comments:

Post a Comment