Sunday, July 10, 2011

The case for segregation of natural language

So this is something I thought of yesterday in connection with my as-yet-nascent natural language production library - in a way similar to the idea of separation of layout from program-generated code, there should really be a separation of "mechanical" language - language generated by lookup or traditional methods - from fluent language.

That's not really very clear, is it?

Here's what I'm proposing, essentially. The starting point is internationalization - which is hard if you want to do it right. It's hard because natural languages deal idiosyncratically with number and gender - just as a start. So you can't use simple fill-in-the-blank methods for generating phrases, once you start worrying about Japanese and Russian and Hebrew.

What I'm proposing instead is some kind of semantic structure that would express a given thought that should be expressed. This structure would then be expressed in a more sophisticated manner - and if necessary would require additional information, like the gender to be used for the addressee. (I presume that Hebrew normally assumes masculine gender if the addressee is unknown.)

But that means that we've got several layers of layout. First is the language-independent layout, into which we could insert lorem ipsum. Next is set pieces - paragraphs of explanatory "brochure" or instructional text that can be translated en masse and inserted wherever needed. And finally there are the interactive segments - error messages, responses to requests, and the like - that need more careful handling for fluency.

It's kind of inchoate, and less revolutionary than it seemed while I was walking the dog. But I don't want to forget it.

No comments:

Post a Comment