Semantic programming: November 2009

Monday, November 30, 2009

Informal logic

There are multiple layers of knowledge and meaning, corresponding more or less to the "zoom level" at which we are looking at something. This means that the definition of a unit may have multiple contradictory elements of knowledge.

For example, we read data from a file. But wait - the file has to be open first! So we read data from an open file, and we write data to an open file, right? Well, but to write to a file, we have to have opened it in write mode.

See what I mean? There are multiple layers of rough approximation that humans effortlessly sort though in order to gloss over unimportant detail during the early stages of a solution, then to retrieve it only when it becomes necessary.

This can sometimes have bad effects - garden-path solutions where we find that our solution is no solution because of some irritating detail we forgot until just now. I wonder if one characteristic of genius is to retrieve those details more quickly than others. At any rate, even a genius would quickly become bogged down in extraneous detail without this zooming-in mechanism.

I think of this as "informal logic". In formal logic, if I state "P", then P holds. Period. In informal logic, I may state "P", then later say, "Well, except for in this case." The same is true of open files - it's true that I can write to a file, but there are caveats. The question is, at what stage do those caveats get retrieved in order to contribute to a solution?

I suspect the answer lies in the syntactic expression mechanism, at least in part. When I sit down to write real code to write something to a file, that's when I remember that I have to signal the write mode during the file-open command, and that's because I have that knowledge connected to the actual syntax (along with some background knowledge that I have about how file I/O works).

Sunday, November 29, 2009

Modes

When I open a file in reading mode, I can't write to it.

How would you define the concept of "mode"? Because it's going to have to be done.

Core meaning

There are two (or maybe more) levels of meaning of a concept. The first is its "core meaning" - a file is a set of data, plus some other stuff. But there is another, outer level derived from the use or applications of files elsewhere, like the fact that we can attach files to mail, or that files are the input for compilers. I hesitate to document those as the "meaning" of the concept, but the sum total of all these known facts about files certainly flavors the concept.

It's really a documentational question: where do we talk about these things? Clearly, the fully indexed lexical unit will include all the links, but the Wiki definition should probably just make reference to other domains in which a unit figures. The rest is something like "background knowledge".

Files

So here is the first unit I'm trying to define, the "file":


term "file" (n):
 
- CONTAINER for data
- HAS name
- either (data IS text) or (data IS binary)

- IN filesystem
- IN directory
- IN path

- ACTOR reads file
- ACTOR writes file

- ACTOR opens file -> file IS open
- ACTOR reads data FROM open file
- ACTOR writes data TO open file

I foresee an indexing script that can read this stuff and tell me what terms are used but not yet defined. Capital words are intended to be non-domain-specific. It's going to be a convention, not a requirement.

The current definition is here, but I want to capture intermediate versions here on the blog. My approach here is really to write the pseudocode, then write a system that understands the pseudocode directly - thus converting the pseudocode into code in a very different way from that usually taken.

Wednesday, November 25, 2009

Online data mining

I just now set up a Wiki for data mining. If I succeed in this notion of using a Wiki format for semantic domain repositories, this is where data mining will end up.

Tuesday, November 24, 2009

Wikis and the Lexicon

The Lexicon is the semantic unit database for a given system. Its keys are the unit names (more or less words), and its records are the sloppy, messy definitions of those concepts.

A Wiki is also a way of mapping from terms onto definition-like text. So is writing a Wiki sort of cognate with defining a domain for semantic programming?

Kind of. I'll bet a Wiki structure would be a fantastic way of annotating a Lexicon; this is the approach I'm going to attempt at the quant-semantics Wiki. A Lexicon entry is going to be much, much more fine-grained than a sensible Wiki structure would comfortably support, but if we think of a Wiki page as a microdomain where concepts are grouped together for ease of presentation, this might be quite useful.

Then we would just scrape the Wiki to compile the program. How cool is that? For this to work, the Wiki is going to have to support code (which Wikispaces does), and the compiler will have to know how to tell smoo code from syntactic structures.

I think this might work! Then the semantic programming Wiki could end up being the general library for inclusion.

What a pretty concept!

Sunday, November 22, 2009

Interesting domains

Well, they're all interesting, but honestly - investment seems rife with possibility. After all, it has a large and eclectic set of concepts that are hard to keep track of, it uses mathematical tools that are interesting in and of themselves, the coding of models is an ongoing process at all times, it can easily involve the integration of external data in arbitrary formats - it basically has everything you'd want in a coding domain, plus the added cachet of being respectable, reputably arcane, and possibly even lucrative.

But a safer and smaller domain (sort of a Hofstadterian microworld) might simply be data mining from online resources.

Saturday, November 21, 2009

Forms of semantic reasoning

So what does semantic programming get us?

The normal process of programming is to specify high-level syntactic structures that are interpreted and ramified, either into simpler or more detailed representations (e.g. macro systems or compilers) or into actual actions (e.g. interpreters or of course the CPU running machine code). A semantic structure can be used in the same way, of course: if I state that we'll be using a "file", then the Lexicon can be consulted for the definition of "file", and details can be filled in. This mode is more or less a very flexible macro system. I'll call this lookup, and it's pretty basic.

A macro system also includes lookup of syntactic structures (symbolic units) and the expression of semantic units. Clearly, this is already part of any macro system to the point that it might not even be seen as a separate mode of operation, but we make a distinction in semantic programming between the semantic and syntactic domains.

But it's the third mode that gets interesting. In recognition, the system searches through semantic configurations to find a matching unit. As a simple example, let's say we need to store some data in a permanent way. Part of the definition of a file is that it stores data permanently - recognition would find "file" as at least a partial solution for our problem.

So as far as I can tell, the different modes of semantic reasoning are:

1. Interpretation or lookup

2. Expression

3. Recognition

As time goes on, this notion will probably get a little firmer.

Friday, November 20, 2009

The semantic and syntactic poles

A symbolic unit encodes a symbolic relationship between a concept (loosely) and its syntactic expression. I'd argue, actually, that a symbolic unit easily generalizes to any symbolic relationship, but the point in our particular case is that the symbolic unit mediates between meaning and expression.

A loop expression

for i=0 to 9 {

...

}

when encountered by an interpreter will cause repeated execution of the loop block and incrementing of the counter. We know that. That, to a human programmer, is the meaning of the loop construct (loosely). The computer, however, can't be said to "know" this except in the narrow sense that programmers use when speaking of software. The computer simply does the right thing. (It is tempting to think of the computer as understanding code because code is expressed in something like the language we use to talk about it - but the computer executing a loop no more understands it than a gearbox understands gear ratios.)

What semantic programming brings to the table is simply that: explicitly associating meaning with program constructs. In this view, the syntactic expression for a loop is the syntactic pole of a symbolic unit whose semantic pole points to the concept of "loop" - or, more precisely, whatever kind of loop it is. And the meaning of that unit involves the more abstract "loop".

This can be applied in either direction. When programming, the system would express the loop in terms of that (or another) symbolic unit. However, the system should be able to parse and understand (let's call it "comprehend" to set it slightly apart from human understanding) existing syntactic structures - we programmers do that all the time, after all. Given code, the system should be able to recreate some of the thinking of the programmer who wrote it.

In natural language, words and phrases in the lexicon work in this same way. We'll construct a syntactic lexicon for each programming language the system will work with, along with specialty project-specific lexicons for natural-language domains (remember - comments are grist for the semantic mill, if the system can manage to comprehend them). (And it should most certainly write comments, when it comes to that.)

Sunday, November 8, 2009

Concept: "file"

So let's examine a possible domain and consider the semantic information that is already part of any programmer's understanding of that domain. Let's take an "easy" one: the file. Not even the file system, just the file.

1. It has a name, a size, modification and creation dates, a type, and contents. The name is a string, with a main part and an extension, the size is a number, the dates are dates, the type is something we can think of as a string or a selection from a database, and the contents are the main event.

2. Its contents can be text or binary. Text is a series of human-readable characters; binary is usually a bunch of packed structures.

3. It has specific sets of commands in different programming languages, like "open", "read", "close"; these are mnemonics for various actions we know we can take with files.

4. There are certain patterns used for file processing in different languages (while (<IN>) { ... }).

5. There are command-line invocations we can use for doing things with files from the outside.

6. Files can be mailed, or attached.

7. Files can be documents. Documents can be managed.

8. A file is an analogy to something kept by businesses in manila envelopes.

9. It's not a very good analogy.

So here are the "neighboring concepts" to "file": name, size, modification, creation, date, contents (container), string, number, parts, main, extension, type, table or database, "focus" (because the contents of a file are sort of the focus of the concept), text, binary, human-readable, human, reading, maybe even parsing, binary, packed structure, structure, series or sequence, command, programming language, language, programming, code, open, read, modify, write, close, delete, action, pattern, "while (<IN>) {...}", command line, command, mail, mailing, attachments, document, document management, analogy, the business meaning of "file".

And that's just what I can think of off the top of my head. Each of the key words I used in that description are just as complex in their own right - the whole point of a semantic approach is that concepts don't break down into simpler concepts; they derive their meaning from the network of relationships with other concepts, each of which can be seen as a world in itself.

That means it's going to be difficult or even impossible to draw a line around a hypothetical semantic programming system, saying, "This is standard knowledge." Leaving anything out will lead to increased levels of Martian logic. So we'll be walking a tightrope.

The thing to do at this point is to consider the above set of semantic information, and to imagine (1) how it might be expressed in a specification language and (2) what categories of information might be expected to appear in a given unit's definition.

It's interesting to note that some of the concepts above are more or less specific to programming or computer systems ("packed structure"), while others are more general in nature ("focus") - that's going to be key, I think.

Saturday, November 7, 2009

Design patterns

Design patterns encode certain approaches or "semantic chunks", if you will, of a finished system. Sort of a high-level macro. Semantic programming will probably end up using a ramified and very detailed set of something like design patterns to build everything, so again: good start.

Probably I should set up some kind of Wiki backing for this blog.

UML

UML (Unified Modeling Language) provides a set of tools to specify the context and basic concepts of a software system. Again: good start.

Methodologies

Methodologies also approach what I'm getting at, without going the whole way. A methodology expresses the processes used by a team to achieve project goals, not necessarily specifically to software. As such, the methodology talks about how to find out domain knowledge (well, this is one way of looking at it) and provides some hints or clues about what knowledge must be found.

Boy, this is a shallow post.

Friday, November 6, 2009

Literate programming

I spent a few years working with literate programming tools, even writing my own XML-based markup language for the purpose. If you're not familiar with the notion of literate programming, it was invented by Donald Knuth while writing TeX, and formats source code (using TeX) as a readable book. The TeX source code is, in fact, a pretty readable and entertaining book.

Knuth's insight was that the order of presentation that makes understandable reading is not the same as the order of presentation that makes parsable code - he was using Pascal, so all the variable declarations had to be at the outset of a given function, but he wanted to introduce them where they were used, for instance.

His system also generated an index of identifiers and did some other nice presentation-oriented stuff.

So I wrote my own literate programming system and used it for several years, and it really did help me organize my code. But it doesn't go far enough; there's still no semantic information that's machine-usable.

Thursday, November 5, 2009

The semantic unit

The basic unit of semantics is the unit. Since "unit" is a really overloaded word, I'll call it a semantic unit, or a su or something. A unit is just a box. It can be thought of (loosely) as the meaning behind a single word.

A unit is made up of other units, which all stand in relationship within it. It also stands in relationship with other units within a domain. If you surmise that a unit is itself a domain for its inner units, you are quick. Domains and units are the same kind of thing, sort of.

A unit is associated with other units by dint of its inclusion with them in domains, and by dint of the nature of the structure they all participate in. Semantic nets that draw lines between nodes and assign weights to those lines are explicitly representing those associations; we can think of this as a sort of precalculated index of association values that could be derived from study of the semantic structures of a given domain. But those linear values are dangerous - they change according to context. They're best thought of in the abstract as an approximation.

A unit is also associated with syntax. I'll speak of a unit's "semantic pole" and its "syntactic pole", but I'm not sure how valid those labels are for what I'm doing here. (In case you're wondering, a lot of this is based on the work of Ronald Langacker, whose talk at Indiana University in, um, about 1994 really knocked my socks off. I've been imprinted on his theory of cognitive linguistics ever since.)

Since syntax is ultimately how semantic structures interact with the world - and in the programming context, syntactic structures are what actually does all the work - they're going to be quite important. A "cognitive linguistics parser" will have to be part of any toolset we come up with during this venture.

Introduction

Since starting the big-old-house blog to chronicle my work on my house, I'm starting to like the blogging medium as a way to organize periodic thoughts.

I used to keep blogging stuff on my site on my server, but frankly - writing and fixing your own tools every time you just want to jot down a little thought gets old. Now that I'm on Blogspot every day anyway for the house, I might as well start new blogs for new thinking threads.

So that's what this is. A new thread of thinking. If you're reading this, well, that's weird. It's more just for me. But hey, thanks for reading - drop me a comment if you find something neat. I hope you will, with time.

So ... what's semantic programming? That's kind of what I want to answer over the next arbitrary period of time; the notion is something I've been mulling over for about fifteen years, maybe more, and it just keeps coming back.

The key insight (well, in addition to hexapodia) is that programming to date has focussed on syntax, whereas it's maybe time to start being more explicit about semantics.

What do I mean by that? Let's consider something sort of like a real-world example; say, a script to harvest some information from a website and put it into a database. (This example will probably crop up again and again over the course of this blog.) Since we're human, we approach this by taking a browser and going to the site in question and reading it, and forming some kind of mental model of the information there. Then we map that onto a structure of tables, and we sit down and write some Perl (say) to walk along the site and load the data into the tables.

Along the way we wrote some SQL to define the tables, and more to insert the data, and so on. The table columns have names that mean something to us in English or Thai or something.

At the end of the process, we test the program and see that it is good, and on the seventh day we rest, while the program runs merrily along and populates our tables for us, doing what software does best.

OK. That's the baseline. We've used our human domain knowledge and our smart-monkey proclivity to build semantic structures, then we've expressed some syntax to define data and procedure. The syntax can be tested or maybe even proved correct (well, probably not, but you get my meaning) but - this is the important part - nothing in that system can be said to understand the domain. It expresses the domain, in a way; another programmer can come and read the code (even without comments) and will probably be able to come up with something like the original domain we learned about.

What semantic programming should do is to encode some of that domain knowledge and semantics into the overall system. Ultimately, it should encode all of it.

Why? I'm glad you asked.

The more semantically sophisticated the system is that we build, the more easily we can adapt it. Instead of the script being the system, the script is now simply an expression of the system; the system itself more closely approximates the set of knowledge and concepts the programmer built in her brain during the programming process.

By changing the concepts at a higher level, and allowing those changes to propagate into the expression accurately (i.e. by metaprogramming instead of programming), we eliminate a lot of error. Moreover, if the system understands what it should be doing, then - potentially - it should be able to run spot checks on its results - sanity checks. It should be able to generate its own unit tests based on the semantic knowledge of how its own components should work.

In short, the smarter the program, the less it is a program.

Yeah, yeah, I know. This is impossible. That's why this blog isn't going to be publicized. It's just my notes. If you're reading it, and it's any time earlier than about 2011, and if you want to pull the alpha geek? Don't. I'll delete negative comments as soon as I see them. Positive comments are welcome.

It's like when I bought my house. I hadn't even seen it when we closed, of course (entirely remote) but when I told my dad we were moving back into the area and where the house was, I told him, "No matter what you see there, no matter what condition it's actually in, when you call me next, I only want to hear validation." He laughed; I think that might have been the most honest moment we've ever shared.

Same applies here. No matter how stupid this is, don't tell me. I'm just using you as a focus to verbalize my thoughts anyway.

Update, November 6, 2011: a brief retrospective.

Semantic programming