Semantic programming: November 2011

Wednesday, November 30, 2011

Scalable data mining of Twitter stream

Neat article that is interesting on multiple fronts for me.

Bayes classifier in 50 lines of code

The modern Python ecosystem

Nice survey article for 2011's Python context.

Now this is way cool: analysis of comments on recipe sites to determine, well, all kinds of things, but especially what ingredients can be substituted for others. Just mouthwatering research - both figuratively and literally!

I want to do this.

Tutorial on how to write a Dojo app

Tutorial here.

Javascript and HTML5 Canvas roundup

A foray into Google on the topic of HTML5 animation yielded (as always) some linkworthy material:

Basic tunnel animation with moving random spheres.
From the same site, a really nice set of presentation slides. The language is Volapük, if you're wondering.
Some of the new HTML5 semantic tags. Wow!
Alex the Alligator, a platform gamer ported to HTML5, based on the engine melonJS.
Basic drawing with HTML5 Canvas.

Also there was doodle-js - a port-ish thing of ActionScript into HTML5 Canvas - but it seems to have vanished. Or at least its example pages did; the code is still on Github.

False claim checker

Fact checking articles against the Politifact database: a moderately breathless article [truth goggles?] gets listed on HNN, of course, but then turns up at LanguageLog as well, leading to a reference to the failed commercial venture Spinoculars (nothing to do with this research, just a general similarity in goals).

Consensus seems to be that it's still a neat idea, and still probably impossible to do without being rather brittle. The research project appears to be using keywords to reference particular claims, which is really not a bad start.

There's just a boatload of things you could do in this arena.

Sunday, November 27, 2011

Supercollider

Supercollider is kind of a Processing for music, I guess: a dedicated music-oriented programming/performance system. It would be neat to look into it.

SourceMap: finding where the JavaScript comes from

Apparently SourceMap is a code project tracker that, given a JavaScript error, can find out which of your many files in PHP and Coffescript and what-have-you is actually responsible for the code that produce the error. Nice!

Functional programming

Nice intro to functional programming for OO-trained programmers. I found it useful.

The 12-factor app: best practices in Web app development

Good article/site on Web app development.

Tuesday, November 22, 2011

IMDb API in Perl

Neat project post, "Create your own movie database".

Codebot bounty

Here's a neat open-source bounty on a code bot.

Doom3 open-sourced

This boggles the mind - great target for code understanding!

Monday, November 21, 2011

Nice candidate for open-source PHP learning

Here's a nice to-do point I could tackle.

Web fonts

Google has a neat Web font search.

Business management

A really handy list of things startup guys shouldn't worry about.

Sentiment analysis

Greed and fear index, an online apparently free sentiment index of current financial news. They want to open an API.

Update 2013-03-13: dead. Maybe I should revive it.

Spanish morphology in Haskell

This is a cool paper about modelling the morphology of Spanish with a functional language.

Sunday, November 20, 2011

(Natural) language recognition

The recognition of the probable language of a text is problematic, of course, but there are a number of different ways to get a "pretty good" estimate that don't even extend entirely to just spell checking the words. (After all, until you know the language, it's not possible to get all the word boundaries right - although you're still going to see most of them, of course. Except for Chinese adn Japanese.)

Wikipedia has a really nice and thorough language recognition chart. It would be nice to put that into a Perl module. The Wikipedia page also lists a couple of additional leads that are kind of neat:

Translated online guesser - uses a vector space model
Huh. The other two links are dead. That's a shame - but it may be worth following up on them at a later date.

Perl already has Lingua::Identify (0.30 in 2011), but I don't know how accurate it is or its coverage. It's definitely worth looking at, though. There's also a statistical approach in Lingua::Ident (1.7 in 2010).

Saturday, November 19, 2011

The original code katas

Here.

These are nice thought-provoking exercises to do to study programming - both your toolset and your mental structures.

Thursday, November 17, 2011

State machines in Perl

I just did a quick CPAN search and turned up a number of interesting packages:

FSA::Rules is the package I initially started building a wrapper for. It's actually pretty nice, and has a couple of constructs that my last post probably is missing.
Parse::FSM builds a parser based on an FSM constructed laboriously by function call.
State::ML provides a utility for converting XML-encoded state machines into other things or even code. I like the code generation aspect!
Win32::CtrlGUI::State is a slick little state-machine controller for Win32 GUIs.
Basset::Machine builds a state machine class in much the same way Term::Shell builds a command line shell.

All pretty cool aspects of the state-machine paradigm. If you really wanted to start getting into the semantic programming approach writ large, you'd think of ways to produce code generators for any of these starting from a Decl state machine description, and ways to organize that kind of code generator family into a semantic domain.

State machine redux redux

My random link strategy is working - it brought me to my state machine post of last year, whereupon I realized I'd kind of forgotten about state machines entirely. And yet a clear state machine presentation is such an improvement over code!

Incidentally, the Wikipedia page for finite-state machines is pretty nice.

Here's my tentative DSL for state machines, then:

The overall tag is "statemachine" and has a name. This name will resolve as a function outside the state machine.
The state machine tag is an iffy executor; that is, if it's the last tag in a program, it will be in control.
Within the state machine node, the children are named. Children named "prepare", "output", or "input" that occur before "start" are special.
The "prepare" child is code executed on each input to prepare it. The local variable $input contains the prepared input (the return from the "prepare" code) and $raw contains the raw input should it be required (this is @_ in the "prepare" code).
The "output" child is code executed on each out-of-band output in the state machine (see below). The default output is the same as anywhere in Decl; it's to pass output to the parent, where eventually it just gets printed to stdout if you don't redirect it.
The "input" child is code executed to obtain the next input token, if the state machine is in control. If there is no input, then the state machine can't be in control; it must be called from other code for each input token.
The "start" child is the first state - every child after "start" is another state, so you can still call a state "prepare" or "input" if you need to.
Within a state, we still have special parse rules, but in general, execution goes down the list of the state's children.
A string followed by "->" consumes an input token if it matches, and changes the state.
A line introduced by "->" just changes the state.
Either of those may have code attached; if so, this code executes before the state transition. But with or without code, both of those act like a "do".
A line that doesn't consist of string and -> or just -> is parsed as normal Decl code and does whatever it's supposed to.
The code morpher will be updated to understand "->" at the start of a line as a state transition if you're inside a state machine. (That's probably a trickle-up thing as well, actually.)

It should be possible to compile a state machine to Perl or C or something; run it in interpreted mode for testing, then spin out C for performance later. Something like that. I'm only vaguely starting to apprehend this aspect of Decl - that eventually it should be entirely master of its own fate and understand how to generate high-performance code from its own programs when necessary.

I still think this would be a pretty powerful feature; I nearly always run aground when coding things based on state machines because it's just so hard to keep track of the darn things, but this lets me pseudocode my way right through it.

Here's a possible rendering of a recognizer for "nice", each letter being an input token:

statemachine nice
start
  n -> n_found
  -> error
n_found
  i -> i_found
  -> error
i_found
  c -> c_found
  -> error
c_found
  e -> success
  -> error
success (accept)
  >> Yay!

   -> start

 error (fail)
  >> bad!

   -> start

That seems to do what I want; it doesn't show any code, but it does show the basic pseudocode I want to use.

Influence Explorer

Interesting data journalism project - and built on an API. Cool!

Wednesday, November 16, 2011

Another open-sourced Ruby app

Another app open-sourced in frustration - grist for the code analysis mill. Ruby.

PR diving

A fun article by Paul Graham about the PR industry, incidentally proposing "PR diving" - finding articles in news sources that were sent in by PR people. I'll bet that could be tracked semiautomatically. I love that kind of idea.

Static typing

As a Perl devotee, you can imagine I'm no fan of static typing - but I acknowledge it has its place when trying to reason about software correctness.

So I wonder how doable it would be to introduce type assertions into Decl in such a way that they aren't mandatory? Perhaps assertions about duck typing - instead of declaring something an Animal, just assert that it's an Animal if it meets certain "can" criteria?

Tuesday, November 15, 2011

Pure: template language

Here's an interesting template language: Pure. In Pure, the HTML is the template language, since fill-in is done based on the class attribute of the container. Data is provided in JSON. Slick!

Wx task

There's a Wx CPAN shell app that would probably be a great Decl target when I resume Wx development.

Random posts

I found and installed a random posts link in the sidebar. This way I have a finite probability of not forgetting any given topic I've found in the past two years. I think this will be pretty interesting going forward.

Codefixbot

I just had the neatest idea. One of the nicest things about Perl/CPAN is the CPAN Testers Network - if you package your module with tests and put it on CPAN, hundreds of automated testing systems running different versions of Perl on different machines under different operating systems test it for you and email you the results.

I can't say how outstanding that is for code quality.

So. Code quality. I posted a couple of days ago about code quality. Here's the first iteration of my idea: a generic code quality tester that would crawl open-source repositories (aside: since the demise of Google Code, there is no universal code index, and that should change), identify problems, and if on e.g. Github, automatically create and submit a pull request to fix common errors. But otherwise attempt to format an easy patch and get in touch with the authors.

That's cool enough - a sort of universal code quality assurance system that would just ... fix everything it finds. But then I came to the second iteration of my Good Idea, which is something even more interesting: a code generator bot. Let's say I have some kind of idea and a language to express it in - the bot could come by and generate, say, C++ or Java code to implement my idea.

OK, granted, that's vague. But surely there's some kind of continuum there from the easy to imagine (code quality automation on the loose) to the Singularity (write a blog post about something you want, and the Internet implements it for you and links to it from a comment). And honestly - what a philanthropic opportunity!

So now I know what I want to do this year. Just gotta jam it onto the priority list.

Monday, November 14, 2011

Meta-learning

Meta-learning is learning which algorithms to use for machine learning, automatically. Sounds perilously close to metaprogramming, actually, which is what I ultimately want to be doing

Update: I've registered with Springer-Verlag as a book reviewer and I'll be reviewing this book. That means I get free online access to the text for six months. This should dovetail nicely with the machine-learing-in-perl tutorial site idea, actually.

Error handling in Decl

As in, there is none. Except for Perl's native system, which is mostly inappropriate.

Anyway, LISP has "conditions" instead of exceptions, and some of this might be applicable to the Decl situation.

Free programming books

OK, so what prompted that last post was this StackOverflow list of free books. Here are some promising ones:

I need more time in my life.

Time for a link dump: more ML/NLP

My browser tab is getting full, this time with free programming and NLP books. So here we go:

Introduction to Information Retrieval
Foundations of Statistical Natural Language Processing (sadly not free online, but deemed valuable)
Mining of Massive Datasets
Two books on Computational Semantics (Blackburn & Bos)
Elements of Statistical Learning
Data-Intensive Text Processing with MapReduce

That's the books. Now the other stuff:

CMU's machine learning course. I might work through it after Ng's course is done.
Apache Mahout [hnn]
The PET parser [online demo] [article], part of DELPH-IN, which has a truly painfully formatted home page but looks promising
Natural Language Engineering journal
StackExchange discussion of NL parsers and starting points for NLP
A list of what's in the Ubuntu NLP stack
The Porter stemmer
Apache OpenNLP - probably a good place to help out
ANTLR

And that's it, for NLP. For now. You know, I could spend a year or two internalizing this stuff and be the best natural language programmer on the planet.

Wednesday, November 9, 2011

Superdesk: journalism tool by and for journalists

I haven't read all this, but it looks neat. Workflow for journalism, I believe.

Local files in JavaScript

Well, this is new. Under HTML5, JavaScript has access to the local filesystem. Wasn't the lack of local access a specific security feature, though?

NLP

I actually started working with natural language this week - by trying my hand at some translation tools at last. First, some links I ran across:

A list of R packages for NLP, with an intriguing link to Weka, a set of Java implementation of data mining algorithms.
StackOverflow reference to NLTK and n-gram extraction.
Note "PMI", point-wise mutual information, cited in the SO link.
Lucene is NLP for Apache; there is a PyLucene as well. But honestly I think I'm going to have to get my hands dirty in Java, because Java seems inordinately popular in the NLP field.
JCC is a code generator developed for use in PyLucene.

OK, so I got Text::Aspell working on Windows with MingW32, which was no trivial task (in the end it just required some library setup and a small change to the installation script for the module, but it took a full day to figure that out), and got down to the business of building word lists and checking them. This works almost well, except that it immediately became obvious that I need a better tokenizer.

While prowling the Net for English tokenizers in Perl (note: there aren't any good ones - yet), I found:

Europarl sentence splitter
Europarl tokenizer
A post on sentence splitting options (2007)
Lingua::EN::Sentence, Text::Sentence
A tech report on tokenizing for biomedical text indexing

In the end, I resolved to write a proper tokenizer for English; it would recognize certain entities embedded in the text (URLs, numbers, numbers with units, some abbreviations, possibly chemical names, and IDs with capital numbers and digits and dashes and the like), mark punctuation as such, and attempt to deal with quotes in a sane way, including distinguishing between quotes and apostrophes. The result would be an arrayref of words and arrayrefs - same as the other Decl tokenizers, and should work fine not only for English, but for the other European languages as well. I don't yet know enough about non-European languages to know how well I can tokenize them, except for the fact that I know it's non-trivial to tokenize Chinese and Japanese, and presumably Korean.

Finally, searching CPAN for NLP, I found an interface to the Stanford parser (a link on how to use it in Java as part of this book).

It's looking like I'll end up with classes in Xlat for basic translation tool functionality (Xlat::Wordlist and Xlat::Speller), along with probably NLP::Tokenizer if I end up making that a HOP-based pure Perl endeavor.

Update: But as always, note that using a deadline for a job (in this case an editing job) to induce urgency for basic research is generally going to lead to sleep deprivation and depression. I've switched to manually doing this job like a schlub, but hopefully the remembered urgency will get me through the next development cycle on this. I gained some good insights.

Monday, November 7, 2011

Cinder: C++ for creative work

Cinder: Kind of like Processing, but C++. The Medusae project is breathtaking.

A quick note on dates and times

The true depths of date and time calculations are frightening; see Perl's DateTime. So it might be nice for Decl to incorporate that from the get-go, as part of the kitchen-sink philosophy.

Sunday, November 6, 2011

XDoclet: "attribute-oriented" programming in Java

Interesting approach to boilerplate generation in Java (of which there is a great deal!) - XDoclet apparently allows you to specify some semantic attributes in comments and then generates all the cruft for you. That's kind of neat.

The philosophy of artificial intelligence

The history of AI (to about the mid-80's) from the point of view of a philosopher. I need to reread this a couple of times.

Decl hits CPAN

For reals this time! I still don't have rights to the Decl namespace (which is why CPAN took its own sweet time indexing the module) so it appears with a honking bold red "** UNAUTHORIZED RELEASE **", but it's up there under my name, which makes me pretty damned happy.

Also nice: except for one test result that I've already fixed for next release, it passes its smoke tests on every system. I love CPAN.

So what's coming up for Decl?

Traversal: this is hierarchical structure walking (e.g. directory walk) and mapping (e.g. something like XSLT)
Boilerplate and macros in modules, then release of declarative CSS and HTML modules
Rewrite Word using some macros (the "select" tag usage is changing) and rerelease it
Look again at Wx now that macros work, maybe release Wx 0.01
Look at macros in the PDF context, probably release PDF 0.01
Database management and access, then release Decl 0.12 with that
An error management system, finally, which will probably be Decl 0.13
Literate programming and PHP katas and examples, then release Publisher
Probably look at Inline next and integrate with Python; I want access to the NLTK.
Declarative logic somewhere in here, based on AI:Prolog.

After that, I'm not sure. But it will surely be obvious by then - and I'm equally sure this list won't survive contact with the enemy, either. For example, maybe I'll start thinking more about the Lexicon for real by the time I'm halfway through that list.

Two years in

I started this blog on November 5, 2009, with every intention of investigating a specifically semantic framework for programming that might have borne fruit by 2011. It's November 6, 2011, so where do we stand?

I started work on Decl in February of 2010, according to my notes (the first SourceForge checkin was on February 15, but I'd posted on Wx::DefinedUI on the 10th, and honestly I think a Markov-chained snippet from my earlier writing may have triggered the concept in January), and it quickly grew to take over my every waking thought. Essentially, all my progress with semantic programming has been in the implementation of Decl. As I noted on February 10th, my earlier effort in late 2009 foundered on the shoals of syntax. At least that's no longer a problem.

The idea of Decl is to define semantic domains and tags that declare various types of programming construct, then to build programs of those. Eventually, the semantic domains should have enough macro machinery involved that the programs will largely self-construct, but I'm nowhere near that level of detail yet. I just finished the v1.0 macro system last month, after all, and it's by no means clear how to get from point A to point B.

But that's where things stand. I have done some musing about shoehorning my old Hofstadter microdomain work into Decl - not that that would require much shoehorning at all, which is the raison d'etre of Decl in the first place - but haven't really made a serious move in that direction yet.

I'll leave you with this notion: the Decl tag is an instance of a concept. As such, it's a token from a Lexicon. I haven't implemented the actual Lexicon yet - but at least Decl will be a language capable of expressing it right from the start. And that's why Decl is important.

Hmm...

Ping.

So there are others out there!

Javascript pitfall: missing var

A heartrending account of mistaken globality. Killer comment from HNN: jshint, stupid.

So.... Code quality tools in general. I want to build a framework. Gauntlet thrown.

Mulberry: app boilerplate generator

Another boilerplate generator for Web apps.

Underrated Features of PostgreSQL

Another good survey.

Statistical comparison of programming languages

This is a pretty fascinating project - they have lots of different implementations of various algorithms in lots of different programming languages. The article does some comparison between them in terms of expressiveness and speed.

ML at Khan Academy

Very nice article on the application of linear regression.

Clojure DML for SQL

Interesting - this is the conceptual-level kind of thing that fascinates me.

Another CPAN for PHP

Composer and Packagist.

Overview of numerical analysis software

So looking at alternatives to Octave, it turns out - to what should not have been my surprise - that there are a boatload of alternatives:

Wikipedia has a nice table
The Octave Wiki recommends Inline::Octave, which I find a little questionable, but hey.
PDL is probably the best Perl alternative; has direct support for sparse matrices, interestingly.
The Monks look at some comparisons between R/S, Octave, and PDL.

Again: I'd essentially like to distill the semantics out of this and have a system that knows how to code for a set of alternatives.

Good maxims for consulting programming

Five things to do for programming on a deadline:

Set up continuous deployment before you start
Write tests first
Be transparent
Maintain daily todo lists
Do the right thing

Not bad.

Thursday, November 3, 2011

Automated freaking writing in the news again

This makes me so envious I could explode. I know it's always the same story. Still.

Wednesday, November 2, 2011

An aside on machine learning, and open-source contribution

So having forced my brain to code a vectorized cost function in Octave starting from the equation - a task that truly taxed skills that had lain dusty for decades, and involved a brief discussion with my private theoretical physicist - I've started to think maybe I might be capable of learning a new trick or two. This Stanford class just barely scratches the surface, of course, and my mathematical background is essentially nil, so I've got a steep hill to climb.

But. There are open-source machine learning projects out there. Perhaps it might be best to start contributing. So on that note: the mloss.org project database. 334 projects and counting.

And one of the things that caught my eye this week on the software development front is PVS-Studio, a static C/C++ code analyzer that finds common coding errors. There was briefly an article on it listing 91 such errors, but it was deleted. Of course, it would be a hell of a lot more interesting to have an open-source equivalent. If there isn't one, I intend to damn well start one, with a curated set of flags (this may be why the article disappeared, of course...).

Update after reading this: OK, so I'm an idiot. Sometimes it's easy to forget the last twenty years and the Internet and all. [also]

Anyway, the whole concept of static code analysis fits well with my vague idea of a "code understander" set loose on open-source code.