Semantic programming: October 2011

Monday, October 31, 2011

B2Brev

B2B tool reviews by startups. Nice idea! Goldmine of component information.

And now I'll violate that by writing about it. I had a partial payment from a customer today with whom I have about 10 open invoices, and there was no telling which invoices the payment was intended to cover. After dithering, I decided that really I should be able to write a brute-force knapsack search to find if there was any combination that matched. (Turns out there wasn't, but there was a combination that came within 2 Euros, so at least I can flag something as paid in the database, close enough.)

It took me about an hour, which is pretty damned pathetic, really. About two minutes of that was realizing it was a recursive problem, and 58 minutes was spent debugging it. I'm pretty sure that I could come up with a programming system that would allow me to express this algorithm without being quite so brittle in its implementation. I mean, essentially that's what I want to do in the first place here.

To that end, I'm wondering if I shouldn't start looking for this kind of algorithmic problem and implement more of them. At least I'd stop dithering about whether it was a good use of time or not. I'll bet I could shave most of those 58 minutes off, anyway.

Sunday, October 30, 2011

IPEDS

More public data online: the National Center for Education Statistics has published databases of colleges in the US.

Saturday, October 29, 2011

Blitzsummary of linguistics for NLP

Good summary of How Linguistics Works.

Unbounce: component and target

Unbounce is a quick way to build landing pages for pre-startups or other purposes.

Fast test for startup ideas

Some of the pieces of this procedure could be automated. Maybe all of it, essentially...

DARPA Shredder Challenge

Solve all five puzzles first, win $50K. Not bad pay for November.

Marketing one product under two names

Another little technique for company construction.

Shakespeare, the programming language

So there's a cute little language called Shakespeare that I've run across before. It's a little silly, of course (well, that's the point!) but it came up on HNN, as does everything eventually, and one of the posts there suggests a "sort of obfuscator" that would, you know, write Shakespeare for you.

Which I think is a pretty dandy idea, in conjunction with a Shakespeare interpreter. It might use a Markov chain to generate the fluff, select characters at random, and so on and so forth - and everything should compile/interpret correctly.

Underlying the Shakespeare, of course, is a kind of bytecode language. That would be the interpreted language, and that would have a more normal expression as well that could be Shakespearified.

This could showcase parsers in Decl, and might not be a bad way to start doing some fun NLP-type stuff, too. Think about it!

Thursday, October 27, 2011

Perl tutorials suck

They have a point.

Dissection of a viral launch

Interesting data-packed post about viral marketing.

Hello from a libc-free world

A fascinating look at melding assembly with C.

Target application: listofthingsforsale.com

This is very cool.

GUI vs. CLI

A post by Vivtek Haldar musing on the differences between a GUI and a CLI, boiling down to the fact that a GUI is an operating interface while a CLI is an expressional or definitional interface. Honestly, though, I don't like CLIs; they require too much memory capacity for me. I much prefer a model of GUI plus scripting API.

The not-so-secret capitalist cabal that owns us all

Here's kind of an interesting analysis: turns out just 147 investment companies own 60% of the global economy. Article in New Scientist, publication on Arxiv.org. Seems like really a very straightforward query of the Orbis 2007 database of corporate information and some nifty visualization.

Wednesday, October 26, 2011

Language: Elephant

I don't believe it was ever actually implemented, but John McCarthy proposed a language he called Elephant based on speech acts, largely commitments, questions, and the like.

It's an intriguing concept, but he loses me pretty quickly on the implementation details. But the notion bears further thought.

Math: sympy

A symbolic manipulation library for Python.

Here's what I want to do:

Handwriting recognition on a tablet PC to be translated into OpenMath and thence TeX.
Selection of portions of a large mathematical formula and specification of specific operations to be carried out (e.g. "solve for this" or "call this theta" or what have you, said operations to be discovered by observation of my private theoretical physicist)
Maintenance of a log of the trajectory through formula space
n-fold productivity increases for theoretical physicists
Public perception of my private theoretical physicist as highly productive physics genius
Live on p.t.p.'s CERN salary while enjoying Geneva

It's freaking brilliant!

Tuesday, October 25, 2011

Nice interactive graphic

Data journalism! Here's a neat interactive graphic for exploring gas prices over time in all 50 states.

Analysis of Steve Jobs tribute messages

Nice application of NLTK to get some interesting statistics about the tribute messages to Steve Jobs on Apple's site.

Rocketcharts

Nice library for statistical/financial charting in HTML5: Rocketcharts.

JavaScript roundup

Speaking of JavaScript, I've got a couple of links - one of which is itself a JavaScript roundup, so this is really more of a meta-roundup.

So you want to write JavaScript for a living. Interesting list of some of the things one should know about JS.
Badass JavaScript, a blog.

Tangle: a JS library for reactive documents

OK, now this is cool: Tangle is a Javascript library for spinning reactive documents [e.g.]. The idea is to write text that can be explored - or, as further down the page on the second link, you can build some graphics as well!

Monday, October 24, 2011

Some more open source projects

I keep running across open-source projects it would be nice to contribute to. Weird.

Qt has officially been spun off by Nokia. Along the same lines would of course be Tk and Wx, and I suppose native W32 by direct DLL access. All these share a lot of concepts that should be organized in parallel, and ultimately a feature in one should always migrate into the others so we're all working with the same set of concepts. They do eventually anyway, so it's kind of an obvious step to formalize that path.
MediaWiki is, of course, in PHP, and always has bugs outstanding. Hone the semantic understanding tools on that. Same goes for Drupal and WordPress, of course.
Which brings us to open-access science. This guy, a chemist at Cambridge, appears to be doing some actual data mining of open-access journals. I need to look a little closer at that. And remember: closed source kills.
And then there's WikiData.

Sunday, October 23, 2011

Decl striving mightily to hit CPAN

Turns out the Decl namespace was reserved by mistake, sort of namespace shrapnel from a project a couple of years ago. They're working on resolving it.

I'm looking at a "walk" tag/action that would justify (in my mind) adding another version number for the next upload. I might finish it up tomorrow; it's essentially just a built-in that permits easy traversal of directories or other directory-like hierarchical data.

Thursday, October 20, 2011

Decl doesn't actually hit CPAN

CPAN users FGLOCK and AVAR have registered the top-level "Decl" namespace for reasons as yet unknown. I've asked. We'll see.

Decl hits CPAN

I told myself I was going to make the switch from the old Class::Declarative to Decl on CPAN when I got templates and macros working. That happened today. (I'm still kind of giddy.)

I had built the CPAN module on Windows, of course (my desktop runs Windows because I'm a translator and it's kind of de rigeure, but also because even though I have a beard, I first got serious about programming on Windows, not Unix) - and I'd forgotten why I'd never done that. Windows makes tar.gz files with world-writeable files and directories, and PAUSE really doesn't like that.

Well, this time I Googled it. Duh. Memo to self - always Google problems. Somebody else has probably fixed it or at least can help you make sense of it.

Google AI challenge

This year's Google AI Challenge is a robot-army kind of tournament. Looks fun! But is that AI?

Graphics by Kevin Karsch

Kevin Karsch has done some pretty cool graphics work (see the first two links on this page... um, the first two as of this writing).

Tuesday, October 18, 2011

Ioke

New language roundup: Ioke.

The point of Ioke seems to be homoiconic macros, meaning it's probably something I should study: its goal is to maximize expressiveness and write code to write code. Aaand that's where I am.

Oh, what a tangled web we weave

... when first we practice to hashtie variables and then try to refer to those hashes in the setvalue for a node at a different level and forget that they're hashtied when building our freaking debugging print statements and get into a loop for that reason alone.

Sigh. Talk about your heisenbugs.

Monday, October 17, 2011

YouTube insult generator

Slick little Webscraping application.

Sunday, October 16, 2011

NLP

I spent a few hours last night avoiding work with yet another feverish trajectory through ongoing NLP research and books available.

NLTK has a book. It might be a reasonable place to start, just working through that. And there are online courses available.
I actually got a lot of useful information from Wikipedia, starting with UIMA, a Unified Information Management Architecture.
GATE comes up a lot. It's Java-based.
Apache OpenNLP is out there. Java.
Book: Handbook of Natural Language Processing
Oh, and Amazon recommendations come up with Syntax-Based Collocation Extraction
Looking for the individual chapters of HNLP seems fruitful: Bing Liu has a whole page on opinion mining and sentiment analysis and even links to a PDF of his chapter of the book (I wonder if the entire book couldn't be reassembled in that manner)
Liu has his own book on Web data mining.

All this just makes me drool. I swear, it's a mid-life crisis.

Hyde

A static website generator in Python (a sorta-port of Ruby's Jekyll). My question: what's the tradeoff between using Jekyll or Hyde as opposed to rolling my own in Decl, now that I have a template engine already? The community would help, sure, but ... how much would I actually use this? And wouldn't my debugging time be better used on my own dog food?

A possible approach

My ultimate goal with Decl is, of course, not only to provide a quick way to bootstrap data-structure-heavy Perl scripts into being, but also to describe software systems at a high level and provide a framework to implement them.

So in that second sense, a semantic domain of "natural language processing" would describe tasks at a high level, describe the algorithms and approaches to take in performing those tasks, and would be amenable to at least some degree of automation in coding the tasks in other languages using various toolkits already available. In other words, the semantic domain encodes at least some of the professional knowledge about that domain that a seasoned programmer would be expected to have; a programmer-in-a-box solution.

To that end, and maybe in NLP to an extent that's a little unusual in comparison with other domains I've wanted to get into, there has to be a means of describing a given toolkit and its basic approach - a theoretical framework if you will - that allows a given task/algorithm to be expressed using it.

Not sure how that's going to happen yet; I just want to throw the gauntlet down here.

It looks like a lot of NLP toolkits are in Java, for whatever reason, with NLTK in Python being a strong contender. Nothing, really, in Perl. Which is why God moved Ingy to create Inline, of course, and Decl will be incorporating Inline very soon.

Windows PE format in painstaking detail

Here's a Google Code repository with the most complete, wonderful definition of the Windows PE executable file format I've ever seen. It's a wonder to behold.

Saturday, October 15, 2011

Stanford's NLP class

I'm a little frustrated with Stanford's free access to their NLP class, because it's not really terribly self-contained. I've managed to find what I believe is their support code on Google Code, but I don't have access to their readings.

I'm starting to think I need to examine this more closely and maybe come up with a course/book of my own, offer it on my site. My site's badly in need of new content anyway. I seem to put everything here on Blogspot nowadays.

Data journalism

I ran across an article 5 Tips for Getting Started in Data Journalism, to wit:

Be mercenary: do what works. But do it.
Shave yaks as needed: take the time to learn details when you need them.
Develop sources
Become the resident expert
Be the data project you want to see on the Web

I particularly liked "develop sources", because the author points to some data journalism blogs, including the Chicago Tribune and the New York Times.

Data journalism to me is like flame to a moth - I'm not too interested in the journalism itself, but the data work, oh, it's lovely.

Friday, October 14, 2011

Target application: web automation

So I found this linked from PC World. Nice feature list. English, maybe a little rocky... My guess is it's based on OLE automation of IE, which is almost interesting. It's cognate to any other way of automating a Web bot. (Note to self: think harder about cognate tasks.)

Description of Djuggler Enterprise

Data Juggler automates repetitive Web & data tasks without programming code. Use it to create sophisticated scripts for collecting data from the Web, filling Web forms, transforming text files, XML, CSV and database data. The easy-to-use drag-and-drop interface creates scripts that can be deployed as stand-alone Windows executables. Typical application examples:

Extract competitor's price list from Web pages regularly.
Extract people data from a Web pages.
Download Web images op a regular basis.
Get search results from multiple search engines.
Automated Web testing and load testing.
Export data to Web based applications using fill Web forms.
Automate web based workflow processes like timesheets.
Search & replace actions to clean data.
Transform data from one format to another.
Convert data from legacy applications to industry standards.
Automate database migration with Business Intelligence.
Comparing data and create reports.
Send emails with personalized attachments.
Server monitoring and reporting.
Synchronize folders, databases, etc.
Automate file management & data backup.

Automate IT operations by deploying stand-alone Djuggler scripts. The powerful script designer has many actions and functions like loops, 'if then else' conditions, get text between from html, get html table, get pictures, strip HTML, web macro's, read and save Excel, support for popular databases and many more. Demo's are included in the setup. Visit www.djuggler.com for the script repository and script service. A Djuggler Personal edition is available as freeware.

Keywords: Web data collection, Application Integration, Data Aggregation, Data Transformation, Report Generation, Batch Processing, Business Intelligence, System Monitoring, Form Filling, Web Scripting, Data Extraction, Web Testing.

Postmark spam filter has an API - Despammed should, too

Darn it, why don't I think of these things? (Their API.)

So I'm moving towards a plan to revitalize Despammed.com. Maybe the time has come? I'm not sure yet, but I want to do it.

Imagine, if you will, a spam filtration service that offers:

SpamAssassin
Procmail
Green and redlighting of known-good, known-bad actors on a per-account basis
CRM114
Bayesian training
Tracking of spamvertised URLs

as well as

Both forwarding and Webmail access
Arbitrary forwarding (including taking Web API action or Twilio phone action) based on rules, including rules that can be expressed in arbitrary JavaScript
Spam discussion with specific examples and other community action
Blogging about spam topics, including botnet identification and such

as well as

Uniform treatment of both email and Web spam
and yeah, an API...

Wouldn't that be cool? Maybe cool enough to pay for, even? Maybe, at this point, a manageable thing to put together because it needn't all be from scratch?

CSS game engine

Neat JS-and-CSS in-browser game engine.

CSS tricks

Here's a nice CSS-and-dynamic-code trick from Airbnb to display rating stars. Nice solution!

Tuesday, October 11, 2011

CRM114

Jesus H. Tap-Dancing Christ, I have seen the light. And it is CRM114.

A little background here. Back in 1999, a friend of mine said, "Hey, there's no free spam filtering forwarder in the world. You should write one, and then we'll figure out how to make money with it." So I did. Despammed.com was born. For the next five years or so, I learned a whole lot about how not to administer the firehose that is email. Due to lack of attention, the server crashed and burned and much of the code was lost (not the filter!) and the service, although its zombie is still on the Net, never really recovered. Because I didn't really care any more (we never had figured out how to make money with it).

In early 2007, Web spam was getting to be a hassle on a forum I had for my then-Web comic (by that time, it was already essentially in permanent hiatus, but a few friends still hung out on the forum). Xrumer had been released in November of 2006. I wrote a despammer for my forum and it worked for a while, but eventually I was forced to close the forum entirely. The release of that seems to have been on February 4, 2007.

In May of the same year, I refined some of the techniques I'd been working with, and the result was the modbot in Perl, a modular framework for applying various tests to determine the spam nature of a given post. It worked kinda well, but I couldn't drum up much interest in the wider world, and it died mostly a-borning.

OK. So that's my history in despamming. Now along comes something I never heard of: CRM114, which is a programming language invented specifically for expressing spam filters. And I'm looking at it, and I love it so much. Seriously. Also, here's a paper about it. Here's some publications by its perpetrator, William S. Yerazunis, who now works for Mitsubishi Electric and comes across to me as a really funny guy who would doubtlessly be a hoot to have a good supper with. Also, my envy for him burns with the heat of a thousand suns, because he basically seems to do all the stuff I really wish I had time to do.

So anyway. I am going to learn the shit out of CRM114.

Sunday, October 9, 2011

Music prototyping

An AskHN with interesting responses.

CmdrTaco: not dead - scaling

Rob "CmdrTaco" Malda writes about building a basically free-of-charge scalable cloud website.

Startup tools

A fantastic curated list of startup tools.

Puppet vs. Chef

As deployment solutions (at least in the Ruby world) Puppet and Chef are turning out to be pretty popular. Neither jumps out at me as a really beautiful syntax, but deployment (i.e. system configuration) strikes me as a sensible thing to start analyzing, starting with Puppet and Chef. What are their commonalities? What are their differences? How interchangeable are they?

Saturday, October 8, 2011

Concurrent Constraint Programming in Oz for Natural Language Processing

A book! Oz is a ... neat language. Its standard interface is, sigh, Emacs. You can imagine how I like that. But hey, I really, really need to get my head into NLP, so this would be another good place to start.

XSB Prolog

XSB is an open-source, tabled (i.e. memoized) Prolog. It has a Perl binding. It would be interesting to pursue. Very interesting, actually.

HNN: what data structure does the brain use?

I didn't expect much from this thread, but it ended up chock full of interesting things to follow up.

Test-driven Django tutorial

Does what it says on the tin.

Survey of debugging techniques

Or rather, "bug-avoidance techniques", perhaps. Good article.

TermL: another specification for expressing symbolic trees

No further comment, except to note that Decl support for this would be convenient.

OMeta: pattern-matching language

I'm a tad surprised I hadn't already blogged this, but OMeta is a language for expressing pattern matches. It can be embedded in Python as PyMeta. Interestingly, PyMeta includes a parser for TermL (about which see next post).

Pattern-matching a la OMeta/XSLT/what have you is definitely going to be one of the modes supported by Decl, but I still don't really grok it. So ... OMeta. For study and illumination.

One-liner music

So there's been a Thing about one-line algorithms fed into /dev/audio to create music (some pleasing, some not) [js in-browser equivalent].

It would be cool to do some kind of social evolutionary variant of the JS one. If only to provide a convenient way to tag your favorites, you know?

Linear regression and linear algebra

OK, OK, I shouldn't be so excited about this, but my machine learning class hasn't even started and I'm already grooving on the preparation parts. Including linear regression and linear algebra.

Linear regression in financial analysis [investopedia] - this is magic to a lot of people.
Linear algebra is nearly universally based on BLAS: the Fortran-written Basic Linear Algebra Subprograms.
Here's a textbook on elementary linear algebra.
ATLAS is a library for linear algebra built on top of BLAS.
And in general this all leads into numerical linear algebra.

So, yeah, that's all a valuable domain. I could particularly see a code generator writing literate programs to solve linear algebra problems, then running them in a separate process. This is the kind of thing I want to get into.

Monday, October 3, 2011

E-discovery

This is exactly what I want to do: mechanical discovery of facts and structure from large collections of documents. The New York Times has an article.

That article mentions the Enron corpus, the collection of emails collected - and then published - by the Justice Department. There are various versions here and there, including one from the EDRM organization (Electronic Discovery Reference Model). That organization deserves a closer look.

Saturday, October 1, 2011

OpenMath

OK, I have to admit, OpenMath is really cool. Here's the list of software and tools that work with it - all pretty thin, actually, but their heart's in the right place. This is exactly what I was looking for. It's always such a relief to find somebody else has done the work already!

Note from the software-and-tools page: there's an OpenMath-to-LaTeX translator (apparently written in Perl, no less!) that ... well, it does what I was discussing earlier today. So very cool. (Update: it was written in 2000 and is therefore not at all OO, but it's unencumbered and built on a rather slick modular architecture, so I've asked the author if I could polish it up [rewrite it] and put it on CPAN. Very, very slick.)

So here's the plan, more or less:

XML, binary, and Declarative versions of representation
LaTeX output
Octave output and manipulation and parsing back in
Some kind of overarching systems description a la "semantic Excel"
Some kind of graphical presentation as active areas a la Equation Editor (but better)

I'm this close to being able to put together that stylus-to-LaTeX math manipulation tool I was thinking about in the 90's, just by using off-the-shelf components. I need a tablet. I badly need a tablet.

Visual Modeling and Programming with Graph Transformations

Dorothea Blostein at the University of Queensland is really into some very cool stuff. (Ran across her at the link from the previous post - she's working in knowledge representation.)

Graph transformation languages look really neat. She's written a book. It's in pieces of PDF on her site, so I should download them - but I don't really have an effective way to organize downloaded PDFs and papers yet, so instead I've just linked to her page, above.

Math

Ah, math, my old nemesis.

Necessarily, a machine learning class uses math (which is one of the reasons I'm taking it) and so I'm thinking about How People Think About Math. This would be a good thing to work on anyway - someday I really hope to get back to that Hofstadterian AI research track - and so here I am, thinking.

Here, by the way, are some neat Javascript tools for learning and working with math. One spinoff of all this is that I'd like to do something that generates things like this - kind of like a big Javascript Excel generator. That's something I've wanted to do for a long time, actually. So we'll see how well I do on that subgoal.

But the larger goal is this: when working with mathematical functions, we typically have a boatload of different representations floating around. Typesetting is done in TeX, of course, but there also has to be a more semantically-oriented form that's useful for tossing to Mathematica/Maple/Octave/whatever the heck you're using (and that includes expressing it as Python or C or Perl).

But the key is this: underlying all that, there is a semantic structure that is the actual equation or expression. That is what I want to approach. And in fact it's an area of active research (of course) - most of which is behind paywalls. Thanks, Springer-Verlag! But searching on names still turns up fascinating links [OMDoc]. If I only had all the time in the world, I could start reading arbitrary numbers of interesting papers. (I'm actually more interested in building a research tool to support the reading of arbitrary numbers of interesting papers in a more efficient manner. But that's a story for another day.)

As far as I can tell in half an hour's search, the state of the art for representing mathematical semantic structures appears to be MathML or something more or less like it. Yeah, XML as serialization, which makes my eyelid twitch, but hey, there you go.

I'll get further into this as the class progresses, of that I'm sure.

Update: OpenMath is the thing I'm looking for.

Gamification

A long post (and another) by Tim Rogers on the evil brain-sucking parasite that is Sims Social and other games. Here's what would be cool:

Economic analyses of popular games
Simulations of popular games
Genetic algorithm to devise new ones. Hee.

Or: how to take over the world without actually working.

Stripe

A new payment gateway that looks quite promising.

Notificon

A JavaScript tool to permit a page's favicon to include two characters of indication. Very neat!

What would be neater: a tagging system that was semantic in some way, to permit the functionality-based indexing of this kind of component.

Spambot combat

Here's an article with some very nice techniques for building more spamproof submission forms. Tl;dr:

Timestamp: don't allow a long period between reading and posting. (I had mixed success with this way back when.)
Hash: check the IP, timestamp, post # - prevents playback attacks.
Randomized field names.
Honeypot fields: invisible (not hidden) fields that, if filled in, are a spam indicator.

The author of the post uses these and only these to block spam - no content-based filters at all. That's cool.

As you know, Bob, I have long wanted to produce a workflow system of sorts that would include spam content filters; form generation is something I hadn't even considered - but it's a great idea. So ... keep this in mind.

Learning algorithms

Here's a nice presentation about (1) learning to program, (2) why algorithms matter, (3) a lot of maze algorithms, and (4) how a general algorithmic approach can often generate better solutions.

Nice stuff. Also, the final slide generates mazes using different algorithms. Neat!