Tuesday, November 30, 2010

World moving faster than me

As per usual: there are some awesome data extraction tools coming online, making my WWW::Declarative nearly obsolete before it's really even started.

Winning at coin-tossing games

A neat Mathematica presentation on Wolfram's blog.

Sunday, November 28, 2010

A more useful way to include individual declarative classes

It would be nice to be able to say "use Win32::Word::Declarative;" in a conventional Perl program and have it go ahead and set up the Class::Declarative environment. Makes testing setups easier, too.

Hmm...

Getting Real: book on webapp construction

Free book! Probably worth perusing.

Win32::Word::Declarative

So for some time I've needed a way to slap together scripts for Word that don't rely on Word's own scripting. I used to do this stuff in Python, but now I've gotten Win32::Word::Declarative to the point where it can generate an attractive document as, say, an invoice.

This is an important milestone, but it doesn't get me all the way to an actual invoicing system. I need to have a few more components first:
  • Something like Document::Declarative to manage the actual files, repositories, and templates - I go back and forth as to whether that should be inherent in the language or split out into a document-management semantics
  • Mapping to apply templates and create abstract documents and expressed document
  • Database retrieval to obtain customer information and so on, not to mention to determine what goes into invoicing in the first place
  • The hierarchical configuration system (this would apply to templates)
  • Probably something like a general semantics system; this is kind of intuitive, but I get the feeling that this is how you'll be able to say "I want an invoice and this is what an invoice is".
So. Getting there, but not yet there. A whole lot closer than last week, though. I think there may be enough working in Win32::Word::Declarative that I can put v0.01 on CPAN.

More HNN quant bloviation

Lots of keyword-rich text there. No time.

Saturday, November 27, 2010

Target application: RTLKlub spider

My wife wants to watch Hungarian TV clips, and RTL has some stuff online - but it takes 5.5 minutes to get a 2-minute clip out through the Hungarian pipe. Obviously, I need to cache things, and so: a spider. A task!

I know how to cache movies given Flash players, but some of that stuff is in Silverlight, so I don't know whether I can do that or not. But at least I could get the Flash movies.

Getopt::Lucid and Term::Shell

Perl brass tacks: I'm basing my command-line handling on Getopt::Lucid. And in the end, I think I'm going to just end up writing my own Term::Shell replacement that pretty much works like that. The events in an event context can be seen as commands, so just invoking some "shell" conversation against an arbitrary event context will be pretty cool. But Term::Shell is just a little too hardwired to make that entirely useful. So we'll see. Either way, both modules have been open tabs on my browser for a couple of weeks now and it's time to clean up and leave a little nonvolatile state.

Update after looking more closely at the Term::Shell source - it would be folly not to subclass Term::Shell. There's a lot of really cool stuff in there. I just have to have an automagical way of setting up the commands - which is easy - and I'm good to go. This will be very cool!

From looking at the code, it looks as though Term::Shell can run under Tk somehow, but Google isn't being very forthcoming about that. It would be nice to be able to use Term::Shell in a Wx context. Very nice, actually.

More GA and machine learning

Link-dumping continues. First, "Using GA to find Starcraft 2 build orders". Second, a useful overview of machine learning techniques I just haven't had the time to finish.

Gamification

An interesting new term for the patterns of game play ported into superficially non-game venues. There's a Wiki for that. Also, an entertaining article on Cracked about intentional addictivity in games. My evil-future scenario: use addictivity plus a Mechanical-Turk interface to analyze email for the NSA. The seed of an interesting Stross story.

Friday, November 26, 2010

Sweet-expressions

A readable syntax for Lisp, with HNN commentary. I just eat this stuff up. A Plisp would probably have to include this.

Target domain: 3D modeling

Cool library of 3D modeling functionality, OpenSceneGraph. Problem: it has a Python binding but no Perl binding. Now, clearly we ought to be able to build Python just as easily as Perl in a declarative structure. Would this be the context for doing that?

Expressive Programming Systems

Steps towards Expressive Programming Systems, 2010 report. Chock-full of interesting ideas that deserve further exploration on some day when I don't have 42,000 words hanging over my head.

Monday, November 22, 2010

Design without designers

An interesting post on what design means in a world where it's being nibbled away by A-vs-B data approaches and genetic algorithms.

More scraping

Ooh. The "Readability" Javascript tool munges a page to put its "content" - that poorly defined part of the HTML that represents the parts the humans actually read - into a separate area for actual, well, reading, minus all the ads and links and sidebars and so on.

That algorithm has been ported into Perl as HTML::ExtractMain. So going into WWW::Declarative.

In other news on this front, O'Reilly has a sale on a bunch of relevant books. Sigh.

NLP (again)

I just keep finding cool things.

Python's NLTK. HNN post on "Natural Language Processing for the Working Programmer", a book-in-the-offing based on Haskell. I'm contemplating porting something like this into Perl.

Thursday, November 18, 2010

Page scraping: Enlive and CSS selectors

So here I am again, thinking about page scraping (earlier) - that part of a Web robot that comes between retrieval of a given page and some data representation that is what we're really after.

Unsurprisingly, there are a great number of solutions to this problem, some better, some worse. The one I want to talk about right now is Enlive, which is based on Clojure and looks pretty darned interesting. I want to do that in dperl. There's an Enlive tutorial that's quite well-written.

Now Enlive uses something very CSS-selector-like to extract interesting information from HTML trees. This is the part that struck my fancy. So here is a CSS selector tutorial as well.

Very promising direction.

Natural-language programming

Dammit, Wolfram is at it again, stealing my ideas. "Natural-language programming is closer than you think" - and he's got the key insight (no, not hexapodia this time): natural-language programming is a dialog. You'll still end up with code, but it will be code that is semantically wrapped in a set of concepts you've defined interactively with the machine. Lack of clarity will elicit questions.

Those last two sentences are my own plan. I didn't read all of Wolfram's article because I find it hard to read about people developing things I really want to do first.

Target domain: computer forensics

Computer forensics is basically the analysis of data files, intrusion logs, etc. As such, it's related to the kind of AI I want to do, so it's game - and there are open-source tools to use, too. Moo ha ha.

I'm particularly struck by the nascent definition of standard operating procedures (none actually yet defined). As you know, Bob, SOPs are workflow. Workflow is ... another target domain, actually. So stick that in your pipe and smoke it.

Monday, November 15, 2010

Target application: text generation

Oh, this is just an idea I can't kick - text generation. You can get paid for it. (That's an essay writing company; I Googled after reading this account of a freelance essay writer.) It's the Eschaton, immanentizing before your very eyes.

I keep thinking of the tic-tac-toe analyzer that was the subject of a thin book I read - and presented at an AI class given by Doug Hofstadter - lo! these many years ago. No notes survive that I can find, but the quality of text generated as justifications for the program's strategies was truly amazing.

I want to do that.

Sunday, November 14, 2010

Pattern matching

I still haven't quite got my head around pattern matching, but I think I'm getting closer. The Wikipedia article on pattern matching largely addresses Haskell and Mathematica (both of which provide pattern matching as part of the language). There's a Sub::PatternMatching in Perl, which ... almost does what I'm looking for, and there's Data::Match, which I remember finding earlier this year. And of course XSLT is based on pattern matching as well.

Essentially, a pattern is a structure with holes. These holes may be named, and we can also make assertions about the holes, like hole A and hole B have to have the same content, but that's the basic upshot. When applied to one or more targets, the pattern is an iterator; it can return more than one match.

Chained together, these matches are AI's "unification", a powerful technique that can find multiple solutions to a given question posed in terms of predicate assertions over a universe of data. Pattern matching is unification, quite literally (although the converse is not true; unification includes things that aren't pattern matching).

Oh, wait. I said that the pattern matcher is an iterator - that's true, it can be used in a "data mode", but we can also associate actions with patterns to move the pattern matcher into an "action mode", and I think in the case of Class::Declarative, this mode is going to be easier to conceptualize. This is the mode used in functional languages such as Haskell (I'm basing this statement on the introduction to Sub::PatternMatching), and now that I think about it, XSLT as well.

When applying a series of patterns to a given data structure in this model, we can think of the series as a kind of "case" selector. Each match runs a bit of code, and the named bindings in the match are passed to the code as its call parameters. We could also disassociate the matches and the code by defining events that would fire, invoking code defined elsewhere (making more general coding easier).

All that remains is to start thinking of some use cases and coding some likely-looking expressions of patterns. WWW::Mechanize and HTML::TreeBuilder are such obvious candidates here; we're essentially doing what parsley does in this case.

Comparison of Python Web frameworks

"I am so starving" - a series of implementations of the same service in a dozen different Python Web frameworks. Interesting!

Saturday, November 13, 2010

Thoughts on rich text

There are lots of ways to encode formatting and semantic information concisely in text. (The "concisely" is what I'm going to address here.) They include things like Markdown, ReStructured Text, Text::Multi (a more generic Markdown-like framework), Markdent (another interesting parsing framework), and of course markup like HTML/SGML/XML, RTF, and TeX.

The end purpose of all these text formatting languages is to provide a way to type regular text and have it formatted or typeset. I've been using something like Markdown in my pseudocoding so far, but I need to think things through in a principled manner, and essentially, what I've come up with is this.

First, the target. The target is formatted text, but what does that mean? Clearly, it means something with a series of nodes specifying format in a generic manner:
container
paragraph
text
This is an
italic
text
example
text
text.
paragraph
text
It consists of two paragraphs, one of which contains italics.
Obviously, I'd rather type this:
text
This is an {\i example} text.

It consists of two paragraphs.
Now note that in this example I've used an RTF-like curly-bracket-and-backslash style, and I've used a convention that a blank line represents a paragraph break, the latter being more or less universal in Wiki and Markdown settings these days.

So what I want to do for the text node - and this will end up being used throughout Class::Declarative, mind you - is to provide a basic framework for rich text that will allow the user to specify parsers to turn any textual formatting language into nodal structure, then to include a few simple parsers (e.g. Markdown and RTF-like) that can be selected in some way.

Target application: poker

I swear, this blog is starting to be less about techniques and more about me rediscovering how many fun things there are to program in the world.

This time, as so very often, the trigger was an HNN post about poker bots, leading me to investigate the poker sites still legal in the US and eventually find the software download page for one of them, Full Tilt.

Thing is, the poker sites don't want bots for one simple reason: if people think they're losing because machines are cheating them, the poker site loses users and therefore money. However, to be perfectly honest, a poker bot would be a good way to make a little money every day while learning a whole lot about AI techniques.

I'm sorely tempted. Really and truly, especially given my heritage as the grandson of a professional poker player.

Update: Full Tilt has fallen on hard times lately; as of November 16, 2011, the link goes to a "US-only" page discussing disbursement of remaining funds held by US residents. But hey - in half a year I'll be living in Europe! So poker can wait until then; Europe loves poker.

Kaggle.com: hosting data mining/machine learning challenges

Now this is cool. Kaggle.com hosts challenges for data mining. That is, every few weeks they start a new challenge for statistical machine learning algorithms, with cash and prizes. I believe it's time for me to start learning.

Saturday, November 6, 2010

NLP link dump

So there's an NLP challenge: find "semantically related terms" over a large vocabulary (> 1 million words) given a large corpus, in a reasonable amount of computer time and with little or no human input. And from its links as to how the corpus was prepared, I ran across splitta (a sentence boundary finder) that will be very useful in the Xlat project, a Penn treebank tokenizer sed script, and the Topia term extractor.

All of these will be essential parts of the eventual front end to OpenLogos, for example. I urgently need a way to find significant terms in the source text in order to facilitate early glossary work (i.e. ideally do the glossary work before starting the translation). This is pretty critical.

I might actually tackle the NLP challenge as well, but honestly - all this stuff is mind-meltingly fascinating to me. I need to get serious.