Monday, March 19, 2012

Devel::Declare

Here's a little module Dave Rolsky turned me onto - I'm going to have to read it a couple of times, but essentially it takes control of the Perl parser by arcane means and lets you define your own syntax.

I love the idea of getting away from rickety code generation and into rickety on-the-fly code munging, but the real value would be letting the Perl parser do what it does best; trying to fake it with line-by-line regexps is just asking for trouble.

Sunday, March 18, 2012

Porter stemming algorithm

Just for completeness, here's a link to the Porter stemming algorithm for English. It's another hash of regexps and special cases, but it's standard and does a fair-to-middling job.

NLP class assignment 1

The first assignment for the Coursera/Stanford NLP class was a regexp approach to finding email addresses and phone numbers in a set of faculty pages from the Stanford CS department.

I ended up doing a first pass with a list of regexps, then doing some post-processing afterwards. As is always the case, to hit all their test cases, a lot of fiddly special-casing had to be done, the most irritating being Jurafsky's own email address, which consists of a JavaScript call. As I had no particular intention to embed Spidermonkey in the Python code, this had to be entirely special-cased, although a sane system design would have put this into the page retrieval spider, not in a regexp-based recognizer.

But that's neither here nor there (just low-level irritation) - the real point of the assignment was of course both to provide a little exposure to regular expressions while making it clear that natural language is utterly rife with special cases, and it succeeded on both counts.

It led me to note that splitting the logic between regular expressions and a separate post-processing step was irritating from the point of view of code understandability and maintainability. Probably it would be better to have a more powerful text processing language, perhaps explicitly based on a parser. And it should work from a real tokenizer to eliminate all possible HTML obfuscation - for example, the simple trick of having DIVs break up your email address would have made this code useless. (Which is why it irritated me to have the JavaScript instance in there.)

Anyway, it was fun.

Saturday, March 17, 2012

A little Word scripting

As you may know, I'm a technical translator by trade these days, so I work under Windows with Word documents a lot. Word has a scripting mechanism that I've often considered to be Microsoft's way of making it technically possible to program whatever you need in Word, without actually making it sensibly easy (making hard things possible and making easy things hard, thus ensuring grist for a robust consulting mill), but back in a previous life I spent a lot of paid time writing Word macros to do text processing. Profiting from that grist, I suppose.

Anyway, this week I have a huge pile of process definitions to be placed into Word, each consisting of a table on each page with a list of steps in a process to be carried out manually (many with barcodes for acknowledgement and so on). Each step has a unique identifier - by which I mean that each type of step has a unique identifier and there can be any number of tokens of each type of step.

And it's all in German and my job is to translate it to English. And the originals are all scanned PDFs, so I can't actually just use a regular translation tool to find repetitive text, or even cut and paste into Word; it all has to be typed. I do have properly formatted blank tables to start with for each document.

What I've ended up doing is putting each type of text into a table in a dictionary document, then writing a macro that looks up the unique ID I've typed into the given target document's first column and, if it's found, copies the already-translated row into the target document. If that type hasn't been encountered yet, then I translate it, and copy the row into the dictionary by hand.

I'm writing about it here, because at a semantic level that Word dictionary document is a database (a NoSQL database, if you want to be all cool-kids about it). If this macro were expressed in Decl, it would be much easier to get a lot of implications of that construal and work them into new ramifications.

In other words, construing one thing as another (a semantic mapping) is a form of recognition that could explicitly underlie the programming process in a semantic language. Also, I think I like the word "construal" as a technical term for this kind of mapping.

Friday, March 16, 2012

Caltech ML

So I didn't manage to keep up with Stanford's ML class in November/December - but Caltech will be giving me another shot in April. Maybe I'll participate! (Although if I have to decide between NLP and ML, you know NLP will win.)

Incidentally, the lectures from Ng's full CS229 treatment of machine learning, with all the math left in, are all online as well. I wish his homework material was also online.

Plucene

I didn't know this, but there's a pretty well-developed Lucene port on CPAN, Plucene. Sadly, development seems to have run aground in 2006; there are bugs reported (even including posted patches!) but nobody seems to be minding the store. I don't know to what extent it would be useful to revitalize it.

NLP class

The NLP class started today on Coursera. So far, it's easy (but so far, I already essentially know what they're telling me). It's still been valuable to go over it in a coherent manner. This whole "take derivatives of Ivy league classes for free" fad is fantastic. I hope people keep doing it.

Monday, March 12, 2012

Neat MakeMaker feature

Wow! I had no idea this was possible, but MakeMaker has a PL_FILES feature that allows you to register scripts to run at install time to configure modules before they're installed. That is just so cool - for my Image::Magick::Wand module, I can write a Configure.PL script that finds the Wand DLLs during installation of the CPAN module, then the module can simply pass them off to Inline::C with no further ado!

It's exactly what I wanted - the Perl ecosystem never ceases to amaze! There's always yet another way to do it.

Saturday, March 10, 2012

Task: write a new Perl interface to ImageMagick

PerlMagick sucks - for two reasons. First and foremost is that in all the years I've messed occasionally with ImageMagick, not once have I ever been able to get PerlMagick to install correctly, and that's just ridiculous. But worse than that, PerlMagick is bad Perl. It handles errors like C (i.e. you have to check them yourself; no croaking or anything) and its object model is weird.

Answer: wrap the new MagickWand and/or MagickCore APIs in Perl, as a bog-standard CPAN module. It can't be that hard.

Update: I've actually started this one. Github link. I'm basing it on Inline::C, because I've always had a love affair with Inline, back from its early days.

Friday, March 9, 2012

CPAN is big

30,000 packages...

If about a thousand of those are HTTP-related, well - what do the rest do? That would actually be a kind of neat thing to figure out. I keep talking about code understanding; maybe CPAN is a reasonable target? Just attempting a global survey would be edifying.

Sort of a "Things people do with Perl."

(Ever notice you can tell from this blog how little sleep I've had the night before?)

Thursday, March 8, 2012

Archive::Tar

Quick bookmark here: Archive::Tar looks like the best approach to opening a tgz'd CPAN module.

OK, so CPAN HTTP client survey first

This one will be easier. It's surprising, but there are a lot of "primary" HTTP clients on CPAN, by which I mean HTTP clients that don't depend on other modules to do their HTTP. LWP is seen as pretty large and cruft-ridden by many authors, and there are a lot of "lightweight" HTTP clients, multiplexing clients, clients for particular frameworks like POE, and so on. There are also wrappers around primary HTTP clients to make them quicker to work with.

This is what I need to do: CPAN search returns 857 modules for "HTTP client"; obviously many of those aren't HTTP clients. I need to look at that list and see what depends on what. That dependency network will probably allow me to see which are the primary HTTP clients, and I can discard a lot of them.

Really, it seems I have the following hierarchy: (1) Primary HTTP client [mandatory], (2) HTTP client wrapper [optional, may be multiple], (3) REST service module [optional, may be multiple], (4) REST API implementation [mandatory]. I really want to see a full network including all these stages, and since the main tool I have is dependencies, I have to start at the primary HTTP client end. And of the primary HTTP clients, it would be interesting to classify them by how many other modules depend on them. The vast majority are probably going to be singletons, not actually used for any published REST API, so I can prune them.

I suppose I have two possible tools: a CPAN search for likely descriptive phrases, and a dependency link checker.

Would it be possible to search on function? I think that would require more sophisticated processing than I can muster. That sounds like code understanding. I'm not even sure how I'd approach it as a human, let alone automate it.

Tuesday, March 6, 2012

First part of my CPAN Web API client survey article

A Survey of Web API client code on CPAN

Why a survey? And how do you start?

For the past few years, I've organized most of my thinking on Blogger – I first got into it while keeping various friends posted on my efforts with house renovation, and it just kind of stuck. Now I tend to start a new blog for every project I undertake. At some point (actually, on December 17, 2011) I had the bright idea that I should be able to do my task management right in Blogger as well, perhaps by the simple expedient of typing a title like "Task: do XXX" right into a blog post.

Earlier that day, I had realized that Blogger has an API, and suddenly, it was obvious how to proceed with this plan. I needed to write a Web API client to build my task indexer.

But like nearly everything I do, I was beset by the sudden fear that I might do it wrong. Maybe I'd be making assumptions I'd regret. Maybe other people were doing it better. (Note to self: this is why you never get anything done.)

I've got very little time to work on side projects – two teenagers, a full-time freelance translation business, and the aforementioned house renovation project make sure of that – so essentially everything technical is on the back burner, and so this one stayed as well, while I chewed on my fear. Occasionally in an off-moment I'd hit CPAN and look for modules that implemented other API clients, and I'd wonder what sorts of functionality might be nice in a more general Web API client support module. Finally, I just started scanning down the list of modules a search returned for "RESTful API", with the vague idea of doing a more or less comprehensive survey. Then I saw the WebService namespace and realized it contains over thirteen hundred modules. Good God. Not something I could actually survey in any meaningful way.

Clearly I needed to search CPAN in a more specifically useful manner. And just as clearly, I needed to do that locally. Which led me to CPAN::Mini. Randall Schwarz wrote this in 2002 when a colleague asked him for a CD with CPAN burned on it and he realized that the size of "the CPAN" (when did we drop the "the"? Or is it just me?) was far too large, but a "mini-CPAN" with just the latest version of each module would be 200 MB and easily fit on a CD.

As of this writing, of course, even a mini-CPAN won't fit on a CD, being 1.84 GB in over 30,000 files. But I downloaded it anyway. I have a CPAN.

What I'm going to do first is just to find all the dependencies on LWP, WWW::Curl, Net::Curl, HTTP::Client, HTTP::Client::Parallel, HTTP::Tiny, and HTTP::Lite. If I run across any other basic HTTP clients, I'll include them in the seed list as well.

No, wait, I guess what I'm going to do first is to try to come up with a more or less complete list of HTTP clients on CPAN, while whistling past the infinite-regress graveyard. (Note: this is a TODO in the article.)

Anyway, the modules we find that way will break down into three categories: (1) modules that implement an API client, (2) support modules that provide an API client framework, and (3) modules that just retrieve HTTP for other purposes, which we'll ignore. Then I'll repeat the step for the modules found in (2) to find indirect dependencies. Obviously, the tool I want is something that can take an input module name and return a list of all modules that depend on it, so I'll do that in the next section.

It might be instructive to get a list of all the URLs used in these APIs. But my ultimate goal here is to see how people are doing things, and see how many of these implementations might be useful in coming up with best Perl practices for writing a Web API client.

Monday, March 5, 2012

New project: Toonchecker.com

I still don't know how to monetize it, but I need it - and if I do, so does somebody else. Here's the plan:
  • Perl walker to scan a list of Web comic sites for each user. (Obviously the sites are shared.) This spider checks for update on, say, an hourly basis. If the site has a feed, I'll use that. If the site pushes an email notification, I'll use that. One way or another, though, I'll figure out what changes and when.
  • For each list of toons, then, we can present a list of updates since the user last checked in and read. That list will show ads, but only that list will show ads. My ads will never appear on the screen at the same time as any comic. That's pretty thin monetization, but it will have to do.
  • The reader consists of a very thin frame at the top with forward and back buttons and a title. No ads on the frame. No ads on the frame. No ads on the frame. The bottom frame is then the entire target URL, with the cartoonist's own ads.
  • A comic counts as read when you've gone to the next page (in case you get called away, lose your connection, whatever). So we have a bookmark for each and every comic we read.
  • With multiple users, we'll be able to start forming a similarity metric for recommendations.
Perl for the spider, PHP for the site.

WebService:: namespace

Good Lord - a CPAN search on WebService modules returns 1,347 hits... No, doing this by hand is stupid.

Instead, we turn to CPAN::Mini to make a local CPAN mirror of just the latest versions of all known packages. Then we'll steal Randall Schwartz's code from here to walk the package list, and see what we can find there. I suppose what I'll want to do is open the gzipped file, which contains a tar archive I suppose, then get the META.yaml for explicit dependencies, then scan all *.pm files and read their use declarations for actual dependencies.

So - away!

API modules

And here's a list of modules that implement APIs. I'm not going to pretty them up (I may do something automatic later) - just slap links down.
That was from searches on "RESTful API". "API client" lists 890 modules, but they're not all Web-based. Here are the ones I can find:
You know, this isn't the most efficient way to do this. What I ought to do is download everything from CPAN first, then grep for "LWP::UserAgent" and go from there. I'm just going to get messy data by searching and sifting through this by hand.

API support modules

OK, here's a list of API support modules on CPAN.

Searching on "RESTful API":
I'm going to automate this (this post was open while the next post was also open).

More on the CPAN API survey

OK, so I've found WWW::RottenTomatoes and WWW::Google::PageSpeedOnline by the simple expedient of searching on RESTful API and looking for everything that seems to be a client.

Interesting modules will break down into:
  • Individual specific APIs and
  • API support modules.
What I actually want to do is the following:
  • Find (what I believe to be) a complete list of all web API modules on CPAN, with authors and place in the nomenclature. List any support modules they use.
  • Find any support modules that seem likely that aren't in use by existing APIs on CPAN.
  • Provide an initial statistical analysis of some sort.
  • Compare code and techniques between all these modules.
  • Derive a descriptive language for the client side of an API and a mapping between this language and the modules in existence. Or something. Mostly I just want to do the comparison.
Common documentation of the APIs in question will be another useful thing associated with (but not part of) the survey. Here's Google Webmaster Tools as another example. I want to take that and encode it in a specific API definition/documentation language.

More best practices for API design

API is UI for programmers. Very nice list of things to consider when defining an API.

Design (or whatever) is not a STEM discipline

Here is a very well-written, well-considered proposal for considering software development something other than a STEM discipline - not science, not technology, not mathematics and not engineering. I like this article. It casts the entire concept of programming language research into a different light.