Wednesday, March 28, 2012

That's what she really said

Geek chick Jessamyn of MeFi fame wrote a Python bot to respond to "That's what SHE said" on her work IRC server. (Given that somebody had already written a TWSS bot to say that whenever even vaguely a propos.) Did a nice writeup, too.

This is nice, by the way - it's a great example of social engineering through symbolic computing. We'll see more and more of this kind of thing as we get better with handling NLP.

Thursday, March 22, 2012

Domain: networking

So lately I've been delving into programming my router. I lend bandwidth to a neighbor, but they tend to watch Netflix when I want to watch Netflix on the Wii, and the Wii is not a happy camper when it comes to sharing bandwidth.

I run DD-WRT on my router, one of several open-source router firmware products out there these days. That means my router is a Unix box. That means that - given that it runs iptables - technically, I could make it do nearly anything, from logging bandwidth to choking it by specific MAC or even by MAC-and-port combination. Or at least I could do that if I truly understood iptables and networking, which, when you get down to brass tacks, I don't.

For example, there is a dandy little embryonic shell script wrtbwmon.sh that munges your iptables to log bandwidth by MAC address. (Here is a forum post by its author as he was trying to get it working, and here's his announcement on the DD-WRT forum.) It's slick, but as I have a much more capable Unix machine sitting right next to my router, what I'd really like to do is have that machine grab stats on a regular basis, more or less keeping the router running as a ganglion.

OK, so link dump out of the way, here's why I find this relevant to semantics. Note that I said up there that if I understood networking I could make iptables do some cool stuff. What does it really mean to understand networking? Clearly, it means to have a mental model of the network and the router and all the utilities used to work with the network, how packets get handled by iptables, and so on. I have a vague idea of all these, and given patience I can usually manage to get networks running or fix them, but I can hardly call that understanding (although, when you think about it, the notion of a "vague idea" is already pretty intriguing from the standpoint of modeling - after all, if you could give a computer system a "vague idea" you'd already be 90% of the way to intelligence; arguably, vagueness of ideas is the most human ability there is...)

So the domain of networking would have these concepts in some kind of Lexicon (borrowing from my Hoftstadter days), along with utility calls for diagnostics, and a library of C code snippets for writing your own special-case utilities. That sort of thing. I really do have some notion of where I'm going; I wish I could characterize it better.

Tuesday, March 20, 2012

Inline and pexports

So I have Inline actually compiling my C wrapper for ImageMagick - but since pexports 0.43 crashes under Windows 7 with the 64-bit ImageMagick DLLs, I was stymied for a while when it came time to link. Turns out it's been fixed in 0.44, but finding the fixed version took me a bit of research.
As long as I'm posting on Image::Magick::Wand, I should mention two links about dealing with hashes from inside C. [1] [perlguts]

Monday, March 19, 2012

Devel::Declare

Here's a little module Dave Rolsky turned me onto - I'm going to have to read it a couple of times, but essentially it takes control of the Perl parser by arcane means and lets you define your own syntax.

I love the idea of getting away from rickety code generation and into rickety on-the-fly code munging, but the real value would be letting the Perl parser do what it does best; trying to fake it with line-by-line regexps is just asking for trouble.

Sunday, March 18, 2012

Porter stemming algorithm

Just for completeness, here's a link to the Porter stemming algorithm for English. It's another hash of regexps and special cases, but it's standard and does a fair-to-middling job.

NLP class assignment 1

The first assignment for the Coursera/Stanford NLP class was a regexp approach to finding email addresses and phone numbers in a set of faculty pages from the Stanford CS department.

I ended up doing a first pass with a list of regexps, then doing some post-processing afterwards. As is always the case, to hit all their test cases, a lot of fiddly special-casing had to be done, the most irritating being Jurafsky's own email address, which consists of a JavaScript call. As I had no particular intention to embed Spidermonkey in the Python code, this had to be entirely special-cased, although a sane system design would have put this into the page retrieval spider, not in a regexp-based recognizer.

But that's neither here nor there (just low-level irritation) - the real point of the assignment was of course both to provide a little exposure to regular expressions while making it clear that natural language is utterly rife with special cases, and it succeeded on both counts.

It led me to note that splitting the logic between regular expressions and a separate post-processing step was irritating from the point of view of code understandability and maintainability. Probably it would be better to have a more powerful text processing language, perhaps explicitly based on a parser. And it should work from a real tokenizer to eliminate all possible HTML obfuscation - for example, the simple trick of having DIVs break up your email address would have made this code useless. (Which is why it irritated me to have the JavaScript instance in there.)

Anyway, it was fun.

Saturday, March 17, 2012

A little Word scripting

As you may know, I'm a technical translator by trade these days, so I work under Windows with Word documents a lot. Word has a scripting mechanism that I've often considered to be Microsoft's way of making it technically possible to program whatever you need in Word, without actually making it sensibly easy (making hard things possible and making easy things hard, thus ensuring grist for a robust consulting mill), but back in a previous life I spent a lot of paid time writing Word macros to do text processing. Profiting from that grist, I suppose.

Anyway, this week I have a huge pile of process definitions to be placed into Word, each consisting of a table on each page with a list of steps in a process to be carried out manually (many with barcodes for acknowledgement and so on). Each step has a unique identifier - by which I mean that each type of step has a unique identifier and there can be any number of tokens of each type of step.

And it's all in German and my job is to translate it to English. And the originals are all scanned PDFs, so I can't actually just use a regular translation tool to find repetitive text, or even cut and paste into Word; it all has to be typed. I do have properly formatted blank tables to start with for each document.

What I've ended up doing is putting each type of text into a table in a dictionary document, then writing a macro that looks up the unique ID I've typed into the given target document's first column and, if it's found, copies the already-translated row into the target document. If that type hasn't been encountered yet, then I translate it, and copy the row into the dictionary by hand.

I'm writing about it here, because at a semantic level that Word dictionary document is a database (a NoSQL database, if you want to be all cool-kids about it). If this macro were expressed in Decl, it would be much easier to get a lot of implications of that construal and work them into new ramifications.

In other words, construing one thing as another (a semantic mapping) is a form of recognition that could explicitly underlie the programming process in a semantic language. Also, I think I like the word "construal" as a technical term for this kind of mapping.

Friday, March 16, 2012

Caltech ML

So I didn't manage to keep up with Stanford's ML class in November/December - but Caltech will be giving me another shot in April. Maybe I'll participate! (Although if I have to decide between NLP and ML, you know NLP will win.)

Incidentally, the lectures from Ng's full CS229 treatment of machine learning, with all the math left in, are all online as well. I wish his homework material was also online.

Plucene

I didn't know this, but there's a pretty well-developed Lucene port on CPAN, Plucene. Sadly, development seems to have run aground in 2006; there are bugs reported (even including posted patches!) but nobody seems to be minding the store. I don't know to what extent it would be useful to revitalize it.

NLP class

The NLP class started today on Coursera. So far, it's easy (but so far, I already essentially know what they're telling me). It's still been valuable to go over it in a coherent manner. This whole "take derivatives of Ivy league classes for free" fad is fantastic. I hope people keep doing it.

Monday, March 12, 2012

Neat MakeMaker feature

Wow! I had no idea this was possible, but MakeMaker has a PL_FILES feature that allows you to register scripts to run at install time to configure modules before they're installed. That is just so cool - for my Image::Magick::Wand module, I can write a Configure.PL script that finds the Wand DLLs during installation of the CPAN module, then the module can simply pass them off to Inline::C with no further ado!

It's exactly what I wanted - the Perl ecosystem never ceases to amaze! There's always yet another way to do it.

Saturday, March 10, 2012

Task: write a new Perl interface to ImageMagick

PerlMagick sucks - for two reasons. First and foremost is that in all the years I've messed occasionally with ImageMagick, not once have I ever been able to get PerlMagick to install correctly, and that's just ridiculous. But worse than that, PerlMagick is bad Perl. It handles errors like C (i.e. you have to check them yourself; no croaking or anything) and its object model is weird.

Answer: wrap the new MagickWand and/or MagickCore APIs in Perl, as a bog-standard CPAN module. It can't be that hard.

Update: I've actually started this one. Github link. I'm basing it on Inline::C, because I've always had a love affair with Inline, back from its early days.

Friday, March 9, 2012

CPAN is big

30,000 packages...

If about a thousand of those are HTTP-related, well - what do the rest do? That would actually be a kind of neat thing to figure out. I keep talking about code understanding; maybe CPAN is a reasonable target? Just attempting a global survey would be edifying.

Sort of a "Things people do with Perl."

(Ever notice you can tell from this blog how little sleep I've had the night before?)

Thursday, March 8, 2012

Archive::Tar

Quick bookmark here: Archive::Tar looks like the best approach to opening a tgz'd CPAN module.

OK, so CPAN HTTP client survey first

This one will be easier. It's surprising, but there are a lot of "primary" HTTP clients on CPAN, by which I mean HTTP clients that don't depend on other modules to do their HTTP. LWP is seen as pretty large and cruft-ridden by many authors, and there are a lot of "lightweight" HTTP clients, multiplexing clients, clients for particular frameworks like POE, and so on. There are also wrappers around primary HTTP clients to make them quicker to work with.

This is what I need to do: CPAN search returns 857 modules for "HTTP client"; obviously many of those aren't HTTP clients. I need to look at that list and see what depends on what. That dependency network will probably allow me to see which are the primary HTTP clients, and I can discard a lot of them.

Really, it seems I have the following hierarchy: (1) Primary HTTP client [mandatory], (2) HTTP client wrapper [optional, may be multiple], (3) REST service module [optional, may be multiple], (4) REST API implementation [mandatory]. I really want to see a full network including all these stages, and since the main tool I have is dependencies, I have to start at the primary HTTP client end. And of the primary HTTP clients, it would be interesting to classify them by how many other modules depend on them. The vast majority are probably going to be singletons, not actually used for any published REST API, so I can prune them.

I suppose I have two possible tools: a CPAN search for likely descriptive phrases, and a dependency link checker.

Would it be possible to search on function? I think that would require more sophisticated processing than I can muster. That sounds like code understanding. I'm not even sure how I'd approach it as a human, let alone automate it.

Tuesday, March 6, 2012

First part of my CPAN Web API client survey article

A Survey of Web API client code on CPAN

Why a survey? And how do you start?

For the past few years, I've organized most of my thinking on Blogger – I first got into it while keeping various friends posted on my efforts with house renovation, and it just kind of stuck. Now I tend to start a new blog for every project I undertake. At some point (actually, on December 17, 2011) I had the bright idea that I should be able to do my task management right in Blogger as well, perhaps by the simple expedient of typing a title like "Task: do XXX" right into a blog post.

Earlier that day, I had realized that Blogger has an API, and suddenly, it was obvious how to proceed with this plan. I needed to write a Web API client to build my task indexer.

But like nearly everything I do, I was beset by the sudden fear that I might do it wrong. Maybe I'd be making assumptions I'd regret. Maybe other people were doing it better. (Note to self: this is why you never get anything done.)

I've got very little time to work on side projects – two teenagers, a full-time freelance translation business, and the aforementioned house renovation project make sure of that – so essentially everything technical is on the back burner, and so this one stayed as well, while I chewed on my fear. Occasionally in an off-moment I'd hit CPAN and look for modules that implemented other API clients, and I'd wonder what sorts of functionality might be nice in a more general Web API client support module. Finally, I just started scanning down the list of modules a search returned for "RESTful API", with the vague idea of doing a more or less comprehensive survey. Then I saw the WebService namespace and realized it contains over thirteen hundred modules. Good God. Not something I could actually survey in any meaningful way.

Clearly I needed to search CPAN in a more specifically useful manner. And just as clearly, I needed to do that locally. Which led me to CPAN::Mini. Randall Schwarz wrote this in 2002 when a colleague asked him for a CD with CPAN burned on it and he realized that the size of "the CPAN" (when did we drop the "the"? Or is it just me?) was far too large, but a "mini-CPAN" with just the latest version of each module would be 200 MB and easily fit on a CD.

As of this writing, of course, even a mini-CPAN won't fit on a CD, being 1.84 GB in over 30,000 files. But I downloaded it anyway. I have a CPAN.

What I'm going to do first is just to find all the dependencies on LWP, WWW::Curl, Net::Curl, HTTP::Client, HTTP::Client::Parallel, HTTP::Tiny, and HTTP::Lite. If I run across any other basic HTTP clients, I'll include them in the seed list as well.

No, wait, I guess what I'm going to do first is to try to come up with a more or less complete list of HTTP clients on CPAN, while whistling past the infinite-regress graveyard. (Note: this is a TODO in the article.)

Anyway, the modules we find that way will break down into three categories: (1) modules that implement an API client, (2) support modules that provide an API client framework, and (3) modules that just retrieve HTTP for other purposes, which we'll ignore. Then I'll repeat the step for the modules found in (2) to find indirect dependencies. Obviously, the tool I want is something that can take an input module name and return a list of all modules that depend on it, so I'll do that in the next section.

It might be instructive to get a list of all the URLs used in these APIs. But my ultimate goal here is to see how people are doing things, and see how many of these implementations might be useful in coming up with best Perl practices for writing a Web API client.

Monday, March 5, 2012

New project: Toonchecker.com

I still don't know how to monetize it, but I need it - and if I do, so does somebody else. Here's the plan:
  • Perl walker to scan a list of Web comic sites for each user. (Obviously the sites are shared.) This spider checks for update on, say, an hourly basis. If the site has a feed, I'll use that. If the site pushes an email notification, I'll use that. One way or another, though, I'll figure out what changes and when.
  • For each list of toons, then, we can present a list of updates since the user last checked in and read. That list will show ads, but only that list will show ads. My ads will never appear on the screen at the same time as any comic. That's pretty thin monetization, but it will have to do.
  • The reader consists of a very thin frame at the top with forward and back buttons and a title. No ads on the frame. No ads on the frame. No ads on the frame. The bottom frame is then the entire target URL, with the cartoonist's own ads.
  • A comic counts as read when you've gone to the next page (in case you get called away, lose your connection, whatever). So we have a bookmark for each and every comic we read.
  • With multiple users, we'll be able to start forming a similarity metric for recommendations.
Perl for the spider, PHP for the site.

WebService:: namespace

Good Lord - a CPAN search on WebService modules returns 1,347 hits... No, doing this by hand is stupid.

Instead, we turn to CPAN::Mini to make a local CPAN mirror of just the latest versions of all known packages. Then we'll steal Randall Schwartz's code from here to walk the package list, and see what we can find there. I suppose what I'll want to do is open the gzipped file, which contains a tar archive I suppose, then get the META.yaml for explicit dependencies, then scan all *.pm files and read their use declarations for actual dependencies.

So - away!

API modules

And here's a list of modules that implement APIs. I'm not going to pretty them up (I may do something automatic later) - just slap links down.
That was from searches on "RESTful API". "API client" lists 890 modules, but they're not all Web-based. Here are the ones I can find:
You know, this isn't the most efficient way to do this. What I ought to do is download everything from CPAN first, then grep for "LWP::UserAgent" and go from there. I'm just going to get messy data by searching and sifting through this by hand.

API support modules

OK, here's a list of API support modules on CPAN.

Searching on "RESTful API":
I'm going to automate this (this post was open while the next post was also open).

More on the CPAN API survey

OK, so I've found WWW::RottenTomatoes and WWW::Google::PageSpeedOnline by the simple expedient of searching on RESTful API and looking for everything that seems to be a client.

Interesting modules will break down into:
  • Individual specific APIs and
  • API support modules.
What I actually want to do is the following:
  • Find (what I believe to be) a complete list of all web API modules on CPAN, with authors and place in the nomenclature. List any support modules they use.
  • Find any support modules that seem likely that aren't in use by existing APIs on CPAN.
  • Provide an initial statistical analysis of some sort.
  • Compare code and techniques between all these modules.
  • Derive a descriptive language for the client side of an API and a mapping between this language and the modules in existence. Or something. Mostly I just want to do the comparison.
Common documentation of the APIs in question will be another useful thing associated with (but not part of) the survey. Here's Google Webmaster Tools as another example. I want to take that and encode it in a specific API definition/documentation language.

More best practices for API design

API is UI for programmers. Very nice list of things to consider when defining an API.

Design (or whatever) is not a STEM discipline

Here is a very well-written, well-considered proposal for considering software development something other than a STEM discipline - not science, not technology, not mathematics and not engineering. I like this article. It casts the entire concept of programming language research into a different light.

Best practices for unsubscription

Here's a nice list of best practices for email update subscription cancellations.

Finding things to do: TODO searches

Here's a good way to find things that need doing in an open-source project - but that haven't made it to the issue list: search for TODO in the actual code. Reason: TODOs are left by programmers and pertain to issues they anticipate, while issues are raised by users and therefore may not overlap at all.

Personal strategy

I've decided to go cold turkey on Internet consumption until I get some more real things done. The old standby of modifying the hosts file works great under Windows 7 because you have to start your editor in admin mode to edit the file. Changing it back takes a specific effort.

I blocked HNN, Google+, Facebook, and Reddit. That ought to do for now.

NoSQL data modeling techniques

A really nice, detailed article on NoSQL data modeling. Very well done!

Target application: receipts


I have this goal to record each and every expense in the household and categorize them for budgeting. Unfortunately, for the past four years I've failed to meet that goal. The problem is it's so difficult to keep up with entry of the paper receipts - this involves a great deal of context switching between paper and screen to find where the date, amount, and destination of each expense is.

So I just don't do it. Instead, I pile up receipts in small boxes scattered around my office.

But now I have this nifty little photo scanner, the PanDigital Photolink. It's great for small stuff; it just sucks things through and stores the scan file onto an SD card for your viewing pleasure, at about 300 dpi. Scanning the receipts is easy because there are no context switches (at least this will make it possible to free my desk of the many small boxes), and then I want to do the following:
  • Delete mis-scans (if the receipt doesn't quite engage, sometimes there's a little blurb that isn't actually anything). This I can do manually after each scanning session.
  • Shrink the files - I don't actually need 300 dpi quality for these, and at about 400 kB a pop, my 80's self is offended by the size of the data.
  • Merge any two-scan receipts - the scanner gives up after about eight inches, knowing it's not actually a plausible length and assuming your photo has jammed. For long receipts like grocery shopping at Meijer's, I'll scan receipts in two sections. Using physical scissors. Then I want to group them as a single receipt.
  • Ideally, straighten the scan up. The receipts are too narrow for the scanner to detect them if they're against the guide rail of the bed, so I scan down the middle of the bed - the result is that they're all slightly slanted. Some move a little during the scan, so they're also bent. Not much to do about that.
  • Ideally, OCR them.
  • Using a combination of OCR and a viewer application (this would be a simple GUI with a viewer for the graphic and a record entry for the data), verify any OCR'd data or enter the data if OCR can't get it.
  • Index everything into a SQLite database, along with non-receipt expenses such as checks or online payments. Categorize and report using something analogous to the Access database I built in the 90's.
That's pretty simple. It should essentially be nearly as simple to write this in Decl as it was to explain it just now.

Server performance tips

Here are a couple of good overviews of best practices for fast Websites.
Time for another survey.

Thursday, March 1, 2012

Declaration of constants

In a graphical coding environment (or something like a literate programming environment), you would have a reference section with the details of constants or starting values. For example, if I have a simple script that works on a list of things, I could put the list of things in a separate file, or define it as a constant list (a local table, in Decl). But if it were a constant list, then it would normally be hidden; you'd just want to be able to click it to manage the data.

Or alternatively, you could specify viewing parameters right on the object or in a separate viewing preferences object in the script file. The more I think about writing my own code editor for Decl, the more I like it - even though I'm reinventing the wheel, as usual.