Wednesday, March 28, 2012
That's what she really said
Thursday, March 22, 2012
Domain: networking
Tuesday, March 20, 2012
Inline and pexports
- The R people figured this bit out for me
- Turns out PerlMagick should work as well, now that I've got pexports fixed, but now I'm having too much fun to stop. Inline rocks.
- Wasn't Inline on track to be a standard module? It isn't.
Monday, March 19, 2012
Devel::Declare
Sunday, March 18, 2012
Porter stemming algorithm
NLP class assignment 1
Saturday, March 17, 2012
A little Word scripting
Friday, March 16, 2012
Caltech ML
Plucene
NLP class
Monday, March 12, 2012
Neat MakeMaker feature
Saturday, March 10, 2012
Task: write a new Perl interface to ImageMagick
Friday, March 9, 2012
CPAN is big
Thursday, March 8, 2012
Archive::Tar
OK, so CPAN HTTP client survey first
Tuesday, March 6, 2012
First part of my CPAN Web API client survey article
A Survey of Web API client code on CPAN
Why a survey? And how do you start?
For the past few years, I've organized most of my thinking on Blogger – I first got into it while keeping various friends posted on my efforts with house renovation, and it just kind of stuck. Now I tend to start a new blog for every project I undertake. At some point (actually, on December 17, 2011) I had the bright idea that I should be able to do my task management right in Blogger as well, perhaps by the simple expedient of typing a title like "Task: do XXX" right into a blog post.
Earlier that day, I had realized that Blogger has an API, and suddenly, it was obvious how to proceed with this plan. I needed to write a Web API client to build my task indexer.
But like nearly everything I do, I was beset by the sudden fear that I might do it wrong. Maybe I'd be making assumptions I'd regret. Maybe other people were doing it better. (Note to self: this is why you never get anything done.)
I've got very little time to work on side projects – two teenagers, a full-time freelance translation business, and the aforementioned house renovation project make sure of that – so essentially everything technical is on the back burner, and so this one stayed as well, while I chewed on my fear. Occasionally in an off-moment I'd hit CPAN and look for modules that implemented other API clients, and I'd wonder what sorts of functionality might be nice in a more general Web API client support module. Finally, I just started scanning down the list of modules a search returned for "RESTful API", with the vague idea of doing a more or less comprehensive survey. Then I saw the WebService namespace and realized it contains over thirteen hundred modules. Good God. Not something I could actually survey in any meaningful way.
Clearly I needed to search CPAN in a more specifically useful manner. And just as clearly, I needed to do that locally. Which led me to CPAN::Mini. Randall Schwarz wrote this in 2002 when a colleague asked him for a CD with CPAN burned on it and he realized that the size of "the CPAN" (when did we drop the "the"? Or is it just me?) was far too large, but a "mini-CPAN" with just the latest version of each module would be 200 MB and easily fit on a CD.
As of this writing, of course, even a mini-CPAN won't fit on a CD, being 1.84 GB in over 30,000 files. But I downloaded it anyway. I have a CPAN.
What I'm going to do first is just to find all the dependencies on LWP, WWW::Curl, Net::Curl, HTTP::Client, HTTP::Client::Parallel, HTTP::Tiny, and HTTP::Lite. If I run across any other basic HTTP clients, I'll include them in the seed list as well.
No, wait, I guess what I'm going to do first is to try to come up with a more or less complete list of HTTP clients on CPAN, while whistling past the infinite-regress graveyard. (Note: this is a TODO in the article.)
Anyway, the modules we find that way will break down into three categories: (1) modules that implement an API client, (2) support modules that provide an API client framework, and (3) modules that just retrieve HTTP for other purposes, which we'll ignore. Then I'll repeat the step for the modules found in (2) to find indirect dependencies. Obviously, the tool I want is something that can take an input module name and return a list of all modules that depend on it, so I'll do that in the next section.
It might be instructive to get a list of all the URLs used in these APIs. But my ultimate goal here is to see how people are doing things, and see how many of these implementations might be useful in coming up with best Perl practices for writing a Web API client.
Monday, March 5, 2012
New project: Toonchecker.com
- Perl walker to scan a list of Web comic sites for each user. (Obviously the sites are shared.) This spider checks for update on, say, an hourly basis. If the site has a feed, I'll use that. If the site pushes an email notification, I'll use that. One way or another, though, I'll figure out what changes and when.
- For each list of toons, then, we can present a list of updates since the user last checked in and read. That list will show ads, but only that list will show ads. My ads will never appear on the screen at the same time as any comic. That's pretty thin monetization, but it will have to do.
- The reader consists of a very thin frame at the top with forward and back buttons and a title. No ads on the frame. No ads on the frame. No ads on the frame. The bottom frame is then the entire target URL, with the cartoonist's own ads.
- A comic counts as read when you've gone to the next page (in case you get called away, lose your connection, whatever). So we have a bookmark for each and every comic we read.
- With multiple users, we'll be able to start forming a similarity metric for recommendations.
WebService:: namespace
API modules
- http://search.cpan.org/~mpgutta/WebService-Soundcloud/
- http://search.cpan.org/~cvicente/Netdot-Client-REST-1.02/lib/Netdot/Client/REST.pm
- http://search.cpan.org/~sschneid/REST-Google-Apps-Provisioning-1.1.9/lib/REST/Google/Apps/Provisioning.pod
- http://search.cpan.org/~sschneid/REST-Google-Apps-EmailSettings-1.1.6/lib/REST/Google/Apps/EmailSettings.pod
- http://search.cpan.org/~tokuhirom/Cache-KyotoTycoon-REST-0.03/lib/Cache/KyotoTycoon/REST.pm
- http://search.cpan.org/~drtech/ElasticSearch-0.51/lib/ElasticSearch.pm
- http://search.cpan.org/~imalpass/WebService-Etsy-0.7/lib/WebService/Etsy.pm
- http://search.cpan.org/~manwar/Filter-DisposableEmail-0.02/lib/Filter/DisposableEmail.pm
- http://search.cpan.org/~bklaas/Blitz-0.01/lib/Blitz/API.pm
- http://search.cpan.org/~cvega/WWW-MediaTemple-0.02/lib/WWW/MediaTemple.pm
- http://search.cpan.org/~cvega/WWW-RottenTomatoes-0.03/lib/WWW/RottenTomatoes.pm
- http://search.cpan.org/~manwar/IP-Info-0.05/lib/IP/Info.pm
- http://search.cpan.org/~manwar/WWW-MovieReviews-NYT-0.04/lib/WWW/MovieReviews/NYT.pm
- http://search.cpan.org/~gbudd/IPsonar-0.23/lib/IPsonar.pm
- http://search.cpan.org/~mndrix/RDF-Sesame-0.17/lib/RDF/Sesame.pm
- http://search.cpan.org/~bricas/SRU-0.99/lib/SRU.pm
- http://search.cpan.org/~bkaney/Bio-Cellucidate-0.03/lib/Bio/Cellucidate.pm
- http://search.cpan.org/~doggy/Net-UpYun-0.001/lib/Net/UpYun.pm
- http://search.cpan.org/~lyokato/Net-OpenSocial-Client-0.01_05/lib/Net/OpenSocial/Client.pm
- http://search.cpan.org/~jwied/BZ-Client-1.04/lib/BZ/Client.pm (very complex)
- http://search.cpan.org/~shiriru/WebService-GData-0.0501/lib/WebService/GData/YouTube/Doc/GeneralOverview.pod
- http://search.cpan.org/~nheinric/WebService-MyGengo-0.012/lib/WebService/MyGengo/Client.pm
- http://search.cpan.org/~symkat/WebService-CloudFlare-Host-000100/lib/WebService/CloudFlare/Host.pm
- http://search.cpan.org/~rplatel/Net-OpenSRS-OMA-0.02/lib/Net/OpenSRS/OMA.pm
- http://search.cpan.org/~lkundrak/WWW-GoodData-1.6/lib/WWW/GoodData.pm
- http://search.cpan.org/~oalders/Net-FreshBooks-API-0.23/lib/Net/FreshBooks/API/Client.pm
- http://search.cpan.org/~franckc/Net-Backtype-0.03/lib/Net/Backtype.pm
- http://search.cpan.org/~cjm/WebService-NFSN-1.02/lib/WebService/NFSN.pm
- http://search.cpan.org/~pjobson/WWW-TheMovieDB-Search-0.03/lib/WWW/TheMovieDB/Search.pm
- http://search.cpan.org/~mramberg/WebService-PutIo-0.3/lib/WebService/PutIo.pm
- http://search.cpan.org/~lukec/Net-Stripe-0.06/lib/Net/Stripe.pm
- http://search.cpan.org/~miyagawa/XML-Atom-0.41/lib/XML/Atom/Client.pm
- http://search.cpan.org/~franckc/Net-Backtype-0.03/lib/Net/Backtweet.pm
- http://search.cpan.org/~symkat/WebService-VaultPress-Partner-0.05/lib/WebService/VaultPress/Partner.pm
- http://search.cpan.org/~dpmeyer/WWW-Instapaper-Client-0.901/lib/WWW/Instapaper/Client.pm
API support modules
More on the CPAN API survey
- Individual specific APIs and
- API support modules.
- Find (what I believe to be) a complete list of all web API modules on CPAN, with authors and place in the nomenclature. List any support modules they use.
- Find any support modules that seem likely that aren't in use by existing APIs on CPAN.
- Provide an initial statistical analysis of some sort.
- Compare code and techniques between all these modules.
- Derive a descriptive language for the client side of an API and a mapping between this language and the modules in existence. Or something. Mostly I just want to do the comparison.
More best practices for API design
Design (or whatever) is not a STEM discipline
Best practices for unsubscription
Finding things to do: TODO searches
Personal strategy
NoSQL data modeling techniques
Target application: receipts
I have this goal to record each and every expense in the household and categorize them for budgeting. Unfortunately, for the past four years I've failed to meet that goal. The problem is it's so difficult to keep up with entry of the paper receipts - this involves a great deal of context switching between paper and screen to find where the date, amount, and destination of each expense is.
- Delete mis-scans (if the receipt doesn't quite engage, sometimes there's a little blurb that isn't actually anything). This I can do manually after each scanning session.
- Shrink the files - I don't actually need 300 dpi quality for these, and at about 400 kB a pop, my 80's self is offended by the size of the data.
- Merge any two-scan receipts - the scanner gives up after about eight inches, knowing it's not actually a plausible length and assuming your photo has jammed. For long receipts like grocery shopping at Meijer's, I'll scan receipts in two sections. Using physical scissors. Then I want to group them as a single receipt.
- Ideally, straighten the scan up. The receipts are too narrow for the scanner to detect them if they're against the guide rail of the bed, so I scan down the middle of the bed - the result is that they're all slightly slanted. Some move a little during the scan, so they're also bent. Not much to do about that.
- Ideally, OCR them.
- Using a combination of OCR and a viewer application (this would be a simple GUI with a viewer for the graphic and a record entry for the data), verify any OCR'd data or enter the data if OCR can't get it.
- Index everything into a SQLite database, along with non-receipt expenses such as checks or online payments. Categorize and report using something analogous to the Access database I built in the 90's.
Server performance tips
- Yahoo!'s best practices.
- Some guy in London's recommendations.