Saturday, April 28, 2012
Friday, April 27, 2012
Data journalism link dump
A chance post on BoingBoing led me to a regular nest of data journalism organizations. Rather than figure them all out, given my sick and tired state (all I want to do is get my CPU fan to shut up!), I'm just going to dump them here.
- http://www.reportingproject.net/proxy/en/
- http://www.reportingproject.net/occrp/
- https://opendata.socrata.com/
- http://www.icfj.org/support-our-work
- http://www.investigativedashboard.org/category/blog/
- http://www.datatracker.org/2011/03/how-to-fish-for-people-in-panama/
- http://ohuiginn.net/wp/?p=205
- http://www.investigativedashboard.org/
- http://www.investigativedashboard.org/category/software/social-networks/
Analysis-free for your viewing pleasure! I'm going to bed now.
Kaggle: song challenge
Here's a new one from Kaggle: the million-song dataset challenge. Given a partial playlist, predict more playlist.
I find this one fascinating, because I seriously wonder whether it would be worthwhile to extract music features from the tracks themselves in some way, or the statistics of similarity to listeners would already give you that information.
I find this one fascinating, because I seriously wonder whether it would be worthwhile to extract music features from the tracks themselves in some way, or the statistics of similarity to listeners would already give you that information.
Time for a link dump
I've been sick, and the interesting tabs are piling up in my browser. The problem is that the browser in question is on my 7-year-old XP laptop, and frankly, its CPU fan is starting to bug me. So before I go to bed, link dump time!
- jQuery file upload demo. OK, I know, but I'm a sucker for this stuff.
- Zac Stewart weighs in with an interesting proposal to use the HTTP OPTIONS verb for API self-description.
- Mojolicious boilerplate. You had me at "boilerplate".
- Building world-class ML systems, an article at the O'Reilly Radar that I haven't even had time to read. By the way, the complaints about NLP systems usually failing when normal mortals try to use them (OK, maybe I'm slanting that a little in paraphrase) is because current NLP systems aren't actually intelligent, but people don't have the mental tools to understand that.
- wkhtmltopdf - heck of a name for a tool that uses WebKit to render an HTML page to PDF.
- Diagram.ly - I'm pretty sure I've linked them before. But still. Here they are again.
- First in a series listing the tools for modern Web development.
- Nice post mortem on a virus infiltration. I'm a sucker for that stuff, too.
- My cousin Erin's old blog has a post on responsive prototyping that's pretty slick.
- (See how I name-dropped there, but only for the really hip? That's just how I roll.)
- Clearly.pl, a programmable structured text editor in Javascript. [github]
- generic-ci, a generic continuous integration tool that doesn't pull in all of Java.
- Cheap hosting page.
Saturday, April 21, 2012
Kohonen maps in Perl
Just so I don't forget it, there's a Kohonen map implementation on CPAN.
I once wrote a rather nifty implementation of Kohonen maps in Visual Basic (it was another country, and besides, the wench is dead), but lost it years ago. Nice graphical demo, too. Always a shame when software gets lost.
I once wrote a rather nifty implementation of Kohonen maps in Visual Basic (it was another country, and besides, the wench is dead), but lost it years ago. Nice graphical demo, too. Always a shame when software gets lost.
Wednesday, April 18, 2012
Web development in Perl these days
As you know, Bob, it's been a while since I did active Web development work. I've been interviewing, though, and man, Webdev in Perl has moved on a lot since I last looked at it! Which was maybe, you know, over a decade ago - so I'm not sure why I'm surprised.
Anyway, the CGI::Application ecosystem is a pretty darned declarative framework for Web application development, and I'm totally hooked. I did a little programming test for one company, and here's the list of modules touched, just for later reference:
- CGI, of course. But it's largely invisible, being wrapped in
- CGI::Application.
- I used CGI::Application::Server for testing.
- HTML::Template was used for formatting.
- Data::FormValidator was the core of the application in this case (just a little contact form)
- Finally, MIME::Lite did the mail handling.
- And there was a database insert, of course. A nice boilerplate page, because I can never remember the details.
This little application took me a total of maybe 3 hours, with preparation (and preparation failure on a couple of platforms) and learning the new tools. The actual coding could probably be done in less than an hour if you knew the modules.
Why should it even take that long? Why don't we already have a central data dictionary with validation rules and so forth? That's another to-do for Decl.
Saturday, April 14, 2012
Abhijit Mahabal strikes again
Abhijit has been working on a Python FARG framework - something I once wanted to do (OK, granted: something I still want to do, but now he's done it, or started one). It's here on Github.
Twitter open-sources MySQL work
Twitter has done a lot of work on MySQL scalability, now open-sourced.
Casual programming
Here's a very thought-provoking post, generally about the notion of "casual programming", i.e. a language/IDE/whatever that would (as I phrase it) work with you at a semantic level to co-write the software you're working on. It would do this by having context-sensitive assistance for API inclusion. This is a pretty good approach.
Saturday, April 7, 2012
WebPutty open-sourced
WebPutty is a gallery site for CSS that runs on GAE. FogCreek just open-sourced it.
Thursday, April 5, 2012
Programming is hard - let's go scripting
Some musings by famed linguist Larry Wall about the dimensions of variability of programming languages, including a couple of obscure ones he had a hand in influencing. Very interesting article, naturally.
Wednesday, April 4, 2012
Making easy things easy with node.js
Here's a nice little article noting that true popularity for a platform comes from how easy it is to build form-based database applications: consider Visual Basic, then PHP, then Rails. The idea is that making easy things easy lets a large community with varying skill levels use a technology, and that creates a dynamic environment that lets you get things done.
The author kind of ends on the note that he hopes this will happen for node.js.
Tuesday, April 3, 2012
Frustration with statistical methods
The "custom language model" section of HW2 is impossible. I implemented a hacked Kneser-Ney smoothing (I think perhaps the corpus is too small, but the problem was lots of zeroes), a trigram-to-bigram backoff, and linear interpolation, and all failed. Linear interpolation of trigrams, bigrams, and unigrams failed miserably.
I dunno. Clearly people are making these things work, but I get the impression that mostly it's throwing a bucket of tacks at the problem and hoping something will stick.
Update: some perusing of the forum led me to the realization that I wasn't testing trigrams correctly. Looking at only the first trigram in a sentence gave me "[s] One two", so that any spelling error in "One" would be lost. Once that was fixed, trigram double-backoff worked as well as bigram backoff. A little twiddling with the backoff coefficients got me slightly better performance than my original bigram backoff with a 0.4.
Moral of the story: the choice of backoff coefficient makes a difference. Which is why I hate statistical approaches.
Sunday, April 1, 2012
NLP homework II
You'd think spelling correction would be old hat, but you know what? I'm starting to get kind of excited about the idea of a spelling corrector that's trained on my actual typing, knows what language I'm translating from, and knows what customer I'm translating for. I think it will catch a whole lot more errors...
In other news, I took the plunge and signed up for Caltech's ML class. I hope I'm going to have time for it.
Subscribe to:
Posts (Atom)