Friday, December 30, 2011

Dedicated hosting at a reasonable price

I might move.

Top skills for 2012

Just FYI.

Interesting cryptography/number theory library

Interesting post on HNN recently (as always): a bloke whose uncle spent a lifetime implementing some pretty amazing algorithms just open-sourced them.

Perl and Twilio

Who knew?

Perl documentation in the news

OK, so there have been some efforts this year to get Perl looking as up-to-date as it actually is; the problem is that the language has been around so long that typical search results may be a decade old, even though the community has decisively moved on. Since competing languages haven't even been around, the result is that it looks like it's easier to do modern tasks in, say, Ruby - because that's what you see in relevant results.

My Lego inventory project should do well for this; it's a typical Web scraping database program that will make a good article. I just have to find out how to cross-post to, say, perl.com.

Waffles command-line ML toolset

Waffles is a comprehensive set of command-line tools for doing machine learning and data mining.

I really, really need to sit down and do a survey/implementation tool thing.

Sunday, December 25, 2011

Scaling a blog

Another scaling article - worth reading, tomorrow.

Task: Lego inventory tracker

This one's actually underway, a Christmas present for my son. I'll write it up separately somewhere - perhaps on the Vivtek site itself! (A blast from the past - I haven't written anything new there since probably 2009; Blogger has been so much more convenient.)

Task: News scraper/tracker

I want to scrape the Reuters news feeds (later, others) into a database for various analytical purposes [eg]. That's going to consist of a daemon on my fileserver that checks the feeds on a period basis and loads things into a database. Then we'll do other analysis on that database. I'm most interested in linking stories and identifying trends.

Yeah, OK, I know this isn't groundbreaking research. It's new for me, though. And it will be a good microcosm of scraping tasks for declaratization as well as a valuable component for all kinds of things. So ... it's a task.

Trading platforms

A couple of trading platforms that were advertised on a blog I follow. Probably stupid even to think about securities trading this year. Or next.
  • Trade Architect from TD Ameritrade.
  • OptionsHouse allows you to do simulated trading to start off and has flat trading fees for real trades. Interesting.

John Carmack on static code analysis

Short version: DO IT.

2011 Visualization Roundup

What it says on the tin.

Markov chains in Chutes and Ladders

... Cute.

How startups succeed

Good article. tl;dr:
  • Fire lots of bullets, not cannonballs (MVPs again)
  • Fanatic devotion to performance goals even when times are hard
  • Productive paranoia: cash in the bank, reduce risk whenever possible, anticipate killer strikes
  • Don't bet on luck. Bet on being good.
  • Seize opportunity when it arises.

Context.IO: mail replacement API?

Looks neat.

Questions for startups

Here's a funny little look at VC questions by a Croatian startup. Toss it into the slushpile.

Open source target: IndexTank

IndexTank was bought by LinkedIn and is now open source. It's apparently also used by Reddit. I need to learn it.

20 sites pushing the limits of JS

Another cool JS roundup.

DD-WRT

... is the open-source firmware for my new router. I want to do per-MAC bandwidth tracking, and here are some leads.

Probably my best bet is to assume a control panel on the local PC (which is the situation I've got) manipulating a remote "sensor head" on the router. The router doesn't have a huge amount of resources, after all.

Twilio set to explode

I need to be on board. Patrick usually knows what he's talking about.

Data mining without prejudice

Interesting article from MIT about a paywalled article in Science about a new technique for data mining developed at MIT, the upshot being apparently that it's doing curve fitting with no preconceived notions of the variables being fit. Or something. I need to read it after some sleep.

Open Government

Another post about open gov - "Dear Internet, it's no longer OK not to know how Congress works", which is clever, but instead largely about disrupting the system with better political software, which I like.

25 Time-saving generators

Another link-list Webdev post.

HTML too complex?

And by HTML, I think the industry now means HTML+CSS+Canvas, as a Flash replacement. Interesting point here about "I'm too lazy to be a HTML dev" - which just means the level of abstraction is wrong.

And that's interesting.

Saturday, December 17, 2011

Task: Write a Blogger to-do list manager

I do my thinking aloud here and on other blogs, and one of the perennial problems I have with that is that Blogger has no particular way of dealing with task lists. (Boy, that sounds stupid, doesn't it?)

Seriously - a blog is a fantastic way of entering tagged text that could be scanned for tasks, progress notifications, and even completion of tasks in a structured way. Remember how I said Blogger has an API? Well, how about the following scheme, then?

1. Introduce a task by prefixing it in the title, like "Task: Write a Blogger to-do list manager". Then introduce the tag just by making up a tag for it, e.g. "todo list manager". The tag can now be a miniblog for the task, you see - for free.

2. Progress reports are just posted using the tag. Optionally, if you put a percentage in there, you could use it as a completion estimate.

3. Completion is also flagged in the title, with the word "complete".

4. The current to-do list can now be generated automagically with a simple script you run whenever the blog is updated (or periodically, or whatever). I'd personally write it as a Perl script against the Blogger API run on my local machine.

5. If you post any post named "To-do list" or something along those lines, the to-do list can be updated into it (say, at the end, or wherever a given comment appears). The current to-do list can link back to old to-do lists of historical interest, and you can just post another one whenever you feel it's appropriate.

6. The updater can even make sure that the current to-do list post is the one linked from a sidebar highlight. You could even put your to-do list on the sidebar (perhaps in an abbreviated form).

So. I should do this as soon as my vacation starts. And on that note, I'm going to get back to work to hasten that very day.

Startup escape path

Swombat again. Toss it kind of into the "procedural" pile.

Big data predictions for 2012

Interesting article, for which I have time measured in microseconds. It's more time than I have just to post a link to it here.

Google on bug prediction and Microsoft on empirical programming

Neat post on HNN about bug prediction at the Goog; answered at HNN with a link to Microsoft's publications on empirical programming, many of which are mouthwatering. Gotta look at this when I have a minute.

Blogger has an API

Blogger has an API. Lots of things have APIs, actually.

Research tools in Python

Not sure how much of this is directly useful, but don't have time right now to figure it out. Good for statistics, perhaps.

Coroutines

An absolutely thought-provoking presentation on co-routines in Python.

Things not to forget

OK, so as soon as I've finished my current project-that-will-not-die, there are a few things I've been meaning to pay more attention to. Here is something like a list, roughly in order of age.
  • Paraphrasing tools. This is something I came up with a couple of years ago that would be a lot easier now that I've spent some time thinking harder about NLP.
  • HVPT word pair trainer.
  • Depatenting, still, I guess.
  • Despammed rebirth, possibly based on CRM114.
  • Practical PHP exercises as kata.
  • Run back through the big translation project management tasks from last spring in light of Windows automation.
  • Code structure examination of OpenLogos, finally.
  • In general, continue automation of my translation workflow.
  • The Heritage Health Prize. Even doing halfway decently on it would be good advertising.
That ought to keep me out of trouble for a while. Now I can close some windows.

Target application: Todoist.com

Nice to-do/project manager application - but so very many of its features are premium! (Which is smart, sure. It's just that I've wanted to do a task manager [again] for a long time. And this one is ripe for analysis.)

Here's a top-ten list of Web to-do apps.

Wednesday, December 14, 2011

Infunl query language for clickpaths

This is really, really neat. (examples here) I want to see more of this kind of thing.

Tuesday, December 13, 2011

Sketching UI

Interesting article on designing UIs on, you know, paper.

Running R on the GPU

I guess? (Can you tell I'm in a hurry this week?)

Tokenizing the Common Crawl corpus

Interesting scalable approach.

ATS: programming language du jour

ATS. Statically typed sysadmin language?

PDFMiner

Python library for reading PDFs.

Evolutionary database design

Heck, I haven't even had time to read this. Looks promising, though.

Programming in Syn

OK, now here is a macro language to end all macro languages - Syn. The point of Syn is to provide a language that just operates on parse trees, and thus compiles to ... anything. Exactly where the code generation of Decl is aimed. Fascinating read!

Sunday, December 11, 2011

Evolved to Win

Ebook about GA evolution of gameplaying algorithms or strategies. Interesting stuff!

Friday, December 9, 2011

Target application: WildChords

Neat iPad app that listens to the mic on your iPad to analyze your guitar playing, then runs a game where you have to play particular chords to lead animals out of the zoo. Like Guitar Hero, except it actually teaches you to play the guitar!

The market cap is stupidly large. I mean, really stupidly large - do you know how many people want to learn an instrument? It's applying the superstimulus of gamification to allow you to reach a goal you desire. So I predict they're going to make money by the boatload.

I'd like to reverse engineer their signal processing (which they say is patent pending, to which I say boo!) and provide it for open-source games. That would be neat.

Infographic: What tools developers actually use

Neat survey of 500 developers on what tools they use in different categories, presented in infographic form.

Wednesday, December 7, 2011

Tuesday, December 6, 2011

XML in PostgreSQL

Dang, PostgreSQL can do some really groovy stuff, like querying on Xpaths right in the database query on XML stored in text columns. You can even index on them!

Reactive programming

Here's a term I hadn't seen before: reactive programming. Reactive programming is a declarative style in which relationships between values are defined, then changes to one value propagate to the other. A data flow graph is created, in other words. I've been stumbling towards this in Decl, of course, but here's Elm, a type-safe functional reactive programming language that compiles to JavaScript.

Apparently, there's nothing that can't be done in JavaScript these days.

I personally find this code nearly unreadable (I'm sure I'd improve with some practice), but the notion of declarative specification of JavaScript I see in the examples is utterly enthralling.

Lucene

Lucene is a (Java) indexer for full text. It's the basis for a lot of built-in search engines today, and it's probably something I need to learn first. Here's a good place to start, and here's another.

Open source target: Civic Commons

Another effort (still just getting underway, really) that makes sense; Tim O'Reilly posted about it recently and I guess O'Reilly is supporting it: Civic Commons: "Sharing Technology for the Public Good". The only actual open-source release I can find is EAS, the "Enterprise Addressing System", which apparently provides a database for civic organizations to use in keeping addresses up to date? Anyway, its issue list is actually kind of long, and it's in Python. Perhaps it makes sense to analyze this as well.

Monday, December 5, 2011

System administration

There's a Python module for system administration. That's neat!

Analyzing the Enron corpus

Fun stuff: evil vs. football.

N-grams

So I wrote some code (finally!) into NLP::Tokenizer to pull out n-grams, and ran across this article about using bilingual n-grams in translation. This article blew my mind, for a simple reason: it assumes that n-gram alignment between languages even makes sense. Fine, I guess if you're restricted to English and French, like the article (actually a set of slides, not an article - whatever), then you might be OK. But German? Hungarian? These guys aren't translators.

So anyway, I ran the n-gram extractor on a rather large German corpus extracted from some HTML files, and ... honestly, I couldn't see much of a way to use the results. I'm thinking really that something more like a Markov network and subsequent identification of ... frames or whatever you want to call them would be more useful.

Not sure yet, but later this month I want to spend some time finding out.

CRAN: knncat

There's a lot of good stuff on CRAN. I really, really need to understand that.

Some neat Perl things

Ran across a few promising-looking modules in CPAN this week, all grist for the (currently hiatized) WWW::Declarative mill:

HTML::Seamstress is a module that actually uses HTML and classids as the template language for HTML output. That's ... pretty clever!

Web::Scraper is another Web scraper module, but it looks rather promising.

HTML::Entities is a fantastic module for dealing with quoted HTML entities. It came in quite handy in textual analysis of some HTML I had to do this week. Very nice indeed!

VP trees

A quick indexing structure: VP trees. Read this again.

Website spam fighting

Another article:
  • Rename default pages
  • Set up honeypot fields (hidden fields on the form)
  • Follow up on spammed companies
  • Have human moderators and an educated forum community
I gotta get back into this.

Webapps with R and Wt

Interesting.

Word representations

So Joseph Turian (the Metaoptimize guy) has a neat little study here about statistical measurement of the semantics or a semantic space of a list of words. Something else to grok.

Metaoptimize QA

I may already have posted this, but it's a StackOverflow thing for machine learning and NLP. If I don't die in the next week from translation overwork, then later this month I'll be spending some quality time here.

Wednesday, November 30, 2011

Scalable data mining of Twitter stream

Neat article that is interesting on multiple fronts for me.

Bayes classifier in 50 lines of code

Does what it says on the tin.

The modern Python ecosystem

Nice survey article for 2011's Python context.

Data mining of recipes

Now this is way cool: analysis of comments on recipe sites to determine, well, all kinds of things, but especially what ingredients can be substituted for others. Just mouthwatering research - both figuratively and literally!

I want to do this.

Tutorial on how to write a Dojo app

Tutorial here.

Javascript and HTML5 Canvas roundup

A foray into Google on the topic of HTML5 animation yielded (as always) some linkworthy material:
Also there was doodle-js - a port-ish thing of ActionScript into HTML5 Canvas - but it seems to have vanished. Or at least its example pages did; the code is still on Github.

False claim checker

Fact checking articles against the Politifact database: a moderately breathless article [truth goggles?] gets listed on HNN, of course, but then turns up at LanguageLog as well, leading to a reference to the failed commercial venture Spinoculars (nothing to do with this research, just a general similarity in goals).

Consensus seems to be that it's still a neat idea, and still probably impossible to do without being rather brittle. The research project appears to be using keywords to reference particular claims, which is really not a bad start.

There's just a boatload of things you could do in this arena.

Sunday, November 27, 2011

Supercollider

Supercollider is kind of a Processing for music, I guess: a dedicated music-oriented programming/performance system. It would be neat to look into it.

SourceMap: finding where the JavaScript comes from

Apparently SourceMap is a code project tracker that, given a JavaScript error, can find out which of your many files in PHP and Coffescript and what-have-you is actually responsible for the code that produce the error. Nice!

Functional programming

Nice intro to functional programming for OO-trained programmers. I found it useful.

The 12-factor app: best practices in Web app development

Good article/site on Web app development.

Tuesday, November 22, 2011

IMDb API in Perl

Neat project post, "Create your own movie database".

Codebot bounty

Here's a neat open-source bounty on a code bot.

Doom3 open-sourced

This boggles the mind - great target for code understanding!

Monday, November 21, 2011

Nice candidate for open-source PHP learning

Here's a nice to-do point I could tackle.

Web fonts

Google has a neat Web font search.

Business management

A really handy list of things startup guys shouldn't worry about.

Sentiment analysis

Greed and fear index, an online apparently free sentiment index of current financial news. They want to open an API.

Update 2013-03-13: dead.  Maybe I should revive it.

Spanish morphology in Haskell

This is a cool paper about modelling the morphology of Spanish with a functional language.

Sunday, November 20, 2011

(Natural) language recognition

The recognition of the probable language of a text is problematic, of course, but there are a number of different ways to get a "pretty good" estimate that don't even extend entirely to just spell checking the words. (After all, until you know the language, it's not possible to get all the word boundaries right - although you're still going to see most of them, of course. Except for Chinese adn Japanese.)

Wikipedia has a really nice and thorough language recognition chart. It would be nice to put that into a Perl module. The Wikipedia page also lists a couple of additional leads that are kind of neat:
  • Translated online guesser - uses a vector space model
  • Huh. The other two links are dead. That's a shame - but it may be worth following up on them at a later date.
Perl already has Lingua::Identify (0.30 in 2011), but I don't know how accurate it is or its coverage. It's definitely worth looking at, though. There's also a statistical approach in Lingua::Ident (1.7 in 2010).

Saturday, November 19, 2011

The original code katas

Here.

These are nice thought-provoking exercises to do to study programming - both your toolset and your mental structures.

Thursday, November 17, 2011

State machines in Perl

I just did a quick CPAN search and turned up a number of interesting packages:
  • FSA::Rules is the package I initially started building a wrapper for. It's actually pretty nice, and has a couple of constructs that my last post probably is missing.
  • Parse::FSM builds a parser based on an FSM constructed laboriously by function call.
  • State::ML provides a utility for converting XML-encoded state machines into other things or even code. I like the code generation aspect!
  • Win32::CtrlGUI::State is a slick little state-machine controller for Win32 GUIs.
  • Basset::Machine builds a state machine class in much the same way Term::Shell builds a command line shell.
All pretty cool aspects of the state-machine paradigm. If you really wanted to start getting into the semantic programming approach writ large, you'd think of ways to produce code generators for any of these starting from a Decl state machine description, and ways to organize that kind of code generator family into a semantic domain.

State machine redux redux

My random link strategy is working - it brought me to my state machine post of last year, whereupon I realized I'd kind of forgotten about state machines entirely. And yet a clear state machine presentation is such an improvement over code!

Incidentally, the Wikipedia page for finite-state machines is pretty nice.

Here's my tentative DSL for state machines, then:
  • The overall tag is "statemachine" and has a name. This name will resolve as a function outside the state machine.
  • The state machine tag is an iffy executor; that is, if it's the last tag in a program, it will be in control.
  • Within the state machine node, the children are named. Children named "prepare", "output", or "input" that occur before "start" are special.
  • The "prepare" child is code executed on each input to prepare it. The local variable $input contains the prepared input (the return from the "prepare" code) and $raw contains the raw input should it be required (this is @_ in the "prepare" code).
  • The "output" child is code executed on each out-of-band output in the state machine (see below). The default output is the same as anywhere in Decl; it's to pass output to the parent, where eventually it just gets printed to stdout if you don't redirect it.
  • The "input" child is code executed to obtain the next input token, if the state machine is in control. If there is no input, then the state machine can't be in control; it must be called from other code for each input token.
  • The "start" child is the first state - every child after "start" is another state, so you can still call a state "prepare" or "input" if you need to.
  • Within a state, we still have special parse rules, but in general, execution goes down the list of the state's children.
  • A string followed by "->" consumes an input token if it matches, and changes the state.
  • A line introduced by "->" just changes the state.
  • Either of those may have code attached; if so, this code executes before the state transition. But with or without code, both of those act like a "do".
  • A line that doesn't consist of string and -> or just -> is parsed as normal Decl code and does whatever it's supposed to.
  • The code morpher will be updated to understand "->" at the start of a line as a state transition if you're inside a state machine. (That's probably a trickle-up thing as well, actually.)
It should be possible to compile a state machine to Perl or C or something; run it in interpreted mode for testing, then spin out C for performance later. Something like that. I'm only vaguely starting to apprehend this aspect of Decl - that eventually it should be entirely master of its own fate and understand how to generate high-performance code from its own programs when necessary.

I still think this would be a pretty powerful feature; I nearly always run aground when coding things based on state machines because it's just so hard to keep track of the darn things, but this lets me pseudocode my way right through it.

Here's a possible rendering of a recognizer for "nice", each letter being an input token:
statemachine nice
start
n -> n_found
-> error
n_found
i -> i_found
-> error
i_found
c -> c_found
-> error
c_found
e -> success
-> error
success (accept)
>> Yay!
   -> start
 error (fail)
>> bad!
   -> start

That seems to do what I want; it doesn't show any code, but it does show the basic pseudocode I want to use.

Influence Explorer

Interesting data journalism project - and built on an API. Cool!

Wednesday, November 16, 2011

Another open-sourced Ruby app

Another app open-sourced in frustration - grist for the code analysis mill. Ruby.

PR diving

A fun article by Paul Graham about the PR industry, incidentally proposing "PR diving" - finding articles in news sources that were sent in by PR people. I'll bet that could be tracked semiautomatically. I love that kind of idea.

Static typing

As a Perl devotee, you can imagine I'm no fan of static typing - but I acknowledge it has its place when trying to reason about software correctness.

So I wonder how doable it would be to introduce type assertions into Decl in such a way that they aren't mandatory? Perhaps assertions about duck typing - instead of declaring something an Animal, just assert that it's an Animal if it meets certain "can" criteria?

Tuesday, November 15, 2011

Pure: template language

Here's an interesting template language: Pure. In Pure, the HTML is the template language, since fill-in is done based on the class attribute of the container. Data is provided in JSON. Slick!

Wx task

There's a Wx CPAN shell app that would probably be a great Decl target when I resume Wx development.

Random posts

I found and installed a random posts link in the sidebar. This way I have a finite probability of not forgetting any given topic I've found in the past two years. I think this will be pretty interesting going forward.

Codefixbot

I just had the neatest idea. One of the nicest things about Perl/CPAN is the CPAN Testers Network - if you package your module with tests and put it on CPAN, hundreds of automated testing systems running different versions of Perl on different machines under different operating systems test it for you and email you the results.

I can't say how outstanding that is for code quality.

So. Code quality. I posted a couple of days ago about code quality. Here's the first iteration of my idea: a generic code quality tester that would crawl open-source repositories (aside: since the demise of Google Code, there is no universal code index, and that should change), identify problems, and if on e.g. Github, automatically create and submit a pull request to fix common errors. But otherwise attempt to format an easy patch and get in touch with the authors.

That's cool enough - a sort of universal code quality assurance system that would just ... fix everything it finds. But then I came to the second iteration of my Good Idea, which is something even more interesting: a code generator bot. Let's say I have some kind of idea and a language to express it in - the bot could come by and generate, say, C++ or Java code to implement my idea.

OK, granted, that's vague. But surely there's some kind of continuum there from the easy to imagine (code quality automation on the loose) to the Singularity (write a blog post about something you want, and the Internet implements it for you and links to it from a comment). And honestly - what a philanthropic opportunity!

So now I know what I want to do this year. Just gotta jam it onto the priority list.

Monday, November 14, 2011

Meta-learning

Meta-learning is learning which algorithms to use for machine learning, automatically. Sounds perilously close to metaprogramming, actually, which is what I ultimately want to be doing

Update: I've registered with Springer-Verlag as a book reviewer and I'll be reviewing this book.  That means I get free online access to the text for six months.  This should dovetail nicely with the machine-learing-in-perl tutorial site idea, actually.

Error handling in Decl

As in, there is none. Except for Perl's native system, which is mostly inappropriate.

Anyway, LISP has "conditions" instead of exceptions, and some of this might be applicable to the Decl situation.

Free programming books

OK, so what prompted that last post was this StackOverflow list of free books. Here are some promising ones:

Time for a link dump: more ML/NLP

My browser tab is getting full, this time with free programming and NLP books. So here we go:
And that's it, for NLP. For now. You know, I could spend a year or two internalizing this stuff and be the best natural language programmer on the planet.

Wednesday, November 9, 2011

Superdesk: journalism tool by and for journalists

I haven't read all this, but it looks neat. Workflow for journalism, I believe.

Local files in JavaScript

Well, this is new. Under HTML5, JavaScript has access to the local filesystem. Wasn't the lack of local access a specific security feature, though?

NLP

I actually started working with natural language this week - by trying my hand at some translation tools at last. First, some links I ran across:
  • A list of R packages for NLP, with an intriguing link to Weka, a set of Java implementation of data mining algorithms.
  • StackOverflow reference to NLTK and n-gram extraction.
  • Note "PMI", point-wise mutual information, cited in the SO link.
  • Lucene is NLP for Apache; there is a PyLucene as well. But honestly I think I'm going to have to get my hands dirty in Java, because Java seems inordinately popular in the NLP field.
  • JCC is a code generator developed for use in PyLucene.
OK, so I got Text::Aspell working on Windows with MingW32, which was no trivial task (in the end it just required some library setup and a small change to the installation script for the module, but it took a full day to figure that out), and got down to the business of building word lists and checking them. This works almost well, except that it immediately became obvious that I need a better tokenizer.

While prowling the Net for English tokenizers in Perl (note: there aren't any good ones - yet), I found:
In the end, I resolved to write a proper tokenizer for English; it would recognize certain entities embedded in the text (URLs, numbers, numbers with units, some abbreviations, possibly chemical names, and IDs with capital numbers and digits and dashes and the like), mark punctuation as such, and attempt to deal with quotes in a sane way, including distinguishing between quotes and apostrophes. The result would be an arrayref of words and arrayrefs - same as the other Decl tokenizers, and should work fine not only for English, but for the other European languages as well. I don't yet know enough about non-European languages to know how well I can tokenize them, except for the fact that I know it's non-trivial to tokenize Chinese and Japanese, and presumably Korean.

Finally, searching CPAN for NLP, I found an interface to the Stanford parser (a link on how to use it in Java as part of this book).

It's looking like I'll end up with classes in Xlat for basic translation tool functionality (Xlat::Wordlist and Xlat::Speller), along with probably NLP::Tokenizer if I end up making that a HOP-based pure Perl endeavor.

Update: But as always, note that using a deadline for a job (in this case an editing job) to induce urgency for basic research is generally going to lead to sleep deprivation and depression. I've switched to manually doing this job like a schlub, but hopefully the remembered urgency will get me through the next development cycle on this. I gained some good insights.

Monday, November 7, 2011

Cinder: C++ for creative work

Cinder: Kind of like Processing, but C++. The Medusae project is breathtaking.

A quick note on dates and times

The true depths of date and time calculations are frightening; see Perl's DateTime. So it might be nice for Decl to incorporate that from the get-go, as part of the kitchen-sink philosophy.

Sunday, November 6, 2011

XDoclet: "attribute-oriented" programming in Java

Interesting approach to boilerplate generation in Java (of which there is a great deal!) - XDoclet apparently allows you to specify some semantic attributes in comments and then generates all the cruft for you. That's kind of neat.

The philosophy of artificial intelligence

The history of AI (to about the mid-80's) from the point of view of a philosopher. I need to reread this a couple of times.

Decl hits CPAN

For reals this time! I still don't have rights to the Decl namespace (which is why CPAN took its own sweet time indexing the module) so it appears with a honking bold red "** UNAUTHORIZED RELEASE **", but it's up there under my name, which makes me pretty damned happy.

Also nice: except for one test result that I've already fixed for next release, it passes its smoke tests on every system. I love CPAN.

So what's coming up for Decl?
  • Traversal: this is hierarchical structure walking (e.g. directory walk) and mapping (e.g. something like XSLT)
  • Boilerplate and macros in modules, then release of declarative CSS and HTML modules
  • Rewrite Word using some macros (the "select" tag usage is changing) and rerelease it
  • Look again at Wx now that macros work, maybe release Wx 0.01
  • Look at macros in the PDF context, probably release PDF 0.01
  • Database management and access, then release Decl 0.12 with that
  • An error management system, finally, which will probably be Decl 0.13
  • Literate programming and PHP katas and examples, then release Publisher
  • Probably look at Inline next and integrate with Python; I want access to the NLTK.
  • Declarative logic somewhere in here, based on AI:Prolog.
After that, I'm not sure. But it will surely be obvious by then - and I'm equally sure this list won't survive contact with the enemy, either. For example, maybe I'll start thinking more about the Lexicon for real by the time I'm halfway through that list.

Two years in

I started this blog on November 5, 2009, with every intention of investigating a specifically semantic framework for programming that might have borne fruit by 2011. It's November 6, 2011, so where do we stand?

I started work on Decl in February of 2010, according to my notes (the first SourceForge checkin was on February 15, but I'd posted on Wx::DefinedUI on the 10th, and honestly I think a Markov-chained snippet from my earlier writing may have triggered the concept in January), and it quickly grew to take over my every waking thought. Essentially, all my progress with semantic programming has been in the implementation of Decl. As I noted on February 10th, my earlier effort in late 2009 foundered on the shoals of syntax. At least that's no longer a problem.

The idea of Decl is to define semantic domains and tags that declare various types of programming construct, then to build programs of those. Eventually, the semantic domains should have enough macro machinery involved that the programs will largely self-construct, but I'm nowhere near that level of detail yet. I just finished the v1.0 macro system last month, after all, and it's by no means clear how to get from point A to point B.

But that's where things stand. I have done some musing about shoehorning my old Hofstadter microdomain work into Decl - not that that would require much shoehorning at all, which is the raison d'etre of Decl in the first place - but haven't really made a serious move in that direction yet.

I'll leave you with this notion: the Decl tag is an instance of a concept. As such, it's a token from a Lexicon. I haven't implemented the actual Lexicon yet - but at least Decl will be a language capable of expressing it right from the start. And that's why Decl is important.

Hmm...

Ping.

So there are others out there!

Javascript pitfall: missing var

A heartrending account of mistaken globality. Killer comment from HNN: jshint, stupid.

So.... Code quality tools in general. I want to build a framework. Gauntlet thrown.

Mulberry: app boilerplate generator

Another boilerplate generator for Web apps.

Underrated Features of PostgreSQL

Another good survey.

Statistical comparison of programming languages

This is a pretty fascinating project - they have lots of different implementations of various algorithms in lots of different programming languages. The article does some comparison between them in terms of expressiveness and speed.

ML at Khan Academy

Very nice article on the application of linear regression.

Clojure DML for SQL

Interesting - this is the conceptual-level kind of thing that fascinates me.

Another CPAN for PHP

Composer and Packagist.

Overview of numerical analysis software

So looking at alternatives to Octave, it turns out - to what should not have been my surprise - that there are a boatload of alternatives:
  • Wikipedia has a nice table
  • The Octave Wiki recommends Inline::Octave, which I find a little questionable, but hey.
  • PDL is probably the best Perl alternative; has direct support for sparse matrices, interestingly.
  • The Monks look at some comparisons between R/S, Octave, and PDL.
Again: I'd essentially like to distill the semantics out of this and have a system that knows how to code for a set of alternatives.

Good maxims for consulting programming

Five things to do for programming on a deadline:
  • Set up continuous deployment before you start
  • Write tests first
  • Be transparent
  • Maintain daily todo lists
  • Do the right thing
Not bad.

Thursday, November 3, 2011

Automated freaking writing in the news again

This makes me so envious I could explode. I know it's always the same story. Still.

Wednesday, November 2, 2011

An aside on machine learning, and open-source contribution

So having forced my brain to code a vectorized cost function in Octave starting from the equation - a task that truly taxed skills that had lain dusty for decades, and involved a brief discussion with my private theoretical physicist - I've started to think maybe I might be capable of learning a new trick or two. This Stanford class just barely scratches the surface, of course, and my mathematical background is essentially nil, so I've got a steep hill to climb.

But. There are open-source machine learning projects out there. Perhaps it might be best to start contributing. So on that note: the mloss.org project database. 334 projects and counting.

And one of the things that caught my eye this week on the software development front is PVS-Studio, a static C/C++ code analyzer that finds common coding errors. There was briefly an article on it listing 91 such errors, but it was deleted. Of course, it would be a hell of a lot more interesting to have an open-source equivalent. If there isn't one, I intend to damn well start one, with a curated set of flags (this may be why the article disappeared, of course...).

Update after reading this: OK, so I'm an idiot. Sometimes it's easy to forget the last twenty years and the Internet and all. [also]

Anyway, the whole concept of static code analysis fits well with my vague idea of a "code understander" set loose on open-source code.

Monday, October 31, 2011

B2Brev

B2B tool reviews by startups. Nice idea! Goldmine of component information.

Less talking, more doing

And now I'll violate that by writing about it. I had a partial payment from a customer today with whom I have about 10 open invoices, and there was no telling which invoices the payment was intended to cover. After dithering, I decided that really I should be able to write a brute-force knapsack search to find if there was any combination that matched. (Turns out there wasn't, but there was a combination that came within 2 Euros, so at least I can flag something as paid in the database, close enough.)

It took me about an hour, which is pretty damned pathetic, really. About two minutes of that was realizing it was a recursive problem, and 58 minutes was spent debugging it. I'm pretty sure that I could come up with a programming system that would allow me to express this algorithm without being quite so brittle in its implementation. I mean, essentially that's what I want to do in the first place here.

To that end, I'm wondering if I shouldn't start looking for this kind of algorithmic problem and implement more of them. At least I'd stop dithering about whether it was a good use of time or not. I'll bet I could shave most of those 58 minutes off, anyway.

Sunday, October 30, 2011

IPEDS

More public data online: the National Center for Education Statistics has published databases of colleges in the US.

Saturday, October 29, 2011

Blitzsummary of linguistics for NLP

Good summary of How Linguistics Works.

Unbounce: component and target

Unbounce is a quick way to build landing pages for pre-startups or other purposes.

Fast test for startup ideas

Some of the pieces of this procedure could be automated. Maybe all of it, essentially...

DARPA Shredder Challenge

Solve all five puzzles first, win $50K. Not bad pay for November.

Marketing one product under two names

Another little technique for company construction.

Shakespeare, the programming language

So there's a cute little language called Shakespeare that I've run across before. It's a little silly, of course (well, that's the point!) but it came up on HNN, as does everything eventually, and one of the posts there suggests a "sort of obfuscator" that would, you know, write Shakespeare for you.

Which I think is a pretty dandy idea, in conjunction with a Shakespeare interpreter. It might use a Markov chain to generate the fluff, select characters at random, and so on and so forth - and everything should compile/interpret correctly.

Underlying the Shakespeare, of course, is a kind of bytecode language. That would be the interpreted language, and that would have a more normal expression as well that could be Shakespearified.

This could showcase parsers in Decl, and might not be a bad way to start doing some fun NLP-type stuff, too. Think about it!

Thursday, October 27, 2011

Perl tutorials suck

They have a point.

Dissection of a viral launch

Interesting data-packed post about viral marketing.

Hello from a libc-free world

A fascinating look at melding assembly with C.

Target application: listofthingsforsale.com

This is very cool.

GUI vs. CLI

A post by Vivtek Haldar musing on the differences between a GUI and a CLI, boiling down to the fact that a GUI is an operating interface while a CLI is an expressional or definitional interface. Honestly, though, I don't like CLIs; they require too much memory capacity for me. I much prefer a model of GUI plus scripting API.

The not-so-secret capitalist cabal that owns us all

Here's kind of an interesting analysis: turns out just 147 investment companies own 60% of the global economy. Article in New Scientist, publication on Arxiv.org. Seems like really a very straightforward query of the Orbis 2007 database of corporate information and some nifty visualization.

Wednesday, October 26, 2011

Language: Elephant

I don't believe it was ever actually implemented, but John McCarthy proposed a language he called Elephant based on speech acts, largely commitments, questions, and the like.

It's an intriguing concept, but he loses me pretty quickly on the implementation details. But the notion bears further thought.

Math: sympy

A symbolic manipulation library for Python.

Here's what I want to do:
  • Handwriting recognition on a tablet PC to be translated into OpenMath and thence TeX.
  • Selection of portions of a large mathematical formula and specification of specific operations to be carried out (e.g. "solve for this" or "call this theta" or what have you, said operations to be discovered by observation of my private theoretical physicist)
  • Maintenance of a log of the trajectory through formula space
  • n-fold productivity increases for theoretical physicists
  • Public perception of my private theoretical physicist as highly productive physics genius
  • Live on p.t.p.'s CERN salary while enjoying Geneva
It's freaking brilliant!

Tuesday, October 25, 2011

Nice interactive graphic

Data journalism! Here's a neat interactive graphic for exploring gas prices over time in all 50 states.

Analysis of Steve Jobs tribute messages

Nice application of NLTK to get some interesting statistics about the tribute messages to Steve Jobs on Apple's site.

Rocketcharts

Nice library for statistical/financial charting in HTML5: Rocketcharts.

JavaScript roundup

Speaking of JavaScript, I've got a couple of links - one of which is itself a JavaScript roundup, so this is really more of a meta-roundup.

Tangle: a JS library for reactive documents

OK, now this is cool: Tangle is a Javascript library for spinning reactive documents [e.g.]. The idea is to write text that can be explored - or, as further down the page on the second link, you can build some graphics as well!

Monday, October 24, 2011

Some more open source projects

I keep running across open-source projects it would be nice to contribute to. Weird.
  • Qt has officially been spun off by Nokia. Along the same lines would of course be Tk and Wx, and I suppose native W32 by direct DLL access. All these share a lot of concepts that should be organized in parallel, and ultimately a feature in one should always migrate into the others so we're all working with the same set of concepts. They do eventually anyway, so it's kind of an obvious step to formalize that path.
  • MediaWiki is, of course, in PHP, and always has bugs outstanding. Hone the semantic understanding tools on that. Same goes for Drupal and WordPress, of course.
  • Which brings us to open-access science. This guy, a chemist at Cambridge, appears to be doing some actual data mining of open-access journals. I need to look a little closer at that. And remember: closed source kills.
  • And then there's WikiData.

Sunday, October 23, 2011

Decl striving mightily to hit CPAN

Turns out the Decl namespace was reserved by mistake, sort of namespace shrapnel from a project a couple of years ago. They're working on resolving it.

I'm looking at a "walk" tag/action that would justify (in my mind) adding another version number for the next upload. I might finish it up tomorrow; it's essentially just a built-in that permits easy traversal of directories or other directory-like hierarchical data.

Thursday, October 20, 2011

Decl doesn't actually hit CPAN

CPAN users FGLOCK and AVAR have registered the top-level "Decl" namespace for reasons as yet unknown. I've asked. We'll see.

Decl hits CPAN

I told myself I was going to make the switch from the old Class::Declarative to Decl on CPAN when I got templates and macros working. That happened today. (I'm still kind of giddy.)

I had built the CPAN module on Windows, of course (my desktop runs Windows because I'm a translator and it's kind of de rigeure, but also because even though I have a beard, I first got serious about programming on Windows, not Unix) - and I'd forgotten why I'd never done that. Windows makes tar.gz files with world-writeable files and directories, and PAUSE really doesn't like that.

Well, this time I Googled it. Duh. Memo to self - always Google problems. Somebody else has probably fixed it or at least can help you make sense of it.

Google AI challenge

This year's Google AI Challenge is a robot-army kind of tournament. Looks fun! But is that AI?

Graphics by Kevin Karsch

Kevin Karsch has done some pretty cool graphics work (see the first two links on this page... um, the first two as of this writing).

Tuesday, October 18, 2011

Ioke

New language roundup: Ioke.

The point of Ioke seems to be homoiconic macros, meaning it's probably something I should study: its goal is to maximize expressiveness and write code to write code. Aaand that's where I am.

Oh, what a tangled web we weave

... when first we practice to hashtie variables and then try to refer to those hashes in the setvalue for a node at a different level and forget that they're hashtied when building our freaking debugging print statements and get into a loop for that reason alone.

Sigh. Talk about your heisenbugs.

Monday, October 17, 2011

Sunday, October 16, 2011

NLP

I spent a few hours last night avoiding work with yet another feverish trajectory through ongoing NLP research and books available.
All this just makes me drool. I swear, it's a mid-life crisis.

Hyde

A static website generator in Python (a sorta-port of Ruby's Jekyll). My question: what's the tradeoff between using Jekyll or Hyde as opposed to rolling my own in Decl, now that I have a template engine already? The community would help, sure, but ... how much would I actually use this? And wouldn't my debugging time be better used on my own dog food?

A possible approach

My ultimate goal with Decl is, of course, not only to provide a quick way to bootstrap data-structure-heavy Perl scripts into being, but also to describe software systems at a high level and provide a framework to implement them.

So in that second sense, a semantic domain of "natural language processing" would describe tasks at a high level, describe the algorithms and approaches to take in performing those tasks, and would be amenable to at least some degree of automation in coding the tasks in other languages using various toolkits already available. In other words, the semantic domain encodes at least some of the professional knowledge about that domain that a seasoned programmer would be expected to have; a programmer-in-a-box solution.

To that end, and maybe in NLP to an extent that's a little unusual in comparison with other domains I've wanted to get into, there has to be a means of describing a given toolkit and its basic approach - a theoretical framework if you will - that allows a given task/algorithm to be expressed using it.

Not sure how that's going to happen yet; I just want to throw the gauntlet down here.

It looks like a lot of NLP toolkits are in Java, for whatever reason, with NLTK in Python being a strong contender. Nothing, really, in Perl. Which is why God moved Ingy to create Inline, of course, and Decl will be incorporating Inline very soon.

Windows PE format in painstaking detail

Here's a Google Code repository with the most complete, wonderful definition of the Windows PE executable file format I've ever seen. It's a wonder to behold.

Saturday, October 15, 2011

Stanford's NLP class

I'm a little frustrated with Stanford's free access to their NLP class, because it's not really terribly self-contained. I've managed to find what I believe is their support code on Google Code, but I don't have access to their readings.

I'm starting to think I need to examine this more closely and maybe come up with a course/book of my own, offer it on my site. My site's badly in need of new content anyway. I seem to put everything here on Blogspot nowadays.

Data journalism

I ran across an article 5 Tips for Getting Started in Data Journalism, to wit:
  • Be mercenary: do what works. But do it.
  • Shave yaks as needed: take the time to learn details when you need them.
  • Develop sources
  • Become the resident expert
  • Be the data project you want to see on the Web
I particularly liked "develop sources", because the author points to some data journalism blogs, including the Chicago Tribune and the New York Times.

Data journalism to me is like flame to a moth - I'm not too interested in the journalism itself, but the data work, oh, it's lovely.

Friday, October 14, 2011

Target application: web automation

So I found this linked from PC World. Nice feature list. English, maybe a little rocky... My guess is it's based on OLE automation of IE, which is almost interesting. It's cognate to any other way of automating a Web bot. (Note to self: think harder about cognate tasks.)

Description of Djuggler Enterprise

Data Juggler automates repetitive Web & data tasks without programming code. Use it to create sophisticated scripts for collecting data from the Web, filling Web forms, transforming text files, XML, CSV and database data. The easy-to-use drag-and-drop interface creates scripts that can be deployed as stand-alone Windows executables. Typical application examples:

  • Extract competitor's price list from Web pages regularly.
  • Extract people data from a Web pages.
  • Download Web images op a regular basis.
  • Get search results from multiple search engines.
  • Automated Web testing and load testing.
  • Export data to Web based applications using fill Web forms.
  • Automate web based workflow processes like timesheets.
  • Search & replace actions to clean data.
  • Transform data from one format to another.
  • Convert data from legacy applications to industry standards.
  • Automate database migration with Business Intelligence.
  • Comparing data and create reports.
  • Send emails with personalized attachments.
  • Server monitoring and reporting.
  • Synchronize folders, databases, etc.
  • Automate file management & data backup.

Automate IT operations by deploying stand-alone Djuggler scripts. The powerful script designer has many actions and functions like loops, 'if then else' conditions, get text between from html, get html table, get pictures, strip HTML, web macro's, read and save Excel, support for popular databases and many more. Demo's are included in the setup. Visit www.djuggler.com for the script repository and script service. A Djuggler Personal edition is available as freeware.

Keywords: Web data collection, Application Integration, Data Aggregation, Data Transformation, Report Generation, Batch Processing, Business Intelligence, System Monitoring, Form Filling, Web Scripting, Data Extraction, Web Testing.

Postmark spam filter has an API - Despammed should, too

Darn it, why don't I think of these things? (Their API.)

So I'm moving towards a plan to revitalize Despammed.com. Maybe the time has come? I'm not sure yet, but I want to do it.

Imagine, if you will, a spam filtration service that offers:
  • SpamAssassin
  • Procmail
  • Green and redlighting of known-good, known-bad actors on a per-account basis
  • CRM114
  • Bayesian training
  • Tracking of spamvertised URLs
as well as
  • Both forwarding and Webmail access
  • Arbitrary forwarding (including taking Web API action or Twilio phone action) based on rules, including rules that can be expressed in arbitrary JavaScript
  • Spam discussion with specific examples and other community action
  • Blogging about spam topics, including botnet identification and such
as well as
  • Uniform treatment of both email and Web spam
  • and yeah, an API...
Wouldn't that be cool? Maybe cool enough to pay for, even? Maybe, at this point, a manageable thing to put together because it needn't all be from scratch?

CSS game engine

Neat JS-and-CSS in-browser game engine.

CSS tricks

Here's a nice CSS-and-dynamic-code trick from Airbnb to display rating stars. Nice solution!

Tuesday, October 11, 2011

CRM114

Jesus H. Tap-Dancing Christ, I have seen the light. And it is CRM114.

A little background here. Back in 1999, a friend of mine said, "Hey, there's no free spam filtering forwarder in the world. You should write one, and then we'll figure out how to make money with it." So I did. Despammed.com was born. For the next five years or so, I learned a whole lot about how not to administer the firehose that is email. Due to lack of attention, the server crashed and burned and much of the code was lost (not the filter!) and the service, although its zombie is still on the Net, never really recovered. Because I didn't really care any more (we never had figured out how to make money with it).

In early 2007, Web spam was getting to be a hassle on a forum I had for my then-Web comic (by that time, it was already essentially in permanent hiatus, but a few friends still hung out on the forum). Xrumer had been released in November of 2006. I wrote a despammer for my forum and it worked for a while, but eventually I was forced to close the forum entirely. The release of that seems to have been on February 4, 2007.

In May of the same year, I refined some of the techniques I'd been working with, and the result was the modbot in Perl, a modular framework for applying various tests to determine the spam nature of a given post. It worked kinda well, but I couldn't drum up much interest in the wider world, and it died mostly a-borning.

OK. So that's my history in despamming. Now along comes something I never heard of: CRM114, which is a programming language invented specifically for expressing spam filters. And I'm looking at it, and I love it so much. Seriously. Also, here's a paper about it. Here's some publications by its perpetrator, William S. Yerazunis, who now works for Mitsubishi Electric and comes across to me as a really funny guy who would doubtlessly be a hoot to have a good supper with. Also, my envy for him burns with the heat of a thousand suns, because he basically seems to do all the stuff I really wish I had time to do.

So anyway. I am going to learn the shit out of CRM114.

Sunday, October 9, 2011

Music prototyping

An AskHN with interesting responses.

CmdrTaco: not dead - scaling

Rob "CmdrTaco" Malda writes about building a basically free-of-charge scalable cloud website.

Startup tools

A fantastic curated list of startup tools.

Puppet vs. Chef

As deployment solutions (at least in the Ruby world) Puppet and Chef are turning out to be pretty popular. Neither jumps out at me as a really beautiful syntax, but deployment (i.e. system configuration) strikes me as a sensible thing to start analyzing, starting with Puppet and Chef. What are their commonalities? What are their differences? How interchangeable are they?

Saturday, October 8, 2011

Concurrent Constraint Programming in Oz for Natural Language Processing

A book! Oz is a ... neat language. Its standard interface is, sigh, Emacs. You can imagine how I like that. But hey, I really, really need to get my head into NLP, so this would be another good place to start.

XSB Prolog

XSB is an open-source, tabled (i.e. memoized) Prolog. It has a Perl binding. It would be interesting to pursue. Very interesting, actually.

HNN: what data structure does the brain use?

I didn't expect much from this thread, but it ended up chock full of interesting things to follow up.

Test-driven Django tutorial

Does what it says on the tin.

Survey of debugging techniques

Or rather, "bug-avoidance techniques", perhaps. Good article.

TermL: another specification for expressing symbolic trees

No further comment, except to note that Decl support for this would be convenient.

OMeta: pattern-matching language

I'm a tad surprised I hadn't already blogged this, but OMeta is a language for expressing pattern matches. It can be embedded in Python as PyMeta. Interestingly, PyMeta includes a parser for TermL (about which see next post).

Pattern-matching a la OMeta/XSLT/what have you is definitely going to be one of the modes supported by Decl, but I still don't really grok it. So ... OMeta. For study and illumination.

One-liner music

So there's been a Thing about one-line algorithms fed into /dev/audio to create music (some pleasing, some not) [js in-browser equivalent].

It would be cool to do some kind of social evolutionary variant of the JS one. If only to provide a convenient way to tag your favorites, you know?

Linear regression and linear algebra

OK, OK, I shouldn't be so excited about this, but my machine learning class hasn't even started and I'm already grooving on the preparation parts. Including linear regression and linear algebra.
So, yeah, that's all a valuable domain. I could particularly see a code generator writing literate programs to solve linear algebra problems, then running them in a separate process. This is the kind of thing I want to get into.

Monday, October 3, 2011

E-discovery

This is exactly what I want to do: mechanical discovery of facts and structure from large collections of documents. The New York Times has an article.

That article mentions the Enron corpus, the collection of emails collected - and then published - by the Justice Department. There are various versions here and there, including one from the EDRM organization (Electronic Discovery Reference Model). That organization deserves a closer look.

Saturday, October 1, 2011

OpenMath

OK, I have to admit, OpenMath is really cool. Here's the list of software and tools that work with it - all pretty thin, actually, but their heart's in the right place. This is exactly what I was looking for. It's always such a relief to find somebody else has done the work already!

Note from the software-and-tools page: there's an OpenMath-to-LaTeX translator (apparently written in Perl, no less!) that ... well, it does what I was discussing earlier today. So very cool. (Update: it was written in 2000 and is therefore not at all OO, but it's unencumbered and built on a rather slick modular architecture, so I've asked the author if I could polish it up [rewrite it] and put it on CPAN. Very, very slick.)

So here's the plan, more or less:
  • XML, binary, and Declarative versions of representation
  • LaTeX output
  • Octave output and manipulation and parsing back in
  • Some kind of overarching systems description a la "semantic Excel"
  • Some kind of graphical presentation as active areas a la Equation Editor (but better)
I'm this close to being able to put together that stylus-to-LaTeX math manipulation tool I was thinking about in the 90's, just by using off-the-shelf components. I need a tablet. I badly need a tablet.

Visual Modeling and Programming with Graph Transformations

Dorothea Blostein at the University of Queensland is really into some very cool stuff. (Ran across her at the link from the previous post - she's working in knowledge representation.)

Graph transformation languages look really neat. She's written a book. It's in pieces of PDF on her site, so I should download them - but I don't really have an effective way to organize downloaded PDFs and papers yet, so instead I've just linked to her page, above.

Math

Ah, math, my old nemesis.

Necessarily, a machine learning class uses math (which is one of the reasons I'm taking it) and so I'm thinking about How People Think About Math. This would be a good thing to work on anyway - someday I really hope to get back to that Hofstadterian AI research track - and so here I am, thinking.

Here, by the way, are some neat Javascript tools for learning and working with math. One spinoff of all this is that I'd like to do something that generates things like this - kind of like a big Javascript Excel generator. That's something I've wanted to do for a long time, actually. So we'll see how well I do on that subgoal.

But the larger goal is this: when working with mathematical functions, we typically have a boatload of different representations floating around. Typesetting is done in TeX, of course, but there also has to be a more semantically-oriented form that's useful for tossing to Mathematica/Maple/Octave/whatever the heck you're using (and that includes expressing it as Python or C or Perl).

But the key is this: underlying all that, there is a semantic structure that is the actual equation or expression. That is what I want to approach. And in fact it's an area of active research (of course) - most of which is behind paywalls. Thanks, Springer-Verlag! But searching on names still turns up fascinating links [OMDoc]. If I only had all the time in the world, I could start reading arbitrary numbers of interesting papers. (I'm actually more interested in building a research tool to support the reading of arbitrary numbers of interesting papers in a more efficient manner. But that's a story for another day.)

As far as I can tell in half an hour's search, the state of the art for representing mathematical semantic structures appears to be MathML or something more or less like it. Yeah, XML as serialization, which makes my eyelid twitch, but hey, there you go.

I'll get further into this as the class progresses, of that I'm sure.

Update: OpenMath is the thing I'm looking for.

Gamification

A long post (and another) by Tim Rogers on the evil brain-sucking parasite that is Sims Social and other games. Here's what would be cool:
  • Economic analyses of popular games
  • Simulations of popular games
  • Genetic algorithm to devise new ones. Hee.
Or: how to take over the world without actually working.

Stripe

A new payment gateway that looks quite promising.

Notificon

A JavaScript tool to permit a page's favicon to include two characters of indication. Very neat!

What would be neater: a tagging system that was semantic in some way, to permit the functionality-based indexing of this kind of component.

Spambot combat

Here's an article with some very nice techniques for building more spamproof submission forms. Tl;dr:
  • Timestamp: don't allow a long period between reading and posting. (I had mixed success with this way back when.)
  • Hash: check the IP, timestamp, post # - prevents playback attacks.
  • Randomized field names.
  • Honeypot fields: invisible (not hidden) fields that, if filled in, are a spam indicator.
The author of the post uses these and only these to block spam - no content-based filters at all. That's cool.

As you know, Bob, I have long wanted to produce a workflow system of sorts that would include spam content filters; form generation is something I hadn't even considered - but it's a great idea. So ... keep this in mind.

Learning algorithms

Here's a nice presentation about (1) learning to program, (2) why algorithms matter, (3) a lot of maze algorithms, and (4) how a general algorithmic approach can often generate better solutions.

Nice stuff. Also, the final slide generates mazes using different algorithms. Neat!

Wednesday, September 28, 2011

Final flurry of Stanford-related links

Here is a fantastic compendium of "nature-inspired" AI algorithms, each described with pseudocode and Ruby. It's this kind of thorough survey work that really makes my world go round.

NLP

I decided against Stanford's online AI class, but their NLP class has a nice set of notes that would probably be worth working through in the same sort of way as I intend to do for machine learning.

They're here.

Machine learning

So as I mentioned earlier, I'm going to be working through Stanford's online machine learning course this fall. I expect there to be a lot of things I'll want to work into Decl and its approach.

The class itself is here; I've also got some supporting links I'll want to keep track of:
Ideally, I'd like to have a toolset of reusable tools that was literate-programmed using Decl when I'm finished with this course. Might be too ambitious, but we'll see.

Sending email: best practices

When sending email from a host, here are the things you want to do to avoid getting onto anybody's block list. Good luck on that.

Target application: Promoter

This is a great example of a simple, clear, well-implemented application that represents the kind of thing I want to be aiming for. It combines a database with Web monitoring. Perfect.

haXe

haXe is the follow-on for the Ming SWF library; instead of compiling Actionscript, they decided just to define their own language that could then compile to SWF, Javascript, and some other stuff.

I find that a tad irritating because I'd like to have something that could compile Actionscript 3.0 into SWF for me that was open-source, but the effort itself is pretty cool and worth looking at.

Diagramming again

Along with Octave, I've been thinking that presentation of systems might benefit from a diagramming language. Not that this is a new thing for me; it recurs with tiresome periodicity, actually.

But this is its first recurrence since I have a pseudocode parser at the ready.

So I've been looking at different diagramming systems for inspiration. I'm not even primarily interested in a diagram editor - just a display based on a structural language. Like graphviz, really. I could see diagrams being editable later, and I could certainly see them being clickable links in a system description or the like, but right now my primary focus is typesetting of readable documents, I think.

This post has been pretty thin in terms of doing anything except establishing a point of interest, hasn't it?

TeX

I've been meaning to get back to TeX lately, for two reasons. First, my wife is doing things with physics that need to be typeset, and second, TeX has some pretty neat diagramming tools, like xy-pic.

The only problem with xy-pic (and with TeX in general) is that it's syntactically unreadable. I am so terribly uninterested in decoding @{0->}<>33x\dot or whatever, and so again I say "Decl macro system".

So I'm mulling over kind of a TeX/wiki mashup, I guess. More on that later, I suppose.

Octave

I've enrolled in the Stanford online machine learning course, and even before it's started it's making me think. It looks as though Octave will figure strongly in the course.

Octave has a pretty groovy set of capabilities, but I'm just not the unadorned command line kind of guy. I'm sure I'll be mostly running Octave on files. And here's the thing - what I really want to do with that is to generate the Octave code from Decl, because I'd like to be able to generate presentations from the same code, keep data in databases, and so forth. (Actually, I believe Octave knows how to talk to databases, but still.)

So I'm looking at Inline::Octave and we'll see how that works out. This is also a natural place to finish my macro system, which seems to have run aground again. Literate Octave programming.

Tuesday, September 27, 2011

Learn REST: blog post series

The title kind of says everything it needs to.

DevOps choices at AppNexus

I don't even know what AppNexus is, but they need to scale, and their devops infrastructure is outlined in a handy-dandy blog post here.

Open source targets: BuddyPress and CUNY Academic Commons

BuddyPress is a WordPress plugin that implements a social network. It's used as the platform for the CUNY Academic Commons, an open-source platform for, well, academics at CUNY. It would be nice to help out open-source projects where possible, so this would be one place to start.

Draw a Stick Man

Very neat HTML Canvas design. [hnn] You could really imagine all kinds of storytelling on a platform like this.

Ticket Servers: Distributed Unique Primary Keys on the Cheap

A nice trick they use over at Flickr to get unique keys using a simple MySQL instance.

This prompts me again to muse about an infrastructure description language.

Real time face substitution

Very cool! This tool finds the face in a video feed in real time and substitutes it - kind of - with another face assembled from stock photos or whatever. It's ... uncanny valley squared.

But it's interesting, because it uses something called openFrameworks (a C++ library for creative work) as a platform, then combines it with FaceTracker (a C++ library for ... face tracking) and an image cloning library that looks pretty rad, too.

All of that is pretty neat stuff, so I wanted to bookmark it.

Saturday, September 24, 2011

mrjob

Yelp has a parallel job framework called "mrjob" - no, not Mr. Job, but map-reduce job - that currently only supports Python but that could be used to manage map-reduce jobs in any language.

Might be cool to try that with Perl.

NLP

At some point I'm just going to have to start, but:
OK, so here's the idea, and it's always the same idea. In a given NLP-domain problem, I'd model the data and the toolchain in Decl. Thus given a problem, you'd state the problem in Decl, and refine your solution progressively, always keeping the Decl semantic structure for the problem intact at each step. Here, it's almost a note-taking or documentation tool; the actual program would be written in Python and/or Java and invoked by Decl. It could also be embedded, of course, via Inline - but the point is that Decl needn't be seen as an exclusively Perl-based tool. It's also a litprog tool that can use macros to build anything else.

Ah, well. That's probably not all too clear. I'm tired.

What prompted this flurry of NLP searching was this Yelp blog post about a data set they're releasing to researchers.

Wednesday, September 21, 2011

Book: Mining of Massive Datasets

A Stanford book/course on the topic named. I should really just work through the thing.

Rhetorical analysis

I'm not even sure how the analysis of rhetoric fits into semantic programming, except that (1) it's NLP kind of, (2) it's research and therefore database-oriented kind of, and (3) I keep coming back to it.

The trigger is an article on CNN [HNN discussion] by Bill Bennett of the Claremont Institute tearing down the concept of spending public money on education (god forbid the teacher's unions should get tax money). There are a few little nasty tricks he throws in. I think it would be possible to analyze this kind of rhetorical treatment, maybe. Eventually. I'm not sure how to start, but it fascinates me.

Anyway, the article just pissed me off, so I thought I'd bookmark the stuff with this post.

Slick Perl trick: "enchantment" of coderefs

I should do this throughout Decl, actually: equip coderefs with debugging facilities. [monks] [another post]