Sunday, September 29, 2013

Diagramming again

I think I may have posted draw.io before, by jgraph as a free demonstrator for their non-free JavaScript diagramming drop-in. The HNN thread, as usual, includes a few alternatives, none of which seem really to measure up. One commenter is using jgraph's code as part of his very cool InsightMaker tool, which uses the diagrammer to build a system, then runs simulations on that system to provide numerical graphs. (Now that is cool.)

Saturday, September 28, 2013

Database reconciliation

So a thing that comes up a lot in working with data is the need for reconciliation, where you have two datasources and you need to match them up and see whether everything in one is in the other, and vice versa. (This is overall a part of the whole data quality issue.) So here I am, today, seeing what's out there - the answer is, unless you're buying SAP, very little (there is a Perl module, kind of, for doing table comparison) - and lo! the process of data reconciliation was patented in 1999 by Qwest; later, JPMorgan Chase inherited it and later sold it off.

Can you believe that?  Can you honestly believe that somebody patented the very idea of finding records in one list and matching them with records in another list?  This is the kind of nonsense up with which we really should not put.

Some NLTK posts

Finally, the last two things I bookmarked in April (!), two posts from the same guy about using NLTK to do neat things: Build your own summary tool, and an Efficient way to extract sentence topics. Note the comment section of that latter one, with a commenter weighing in with his own Prolog sentence parser. I really, really need to spend some time in this arena.

PHP commandments

Some wisdom about PHP.

Wikidata

I'll just leave this here. Lots and lots of data.

Scaling Django

A slideshow.

Elixir

A language built on the Erlang VM focusing on metaprogramming, with the following highlights, some of which I like:

  • Everything is an expression.
  • Specific attention paid to metaprogramming and DSLs.
  • Polymorphism via "protocols", whatever that is in this context.
  • First-class documentation of subs, using Markdown. This is neat.
  • Pattern matching.

Problems with Markdown

Here are some problems with Markdown when you get down to serious formatting with it - but the basic idea I still love.

Selections in D3

D3 is cool, and apparently has selections. (You can tell I've spent a lot of time with it...)

Tutorial site for Foundation

Foundation is yet another HTML5/CSS framework. Here is a tutorial site with clever gamification and pretty design.

Shark: ML in C++

Shark is a machine learning library for C++.

Basic quantitative finance reading list

Good stuff.

Adding types to PHP

Hack is kinda cool, a way to gradually add static types and type checking to PHP without disturbing programmer workflow. It appears to run on the HipHop VM and is gaining some traction. That's an interesting evolution.

Friday, September 27, 2013

Maps! Maps!

... map icons, anyway. Kinda neat!

Using D3 to produce SVG

More graphics stuff.

AnyYolk: example HTML5 game

Here's an HTML5 game example written on Backbone and Parse. Interesting architecture.

PyEphem

Astronomical calculations in Python. Very slick.

A comparison of process managers

Hey, finally some halfway modern technology for dealing with Linux sysadmin issues - here's a comparative look at some of the newer process managers available.  Very nice!

Churnalism

Churnalism is a tool for tracking journalistic plagiarism by scanning a database of existing text for phrases found in a given article. It's a pretty straightforward application of the open-source SuperFastMatch [git] text comparison tool, which I should probably investigate in greater detail.

Excellent happy synthetic eyesight

So back in April (yeah, I'm still chewing through April's bookmarks, can you believe that?) I bookmarked Qbix, a new social platform attempt that ... doesn't seem to have done a lot since April. And I was going to mark it as a kind of interesting point, but got distracted by a comment spam on their blog.

Excellent happy synthetic eyesight with regard to details and can anticipate problems just before
they happen.
That's kind of poetic, and Googling it turns up a lot of similar variations. This kind of thing always draws me in, because there's a template at work here that could easily be reverse-engineered, allowing us to classify the link spam and identify specific actors.

I just love that kind of plan. I really ought to do something with it.

Signs you're a good programmer

... and how to cultivate the art.

Code comprehensibility

Here's a paper asking: what makes code hard to understand?

Good question...

Machine learning for link spam

Double-whammy for tickling my fancy: a blow-by-blow account of applying machine learning to detection of link spam.

Usability checklist

This checklist for Website usability is a goldmine of best-practice information!

Code organization for AngularJS

Oh, a code project template post!

Here's the thing. All code organization schemes provide a mapping between filenames and semantics that reveal the semantic structure of a given project. That's pretty interesting; the code organization reflects the programmer's mental model of the project.

Status page hosting

Here's a service that provides status-page hosting for your site.

NLP hacking in Python with Scripted

Here's a neat post.

H2O math for Hadoop

The H2O project provides a math runtime for Hadoop, extending it for big data, statistics, machine learning, all that jazz.

Liquid Helium

Liquid Helium provides a linguistic analysis API that does rule-based decision about various textual markers (formal register, etc.) - interesting.

Datalog

Datalog is a declarative language for data, a subset of Prolog, apparently. Zef Hemel has a neat little taste-test. The company he just joined, LogicBlox, has developed a new, high-performance commercial implementation, but there are various interesting-looking open-source alternatives.

TokuDB

TokuDB apparently scales MySQL/MariaDB instances by improving indexing?

Conception IDE

Conception is a general IDE for assembling snippets and macros into code (that's probably an inadequate description). It looks pretty slick.

Optimization with the Excel solver

This is a fascinating little article about how to set up and solve optimization problems in Excel.

Structure of a good open-source project

Yeah, yeah. I can't resist structure descriptions.

2014-04-19: Darn. This is clearly not the right link. I wonder what I did intend to link to?

pip and virtualenv in Python

Here's something I always get confused about. Good post.

Datasets released by Google

Google has actually released a lot of interesting ML datasets. Here's a short list.

Getting started with HFT

Quantstart's starter post. HFT is no longer the cash cow it was five years ago, but it's still a fascinating intersection of statistics and big data with real-world things (for certain values of "real world").

Webspam template

Some spammer mistakenly posted the template instead of the Webspam to a comment section - here's the gist!

Thursday, September 26, 2013

Drawingboard.js

Drop-in drawing board widget. This is getting pretty cool lately.

OCR in Perl

Neat little DIY article that hits my sweet spots.

Probabilistic programming languages

Apparently I was on a real language-design tear in April, too - here's a post on probabilistic programming languages, proposing semantic primitives for, well, probabilistic programming. Where do DSLs stop and plain old programming languages start? ... Good question.

I have to say, the BUGS language [that was to the old WinBUGS: here's OpenBUGS, the current project] looks pretty darned interesting - you're really using this to set up a model in a declarative manner, then invoking an engine that writes the "query results" into the original file, looks like.  I really like the cut of that jib.

Then there's Church. Wow. I think this might have Hofstadterian implications, honestly.

Sapir-Whorf on the Lua forums

Sapir-Whorf as applied to computer languages... Hmm...

Open-source quant platform

Now here's something you don't see every day.

Voice-operated queries to ... things

This is really keen - and it's pure Python, making it groovier.

Wednesday, September 25, 2013

PyCharm

A Python IDE written in Python.

Hoodie local-only Web app framework

Hoodie is a Webapp framework for local use only. Neat! Claims to be very fast.

Decompiling, reverse engineering tools

More on code analysis!  Apparently April was the month for it. Note that this is an HNN post, not an article; the article pointed to is not actually all that interesting but the discussion is.

Valgrind

Open-source code analysis and profiling tool.

PHP refactoring browser

And while we're on the topic of code understanding (which I think is kind of prerequisite to code refactoring... or related to it, anyway), here's a magical set of PHP refactoring tools written in PHP that ... you know, help move models around at the syntactic level.

Partially-powered languages

Here's an interesting polemic about, well, everything that's not Haskell and using a declarative data model for its data structures, essentially, but especially where that concerns the Java/Ant/XSLT ecology. There are some interesting comments with corrections, but I get the author's concerns.

Icon-like expression evaluation

OK, so here's a little break from the usual - programming languages differ in part not only due to their syntax, but also in the vernacular they provide for the expression of programming ideas, right? To that end, there are still some semantic atoms out there that aren't in general use (yet). Here's a paper about an evaluation system used in a research language in the 70's that permitted an interesting type of backtracking during evaluation.

Essentially, it builds on the concept of generators (like the ones offered by Python, which can deliver any number of values before they "fail", having run out of values - this is explicitly called succeeding and failing in Icon, but Python just uses an undefined return as failure, which is pretty reasonable).

If you chain generator and expression calls together with &, then Icon will try to retrieve a value from the first thing in the chain, then go on to evaluate the rest of the chain. Only if each link in the chain succeeds does the overall expression succeed; a failure at any step causes the evaluator to backtrack to the previous link in the chain. And you can assign "temporary variables" within the chain, whose values revert to the earlier value as you backtrack up through the chain.

This is really a pretty cool notion, but I have to start asking, first: what other "semantic primitives" are permitted by programming languages, and how can they be categorized in terms of ease of comprehension? How far can you go, designing a language, before people just don't get it?

Second: it would be cool to categorize this kind of semantic primitive and see how they move between languages.  If a given algorithm is expressed using such a primitive, how easy it is to "recast" the concepts into other idioms? This kind of thing is also related to the notion - often seen in Python discussions - of "idiomatic" programming, that is, programming that makes use of the community-condoned semantic primitives to achieve elegance and evidence of community membership, of "getting it".

There's a sliding scale of complexity here. Programming languages are, when you get down to it, just another human medium of expression - they're just specialized for the expression of algorithms and procedures. Are they as good as they can be? How easily can software "understand" the same things humans do?

Moose for software analysis

Aside from Moose for Perl object-oriented programming, there is also a Moose for the analysis of software. There's a book as well. Moose appears to be about the model-based facilitation of software engineering, especially in the research arena. It's Swiss, meaning that there is this FrancoGerman assumption of underlying ontologies I find nearly incomprehensible, but they appear to be doing a lot of things I want to understand as well.

So I should come back to it. Sometime when I can grok what meta-meta-modelling is supposed to be about.

A practical intro to data science

Here's a good, link-rich article for ya.

Data sharing

Caitlin Rivers deplores the current state of the art in data sharing, and offers some tips. I wonder how much could be done with some kind of semantic "data presentation understanding tool".

CommitQ

We want to retire the plain old generic-text diff and replace it with a programming-language aware semantic diff tool.

Sounds good!

Tuesday, September 24, 2013

Quandl

Quandl is a search engine for datasets. Cool!

Media queries are a hack

(An aside - yeah, the last few posts are things other people posted in April. April is when I started throwing up my hands and storing links instead of blogging them, so there's a bit of backlog that will hopefully be working its way out into the world over the next couple of months.)

So here is a fascinating little post about responsive design and how it's focusing too much on medium instead of design situations. I like the way this guy thinks. Anyway, worth a read.

Quantopian

Quantopian appears to be a development platform/incubator kind of thing for amateur quants. Interesting stuff there that you could really spend some time grokking.

PSPP

PSPP is an open source alternative to SPSS, IBM's statistical analysis package.

Thomas Friedman op-ed generator

I know, I know, it's just template filling - but I'm a sucker for these things. I love'em. [inspiration]

Raven Software open-sources code for Star Wars games

This is always cool stuff: a couple of games got open-sourced after Disney's acquisition of Lucas.

I'm posting this under "open source target", but my understanding of what that means seems to have drifted a little. Originally I considered open source targets to be interesting things that could be done for programming using declarative styles and semantic programming. Now I find myself also including things that could be used as existing code for the purpose of exegesis and code understanding.

This is kinda both.

Note 2013-10-10: I just now noticed I didn't link to the post in question, but it doesn't matter. Raven apparently undecided to release, and all trace of their code is gone from SourceForge. That irks me, but there doesn't seem to be anything I can do about it.

Hosting options

There have been a few new hosting options lately - it's really getting very cheap to host a server. Case in point: Digital Ocean, which provides $5 root-access IP addresses. Not much storage, granted (20 gig), but the servers in question are blazingly fast, SSD drives and multiple cores for a little more money. Outgoing bandwidth is measured in terabytes, and incoming is not metered at all.

You can set up new servers with an API call.

And that's just one such hosting company. I'm going to start tracking the ones I find under this "hosting" tag.  Right now I'm paying $60 a month for a dedicated server that's, what, seven years old and feeling it? That's just money wasted these days.

Another cheap hosting alternative I've seen lately is Uberspace.de, remarkable for being in Germany, which could be quite useful.

Wit.ai

Wit purports to be a (voice) NLP API for arbitrary apps. I'm skeptical, but it's still a neat idea.

Email message threading

Jamie Zawinski explains his email threading algorithm here, the one used in Netscape back in the day. I love reading his work.

Sunday, September 22, 2013

Schema.org scraper

Does what it says on the tin, apparently. Interesting!

OpenCV

http://opencv.org/ is the open-source computer vision library I keep hearing about. There's currently a Kickstarter up for using it to interpret hand drawings of a mobile UI and generate the UI skeleton, which begs the question - couldn't I use it for sketches and concept maps?

I don't see why not!

General SEO tricks for any Website template

Here's a short list of some SEO best practices that seems pretty good.

Calculating rolling cohort retention - with SQL

This kind of trick is great stuff. I don't even know how to categorize it. Well, "data science", of course, but this general kind of algorithmic sleight-of-hand is always attractive.

Dictionary of Algorithms and Data Structures

Semantic gold mine! A long-running personal project at NIST cataloging data structures and the algorithms that use them.

svg.js

A lightweight library for manipulating SVG in JS.

Data table editor for jQuery

And another nice drop-in component: a data table viewer/editor built on jQuery.

Outline editor Concord

Oh, this is nice - a drop-in outline editor component in open-source JavaScript.

Saturday, September 21, 2013

Skill trees for Webdev work

A new skill tree (cheat sheet) site for Webdev work (bentobox.io [github]) hit HNN the other day, and the hivemind came up with a couple of interesting alternatives: The Odin Project with a self-contained curriculum, and the very cute Dungeons & Developers.

Note that a skill tree is essentially a semantic map of the domain of interest. I'm just sayin.

How to build a MOBI

The SICP book has been essentially open-sourced and there are spinoffs for different formats. The Kindle version is generated using this Github project, so it would be nice to go in and figure out how the content is handled.  (It appears to reside in HTML files, oddly.)

Wednesday, September 11, 2013

Summer hiatus

Due to health issues and travel (and it's always fun when those coincide) I have not really done any programming or thinking about programming for about two or three months now.  So I'm coming back to a lot of my old ongoing efforts with a fresh eye, and today I had a strange epiphany:

I'm thinking of the platform for a given piece of software as ephemeral now.

For instance, one of the things I'm working on is a parser of English in order to automate some of the language-quality work I do professionally. I'd like to implement that on my usual machine, but for performance reasons it would be convenient to offload it onto the Parallella platform since I expect it will really benefit from it.

So I can't really write it in Perl because of platform conflict. OK, I know Perl will probably run fine on the managing processors - but the point here is not whether Perl will or won't work, the point is that I really want to develop the algorithms and then "compile" them to Perl or C or whatever, as needs require.

This is what Java purports to address, by the way.  But I'm seeing a lot of new languages that "compile" to various high-level languages, notably JavaScript and C, and maybe this is a new modality.

Maybe what semantic programming is about, I tell myself yet again, is working out the semantic content of an algorithm, expressing it at that level, then having it run in whatever platform is required - and if that means "compiling" to a given language, then in a sense it's really coding in that language. The semantic structure is expressed in C or in Perl, but at some level it's also expressed as a bunch of semantic units that could also be used to express an explanation of the code in English, or even to derive a domain-specific language for intermediate work, a set of macros or something like that.

In other words, what I'm internalizing is that in a semantic programming paradigm the computer should be doing more of the work of coding, at a level that reflects a knowledge of the underlying purpose of each part of the code. That naturally ties back into code understanding to reverse-engineer this kind of semantic structure given existing syntactic expressions, but it's output that should logically come first.