Friday, December 30, 2011
Interesting cryptography/number theory library
Perl documentation in the news
Waffles command-line ML toolset
Sunday, December 25, 2011
Task: Lego inventory tracker
Task: News scraper/tracker
Trading platforms
- Trade Architect from TD Ameritrade.
- OptionsHouse allows you to do simulated trading to start off and has flat trading fees for real trades. Interesting.
How startups succeed
- Fire lots of bullets, not cannonballs (MVPs again)
- Fanatic devotion to performance goals even when times are hard
- Productive paranoia: cash in the bank, reduce risk whenever possible, anticipate killer strikes
- Don't bet on luck. Bet on being good.
- Seize opportunity when it arises.
Questions for startups
Open source target: IndexTank
DD-WRT
- From the dd-wrt forum.
- Refers to Google code here.
- Possibly upgraded here.
Data mining without prejudice
Open Government
HTML too complex?
Saturday, December 17, 2011
Task: Write a Blogger to-do list manager
Big data predictions for 2012
Google on bug prediction and Microsoft on empirical programming
Research tools in Python
Things not to forget
- Paraphrasing tools. This is something I came up with a couple of years ago that would be a lot easier now that I've spent some time thinking harder about NLP.
- HVPT word pair trainer.
- Depatenting, still, I guess.
- Despammed rebirth, possibly based on CRM114.
- Practical PHP exercises as kata.
- Run back through the big translation project management tasks from last spring in light of Windows automation.
- Code structure examination of OpenLogos, finally.
- In general, continue automation of my translation workflow.
- The Heritage Health Prize. Even doing halfway decently on it would be good advertising.
Target application: Todoist.com
Wednesday, December 14, 2011
Infunl query language for clickpaths
Tuesday, December 13, 2011
Programming in Syn
Sunday, December 11, 2011
Friday, December 9, 2011
Target application: WildChords
Infographic: What tools developers actually use
Wednesday, December 7, 2011
Tuesday, December 6, 2011
XML in PostgreSQL
Reactive programming
Lucene
Open source target: Civic Commons
Monday, December 5, 2011
N-grams
Some neat Perl things
Website spam fighting
- Rename default pages
- Set up honeypot fields (hidden fields on the form)
- Follow up on spammed companies
- Have human moderators and an educated forum community
Word representations
Metaoptimize QA
Wednesday, November 30, 2011
Data mining of recipes
Javascript and HTML5 Canvas roundup
- Basic tunnel animation with moving random spheres.
- From the same site, a really nice set of presentation slides. The language is Volapük, if you're wondering.
- Some of the new HTML5 semantic tags. Wow!
- Alex the Alligator, a platform gamer ported to HTML5, based on the engine melonJS.
- Basic drawing with HTML5 Canvas.
False claim checker
There's just a boatload of things you could do in this arena.
Sunday, November 27, 2011
Supercollider
SourceMap: finding where the JavaScript comes from
Functional programming
Tuesday, November 22, 2011
Monday, November 21, 2011
Sentiment analysis
Update 2013-03-13: dead. Maybe I should revive it.
Spanish morphology in Haskell
Sunday, November 20, 2011
(Natural) language recognition
- Translated online guesser - uses a vector space model
- Huh. The other two links are dead. That's a shame - but it may be worth following up on them at a later date.
Saturday, November 19, 2011
The original code katas
Thursday, November 17, 2011
State machines in Perl
- FSA::Rules is the package I initially started building a wrapper for. It's actually pretty nice, and has a couple of constructs that my last post probably is missing.
- Parse::FSM builds a parser based on an FSM constructed laboriously by function call.
- State::ML provides a utility for converting XML-encoded state machines into other things or even code. I like the code generation aspect!
- Win32::CtrlGUI::State is a slick little state-machine controller for Win32 GUIs.
- Basset::Machine builds a state machine class in much the same way Term::Shell builds a command line shell.
State machine redux redux
- The overall tag is "statemachine" and has a name. This name will resolve as a function outside the state machine.
- The state machine tag is an iffy executor; that is, if it's the last tag in a program, it will be in control.
- Within the state machine node, the children are named. Children named "prepare", "output", or "input" that occur before "start" are special.
- The "prepare" child is code executed on each input to prepare it. The local variable $input contains the prepared input (the return from the "prepare" code) and $raw contains the raw input should it be required (this is @_ in the "prepare" code).
- The "output" child is code executed on each out-of-band output in the state machine (see below). The default output is the same as anywhere in Decl; it's to pass output to the parent, where eventually it just gets printed to stdout if you don't redirect it.
- The "input" child is code executed to obtain the next input token, if the state machine is in control. If there is no input, then the state machine can't be in control; it must be called from other code for each input token.
- The "start" child is the first state - every child after "start" is another state, so you can still call a state "prepare" or "input" if you need to.
- Within a state, we still have special parse rules, but in general, execution goes down the list of the state's children.
- A string followed by "->" consumes an input token if it matches, and changes the state.
- A line introduced by "->" just changes the state.
- Either of those may have code attached; if so, this code executes before the state transition. But with or without code, both of those act like a "do".
- A line that doesn't consist of string and -> or just -> is parsed as normal Decl code and does whatever it's supposed to.
- The code morpher will be updated to understand "->" at the start of a line as a state transition if you're inside a state machine. (That's probably a trickle-up thing as well, actually.)
statemachine nice
start
n -> n_found
-> error
n_found
i -> i_found
-> error
i_found
c -> c_found
-> error
c_found
e -> success
-> error
success (accept)
>> Yay!
-> start
error (fail)
>> bad!
-> start
That seems to do what I want; it doesn't show any code, but it does show the basic pseudocode I want to use.
Wednesday, November 16, 2011
Another open-sourced Ruby app
PR diving
Static typing
Tuesday, November 15, 2011
Pure: template language
Wx task
Random posts
Codefixbot
Monday, November 14, 2011
Meta-learning
Update: I've registered with Springer-Verlag as a book reviewer and I'll be reviewing this book. That means I get free online access to the text for six months. This should dovetail nicely with the machine-learing-in-perl tutorial site idea, actually.
Error handling in Decl
Free programming books
Time for a link dump: more ML/NLP
- Introduction to Information Retrieval
- Foundations of Statistical Natural Language Processing (sadly not free online, but deemed valuable)
- Mining of Massive Datasets
- Two books on Computational Semantics (Blackburn & Bos)
- Elements of Statistical Learning
- Data-Intensive Text Processing with MapReduce
- CMU's machine learning course. I might work through it after Ng's course is done.
- Apache Mahout [hnn]
- The PET parser [online demo] [article], part of DELPH-IN, which has a truly painfully formatted home page but looks promising
- Natural Language Engineering journal
- StackExchange discussion of NL parsers and starting points for NLP
- A list of what's in the Ubuntu NLP stack
- The Porter stemmer
- Apache OpenNLP - probably a good place to help out
- ANTLR
Wednesday, November 9, 2011
Superdesk: journalism tool by and for journalists
Local files in JavaScript
NLP
- A list of R packages for NLP, with an intriguing link to Weka, a set of Java implementation of data mining algorithms.
- StackOverflow reference to NLTK and n-gram extraction.
- Note "PMI", point-wise mutual information, cited in the SO link.
- Lucene is NLP for Apache; there is a PyLucene as well. But honestly I think I'm going to have to get my hands dirty in Java, because Java seems inordinately popular in the NLP field.
- JCC is a code generator developed for use in PyLucene.
- Europarl sentence splitter
- Europarl tokenizer
- A post on sentence splitting options (2007)
- Lingua::EN::Sentence, Text::Sentence
- A tech report on tokenizing for biomedical text indexing
Monday, November 7, 2011
Cinder: C++ for creative work
A quick note on dates and times
Sunday, November 6, 2011
XDoclet: "attribute-oriented" programming in Java
The philosophy of artificial intelligence
Decl hits CPAN
- Traversal: this is hierarchical structure walking (e.g. directory walk) and mapping (e.g. something like XSLT)
- Boilerplate and macros in modules, then release of declarative CSS and HTML modules
- Rewrite Word using some macros (the "select" tag usage is changing) and rerelease it
- Look again at Wx now that macros work, maybe release Wx 0.01
- Look at macros in the PDF context, probably release PDF 0.01
- Database management and access, then release Decl 0.12 with that
- An error management system, finally, which will probably be Decl 0.13
- Literate programming and PHP katas and examples, then release Publisher
- Probably look at Inline next and integrate with Python; I want access to the NLTK.
- Declarative logic somewhere in here, based on AI:Prolog.
Two years in
Javascript pitfall: missing var
Statistical comparison of programming languages
Overview of numerical analysis software
- Wikipedia has a nice table
- The Octave Wiki recommends Inline::Octave, which I find a little questionable, but hey.
- PDL is probably the best Perl alternative; has direct support for sparse matrices, interestingly.
- The Monks look at some comparisons between R/S, Octave, and PDL.
Good maxims for consulting programming
- Set up continuous deployment before you start
- Write tests first
- Be transparent
- Maintain daily todo lists
- Do the right thing
Thursday, November 3, 2011
Automated freaking writing in the news again
Wednesday, November 2, 2011
An aside on machine learning, and open-source contribution
Monday, October 31, 2011
Less talking, more doing
Sunday, October 30, 2011
IPEDS
Saturday, October 29, 2011
Unbounce: component and target
Fast test for startup ideas
Shakespeare, the programming language
Thursday, October 27, 2011
GUI vs. CLI
The not-so-secret capitalist cabal that owns us all
Wednesday, October 26, 2011
Language: Elephant
Math: sympy
- Handwriting recognition on a tablet PC to be translated into OpenMath and thence TeX.
- Selection of portions of a large mathematical formula and specification of specific operations to be carried out (e.g. "solve for this" or "call this theta" or what have you, said operations to be discovered by observation of my private theoretical physicist)
- Maintenance of a log of the trajectory through formula space
- n-fold productivity increases for theoretical physicists
- Public perception of my private theoretical physicist as highly productive physics genius
- Live on p.t.p.'s CERN salary while enjoying Geneva
Tuesday, October 25, 2011
Nice interactive graphic
Analysis of Steve Jobs tribute messages
JavaScript roundup
- So you want to write JavaScript for a living. Interesting list of some of the things one should know about JS.
- Badass JavaScript, a blog.
Tangle: a JS library for reactive documents
Monday, October 24, 2011
Some more open source projects
- Qt has officially been spun off by Nokia. Along the same lines would of course be Tk and Wx, and I suppose native W32 by direct DLL access. All these share a lot of concepts that should be organized in parallel, and ultimately a feature in one should always migrate into the others so we're all working with the same set of concepts. They do eventually anyway, so it's kind of an obvious step to formalize that path.
- MediaWiki is, of course, in PHP, and always has bugs outstanding. Hone the semantic understanding tools on that. Same goes for Drupal and WordPress, of course.
- Which brings us to open-access science. This guy, a chemist at Cambridge, appears to be doing some actual data mining of open-access journals. I need to look a little closer at that. And remember: closed source kills.
- And then there's WikiData.
Sunday, October 23, 2011
Decl striving mightily to hit CPAN
Thursday, October 20, 2011
Decl doesn't actually hit CPAN
Decl hits CPAN
Google AI challenge
Graphics by Kevin Karsch
Tuesday, October 18, 2011
Oh, what a tangled web we weave
Monday, October 17, 2011
Sunday, October 16, 2011
NLP
- NLTK has a book. It might be a reasonable place to start, just working through that. And there are online courses available.
- I actually got a lot of useful information from Wikipedia, starting with UIMA, a Unified Information Management Architecture.
- GATE comes up a lot. It's Java-based.
- Apache OpenNLP is out there. Java.
- Book: Handbook of Natural Language Processing
- Oh, and Amazon recommendations come up with Syntax-Based Collocation Extraction
- Looking for the individual chapters of HNLP seems fruitful: Bing Liu has a whole page on opinion mining and sentiment analysis and even links to a PDF of his chapter of the book (I wonder if the entire book couldn't be reassembled in that manner)
- Liu has his own book on Web data mining.
Hyde
A possible approach
Windows PE format in painstaking detail
Saturday, October 15, 2011
Stanford's NLP class
Data journalism
- Be mercenary: do what works. But do it.
- Shave yaks as needed: take the time to learn details when you need them.
- Develop sources
- Become the resident expert
- Be the data project you want to see on the Web
Friday, October 14, 2011
Target application: web automation
Description of Djuggler Enterprise
Data Juggler automates repetitive Web & data tasks without programming code. Use it to create sophisticated scripts for collecting data from the Web, filling Web forms, transforming text files, XML, CSV and database data. The easy-to-use drag-and-drop interface creates scripts that can be deployed as stand-alone Windows executables. Typical application examples:
- Extract competitor's price list from Web pages regularly.
- Extract people data from a Web pages.
- Download Web images op a regular basis.
- Get search results from multiple search engines.
- Automated Web testing and load testing.
- Export data to Web based applications using fill Web forms.
- Automate web based workflow processes like timesheets.
- Search & replace actions to clean data.
- Transform data from one format to another.
- Convert data from legacy applications to industry standards.
- Automate database migration with Business Intelligence.
- Comparing data and create reports.
- Send emails with personalized attachments.
- Server monitoring and reporting.
- Synchronize folders, databases, etc.
- Automate file management & data backup.
Automate IT operations by deploying stand-alone Djuggler scripts. The powerful script designer has many actions and functions like loops, 'if then else' conditions, get text between from html, get html table, get pictures, strip HTML, web macro's, read and save Excel, support for popular databases and many more. Demo's are included in the setup. Visit www.djuggler.com for the script repository and script service. A Djuggler Personal edition is available as freeware.
Keywords: Web data collection, Application Integration, Data Aggregation, Data Transformation, Report Generation, Batch Processing, Business Intelligence, System Monitoring, Form Filling, Web Scripting, Data Extraction, Web Testing.
Postmark spam filter has an API - Despammed should, too
- SpamAssassin
- Procmail
- Green and redlighting of known-good, known-bad actors on a per-account basis
- CRM114
- Bayesian training
- Tracking of spamvertised URLs
- Both forwarding and Webmail access
- Arbitrary forwarding (including taking Web API action or Twilio phone action) based on rules, including rules that can be expressed in arbitrary JavaScript
- Spam discussion with specific examples and other community action
- Blogging about spam topics, including botnet identification and such
- Uniform treatment of both email and Web spam
- and yeah, an API...
CSS tricks
Tuesday, October 11, 2011
CRM114
Sunday, October 9, 2011
CmdrTaco: not dead - scaling
Puppet vs. Chef
Saturday, October 8, 2011
Concurrent Constraint Programming in Oz for Natural Language Processing
XSB Prolog
HNN: what data structure does the brain use?
TermL: another specification for expressing symbolic trees
OMeta: pattern-matching language
One-liner music
Linear regression and linear algebra
- Linear regression in financial analysis [investopedia] - this is magic to a lot of people.
- Linear algebra is nearly universally based on BLAS: the Fortran-written Basic Linear Algebra Subprograms.
- Here's a textbook on elementary linear algebra.
- ATLAS is a library for linear algebra built on top of BLAS.
- And in general this all leads into numerical linear algebra.
Monday, October 3, 2011
E-discovery
Saturday, October 1, 2011
OpenMath
- XML, binary, and Declarative versions of representation
- LaTeX output
- Octave output and manipulation and parsing back in
- Some kind of overarching systems description a la "semantic Excel"
- Some kind of graphical presentation as active areas a la Equation Editor (but better)
Visual Modeling and Programming with Graph Transformations
Math
Gamification
Notificon
Spambot combat
- Timestamp: don't allow a long period between reading and posting. (I had mixed success with this way back when.)
- Hash: check the IP, timestamp, post # - prevents playback attacks.
- Randomized field names.
- Honeypot fields: invisible (not hidden) fields that, if filled in, are a spam indicator.
Learning algorithms
Wednesday, September 28, 2011
Final flurry of Stanford-related links
Machine learning
- BYU's machine learning and data mining course.
- Book: The Elements of Machine Learning. More math, I think.
- Octave documentation.
- Reddit on Machine Learning and on this class.
Sending email: best practices
Target application: Promoter
haXe
Diagramming again
TeX
Octave
Tuesday, September 27, 2011
DevOps choices at AppNexus
Open source targets: BuddyPress and CUNY Academic Commons
Draw a Stick Man
Ticket Servers: Distributed Unique Primary Keys on the Cheap
Real time face substitution
Saturday, September 24, 2011
mrjob
NLP
- NLTK, and a list of suggested NLTK projects for further thought
- OpenNLP is an umbrella project for all kinds of NLP open-source projects
- ClearTK is a Java-based NLP library
- LingPipe ditto
- GATE ditto
- Xerox has a finite-state tool
Wednesday, September 21, 2011
Book: Mining of Massive Datasets
Rhetorical analysis
The trigger is an article on CNN [HNN discussion] by Bill Bennett of the Claremont Institute tearing down the concept of spending public money on education (god forbid the teacher's unions should get tax money). There are a few little nasty tricks he throws in. I think it would be possible to analyze this kind of rhetorical treatment, maybe. Eventually. I'm not sure how to start, but it fascinates me.