Friday, December 30, 2011
Sunday, December 25, 2011
- Fire lots of bullets, not cannonballs (MVPs again)
- Fanatic devotion to performance goals even when times are hard
- Productive paranoia: cash in the bank, reduce risk whenever possible, anticipate killer strikes
- Don't bet on luck. Bet on being good.
- Seize opportunity when it arises.
Saturday, December 17, 2011
- Paraphrasing tools. This is something I came up with a couple of years ago that would be a lot easier now that I've spent some time thinking harder about NLP.
- HVPT word pair trainer.
- Depatenting, still, I guess.
- Despammed rebirth, possibly based on CRM114.
- Practical PHP exercises as kata.
- Run back through the big translation project management tasks from last spring in light of Windows automation.
- Code structure examination of OpenLogos, finally.
- In general, continue automation of my translation workflow.
- The Heritage Health Prize. Even doing halfway decently on it would be good advertising.
Wednesday, December 14, 2011
Tuesday, December 13, 2011
Sunday, December 11, 2011
Friday, December 9, 2011
Tuesday, December 6, 2011
Monday, December 5, 2011
Wednesday, November 30, 2011
- Basic tunnel animation with moving random spheres.
- From the same site, a really nice set of presentation slides. The language is Volapük, if you're wondering.
- Some of the new HTML5 semantic tags. Wow!
- Alex the Alligator, a platform gamer ported to HTML5, based on the engine melonJS.
- Basic drawing with HTML5 Canvas.
There's just a boatload of things you could do in this arena.
Sunday, November 27, 2011
Tuesday, November 22, 2011
Monday, November 21, 2011
Sunday, November 20, 2011
- Translated online guesser - uses a vector space model
- Huh. The other two links are dead. That's a shame - but it may be worth following up on them at a later date.
Saturday, November 19, 2011
Thursday, November 17, 2011
- FSA::Rules is the package I initially started building a wrapper for. It's actually pretty nice, and has a couple of constructs that my last post probably is missing.
- Parse::FSM builds a parser based on an FSM constructed laboriously by function call.
- State::ML provides a utility for converting XML-encoded state machines into other things or even code. I like the code generation aspect!
- Win32::CtrlGUI::State is a slick little state-machine controller for Win32 GUIs.
- Basset::Machine builds a state machine class in much the same way Term::Shell builds a command line shell.
- The overall tag is "statemachine" and has a name. This name will resolve as a function outside the state machine.
- The state machine tag is an iffy executor; that is, if it's the last tag in a program, it will be in control.
- Within the state machine node, the children are named. Children named "prepare", "output", or "input" that occur before "start" are special.
- The "prepare" child is code executed on each input to prepare it. The local variable $input contains the prepared input (the return from the "prepare" code) and $raw contains the raw input should it be required (this is @_ in the "prepare" code).
- The "output" child is code executed on each out-of-band output in the state machine (see below). The default output is the same as anywhere in Decl; it's to pass output to the parent, where eventually it just gets printed to stdout if you don't redirect it.
- The "input" child is code executed to obtain the next input token, if the state machine is in control. If there is no input, then the state machine can't be in control; it must be called from other code for each input token.
- The "start" child is the first state - every child after "start" is another state, so you can still call a state "prepare" or "input" if you need to.
- Within a state, we still have special parse rules, but in general, execution goes down the list of the state's children.
- A string followed by "->" consumes an input token if it matches, and changes the state.
- A line introduced by "->" just changes the state.
- Either of those may have code attached; if so, this code executes before the state transition. But with or without code, both of those act like a "do".
- A line that doesn't consist of string and -> or just -> is parsed as normal Decl code and does whatever it's supposed to.
- The code morpher will be updated to understand "->" at the start of a line as a state transition if you're inside a state machine. (That's probably a trickle-up thing as well, actually.)
n -> n_found
i -> i_found
c -> c_found
e -> success
That seems to do what I want; it doesn't show any code, but it does show the basic pseudocode I want to use.
Wednesday, November 16, 2011
Tuesday, November 15, 2011
Monday, November 14, 2011
Update: I've registered with Springer-Verlag as a book reviewer and I'll be reviewing this book. That means I get free online access to the text for six months. This should dovetail nicely with the machine-learing-in-perl tutorial site idea, actually.
- Introduction to Information Retrieval
- Foundations of Statistical Natural Language Processing (sadly not free online, but deemed valuable)
- Mining of Massive Datasets
- Two books on Computational Semantics (Blackburn & Bos)
- Elements of Statistical Learning
- Data-Intensive Text Processing with MapReduce
- CMU's machine learning course. I might work through it after Ng's course is done.
- Apache Mahout [hnn]
- The PET parser [online demo] [article], part of DELPH-IN, which has a truly painfully formatted home page but looks promising
- Natural Language Engineering journal
- StackExchange discussion of NL parsers and starting points for NLP
- A list of what's in the Ubuntu NLP stack
- The Porter stemmer
- Apache OpenNLP - probably a good place to help out
Wednesday, November 9, 2011
- A list of R packages for NLP, with an intriguing link to Weka, a set of Java implementation of data mining algorithms.
- StackOverflow reference to NLTK and n-gram extraction.
- Note "PMI", point-wise mutual information, cited in the SO link.
- Lucene is NLP for Apache; there is a PyLucene as well. But honestly I think I'm going to have to get my hands dirty in Java, because Java seems inordinately popular in the NLP field.
- JCC is a code generator developed for use in PyLucene.
- Europarl sentence splitter
- Europarl tokenizer
- A post on sentence splitting options (2007)
- Lingua::EN::Sentence, Text::Sentence
- A tech report on tokenizing for biomedical text indexing
Monday, November 7, 2011
Sunday, November 6, 2011
- Traversal: this is hierarchical structure walking (e.g. directory walk) and mapping (e.g. something like XSLT)
- Boilerplate and macros in modules, then release of declarative CSS and HTML modules
- Rewrite Word using some macros (the "select" tag usage is changing) and rerelease it
- Look again at Wx now that macros work, maybe release Wx 0.01
- Look at macros in the PDF context, probably release PDF 0.01
- Database management and access, then release Decl 0.12 with that
- An error management system, finally, which will probably be Decl 0.13
- Literate programming and PHP katas and examples, then release Publisher
- Probably look at Inline next and integrate with Python; I want access to the NLTK.
- Declarative logic somewhere in here, based on AI:Prolog.
- Wikipedia has a nice table
- The Octave Wiki recommends Inline::Octave, which I find a little questionable, but hey.
- PDL is probably the best Perl alternative; has direct support for sparse matrices, interestingly.
- The Monks look at some comparisons between R/S, Octave, and PDL.
Thursday, November 3, 2011
Wednesday, November 2, 2011
Monday, October 31, 2011
Sunday, October 30, 2011
Saturday, October 29, 2011
Thursday, October 27, 2011
Wednesday, October 26, 2011
- Handwriting recognition on a tablet PC to be translated into OpenMath and thence TeX.
- Selection of portions of a large mathematical formula and specification of specific operations to be carried out (e.g. "solve for this" or "call this theta" or what have you, said operations to be discovered by observation of my private theoretical physicist)
- Maintenance of a log of the trajectory through formula space
- n-fold productivity increases for theoretical physicists
- Public perception of my private theoretical physicist as highly productive physics genius
- Live on p.t.p.'s CERN salary while enjoying Geneva
Tuesday, October 25, 2011
Monday, October 24, 2011
- Qt has officially been spun off by Nokia. Along the same lines would of course be Tk and Wx, and I suppose native W32 by direct DLL access. All these share a lot of concepts that should be organized in parallel, and ultimately a feature in one should always migrate into the others so we're all working with the same set of concepts. They do eventually anyway, so it's kind of an obvious step to formalize that path.
- MediaWiki is, of course, in PHP, and always has bugs outstanding. Hone the semantic understanding tools on that. Same goes for Drupal and WordPress, of course.
- Which brings us to open-access science. This guy, a chemist at Cambridge, appears to be doing some actual data mining of open-access journals. I need to look a little closer at that. And remember: closed source kills.
- And then there's WikiData.
Sunday, October 23, 2011
Thursday, October 20, 2011
Tuesday, October 18, 2011
Monday, October 17, 2011
Sunday, October 16, 2011
- NLTK has a book. It might be a reasonable place to start, just working through that. And there are online courses available.
- I actually got a lot of useful information from Wikipedia, starting with UIMA, a Unified Information Management Architecture.
- GATE comes up a lot. It's Java-based.
- Apache OpenNLP is out there. Java.
- Book: Handbook of Natural Language Processing
- Oh, and Amazon recommendations come up with Syntax-Based Collocation Extraction
- Looking for the individual chapters of HNLP seems fruitful: Bing Liu has a whole page on opinion mining and sentiment analysis and even links to a PDF of his chapter of the book (I wonder if the entire book couldn't be reassembled in that manner)
- Liu has his own book on Web data mining.
Saturday, October 15, 2011
- Be mercenary: do what works. But do it.
- Shave yaks as needed: take the time to learn details when you need them.
- Develop sources
- Become the resident expert
- Be the data project you want to see on the Web
Friday, October 14, 2011
Description of Djuggler Enterprise
Data Juggler automates repetitive Web & data tasks without programming code. Use it to create sophisticated scripts for collecting data from the Web, filling Web forms, transforming text files, XML, CSV and database data. The easy-to-use drag-and-drop interface creates scripts that can be deployed as stand-alone Windows executables. Typical application examples:
- Extract competitor's price list from Web pages regularly.
- Extract people data from a Web pages.
- Download Web images op a regular basis.
- Get search results from multiple search engines.
- Automated Web testing and load testing.
- Export data to Web based applications using fill Web forms.
- Automate web based workflow processes like timesheets.
- Search & replace actions to clean data.
- Transform data from one format to another.
- Convert data from legacy applications to industry standards.
- Automate database migration with Business Intelligence.
- Comparing data and create reports.
- Send emails with personalized attachments.
- Server monitoring and reporting.
- Synchronize folders, databases, etc.
- Automate file management & data backup.
Automate IT operations by deploying stand-alone Djuggler scripts. The powerful script designer has many actions and functions like loops, 'if then else' conditions, get text between from html, get html table, get pictures, strip HTML, web macro's, read and save Excel, support for popular databases and many more. Demo's are included in the setup. Visit www.djuggler.com for the script repository and script service. A Djuggler Personal edition is available as freeware.
Keywords: Web data collection, Application Integration, Data Aggregation, Data Transformation, Report Generation, Batch Processing, Business Intelligence, System Monitoring, Form Filling, Web Scripting, Data Extraction, Web Testing.
- Green and redlighting of known-good, known-bad actors on a per-account basis
- Bayesian training
- Tracking of spamvertised URLs
- Both forwarding and Webmail access
- Spam discussion with specific examples and other community action
- Blogging about spam topics, including botnet identification and such
- Uniform treatment of both email and Web spam
- and yeah, an API...
Tuesday, October 11, 2011
Sunday, October 9, 2011
Saturday, October 8, 2011
- Linear regression in financial analysis [investopedia] - this is magic to a lot of people.
- Linear algebra is nearly universally based on BLAS: the Fortran-written Basic Linear Algebra Subprograms.
- Here's a textbook on elementary linear algebra.
- ATLAS is a library for linear algebra built on top of BLAS.
- And in general this all leads into numerical linear algebra.
Monday, October 3, 2011
Saturday, October 1, 2011
- XML, binary, and Declarative versions of representation
- LaTeX output
- Octave output and manipulation and parsing back in
- Some kind of overarching systems description a la "semantic Excel"
- Some kind of graphical presentation as active areas a la Equation Editor (but better)
- Economic analyses of popular games
- Simulations of popular games
- Genetic algorithm to devise new ones. Hee.
- Timestamp: don't allow a long period between reading and posting. (I had mixed success with this way back when.)
- Hash: check the IP, timestamp, post # - prevents playback attacks.
- Randomized field names.
- Honeypot fields: invisible (not hidden) fields that, if filled in, are a spam indicator.
Wednesday, September 28, 2011
- BYU's machine learning and data mining course.
- Book: The Elements of Machine Learning. More math, I think.
- Octave documentation.
- Reddit on Machine Learning and on this class.
Tuesday, September 27, 2011
Saturday, September 24, 2011
- NLTK, and a list of suggested NLTK projects for further thought
- OpenNLP is an umbrella project for all kinds of NLP open-source projects
- ClearTK is a Java-based NLP library
- LingPipe ditto
- GATE ditto
- Xerox has a finite-state tool
Wednesday, September 21, 2011
The trigger is an article on CNN [HNN discussion] by Bill Bennett of the Claremont Institute tearing down the concept of spending public money on education (god forbid the teacher's unions should get tax money). There are a few little nasty tricks he throws in. I think it would be possible to analyze this kind of rhetorical treatment, maybe. Eventually. I'm not sure how to start, but it fascinates me.