Sunday, March 31, 2013
Quantopian
Cool. Algorithmic trading platform with arbitrary algorithms and lots of data. This is very cool. Here's the example that brought me to it.
Saturday, March 30, 2013
Put another way...
"A language should be designed in terms of an abstract syntax and it should have perhaps, several forms of concrete syntax: one which is easy to write and maybe quite abbreviated; another which is good to look at and maybe quite fancy... and another, which is easy to make computers manipulate... all should be based on the same abstract syntax... the abstract syntax is what the theoreticians will use and one or more of the concrete syntaxes is what the practitioners will use."
-- John McCarthy, creator of Lisp
-- John McCarthy, creator of Lisp
Alchemy and mechanical computing
Cool little article about a book of alchemy that contains a mysterious set of tables. It took four hundred years to reverse-engineer the algorithm for generating the tables, and this article is about not only that, but further reverse-engineering a mechanical computer that could have been built when the book was written.
Neat stuff! Very Italian in nature.
Neat stuff! Very Italian in nature.
Macros and washing machines
Well, here's a timely screed about macros and why they make sense - to which I can only say, "Yeah!"
Which brings me to discussion (again, and we'll keep discussing this until it comes out right!) of macros and code generation.
Let us imagine a system of articles and books that describe a codebase. Some of the codebase may be maintained outside this system; some of it within, because the articles include some literate programming tools that can generate sections of code. (This way the system can be used to start analyzing an existing codebase and slowly grow to encompass all of it, as needed.)
An article is equivalent to a book section, that is, a book consists of a hierarchical organization of multiple articles, presumably related. An article may still have hierarchical structure within it, though, because sometimes you just need that for clarity.
In general, though, a single article addresses a single "thing". That topic could be a feature or a specific function, or it could be a change request touching many different parts of the system. Ideally the maintenance of a complex system would thus have a narrative made up of multiple articles explaining the thinking at each stage.
OK. So in that context, let's assume that some of our literate programming-type tools include arbitrary macros that can be reused. (Literate programming can be seen as writing a number of single-use macros, so generalization of that to reusable macros is no great leap.) Some languages are easier to macro-ize than others, of course: we have to parse things to make truly effective use of macros without leaving the native syntax. But by extending the native syntax with a template language, of course (as we do in literate programming, actually) we can build macros for any language. The key is the code generation, you see.
It might be a good idea, though, if particularly questionable or novel macros were to be given a kind of "half-way existence", where the macro as well as its expansion are shown in the presentation. Maintenance then has a template or macro to work with, but the full code is shown for clarity. There are plenty of instances where that makes a lot of sense to me.
Which brings me to discussion (again, and we'll keep discussing this until it comes out right!) of macros and code generation.
Let us imagine a system of articles and books that describe a codebase. Some of the codebase may be maintained outside this system; some of it within, because the articles include some literate programming tools that can generate sections of code. (This way the system can be used to start analyzing an existing codebase and slowly grow to encompass all of it, as needed.)
An article is equivalent to a book section, that is, a book consists of a hierarchical organization of multiple articles, presumably related. An article may still have hierarchical structure within it, though, because sometimes you just need that for clarity.
In general, though, a single article addresses a single "thing". That topic could be a feature or a specific function, or it could be a change request touching many different parts of the system. Ideally the maintenance of a complex system would thus have a narrative made up of multiple articles explaining the thinking at each stage.
OK. So in that context, let's assume that some of our literate programming-type tools include arbitrary macros that can be reused. (Literate programming can be seen as writing a number of single-use macros, so generalization of that to reusable macros is no great leap.) Some languages are easier to macro-ize than others, of course: we have to parse things to make truly effective use of macros without leaving the native syntax. But by extending the native syntax with a template language, of course (as we do in literate programming, actually) we can build macros for any language. The key is the code generation, you see.
It might be a good idea, though, if particularly questionable or novel macros were to be given a kind of "half-way existence", where the macro as well as its expansion are shown in the presentation. Maintenance then has a template or macro to work with, but the full code is shown for clarity. There are plenty of instances where that makes a lot of sense to me.
Curation and CPAN
Construing CPAN as citations made it clear to me what isn't provided in the existing CPAN ecosystem.
That's an interesting thought.
That's an interesting thought.
Friday, March 29, 2013
New approach to Decl
I came to Decl from the wrong direction, last time.
My original thought was to build wxWidgets stuff in a way that didn't kill the beauty in my soul, and it kinda worked - but there was a lot of work involved, so I got distracted by some other shiny stuff like PDFs and Word documents and went off doing it. What I ended up with was a clunky interpreter instead of a semantic description language.
What I should have done instead was, well, a semantic description language. I've been giving some thought to that lately, because spring means it's time to start new things, or bring old ones back to new life.
What occurs to me is:
My original thought was to build wxWidgets stuff in a way that didn't kill the beauty in my soul, and it kinda worked - but there was a lot of work involved, so I got distracted by some other shiny stuff like PDFs and Word documents and went off doing it. What I ended up with was a clunky interpreter instead of a semantic description language.
What I should have done instead was, well, a semantic description language. I've been giving some thought to that lately, because spring means it's time to start new things, or bring old ones back to new life.
What occurs to me is:
- The basic parser into a data structure is a good thing. Let's keep that. Tags, good. Parameters, good (maybe drop the distinction between options and parameters because I couldn't keep them straight anyway). But the basic idea is good: a set of nested tags describing arbitrary structure that can then be mapped onto programming constructs, the tags representing nouns instead of the verbs we see in non-declarative programming.
- The semantics of tags need to be represented more explicitly. I did a lot of dancing around trying to map tags onto concepts behind the scenes, and it was unmanageable. Instead, I should explicitly state a mapping - as an appendix or footnote.
- Yeah, you heard me. We've developed an entire structure of documentation over centuries that is used to address semantic complexities by using out-of-band channels to clarify ambiguity or add information that could be distracting to the flow of presentation. Why should everything be linear in source code? Because it's easier for the compiler to understand? Balderdash. This is really close to where Knuth is going with literate programming - but Knuth worked with compiled languages, and I don't. I don't want to respect identifier uniqueness; my compiler should get from context what I mean. Knuth worked at too low a level. It's time to kick it up a notch.
- I should be able to use citations, too, to include semantic presentation that bears on a given solution topic. Citations here are just ("just") libraries - or macros. Boilerplate and templates. APIs to external functionality. CPAN modules. Anything that has already been worked out to address a solution space, can be a cited reference.
- Decl, although written in Perl, should not be Perl-bound. Ideally I should be able to use Decl to define a program in any language, or to define it in several languages at once. I should be able to use the same framework to use NLTK or data science stuff in Python, describe machine learning algorithms in Octave or R, define modules for CPAN, spin out a Web app in JavaScript, or write low-level things in C or assembler. Decl should be an approach, not a chain. The first time around, it was an interpreted language written in Perl. That was wrong. Decl should actually be a compiler - maybe a compiler on the fly, but still a compiler.
How's that for an Easter resolution?
Context restoration
One of the things I have the worst trouble with, in programming, is context restoration. I suspect this is a large part of what makes an IDE valuable, honestly.
The context of a task is just that - it's workflow. It's all the files involved in an activity, where you are in them, your notes and insights, bookmarks to documentation involved, and so on. It's something a workflow engine would have to provide anyway for any longer-running task.
But it can also be seen as the semantic context of a given task. A semantic programming system should include not only just the current code that addresses a given set of needs, but the history of that code, why changes were needed, and the thoughts of human programmers as they addressed them. There should be rich context for everything.
If you have to laboriously write it all up, though, you'll never get anything done. A semantic programming system should be a sort of assistant that (hopefully) understands what you're doing to the point that it can explain it to others.
The context of a task is just that - it's workflow. It's all the files involved in an activity, where you are in them, your notes and insights, bookmarks to documentation involved, and so on. It's something a workflow engine would have to provide anyway for any longer-running task.
But it can also be seen as the semantic context of a given task. A semantic programming system should include not only just the current code that addresses a given set of needs, but the history of that code, why changes were needed, and the thoughts of human programmers as they addressed them. There should be rich context for everything.
If you have to laboriously write it all up, though, you'll never get anything done. A semantic programming system should be a sort of assistant that (hopefully) understands what you're doing to the point that it can explain it to others.
Thursday, March 28, 2013
Web framework benchmark
Wow - here's a pretty amazing graph comparing speed of trivial JSON serialization of a freshly created object over about twenty different frameworks. There's a huge spread.
On that note, let's get a Web framework link dump here, OK?
On that note, let's get a Web framework link dump here, OK?
- Flask Megatutorial - a series on Flask that I think I'm going to work through.
- Getting started with Django
- and Django Best Practices
Wednesday, March 27, 2013
R: master troll
A good article on R and how it's essentially the uber-Perl in terms of having more than one way to do things.
Tuesday, March 26, 2013
Content generation
So apparently "content spinning" is a thing (like here) - take an article and munge the text so a search engine will accept it as being a different article. It's a ... I guess it's for SEO, to make blog posts look real. Or something.
I still think it would be fun to autogenerate articles using found content on the web. Give it a keyword or two, it finds some articles, spins them in kinda this way, writes a blog post. People would pay for that.
And I still want to write those paraphrasing tools for repurposing content from The AP.
I need a sabbatical, to get some of these projects off the ground.
I still think it would be fun to autogenerate articles using found content on the web. Give it a keyword or two, it finds some articles, spins them in kinda this way, writes a blog post. People would pay for that.
And I still want to write those paraphrasing tools for repurposing content from The AP.
I need a sabbatical, to get some of these projects off the ground.
PDF generation as a service
This is neat. Template generation into PDF documents as a service - very slick!
Diagramming link dump
I got a few diagramming links piling up.
- JS sequence diagrams, very nice look, parses a sequence diagram description language.
- Chart.js looks neat.
- A guy that blogs about UML, talking about different modes of use for it, which I find pretty enlightening. In case you're wondering, I favor "UML as programming language".
- A UML sharing service (UMLbin). Neat idea. But the only download option is a PNG - no source. So that's disappointing.
- And Visualizing Social Structures, a historical retrospective.
Python data tools
O'Reilly: "they keep getting better and better". Yeah. I'm gonna have to reimmerse myself in Python, I guess. *snf* I'll miss you, CPAN!
Monday, March 25, 2013
Expect
Expect is a neat Tcl extension that provides scripting on Unix for command-line programs, especially interactive ones. This is essentially what I'd like to do (unfortunately from my Windows box) through ssh, although it would be great to be able to do it on Windows as well.
Turns out Windows just basically doesn't permit this, so Windows ports don't work. Apparently at all, because Windows treats the console special in some way, and when invoking things in a pipe things don't stay interactive. It's complicated.
But still - the basic idea is quite sound. For really effective sysadmin work, I'd like to put together two ends: first, an "action worksheet" for complex command-line invocations, and second, just such an interactive back-and-forth. If it only works through ssh, then so be it - I've got a Cygwin ssh working fine that could probably do the trick.
[Also, note on using expect in noisy connection environments.]
Oh! Later, I found Net::SSH2::Expect (along with Net::SSH2::SCP and ::SFTP and the fact that Net::SSH2 is itself part of Strawberry, so part of my faith in the CPAN ecosystem was restored this day). [The Monks have some useful sample code. I think maybe SSH2 could use a tutorial.]
Turns out Windows just basically doesn't permit this, so Windows ports don't work. Apparently at all, because Windows treats the console special in some way, and when invoking things in a pipe things don't stay interactive. It's complicated.
But still - the basic idea is quite sound. For really effective sysadmin work, I'd like to put together two ends: first, an "action worksheet" for complex command-line invocations, and second, just such an interactive back-and-forth. If it only works through ssh, then so be it - I've got a Cygwin ssh working fine that could probably do the trick.
[Also, note on using expect in noisy connection environments.]
Oh! Later, I found Net::SSH2::Expect (along with Net::SSH2::SCP and ::SFTP and the fact that Net::SSH2 is itself part of Strawberry, so part of my faith in the CPAN ecosystem was restored this day). [The Monks have some useful sample code. I think maybe SSH2 could use a tutorial.]
Postgres
A couple of excellent slideshows about Postgres and its Django bindings that are really pretty convincing - brief, but that's because they're slideshows.
Sunday, March 24, 2013
Life in APL
Wow. This YouTube presentation is really quite good and kind of makes me want to learn both APL and linear algebra.
Email bounce parsing and mapping of modules
The definitive CPAN module for extracting information from email bounces is Mail::DeliveryStatus::BounceParser, and it's still relatively actively maintained - just had an update adding a couple of cases in January. There's only a couple of problems. First, there are bugs listed that are years old and (to me) look relatively important - a bounce with multiple bad addresses doesn't get multiple addresses parsed out, that kind of thing.
Second, its report object subclasses Mail::Header. So sure, that's not horrible, but still - I'd much rather have something that can look at an Email::Abstract object or just the headers of one (Email::Simple::Header) and extract an object that doesn't hook into an alternative mail ecology.
Third, some of the cases might be dubious. I'd rather have something that uses some kind of tabular organization of filtration cases or something - this is wishy-washier, but it does contribute to my sense of unease in going with this module casually.
But the amount of information in this module is outstandingly valuable - it represents years of tinkering with bounce messages and trying to deal with the weird ones. So I don't want to just start over, either, and if I port it into a different framework I'd like to be able to keep up with any further updates.
This is a common set of problems in reusable software development. The source code/module level is not really a fantastic level of granularity for knowledge preservation - it's just better than anything else we've got yet in common use.
Another case I ran across was a translation tool called Anaphraseus - it's written as an open-source replacement for the TRADOS Word tools (essentially a work-alike) but it works only in Open Office. I use Word, so if I want to use Anaphraseus I'd need to port it - but I'd like to be able to keep in synch with the official release because they do things that I don't think of.
In general I think of these situations as requiring a tool I think of as a "cross-parser"; they take a text in one high-level language and translate it into another, and maybe back. They allow a continuous mapping of knowledge into two different expressions, in other words.
I need to research this general area of problems. I'm sure it's old hat to somebody.
Specifically, though, this week, I'm proposing Email::Simple::BounceParser, which would mirror Mail::DeliveryStatus::BounceParser, perhaps through some kind of database representation in the middle. I have no idea how that would work exactly.
In general, it would be nice to be able to define some kind of abstract module for a specific set of "knowledge" that would be crystallized as specific module instances. This is essentially what parser parsers already do (Parse::RecDescent does exactly this); it would be nice to formalize this technique with a more visible system of expressing it. Kind of a code template thing, I guess.
Second, its report object subclasses Mail::Header. So sure, that's not horrible, but still - I'd much rather have something that can look at an Email::Abstract object or just the headers of one (Email::Simple::Header) and extract an object that doesn't hook into an alternative mail ecology.
Third, some of the cases might be dubious. I'd rather have something that uses some kind of tabular organization of filtration cases or something - this is wishy-washier, but it does contribute to my sense of unease in going with this module casually.
But the amount of information in this module is outstandingly valuable - it represents years of tinkering with bounce messages and trying to deal with the weird ones. So I don't want to just start over, either, and if I port it into a different framework I'd like to be able to keep up with any further updates.
This is a common set of problems in reusable software development. The source code/module level is not really a fantastic level of granularity for knowledge preservation - it's just better than anything else we've got yet in common use.
Another case I ran across was a translation tool called Anaphraseus - it's written as an open-source replacement for the TRADOS Word tools (essentially a work-alike) but it works only in Open Office. I use Word, so if I want to use Anaphraseus I'd need to port it - but I'd like to be able to keep in synch with the official release because they do things that I don't think of.
In general I think of these situations as requiring a tool I think of as a "cross-parser"; they take a text in one high-level language and translate it into another, and maybe back. They allow a continuous mapping of knowledge into two different expressions, in other words.
I need to research this general area of problems. I'm sure it's old hat to somebody.
Specifically, though, this week, I'm proposing Email::Simple::BounceParser, which would mirror Mail::DeliveryStatus::BounceParser, perhaps through some kind of database representation in the middle. I have no idea how that would work exactly.
In general, it would be nice to be able to define some kind of abstract module for a specific set of "knowledge" that would be crystallized as specific module instances. This is essentially what parser parsers already do (Parse::RecDescent does exactly this); it would be nice to formalize this technique with a more visible system of expressing it. Kind of a code template thing, I guess.
Saturday, March 23, 2013
Followup to email
One thing that's actually quite important but is easy to forget is reliability. When I lose contact with the mail server, I need to fail gracefully. If I can't contact the server because the machine is offline, I need to take appropriate action. Under no circumstances should things just crash or die.
There's probably a principled way of approaching this as a design goal, but I don't know precisely what it is.
Friday, March 22, 2013
Email workflow
So here's my current thinking about email handling.
- Incoming mail has to go through various filters and classifiers before humans ever see it.
- Bounces are first - they contain some information no matter what.
- Some filters are database-backed: assignment to customers, assignment to ongoing threads by response header (and thus to tasks). Since a lot of what I want to do with email is actually workflow, assignment to tasks, projects, and task categories is important and that happens at this stage, mostly.
- Then there are automatic notifications - logs and heartbeats. That stuff can go into an automatic-processing bucket without much further ado.
- Next comes categorization of undifferentiated stuff. The first line of defense is spam filtration. Distributed solutions are useful here because it's best to triangulate over a wider recipient base. (Despammed comes in here - and more on Despammed in a bit.)
- Ham can be autofiltered as well, although I'm not as convinced that's terribly useful. It's an active area of research, though (see a list below).
- Finally, you end up with a little dribble of personal mail and "things that could be new tasks" or related to old ones.
- Except for that last stage, all this happens automatically in the background.
- Once categorized, each incoming mail triggers a system response: this can range from simply marking or putting into a folder, to kicking off an arbitrary program for data storage or statistics, or notifying the user. In other words, workflow.
- Task mail is intended to be relatively small in scale; when a task is finished, its mail is archived with it, possibly leaving a trace in an index for a while.
- Known task messages have their message IDs stored for detection of follow-ups.
- Known task contacts have their email IDs or domains marked for the same reason.
- Longer-term projects can be archived monthly or when the number of messages gets unmanageable.
- Personal conversations are treated as projects in this sense.
- Responses (outgoing mail) are stored with the conversations responded to. Gmail does this and it really makes sense.
- Document management of attachments happens in there somewhere, maybe even organization into some kind of versioned track (identical attachments can at least certainly be stored singly and noted as identical or something).
And again, all that could happen on the server or on your machine, all using an IMAP or other client or just autoinvoked upon receipt. And in fact it might be rather useful to have some subset of this running at Despammed and get serious about that poor old thing.
A couple of promising links I found searching on "email categorization":
- http://people.cs.umass.edu/~ronb/papers/email.pdf looks like a great place to start with machine learning of email categories.
- Implicit Mailroom seems to do a lot of the above, in the Microsoft Exchange context. Certainly a good idea to glom onto their feature list, anyway.
Anyway, the client is more or less independent of all that and could be anything. I'm going on the assumption that client-facing folders are going to be virtual (that is, tag index instead of a physical folder) and stored with Email::Store. Virtual, because right now moving archived folders back and forth to reflect current activity levels of customers takes forever and really is backup-unfriendly. Thunderbird's method of email management just isn't really cutting it for me any more. (Gmail's even worse - even apart from the fact that Google is reading my mail.)
So there's my line in the sand. That's what I want in an email client engine. Kind of an assistant in a box.
Wednesday, March 20, 2013
Website design tools
Machine learning post roundup
Ugh, I've been falling behind in blogging out the links lately. Here are a few things I've run across having to do with machine learning, NLP, and neural networks lately (all the stuff people don't call AI any more):
- Ersatz is an upcoming platform for neural network training in the cloud. Neat!
- Topology optimization for physical structures with genetic programming, with a fun animation
- A CMU class in ML
- Sentiment analysis in Python
- A field guide to genetic programming, nice name
And finally, an interesting post about the background: - Prognistication about the future of the ML industry
Bounce processing
This bubbled up on my random post bar on the right, and deserves a little reinforcement: MRjob for Postfix bounce processing.
Since I'm looking at email again lately, this was good timing: an autohandler for postmaster bounce notices would be an incredible boon to me, and honestly shouldn't be difficult to write. So ... I should write it.
Since I'm looking at email again lately, this was good timing: an autohandler for postmaster bounce notices would be an incredible boon to me, and honestly shouldn't be difficult to write. So ... I should write it.
Taking over CPAN modules
Here are some notes on taking over CPAN modules. I can officially confirm that this works, because I have taken on co-maintainership of Iterator::Simple. (Yeah, I'm still pretty enthused.)
- The official PAUSE guide to taking over a module.
- Identifying CPAN modules that need help
In the comments to the latter one, Neil Bowers describes his own procedure, which is the one I used:
- Find and solve a problem with a module. (This is sometimes the hard part.)
- Post a patch to RT.
- Fork the module on Github (this is a little pie-in-the-sky, so put the module on Github, the point being to have an open repository you can refer to)
- Don't forget to update the repo with meta-information to point to itself. (I didn't do this for Iterator::Simple, durr.) More Github integration is a benefit to everyone.
- Two weeks after this, email the author at whatever email addresses you can find, remind him or her of the patch, point to the Github repo, and offer to take over co-maint. Copy modules@perl.org on this email.
- One month after that, notify PAUSE that you haven't heard back.
- Wrap and upload your new version.
Wash, rinse, repeat: slow but certain world domination.
The dire state of WordPress
Here's a pretty fascinating post on WordPress. It's the backbone to 17.4% of the sites in the world, and its structure is weird. Semantically, in other words, its concepts map poorly onto what they're being asked to do.
Which makes me ask: what would a ... "concept mapper" look like that could look at a WP site, map it back and forth onto what it's doing, and allow people to work with it more rationally? In other words, treat the WP site as the compiler target, and allow people to write higher-level things and compile them into WordPress.
Think on that.
Which makes me ask: what would a ... "concept mapper" look like that could look at a WP site, map it back and forth onto what it's doing, and allow people to work with it more rationally? In other words, treat the WP site as the compiler target, and allow people to write higher-level things and compile them into WordPress.
Think on that.
Friday, March 15, 2013
Expression
Template expression is a subject that comes up a lot. It's how you create a document, a document being a more-or-less structured chunk of text.
In its more restricted form, template expression takes a map of values and writes them into a tagged template to create a document. As you increase the complexity of the template language, you approach natural language and you start talking about structured text or structured documents.
I think of this as "expression" in general of a data structure of arbitrary complexity. There's a sliding scale there from simple stringification to serialization to templates to structured text that probably should be explicitly represented in this procedure language - not least because the preparation of documents comes up in business so very often.
(In a sense, if we take binary data to be a kind of document, then compilers are a form of expression as well. That's kind of an interesting notion.)
That's your disjoint thought stream for the day. Back to work, you!
Thursday, March 14, 2013
Data modeling and semantics
I've been musing more about the semantics of data modeling lately - or really rather about the fact that data modeling is a form of semantic manipulation. The thing that makes semantics interesting is how semantic structures can be mapped onto other semantic structures. That is, the mapping, or recognition, of structures is really what semantics buys us.
In the case of accounting (sorry, I do tend to fixate on particular applications for months or years at a time), it would be instructive to gather the various data models used in open-source software (well, and open formats such as QIF used for non-open software) and do a kind of line-by-line comparison. A mapping, in fact - a mapping onto the semantic constructs that accountants use to talk about accounting.
That nexus is where semantic programming resides, in potential anyway.
At any rate, comparison of projects in this manner would allow us to identify certain features of accounting data structures that were incorporated into or absent from different models. Description of those variants is also part of modeling, and a full description would permit us to auto-generate data migration tools.
And once you can migrate data back and forth between different representations, well, then you have semantic data management, I guess. Not (semantic data) management, that is, but semantic (data management). You've started to graduate from data to knowledge.
In the case of accounting (sorry, I do tend to fixate on particular applications for months or years at a time), it would be instructive to gather the various data models used in open-source software (well, and open formats such as QIF used for non-open software) and do a kind of line-by-line comparison. A mapping, in fact - a mapping onto the semantic constructs that accountants use to talk about accounting.
That nexus is where semantic programming resides, in potential anyway.
At any rate, comparison of projects in this manner would allow us to identify certain features of accounting data structures that were incorporated into or absent from different models. Description of those variants is also part of modeling, and a full description would permit us to auto-generate data migration tools.
And once you can migrate data back and forth between different representations, well, then you have semantic data management, I guess. Not (semantic data) management, that is, but semantic (data management). You've started to graduate from data to knowledge.
Wednesday, March 13, 2013
Followers
To my surprise, this blog now has a follower (hi, Karsten!) Clearly, this is but the first step on my journey to world domination.
Sunday, March 10, 2013
Accounting and data management
So I've been looking at accounting in more detail lately (it's coming up on tax season, and then there's data modeling, and business plans) and I've been having some various epiphanies.
I've always thought of accounting as being primarily a database application. Which it is, naturally, but I've come to realize that the purpose of accounting isn't actually data management - the purpose of accounting is to predict the future. Well - and satisfy the tax authorities and make sure your customers don't forget to pay you, but one of the main reasons you do accounting is so you can plan.
In looking for data models for accounting, I first looked at the mother of all data model sites for a basic model. Data models are fluid (they don't get treated as very fluid, but at the semantic level, they should be seen as fluid). In other words, there are a lot of different ways to model accounting data. There are standards, of course, some of which are mandated by governments so that corporate reports and statements take a standard form, and some of which are just good ideas - but they only make sense if you accept that the purpose of accounting is to tell a story about a company and explain how that history allows you to make such-and-so a prediction about next year.
Case in point: the chart of accounts. The chart of accounts is the list of all the separate accounts that a given accounting system tracks for a company. By convention, they're numbered, and the first digits of the numbers have meaning. ('1' being assets, '2' liabilities, and so on.) The reason for this numbering system may not be obvious to the programmer - but account numbers must often be written down on papers or whatever, and if they have internal structure it's easier to see what's what on these paper documents. In other words, the numbering system is an interface to traditional document management and is justified.
So here's what I learned about planning a chart of accounts. If you initially use a very simple arrangement, but then as the company grows you start ramifying into separate accounts for separate purposes, then as the page I linked to above notes (and I highly recommend that entire site, actually - very information-rich!), you lose the ability to compare year to year. Because the point of accounting is to compare year to year! (And also to make sure you get paid for invoices, and you pay your own incoming invoices, etc.)
Which brings me to data management. There's a concept of master data management (MDM, which to me means the Manufacturing Data Management system I worked on at Eli Lilly in the 90's, but that's another story entirely) which can be seen as version control for data that warrants it. Master data tends to be complex, slow-changing, benefits from versioning, and is largely global (although for reasons of performance it can be mirrored here and there). The processes of master data management can be seen as relatively independent of the processes that simply use the master data for other purposes (which are transactional processes).
Now clearly, master data management and data model management are essentially the same thing: they involve definition of the semantics of a given company. They can evolve over time, but if they do, you need to keep track of how they've done so. For example, our list of customers naturally changes over time; a properly versioned customer master can tell us when a customer was a valid customer, when they stopped being a customer, and so on, and the customer master at a given point in time can be seen as a snapshot of that process. The same can be said of the data model; as our data management needs grow, we start to make distinctions that were unimportant before - perhaps we have different processes for retail customers on our website and larger contracted customers, and so having a single record that addresses both sets of needs may be too complex. This is an area of data management that seems to be really poorly considered and addressed, but maybe I'm just too naive at this point.
Anyway, back to accounting. In terms of the chart of accounts, you can easily see that accounts fall into a hierarchy that can ramify to an arbitrary extent - but as they ramify, if you want to preserve comparability, you need to "back-categorize" existing entries to fall into one of the new subcategories. This is arguably what that post from last week (or last month, time flies) is doing using machine learning; taking the posts from Chase as a sort of general ledger, it categorizes them into subcategories using a machine-learning algorithm I haven't examined in any detail. The same kind of thing could be done if we just split a general asset ledger into a petty cash and bank account setup, for example.
If we look at the overall process of accounting, the "accounting cycle", we see that there are actually two phases involved. The first phase is really not even a phase - it's an ongoing thing. As each transaction happens (an invoice is received, money changes hands, etc.), it's identified and a determination is made of its significance to the accounting system. That is, if we receive money, we determine which account should be credited, why we got it (which invoice the customer is paying), and so on.
Then, periodically, we close the books - we reconcile all the outstanding weirdnesses, fix things up with corrections if necessary, and issue statements that can be given to shareholders, governments, and management planners to explain what happened to the company and what can be expected to happen next period or next year.
That's what I learned about accounting in my browsing today. There are a couple of side points on data management I'd like to address as well.
First is business rules. As noted in the data model tutorial here, business rules are generally implemented as constraints on the database that prevent certain nonsensical things from happening - an order for a non-existent product, for example.
Second, a canonical data model ([here] and [here]), a popular concept lately due to service-oriented architecture, can be seen as a lingua franca between two specific data models. We can define transformations between each specific model to the canonical model to permit communication between the specific systems.
Third, a link to Microsoft's modeling tool, such as it is, and the observation that my own notion of modular data modeling really seems underrepresented out there still, and maybe there's a need for it.
Actually, for basic accounting concepts the GnuCash manual is pretty fantastic.
Third, a link to Microsoft's modeling tool, such as it is, and the observation that my own notion of modular data modeling really seems underrepresented out there still, and maybe there's a need for it.
Actually, for basic accounting concepts the GnuCash manual is pretty fantastic.
Saturday, March 9, 2013
OpenNews Source
A soon-to-be-available repository for journalism code (or an information source about said code and repositories): OpenNews Source. Where it might end up being is unclear, but it's probably something to watch for.
Friday, March 8, 2013
GAE for static file hosting
A post noting that App Engine is actually another valid strategy for hosting static files. That's pretty cool.
Tern
Tern is to be a JavaScript analysis engine for embedding into editors. The author made his Kickstarter goal, so the code will be open (still not sure what I think of this trend).
I'm ambivalent about Tern per se, but I really like the general idea of parsing editors; it's halfway to a semantic editor.
I'm ambivalent about Tern per se, but I really like the general idea of parsing editors; it's halfway to a semantic editor.
A couple of JavaScript posts
And here are a couple of neat site-design JS posts from this last week:
A couple of sysadmin/security posts
Two good ones:
- My First 5 Minutes On A Server; Or, Essential Security for Linux Servers | Bryan Kennedy
- mmb.pcb.ub.es/~carlesfe/unix/tricks.txt
That last one has a lot of stuff I don't see great use for (personally) but I want to kind of get a little database of snippets going for this kind of thing.
NoSQL pondering
This guy seems to have some good thoughts about databases. I should probably read more of him.
Thursday, March 7, 2013
More accounting
- SQL-Ledger ERP
- Its newer fork Open Source ERP: accounting, CRM and more | LedgerSMB
- And command-line based Ledger, a powerful command-line accounting system
I'm really looking for the back end here, essentially an accounting ORM or something. It ought to be easy to extract that from SQL-Ledger/LedgerSMB.
Note that both of those are firmly PostgreSQL-based. I don't particularly want to be bound to a single database system, because I think the overall concepts should be defined at a higher level. They're not interested in that approach (quoting the LedgerSMB guys), so extraction is probably my best bet.
Also it would be good to compare-and-contrast with OpenERP and Sage ERP. And anything else I can find. The usual plan. Probably on the usual timetable. Sigh.
PHP still the right tool for many applications
This same question keeps coming up, but this answer is a good one. Tl;dr: the usual - PHP is ubiquitous and reliable. End of story.
Sunday, March 3, 2013
PaperPort .MAX files
I've still got about 85 PaperPort .MAX files from 2004-2006 when we had a scanner bundled with the format. It's proprietary, which is a real pain now that I no longer have the software to read it. None of the files are drop-dead crucial, but it turns out there's no converter available at all. And I'm hardly the only person in this pickle. Proprietary formats are a Bad Thing.
Well, but honestly - how hard could it possibly be to come up with something that could convert these to a bitmap or something?
That would be an interesting exercise in file format management. I should do it. I already know the first four bytes.
Well, but honestly - how hard could it possibly be to come up with something that could convert these to a bitmap or something?
That would be an interesting exercise in file format management. I should do it. I already know the first four bytes.
Friday, March 1, 2013
Memetracker
Considering, just for a moment, that I might take up the practice of following blogs (which would be stupid because I don't even have time for the things I already do right now), I looked at tools for following blogs, and from there to the concept of a memetracker, and from there came to a Drupal module of that name looking for a maintainer.
I could see that.
But yeah. Meme tracking.
I could see that.
But yeah. Meme tracking.
- Memetracker.org - moribund, but follow the authors' links for loads of interesting things.
- Data from Memetracker - for 2008 and 2009 in support of a paper. As is the whole site, it appears. I could see doing that, too.
Math/pattern/Mathematica blog
This entire Tumblr blog is just fantastic. It makes me want to do this stuff, too.
Some more that he links to (note I'm not even troubling myself to pretty the links up):
Some more that he links to (note I'm not even troubling myself to pretty the links up):
CSS for layout considered dangerous
This is a little old, but its point is very well-taken. For tabular layout, just use a freaking table.
Crowdfunding science
Seems like steam-engine time for scientific crowdfunding. I love Internet disruption.
The Chaos Monkey
I like this; a post on Amazon's approach to scalability: they have a Chaos Monkey that randomly takes down their component servers to see if the remainder are really running as resiliently and failing as gracefully as they should.
That's brilliant.
That's brilliant.
Literate programming reprise
So Jeremy Ashkenas has a post on literate CoffeeScript, which is a feature of 1.5, apparently. And he links to Knuth's own CWEB write-up of ADVENTURE, the original adventure game.
As always when reading Knuth, it makes me think. One of the thoughts it provoked yesterday was this. Literate programming as currently conceived has been criticized as not being sufficiently cognizant of the practice of reusability - which is true. On the other hand, it might actually be nice to track the evolution of subroutines (say) from version to version as one's own skill and knowledge of a given domain grows.
In other words, reusability in the form of a library is also not the end goal. You can kind of reconstruct the history of a concept in git. Kind of. I've never actually done it. But it might be interesting to have some kind of index of code at (yeah) a semantic or descriptive level that explicitly makes it clear what it is supposed to do and how it reflects increasingly refined knowledge of the domain.
I'm having troubles articulating this. Hopefully this flailing around will be enough for me to reconstruct the notion later.
As always when reading Knuth, it makes me think. One of the thoughts it provoked yesterday was this. Literate programming as currently conceived has been criticized as not being sufficiently cognizant of the practice of reusability - which is true. On the other hand, it might actually be nice to track the evolution of subroutines (say) from version to version as one's own skill and knowledge of a given domain grows.
In other words, reusability in the form of a library is also not the end goal. You can kind of reconstruct the history of a concept in git. Kind of. I've never actually done it. But it might be interesting to have some kind of index of code at (yeah) a semantic or descriptive level that explicitly makes it clear what it is supposed to do and how it reflects increasingly refined knowledge of the domain.
I'm having troubles articulating this. Hopefully this flailing around will be enough for me to reconstruct the notion later.
Subscribe to:
Posts (Atom)