- Incoming mail has to go through various filters and classifiers before humans ever see it.
- Bounces are first - they contain some information no matter what.
- Some filters are database-backed: assignment to customers, assignment to ongoing threads by response header (and thus to tasks). Since a lot of what I want to do with email is actually workflow, assignment to tasks, projects, and task categories is important and that happens at this stage, mostly.
- Then there are automatic notifications - logs and heartbeats. That stuff can go into an automatic-processing bucket without much further ado.
- Next comes categorization of undifferentiated stuff. The first line of defense is spam filtration. Distributed solutions are useful here because it's best to triangulate over a wider recipient base. (Despammed comes in here - and more on Despammed in a bit.)
- Ham can be autofiltered as well, although I'm not as convinced that's terribly useful. It's an active area of research, though (see a list below).
- Finally, you end up with a little dribble of personal mail and "things that could be new tasks" or related to old ones.
- Except for that last stage, all this happens automatically in the background.
- Once categorized, each incoming mail triggers a system response: this can range from simply marking or putting into a folder, to kicking off an arbitrary program for data storage or statistics, or notifying the user. In other words, workflow.
- Task mail is intended to be relatively small in scale; when a task is finished, its mail is archived with it, possibly leaving a trace in an index for a while.
- Known task messages have their message IDs stored for detection of follow-ups.
- Known task contacts have their email IDs or domains marked for the same reason.
- Longer-term projects can be archived monthly or when the number of messages gets unmanageable.
- Personal conversations are treated as projects in this sense.
- Responses (outgoing mail) are stored with the conversations responded to. Gmail does this and it really makes sense.
- Document management of attachments happens in there somewhere, maybe even organization into some kind of versioned track (identical attachments can at least certainly be stored singly and noted as identical or something).
And again, all that could happen on the server or on your machine, all using an IMAP or other client or just autoinvoked upon receipt. And in fact it might be rather useful to have some subset of this running at Despammed and get serious about that poor old thing.
A couple of promising links I found searching on "email categorization":
- http://people.cs.umass.edu/~ronb/papers/email.pdf looks like a great place to start with machine learning of email categories.
- Implicit Mailroom seems to do a lot of the above, in the Microsoft Exchange context. Certainly a good idea to glom onto their feature list, anyway.
Anyway, the client is more or less independent of all that and could be anything. I'm going on the assumption that client-facing folders are going to be virtual (that is, tag index instead of a physical folder) and stored with Email::Store. Virtual, because right now moving archived folders back and forth to reflect current activity levels of customers takes forever and really is backup-unfriendly. Thunderbird's method of email management just isn't really cutting it for me any more. (Gmail's even worse - even apart from the fact that Google is reading my mail.)
So there's my line in the sand. That's what I want in an email client engine. Kind of an assistant in a box.
No comments:
Post a Comment