The result, when self-described, is a readable overview of the contents of the PDF. And since the nodal structure still hooks back into the CAM::PDF::Node structure (well, it doesn't right at the moment, but you know what I mean) it ought to be relatively simple to modify and write the file back out. I haven't explored that; right now, I'm much more interested in introspection.
Reading takes place in multiple phases. First, we build the list of objects and add PDF::Declarative::InternalValue objects (which are nodes) that take whatever tag describes the type of internal value (dictionary, array, hexstring, stream, and so on). Names are the dictionary names from the PDF data structure, and labels are generally values for scalars, unused for other data types.
I'm still working on interpreting page content streams, but the idea is to locate the text strings and group them into paragraphs according to their mutual proximity and fonts. To do this, I'm going to have to develop a simple PDF command interpreter, so things are going a little slowly. But I really think it's doable, and good handling of PDFs is essential for all kinds of tasks.
I ran across a nice technical paper on PDF structure here (dated 1999, but still a great overview).
Update: The NitroPDF package does what I need to do. Of course, it's not a scripting solution, but at least it will get me what I need today, plus provide a benchmark for performance. It really munges font spacing in order to get a Word document that corresponds closely to the PDF (otherwise your text will overlap any graphical decoration). I hate that; it makes it impossible to work with TagEditor. Of course, I have my unmunge.dpl script, but still: I need something more scriptable and flexible in the long run.
No comments:
Post a Comment