Run Around

10/29/2008

I was handed a 2,405 page PDF today (tabular gate log data) and asked if I could convert it into an Excel file. Now, it’s safe to say I’m a lame duck at work with no direct responsibilities. Because of this, a project such as this might actually appear both interesting and do-able in the current environment.

As one might fear, the data is not pure within the PDF. There are page numbers, various column headers, and other stuff that make this even more non-trivial than the ideal base case. It might be possible to copy-paste data such as this directly into Excel, but not really for 112,999 individual records (almost twice Excel’s per-worksheet row limit).

So, first off one must pretty much eliminate the work computer as the problem-solving vehicle. We’re brutally locked down and only capable of invoking VBScript at best.

Up comes PuTTY connecting to Joey back home. Up comes the Linux terminal. Tilde, sweet tilde. We have Python.

Up comes WinSCP also connecting to Joey. Push the file across the net and into my home drive.

Okay, so let’s bust up the file: pdfminer to the rescue. Extracted the tarball and ran the test script with success.

Now, this tool outputs some really filthy HTML (CSS absolute positioning to make the webpage look like the original PDF). It might be possible to re-invent the wheel and create a better output template, but no time. So, what I ultimately got was a bunch of HTML-tag-separated tokens from the file. This is a tabular file, and pdfminer seems to read top-down-left-right, so what I got was a sequential listing of each column on each page, one after another.

So, in comes sed, the dirty old stream editor from early UNIX history. Google revealed an HTML tag-stripping script for the program, so out comes a reduced tokenized version of the original PDF-turned-HTML, in a somewhat predictably scrambled order.

On to the real Python processing. I hate writing parsers. I hate writing code that interfaces with parsers. Ugh.

So, I think, what’s the best way to do this? I say, well, the column headers (actually ordered below the specified data, go figure) are signals that we’re changing “state,” meaning I can control my interpretation of the data using the trash that was tokenized along with the data.

This was miserable but tantalizingly half-successful. So, I end up like some 10,000 rows short of complete, off-by-one errors all over the place, first two pages won’t process for anything, etc.

I then realize I’m writing a general case for a fixed subset. Each page happens to have 47 rows with predictable garbage in between. Hack in a special case for the 11 row final page, and you have an isolated, predictable set. This solution took about a fifth the time of the original failed general case solution. And it worked to the row.

Let this be a lesson to me: once-off means don’t worry about next time!

So, the output of my script was a tab-separated de-junkified version of the original unordered tokens. But, well over the 65k limit for Excel.

So, send the file back over with WinSCP, import into Access. Make it all pretty and set up custom filters to split the data into four subgroups based on first letter of last name. Paste and enjoy.

What’s really brutal about this is that it probably came out of a SQL Server database to start, and the vendor just doesn’t offer raw data formats. But, it was a heck of a way to spend a day, ugly parser code or not.