RSS

Reading Compound File Binary Format Files generated by PHPExcel with Python/Django xlrd

September 19

This was a monster for me to track down, so I’m hoping it helps many others.

The issue I ran into was that when trying to read an Excel file using Python’s xlrd package that was generated by PHPExcel, the following error was generated:

File read error [row 1]: Workbook corruption: seen[2] == 4

In search of a fix, preferably one that didn’t assume PHPExcel was just buggy (which it may be, but that’s another topic), I perused dozens of online articles with the actual developer of xlrd (John Machin) commenting on threads from people with nearly identical issues, but all the solutions were just specific workarounds that didn’t seem to apply to me.

Somewhere, I caught the tiniest glimpse of the phrase “Compound File Binary.” At first it didn’t lead to much, but it ultimately led me here:

OleFileIO_PL – a Python module to read MS OLE2 files

A quick “pip install OleFileIO-PL” and the deployment of this code:


data = file.read()
try:
f = StringIO.StringIO(data)
ole = OleFileIO_PL.OleFileIO(f)
    if ole.exists('Workbook'):
         d = ole.openstream('Workbook')
         data = d.read()

except Exception as e:
     # try to read it the normal way
     pass

And I was on my way.

Note that “Workbook” stream might be specific to PHPExcel, but there is a list_dir() method on the ole object that will show you the streams available.  There’s also decent documentation in the zip download from the above link.

 

Posted by on September 19, 2013 in Computers, How-To, Programming

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *