Tabula
December 21, 2019
Let’s examine a piece of technology that I’ve used to great success in my application for automatic parsing of emails.The application will be detailed in a different post in the future( it is written in C#).
An essential piece of the app is the parsing of email attachments for tabular data and that too restricted to certain areas in the pdf field since we are only concerned with those fields.
For this , Tabula is a great tool.Lets see how. Note that tabula requires java and with oracle now charging for java support for commercial licenses, It is advisable to use openJDK and other such JAVA implementations rather than ORacle. But for personal use, it may be fine. Please read terms before use as always.
If you are inclined to use python, there is Camelot,which I might migrate to soon.
First download Tabula for your operating system.I’ll describe for Windows 10.
After unzipping the file, browse and locate tabula.exe and run it.It will start a browser window with an interface for loading pdf files. Please note that tabula is for extracting tables and does not have OCR modules.For that , you can experiment with tesseract.Basically, your pdf should be machine generated not scanned.
Enough with the limitations, let us see what it can do.
Once you import the pdf file, you can view it page by page and using the mouse select which area needs to be analyzed for tabular data. Then select Preview & Export Selected Data. You can select stream or lattice method.Stream is for table with whitespace as blank space between cells and lattice is for gridlines. You can see the preview anad observe if the column-data is correct or not and then save to csv. Or you can save the area co-ordinates to a script file(.sh). If you open in file in a text-viewer, it will be like as below.
If there were two areas selected, you will have two lines of code( one for each section).These commands can be run via cmdline Refer tabula commandline details For example , a sample command java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o data_table.csv report.pd
For this, we need to give paths of the input pdf and output csv if not the ouput will be in the same folder.