Using `tabula-extractor` to liberate tables from their PDF imprisonment

by pdfkungfoo
macOS ◆ xterm-256color ◆ bash 25498 views

Extracting tables from a PDF and transforming them into a spreadsheet format is not simple. Until quite recently this was almost impossible, let alone accomplishing such tasks with Free Software utilities.

The reason is: PDF as a document format does not store the meaning and the context of text elements. (This may improve in the future with the proliferation of tagged PDF/UA ( UA == Universal Accessibility ).

This has changed now, with the arrival of tabula-extractor on the scene. That tool can extract tables which are trapped in a PDF’s fixed layout even though their data may not be tagged at all as ‘table’ and ‘colum’ and ‘row’ and ‘cell’.

This ASCiinema screencast introduces you to tabula-extractor . It covers installation from Git, preparing a small wrapper script and putting it to use with a 293 pages long PDF consisting of one loooong table…

(The cast may at certain spots be progressing too fast for you to follow closely. Please make use of the Pause button at the lower left corner if you need more time to read+understand the details of the screen contents. You can also scroll back if needed.)