Using `tabula-extractor` to liberate tables from their PDF imprisonment

--:----:--

Using `tabula-extractor` to liberate tables from their PDF imprisonment

by pdfkungfoo 9 years ago

macOS ◆ xterm-256color ◆ bash 27066 views

Extracting tables from a PDF and transforming them into a spreadsheet format is not simple. Until quite recently this was almost impossible, let alone accomplishing such tasks with Free Software utilities.

The reason is: PDF as a document format does not store the meaning and the context of text elements. (This may improve in the future with the proliferation of tagged PDF/UA ( UA == Universal Accessibility ).

This has changed now, with the arrival of tabula-extractor on the scene. That tool can extract tables which are trapped in a PDF’s fixed layout even though their data may not be tagged at all as ‘table’ and ‘colum’ and ‘row’ and ‘cell’.

This ASCiinema screencast introduces you to tabula-extractor . It covers installation from Git, preparing a small wrapper script and putting it to use with a 293 pages long PDF consisting of one loooong table…

(The cast may at certain spots be progressing too fast for you to follow closely. Please make use of the Pause button at the lower left corner if you need more time to read+understand the details of the screen contents. You can also scroll back if needed.)

More by pdfkungfoo

AppImage: From Debian Linux to MS PowerShell in Less than 60 Seconds 06:03

by pdfkungfoo 7 years ago

Testing the AppImage package for QPDF 02:03

by pdfkungfoo 7 years ago

Announcing the upcoming tutorial: Introduction to ImageMagick 7.0.x 00:37

by pdfkungfoo 7 years ago

PDF Text Extraction Shootout: `pdftotext` vs. The Rest 08:06

by pdfkungfoo 9 years ago

See all

You can download this recording in asciicast v1 format, as a .json file.

Download

Replay in terminal

You can replay the downloaded recording in your terminal using the asciinema play command:

asciinema play 22761.json

If you don't have asciinema CLI installed then see installation instructions.

Use with stand-alone player on your website

Download asciinema player from the releases page (you only need .js and .css file), then use it like this:

<!DOCTYPE html>
<html>
<head>
  <link rel="stylesheet" type="text/css" href="asciinema-player.css" />
</head>
<body>
  <div id="player"></div>
  <script src="asciinema-player.min.js"></script>
  <script>
    AsciinemaPlayer.create(
      '/assets/22761.json',
      document.getElementById('player'),
      { cols: 116, rows: 44 }
    );
  </script>
</body>
</html>

See asciinema player quick-start guide for full usage instructions.

While this site doesn't provide GIF conversion at the moment, you can still do it yourself with the help of asciinema GIF generator utility - agg.

Once you have it installed, generate a GIF with the following command:

agg https://asciinema.org/a/22761 demo.gif

Or, if you already downloaded the recording file:

agg demo.cast demo.gif

Check agg --help for all available options. You can change font family and size, select color theme, adjust speed and more.

See agg manual for full usage instructions.