Using ack to grep through Secretary Hillary Clinton's released emails (truncated)

by dann
macOS ◆ xterm-256color ◆ bash 932 views

Demonstration of basic ack usage with the Clinton emails as packaged by the WSJ:

http://graphics.wsj.com/hillary-clinton-email-documents/

I downloaded a couple of the zip files, unzipped them into subdirectories, and then ran poppler’s pdftotext to create text files of each PDF:

                  ls *.pdf | xargs -n 1 pdftotext -layout

With the given directory tree of:

    hereiam/
    |__ clinton-emails/
         |__dec
         |__jun

The following ack command searches through the subdirectories for all text files that have a line in which there are the words “To” or “From”, followed by a colon, followed by 0 or more whitespaces, followed by a single capital H (which is how Clinton’s name shows up in the email fields), followed by anything that is not a letter (i.e. she goes by “H” and not “Hillary”)

                ack '(?:From|To): *H\W' .

Example result:

       clinton-emails/june/C05763789.txt
       19:From: H [mailto:HDR22@clintonemail.com]

The top line is the relative path to the filename. The second line is the actual match. The number 19 refers to the line number of the file (you can run ack to not include this metadata if you want)

Here’s that same search except with the -C flag, which adds the 2 lines before and after each match. Note how the context lines start with the line number and then a hyphen. The actual line with the match starts with a number followed by a colon, as before:

        clinton-emails/june/C05763789.txt
        17-
        18-    Original Message
        19:From: H [mailto:HDR22@clintonemail.com]
        20-Sent: Thursday, July 16, 2009 7:48 AM
        21-To: Chollet, Derek H

(note that this asciinema playback shows a truncated version of the output. Otherwise, showing all of the matches created a asciinema file so big it would make your browser struggle)