Skip to content

Parsing PDFs

Parsing PDFs

Details

Much of the world’s data are stored in portable document format (PDF) files. This is not my preferred storage or presentation format, so I often convert such files into databases, graphs, or spreadsheets. When I'm reading PDF files, I ask these questions.

• Do we need to read the file contents at all?

• Do we only need to extract the text and/or images?

• Do we care about the layout of the file?

I take different approaches to parsing depending on the answers to these questions. In the talk, I’ll show a few different approaches to parsing and analyzing PDF files, and I'll discuss which approaches make sense in which situations.

Our Teacher: Thomas Levine

Playing with computers since he was young, Tom eventually developed back and wrist pain, so he started studying ergonomics and conducting quantitative ergonomics research. At some point, people started calling him a data scientist. And his back and wrists now hurt less. He has recently been playing music and studying how people share data.

Photo of Data Engineers DC group
Data Engineers DC
See more events
GWU, Funger Hall, Room 103
2201 G St. NW · Washington, DC