Parsing PDFs

Name: Parsing PDFs
Start: 2014-04-02T18:30:00-04:00
End: 2014-04-02T21:00:00-04:00
Location: GWU, Funger Hall, Room 103

Hosted By Data Engineers DC

public group

Details

Much of the world’s data are stored in portable document format (PDF) files. This is not my preferred storage or presentation format, so I often convert such files into databases, graphs, or spreadsheets. When I'm reading PDF files, I ask these questions.

• Do we need to read the file contents at all?

• Do we only need to extract the text and/or images?

• Do we care about the layout of the file?

I take different approaches to parsing depending on the answers to these questions. In the talk, I’ll show a few different approaches to parsing and analyzing PDF files, and I'll discuss which approaches make sense in which situations.

Our Teacher: Thomas Levine

Playing with computers since he was young, Tom eventually developed back and wrist pain, so he started studying ergonomics and conducting quantitative ergonomics research. At some point, people started calling him a data scientist. And his back and wrists now hurt less. He has recently been playing music and studying how people share data.

Events in Washington, DC

Wednesday, April 2, 2014 at 6:30 PM to Wednesday, April 2, 2014 at 9:00 PM EDT

GWU, Funger Hall, Room 103

2201 G St. NW · Washington, DC

Data Engineers DC

public group

Parsing PDFs