May Meetup - Elsevier's Datasearch Platform & Harvesting data from PDFs


Our first talk is from Peter Cotroneo, Senior Product Manager at Elsevier, who will present DataSearch, Elsevier’s award-winning search engine that allows scientists and researchers to search for many different data types and formats across a variety of domain-specific and cross-domain institutional data repositories and other data sources. Peter will discuss the challenges of building DataSearch, the technology stack and future direction.

The second talk is from Michael Hardwick, founder and Managing Director of Elite Software, where he is responsible for PDF data extraction technology. Michael has a strong preference for C++ and a background in Mathematics, Physics, and Astronomy. Michael will talk about harvesting data from PDFs, covering a short history of the format and how extraction tools must cope with typefaces, fonts, spacing, columns and reading order, encryption and more! If you've ever wrestled with indexing PDF data this talk will be of interest.