Extracting a Million-Record Dataset from Historical NYC City Directories


Details
Segarmakers, Wheelwrights, Merchants: Extracting a Million-Record Dataset from Historical NYC City Directories
Almost a year ago, during the first meetup in this series (https://www.meetup.com/historical-data-and-maps-at-nypl/events/235450812/), we showed how the New York Public Library is digitizing its volumes of city directories, and how, as part of the NYC Space/Time Directory project, we have started to extract data from these books and turn them into historical datasets.
City directories present a tantalizing data source for the demographic, occupational, and spatial history of urban environments. New York City’s listings are no exception, with more than 120 years of directories and over a million entries documenting the city’s inhabitants available for public use.
While these directories have been digitized and made publicly available by the NYPL and other institutions, extracting the directory entries for data analysis poses additional challenges involving computer-assisted automated field detection and language parsing. Come hear about updates on this ongoing effort, completed in collaboration with members of New York University’s Data Services team.
This event will be more technical than the one last year, and the focus will be on how we are using optical character recognition, statistical modeling and historical addresses to extract, geocode, and visualize the people and businesses listed in the city directories.
Project members:
• Bert Spaan, NYC Space/Time Directory, NYPL
• Stephen Balogh, NYU Spatial Data Repository
• Nicholas Wolf, NewYorkScapes
Each of us will talk about their role in the project. Afterwards, there will be time for questions and discussion.
https://secure.meetupstatic.com/photos/event/c/c/a/a/600_465412394.jpeg

Extracting a Million-Record Dataset from Historical NYC City Directories