Spark ETL and Canon for NLP

Join waitlist?

48 on waitlist

Location image of event venue


Get ready for another scintillating evening of presentations on Data Science.

The Dos and Don'ts of Spark ETL
Apache Spark is a fantastic framework for large-scale data processing. It can be used to transform and prepare structured data quickly and efficiently. However, it's not a drop-in replacement for traditional RDBMS. Learn concepts, tips, tricks, and pitfalls of using Apache Spark for your data processing needs.

Speaker: Ilia Fishbein is Director of Software Engineering at HealthVerity, managing the Logistics team. His background is parallel computing, map-reduce, and distributed data processing. Ilia and his team routinely transform 100s of GBs of healthcare data.

Canon for NLP
Canon is a datastore for natural language documents which at the moment supports sentence segmentation, tokenization, part-of-speech tagging, named entity recognition, and soon, dependency parsing. It comes with an extensible scraper which does machine learning engineering on text streams in flight. It can be deployed at the press of a button.

Speaker: Alex Tecce is a Data Architect at MachineQ working on the storage and insight of time series data from a range of IoT devices. He also has an interest in NLP both academically and industrially, and will be showing his pet project, Canon.

Event Sponsors:
Health Verity ( - HealthVerity offers a cloud-based platform to discover, license, and link HIPAA compliant and de-identified healthcare data.
Revzilla ( - A global eCommerce retailer providing motorcycle enthusiasts with premium apparel, accessories and parts for any riding adventure.

We are thankful to the event sponsors for their generous support of DataPhilly! If you're interested in sponsoring future events please fill out our form at