May Meetup Night


Details
We've got a sponsor in Copyright Clearance Center: http://www.copyright.com/
Food will arrive at 6pm
In our first talk we outline an approach to auto-scaling Amazon EMR clusters running Spark Streaming. The motivation is two-fold.
First is to preclude manual scaling-out operations in response to spikes in event volume, and second to exploit cost savings
by scaling-in a cluster when event volume drops off. The auto-scaling steps are described in detail along with code examples that leverage the EMR and Spark API's.
Our first speakers are:
Nick Afshartous is a Principal Big Data Engineer leading development and QA automation efforts for the streaming analytics platform at WB Games.
Leveraging his passion for big data, distributed systems and functional programming, Nick is a veteran of the industry with more than a decade
of experience building systems that are scalable and maintainable.
Helen Liu is a computer science major at Northeastern University, currently on co-op with WB Games.
Our second talk is “How to (Not) Light a Pile of Money on Fire Using On-Demand Web Services for ETL”
On-demand serverless ETL sounds appealing: there’s no expensive infrastructure for workloads that can be large and infrequent. It also offers centralized management for jobs without the capital expenditures of commercial products. However, failing to understand the pricing model and retaining out-of-the-box configurations can lead to sticker shock when you get your first bill. This talk tells the tale of how we learned from our experience, and made the services work for us.
Glenn Street is the Data Architect for CCC and prefers roasting marshmallows to money
The third talk is "Author Disambiguation in a Knowledge Graph"
In navigating scholarly research, finding all the papers written by an author, or finding all the papers citing an author’s work are frequent operations. However, names as reported by publishers are imperfect tools for performing these searches.
During this talk we’ll cover:
• How we built a knowledge graph of authorships and citations
• Used Spark to identify similar authorships
• Propagated the similarities back to the graph
This is presented with illustrative examples of our approach to help solve these challenges.
Matt Kleiderman is the Director of Architecture at CCC, and has worked in the management of documents and their metadata at Thomson Reuters, Standard & Poors and Xerox Global Services

May Meetup Night