Akka streams & Bloom filters; and ML Pipeline for Text Classification

This is a past event

44 people went

Location image of event venue


IMPORTANT: If your Meetup.com name is not your real first AND LAST NAME, please e-mail that info to [masked]

Security at Oracle is requiring us to provide a list of attendees in advance. Also bring with you a GOVERNMENT PHOTO ID (drivers license, or passport, etc). You may be turned away if they do not have your full name in advance or if you don't bring an ID.


RSVPs will close 1pm Wednesday, August 31, 2016, in order that I can provide the attendee list to security at Oracle. If you miss the RSVP date, then e-mail me at [masked] and I will add your name to an addendum that I will provide to security at Oracle on September 7 and hopefully they will accept it.

On the evening of the event, just come to building 1 and check in with security, then proceed down the hall to the "conference center". We'll be in the nice 75-person conference/training room again this time.


6:00pm Pizza and networking
6:30pm Announcements -- DOOR PRIZE a print copy of Michael Malak's book Spark GraphX in Action (http://www.manning.com/malak?a_aid=sparkgraphx&bid=28876901) will be given to a randomly selected attendee
6:40pm An Investigation of Akka Streams and Bloom Filters, by Anthony May
7:25pm Spark ML Pipeline for Text Classification, by Adam Hicks
8:10pm adjourn

An Investigation of Akka Streams and Bloom Filters - Abstract

Anthony is involved in replacing a legacy system based on batch processing with stream processing to more quickly monetize the data and decrease costs. Akka Streams promised to be fast and good at managing state, while Bloom Filters promised to save a lot of costly Cassandra joins while being at least 99% correct. This is the story of their investigation in using Akka Streams and Bloom Filters for part of their new system.

Anthony May - Bio

Anthony is a Senior Software Engineer with the Oracle Data Cloud where he works on the Campaign Dev team. His work is focused on the ingest of large volumes of unbounded streaming datasets using Akka, Spark, Kafka, Mesos etc. Previously he's built software for Healthcare, Education and Retail. He originally hails from New Zealand and moved to the USA in 2012.

Spark ML Pipeline for Text Classification - Abstract

Text classification is a broad topic ranging from binary spam filtering to enterprise document management. Spark on Hadoop’s distributed approach to machine learning is an ideally suited solution to the problem of classifying documents at large scale. This talk will cover the use of another Apache product, Tika, and the use of Tesseract open source OCR to get a handle on multi-class document categorization. The focus will be on text extraction and cleansing through Spark ML Pipeline Models.

Adam Hicks - Bio

Adam’s background is where data analyst, full stack developer and DevOps engineer converge. He has an academic background in philosophy, pure mathematics and graduate computer science. His work spans business automation with Visual Basic, micro service application development, continuous integration for the enterprise and management of a Hadoop stack from the bottom to the top with a focus on Spark development. Adam has an undying passion for all things open source and *nix, and currently works for a small financial services company helping provide solutions anywhere they need a fresh approach.