All Things Spark: Machine Learning, Atlas integration, ORC & Hive EDW updates


Details
Location details: Find us in Room LL 20A!
See venue map. https://www.sanjose.org/sites/default/files/styles/overlay_gallery_image/public/2018-01/cc_floorplan_parkway_1200typea.jpg
Apache Spark has become one of the most popular in-memory compute engines due to its elegant and expressive development APIs combined with enterprise readiness. At the meetup we will focus on machine and deep learning use cases and performance; Apache Atlas integration to enable governance and metadata; performance improvements and Parquet parity with Apache ORC (high performance columnar storage); and finally we will cover Apache Hive EDW connector enabling data warehouse initiatives for advanced business analytics.
Agenda
6:00 - 6:15 PM Food & drinks
6:15 - 6:20 PM Kickoff
6:20 - 6:40 PM Talk 1
6:40 - 7:00 PM Talk 2
7:00 - 7:20 PM Talk 3
7:20 - 7:40 PM Talk 4
7:40 - 8:00 PM Q&A
8PM+ Networking
Talks
SparkML – Pyspark performance, image integration, and Deep Learning use cases – Yanbo Liang and Mingjie Tang (20 min)
Spark Atlas integration – Yanbo Liang and Mingjie Tang (20 min)
Spark + ORC – Dongjoon Hyun (20 min)
Spark + HiveEDW connector – Eric Wohlstadter (20 min)
Bios
Robert Hryniewicz (host)
Robert is a Data Evangelist with over 11 years of experience working on a variety of technologies from AI and robotics to IoT and blockchain. He’s part of the Hortonworks community team, driving data science sandbox product strategy, thought leadership on AI, delivering crash courses and lectures on Spark, data science + deep learning, and making sure that the community has all the resources needed to build kickass next-gen products. Robert will be your host for the evening.
Arun Iyer
Arun Iyer has been involved with the design and development of various Streaming Analytics platforms at Hortonworks. He has been contributing to Apache Storm project and currently a committer and a PMC member of the project. Prior to Hortonworks he was involved in the development of various streaming and distributed systems at Informatica and at Yahoo.
Jerry Shao
Jerry Shao works as a member of technical staff at Hortonworks, mainly focused on Spark area, especially Spark core, Spark on Yarn and Spark Streaming. He is an Apache Spark committer and Apache Livy (incubating) PPMC. Prior to Hortonworks, he was a software engineer at Intel working on performance tuning and optimization of Hadoop and Spark.
Yanbo Liang
Yanbo is a staff software engineer at Hortonworks. His main interests center around implementing effective machine learning and deep learning algorithms or models. He is an Apache Spark PMC member and contributes to lots of open source projects such as TensorFlow, Apache MXNet and XGBoost. He delivered the implementation of some core Spark MLlib algorithms. Prior to Hortonworks, he was a software engineer at Yahoo! and France Telecom working on machine learning and distributed system.
Mingjie Tang
Mingjie Tang is an engineer at Hortonworks. He is working on SparkSQL, Spark MLlib and Spark Streaming. He has broad research interest in database management system, similarity query processing, data indexing, big data computation, data mining and machine learning. Mingjie completed his PhD in Computer Science from Purdue University.
Dongjoon Hyun
Dongjoon Hyun is an Apache REEF PMC member and committer. Currently, he works for Hortonworks and is focusing on Apache Spark and Apache ORC.
Eric Wohlstadter
Eric is a principal engineer at Hortonworks. He is working on Hive, Tez, and Spark-Hive interoperability. His interests are in database systems and distributed query execution. Eric completed his PhD in Computer Science from the University of California at Davis.

Sponsors
All Things Spark: Machine Learning, Atlas integration, ORC & Hive EDW updates