Open Source Tools for Data Science
Details
For many of us Open Source is an integral part of our life. We are excited to host an event focused on several Open Source tools used in Data driven products.
We are happy to have five speakers Woo Jae Jung, Principal Data Scientist at Pivotal, Timothy Farkas, back end engineer at DataTorrent, Moon Soo Lee, co-founder of Apache Zeppelin and CTO at NFLabs and Daria Mehra (https://www.linkedin.com/in/dmehra), Director of Quality at Jut and Alexy Khrabrov (https://www.linkedin.com/in/chiefscientist), Chief Scientist at Nitro. This event will consist of four longer talks by Woo, Timothy, Moon and Alexy and one lightning talk by Daria.
It is the first time that Pivotal Labs (http://pivotal.io) is hosting us. We are grateful to Pivotal for offering us their venue and sponsoring this event.
Schedule
6:30 - 7:00pm Social (food + drinks)
7:00 - 8:20pm Talks
8:20 - 8:30pm Social
Talks:
Talk by Timothy Farkas (20 minutes): Apache Apex for Dimensions Computation
DataTorrent has implemented a scalable aggregation engine called Dimensions Computation, which allows statistics like Average and Variance to be computed over time, for different dimensions, and visualized in real time for enormous streaming data sets. The engine can be applied to a number of areas ranging from the processing of sensor data to digital advertising. This talk will cover the key concepts behind Dimensions Computation and outline how to apply Dimensions Computation to Digital Advertising.
Lightning talk by Daria Mehra (5 minutes): Last mile analytics with Juttle - a dataflow language for everyone
Daria will present Juttle - an analytics system and language for developers built on a stream-processing. It allows analytics-driven visualization in your application. Juttle gives you an agile way to query, analyze, and visualize live and historical data from many different big data backends or web services.
Talk by Moon Soo Lee (20 minutes): Apache Zeppelin for Data Science
In his talk Moon will describe how Apache Zeppelin improves data science lifecycle as well as shine a light on future roadmap of the project.
Talk by Woo Jae Jung (20 minutes): Intro to Open Source Data Science Tools @ Pivotal
In his talk Woo will introduce open source tools that Pivotal Data Science team uses in their work. One of them is MADlib (https://github.com/madlib/madlib) - an open-source library for scalable machine learning on databases and Hadoop-based platforms. If time allows, he will also introduce PivotalR (https://cran.r-project.org/web/packages/PivotalR/index.html), an R interface for big data analytics, and applications of Procedural Language extensions such as PL/R (http://www.joeconway.com/plr/) in large data settings.
Talk by Alexy Khrabrov (15 minutes): Apache Spark
Apache Spark is transforming data science and data engineering. This brief overview of Apache Spark comes from two angles. Spark can be seen as:
• a DSL written in Scala, the JVM language of choice for modern data pipelines
• a pivotal component of these data pipelines, going full-stack from the API to actionable insights.
Bios
Timothy Farkas has a BS in Electrical Engineering and a BS in Mathematical sciences from Carnegie Mellon University. He's made a drag and drop Script editor for social scientists at CMU, built a machine data analysis system for Oracle, and is currently a back end engineer at DataTorrent. In his free time he likes to work on computer graphics projects and bike.
Daria Mehra (https://www.linkedin.com/in/dmehra) is now Director of Quality at Jut, Inc., a data analytics startup, but prefers the title "Bug Huntress". A winner of Testathon 2014, she has done development and QA for distributed systems, data storage and analytics. Her favorite programming language is Juttle. When not debugging, she likes to read books on paper.
Moon Soo Lee is a creator for Apache Zeppelin (incubating) and a Co-Founder, CTO at NFLabs. For past few years he has been working on bootstrapping Zeppelin project and it’s community. His recent focus is growing Zeppelin community and getting adoptions.
Woo Jae Jung is currently aPrincipal Data Scientist at Pivotal. He is also a contributor of MadLib open source project. Woo came to Pivotal with a background in industrial applications of both humble and advanced inferential statistics. He is focused on delivering a wide range of data science projects at Pivotal, including engagements in the retail, digital, telco, and energy verticals. Woo is passionate about the adoption & usability of advanced analytics tools in the Big & Fast Data ecosystem – including the interoperability of R with the Pivotal platform. He was previously Senior Statistician at Bay Area startup, M-Factor (now IBM), where he built and delivered demand analysis solutions powered by Bayesian hierarchical models. He holds an MSc in Statistics from Stanford and a BSc from Cornell.
Alexy Khrabrov is the Chief Scientist at Nitro, as well as the founder and organizer of SF Scala, SF Spark, SF Text, and Reactive Systems meetups in San Francisco. Alexy also started and runs the By the Bay series of conferences, including Scala By the Bay, Big Data Scala, and Text By the Bay, with the latter expanding into a four-day, six conference sequence called Data By the Bay: data.bythebay.io (http://data.bythebay.io/), May 17-20, San Francisco.
Extra information:
Apache Apex is open source stream and batch processing platform. Apache Apex is used within GE Predix (IOT Cloud platform) solution. There Apache Apex was used for ingestion of time series data.
Apache Zeppelin is a web based notebook server that helps Data Scientists with data exploration and visualization. As one of its backends, Zeppelin implements Spark, and other implementations, such as Hive, Markdown, D3 etc., are also available.
Special thanks to
http://photos1.meetupstatic.com/photos/event/6/5/b/f/600_446306047.jpeg
