The Data Scientists' Guide to Apache Spark


Details
Data Scientist are finding themselves working with increasingly large and complex data in their day to day work. The standard toolset of a data scientist however has not evolved to meet this need. There currently exists a divide in the tools of engineers (such as Java and Hadoop) which have been developed to handle production tasks and those of data scientists (Python and R) which facilitate rapid prototyping and modeling.
While there has been much improvement in the tooling for dealing with data at scale with the development of higher abstractions such as Pig, Hive, Spark, and Scalding, there hasn’t been an equivalent adoption in the workflow of many data scientists. Part of this is due to awareness and part of this is due to availability resources. Due to the fact that most of these tools are in languages the data scientists may not be comfortable with (Java, Scala) there is a perceived high barrier to entry.
This talk will teach the best practices of using Spark for practicing data scientists in the context of a data scientist’s standard workflow. By leveraging Spark’s APIs for Python and R to present practical applications, the technology will be much more accessible by decreasing the barrier to entry.
Prerequisites:
Intermediate
What To Bring:
Laptop
Meet Your Instructor:
Jonathan Dinu is currently the VP of Academic Excellence at Galvanize. Previously, he founded Zipfian Academy, which recently has been acquired by Galvanize. He first discovered his love of all things data while studying Computer Science and Physics at UC Berkeley. In a former life, he worked for Alpine Data Labs developing distributed machine learning algorithms for predictive analytics on Hadoop.
Jonathan has always had a passion for sharing the things he has learned in the most creative ways he can. At Galvanize, he gets to combine his two favorite things: humans and code. When he is not working with students you can find him blogging about data, visualization, and education at http://hopelessoptimism.com

The Data Scientists' Guide to Apache Spark