Skip to content

The Spark-Notebook and Enterprise enabled Data Science

Photo of Ari Vedant
Hosted By
Ari V.
The Spark-Notebook and Enterprise enabled Data Science

Details

Notebooks have become the standard way of exploring datasets, the Spark Notebook is dedicated to Spark and the Scala language. In the first part of the talk, we run a hands-on example showing data processing with H2O integration.

We also discuss where notebooks are headed to make the Data Science work more valuable, with proven needs for git integration and security.

The second part of the talk discusses the valuable the language independent feature of Spark: lineage, not only for the Data Scientists, Developers and system engineers, but also for other aspects of the business like needs for governance and data privacy tracking.

About the Speaker

Xavier Tordoir started his career in academia in Experimental Physics, and focused on data processing. He took part in projects in finance, genomics, and software development for academic research. He worked on massively interacting systems modelling, using machine learning and also developed solutions to manage and process data distributed across data centres. He then worked as a consultant for Data Science and Spark in banking and IoT, created training material for Spark, genomics use cases, the O’Reilly Distributed Pipeline training. He is co-founder of Kensu, a company dedicated to Data Science Governance.

Instructions for Lab

And if you'd like to follow along with the presenter, here are the instructions to setup Spark notebook on your PC/Mac:
## Install the Spark Notebook and notebooks examples

### a. Precompiled build:
Easiest option is to install from a pre-compiled distribution (scala 2.11, spark 2.0.2):
wget --no-check-certificate https://s3.eu-central-1.amazonaws.com/spark-notebook/tgz/spark-notebook-0.7.0-scala-2.11.8-spark-2.0.2-hadoop-2.7.2.tgz

tar -xzf spark-notebook-0.7.0-scala-2.11.8-spark-2.0.2-hadoop-2.7.2.tgzln -s spark-notebook-0.7.0-scala-2.11.8-spark-2.0.2-hadoop-2.7.2 spark-notebook
cd spark-notebook

git clone https://github.com/kensuio/public-notebooks.gitmv public-notebooks/* notebooks/.

## launch the spark-notebook

serverbin/spark-notebook -Dhttp.port=9000

### b. Compile from sourcesSecond option is to compile from the sources, you need the sbt installed (see http://www.scala-sbt.org)

git clone https://github.com/andypetrella/spark-notebook.gitcd spark-notebook

git clone https://github.com/kensuio/public-notebooks.gitmv public-notebooks/* notebooks/.

## launch the spark-notebook server

## (first run will download all dependencies from repositories)sbt -Dscala.version=2.11.8 -Dspark.version=2.0.2 -Dhttp.port=9000 run

## Open the Spark NotebookAfter installation and starting the server, open the spark-notebook in your browser:http://localhost:9000/tree/H2O
And check a specific notebook:
[http://localhost:9000/notebooks/H2O/Chicago-Crime.snb](http://localhost:9000/notebooks/H2O/Chicago-Crime.snb)

Photo of Houston Spark Meetup group
Houston Spark Meetup
See more events
Microsoft Houston
750 Town and Country Blvd., Suite 1000 (top floor) · Houston, TX