Skip to content

SPARKling Analytics by Ravi Nair

Photo of Katie Bakewell
Hosted By
Katie B.
SPARKling Analytics by Ravi Nair

Details

PLEASE NOTE THE CHANGE OF DAY. This year, we're back to Tuesday nights instead of Wednesday nights. To kick off the new year, we're welcoming back Ravi Nair (Hadoopean) to talk about Spark. Hope to see a great crowd!

Apache® Spark™ is an open-source cluster computing framework with in-memory processing to speed analytic applications up to 100 times faster compared to technologies on the market today. Developed in the AMPLab at UC Berkeley, Apache Spark can help reduce data interaction complexity, increase processing speed and enhance mission-critical applications with deep intelligence. JaxBigData is bringing you this latest trend for the first time in Jacksonville and around.Highly versatile in many environments, Apache Spark is known for its ease of use in creating algorithms that harness insight from complex data. Spark was elevated to a top-level Apache Project in 2014 and continues to expand today.Ravi Nair, CTO of Percipient (

www.percipientcx.com (http://www.percipientcx.com/)

) will elaborate the amazing power of Spark computing, showing how it is suited for interactive, batch and realtime applications. He will dive deep into the architecture, unfolding the RDDs, which acts as the core of Spark. With many examples and demos, Ravi will be speaking on

Spark Core

Spark Core contains basic Spark functionalities required for running jobs and needed by other components. The most important of these is the RDD concept, or resilient distributed dataset, which is the main element of Spark API. It is an abstraction of a distributed collection of items with operations and transformations applicable to the dataset. It is resilient because it is capable of rebuilding datasets in case of node failures. Spark Core also provides means of information sharing between computing nodes with broadcast variables and accumulators. Other fundamental functions, like networking, security, scheduling and data shuffling, are also part of the Spark Core.

Spark SQL

Spark SQL is the newest Spark component, but very actively developed. It provides functions for manipulating large sets of distributed, structured data using SQL (actually, an SQL subset supported by Spark) and Hive SQL language (HQL). Spark SQL can also be used for querying JSON data as well as for writing and reading Parquet files, which is an increasingly popular file format that allows for storing schema along with the data. It provides a query optimization framework called Catalyst that can be extended by custom optimization rules and includes a Thrift server, which can be used by external systems, such as BI tools, to query data through Spark SQL using classic JDBC and ODBC protocols.

Spark Streaming

Spark Streaming is a framework for ingestion of real-time streaming data from various sources. The supported sources include HDFS, Kafka, Flume, Twitter, ZeroMQ and custom ones. Its operations recover from failure automatically which is, of course, very important for online data processing. Spark Streaming can be combined with other Spark components in a single program unifying real-time processing with machine learning, SQL and graph operations, which is something not seen in the Hadoop ecosystem.

Spark GraphX - Intro

Graphs are data structures comprised of vertices and edges connecting them. GraphX provides functions for building graphs and implementations of the most important algorithms of the graph theory, like page rank, connected components, shortest paths, SVD++ and others. It also provides Pregel message-passing API, the same API for large scale graph processing implemented by Apache Giraph, a project with implementations of graph algorithms and running on Hadoop.

Spark MLlib - Intro

Spark MLlib is a library of machine learning algorithms grown from MLbase project at UC Berkeley. Supported algorithms include logistic regression, naive Bayes classification, SVM, decision trees, random forests, linear regression, k-means clustering and others.

As usual, it'll be a fun group with pizza, beer, and ping pong! Tentatively, we'll be arriving at 5:30 for pizza, beer, and socializing, Ravi will start talking around 6:30, and we'll have time for questions at 8:00.

Photo of AI Connect Jax group
AI Connect Jax
See more events
Ignite
6 East Bay Street, 4th Floor · Jacksonville, FL