As companies work to gain insight from ever-increasing amounts of data, data platform practitioners need tools which can scale along with the data. Early big data solutions in the Hadoop ecosystem assumed that data sizes overwhelmed available memory, emphasizing heavy disk usage to coordinate work between nodes. As the cost of memory decreases and the amount of memory available per server increases, we see a shift in the makeup of big data systems, emphasizing heavy memory usage instead of disk. Apache Spark, which focuses on memory-intensive operations, has taken advantage of this hardware shift to become the dominant solution for problems requiring distributed data. In this talk, we will take an introductory look at Apache Spark. We will review where it fits in the Hadoop ecosystem, cover how to get started and some of the basic functional programming concepts needed to understand Spark, and see examples of how we can use Spark to solve issues like calculating PageRank and analyzing large data sets.
Kevin Feasel is a Data Platform MVP and Engineering Manager of the Predictive Analytics team at ChannelAdvisor, where he specializes in T-SQL and R development, fighting with Kafka, and pulling rabbits out of hats on demand. He is the lead contributor to Curated SQL (https://curatedsql.com), a contributing author to Tribal SQL (http://www.tribalsql.com), and one of the contributors behind We Speak Linux (https://wespeaklinux.com). A resident of Durham, North Carolina, he can be found cycling the trails along the triangle whenever the weather's nice enough.