Exponea is full-stack Omni-channel real-time marketing cloud. In Exponea, we are extensively building practical AI applications varying from predictions or recommendations to simple simulated annealing. Regardless of application we are building, each one needs data. A lot of data that Exponea can efficiently provide.
Major issue, when building any AI application or ML model, is data preprocessing. This problem arises when you need to process vast volume datasets or high velocity data streams. We build such data pipelines mostly using Spark respectively PySpark and Python, but also many other tools are adopted.
In this talk we will go through the steps we implemented to build such pipelines. We will show you how to get Spark running easily, basic data wrangling with PySpark and Spark Streaming. In the end, we will use our data pipeline for real application and finish talk about resource managing joys and sorrows.
About speaker:
Matus Cimerman
1+y Data science @Exponea, before BI intern and other stuff @Orange.
Finishing masters FIIT STU, thesis: Data stream analysis