This time engineers from Integral Ad Science, global Ad Tech company, which works with brands like Google, Admeld, AppNexus, PulsePoint, Yahoo and from Grid Dynamics, a global company with expertise in integrating existing Big Data frameworks into new solutions for clients in Advertising, Telecom and eCommerce industries, will share their experience building real-time, large-scale production systems with Storm framework, Apache Kafka and Cassandra.
At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection.
Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent.
We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time.
This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution.
Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing is far more effective for time series data preparation and aggregation. Real-time processing of the same data may need several times less processing power than its offline batch counterpart. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines into in-stream processing raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations:
• Recovery time
• Time relativity and continuity
• Geographical distribution of data sources
• Limit on data loss
The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec. One of the more challenging problems we had to solve had to do with joining sequential requests from the same browser session. For such processes a window of up to 30 minutes has to be open in anticipation of session closing calls.
While most of processing and aggregation is done in memory, some more granular aggregates have to be stored on a persistent layer of the key-value store.
This talk will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using cassandra. Presentation will also highlight certain optimization pattern used in implementation that can be useful trick to apply in similar situations.
Alexey Kharlamov is a Solution Architect with more than 15 years of experience in software industry. Alexey specializes on big data/low latency system in advertising, finance and eCommerce industries. Alexey worked for large institutions and startups across the globe including Integral Ad Science, Bank of America, Macys.com.
Kiril Tsemekhman leads data science and data infrastructure development at Integral Ad Science, an online advertising analytics and data provider. Kiril designed and lead implementation of several large-scale systems in the past all created to solve complex data science and analytics problems on terabytes of data in online advertising space. He also developed and applied machine learning algorithms solving these complex problems. Kiril holds Ph.D in theoretical physics from University of Washington.
Rahul Ratnakar is a Data Engineer with more than 10 years of experience in software industry. He specializes in designing and optimizing low latency high volume system. Rahul has worked under various capacities for companies like Integral Ad Science, EA, Ingersoll Rand, Bobcat, Tavant.
As usual, we will have books raffle from our sponsors, pizza and drinks.
Please share this event with your social network on LinkedIn, Twitter, Facebook.