addressalign-toparrow-leftarrow-rightbackbellblockcalendarcameraccwchatcheckchevron-downchevron-leftchevron-rightchevron-small-downchevron-small-leftchevron-small-rightchevron-small-upchevron-upcircle-with-crosscrosseditemptyheartfacebookfolderfullheartglobegmailgoogleimagesinstagramlinklocation-pinmagnifying-glassmailminusmoremuplabelShape 3 + Rectangle 1outlookpersonplusprice-ribbonImported LayersImported LayersImported Layersshieldstartrashtriangle-downtriangle-uptwitteruseryahoo

Building large-scale analytics platform with Storm, Kafka and Cassandra

Dear Friends!

This time engineers from Integral Ad Science, global Ad Tech company, which works with brands like  Google, Admeld, AppNexus, PulsePoint, Yahoo  and from Grid Dynamics, a global company with expertise in integrating existing Big Data frameworks into new solutions for clients in Advertising, Telecom and eCommerce industries, will share their experience building real-time, large-scale production systems with Storm framework, Apache Kafka and Cassandra. 

At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection.

Sessionization implies linking a series of requests from same browser starting with an initial request for javascript followed by subsequent pings capturing different user interactions with a page and with ad. There can be 5 or more total requests spread over 15-30 minutes which we need to link to each other.

Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent.

We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time.

This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution.

Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing is far more effective for time series data preparation and aggregation. Real-time processing of the same data may need several times less processing power than its offline batch counterpart. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines into in-stream processing raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations:

• Recovery time

• Time relativity and continuity

• Geographical distribution of data sources

• Limit on data loss

• Maintainability

The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec. One of the more challenging problems we had to solve had to do with joining sequential requests from the same browser session. For such processes a window of up to 30 minutes has to be open in anticipation of session closing calls.

While most of processing and aggregation is done in memory, some more granular aggregates have to be stored on a persistent layer of the key-value store.

This talk will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using cassandra. Presentation will also highlight certain optimization pattern used in implementation that can be useful trick to apply in similar situations. 

About Presenters:

Alexey Kharlamov is a Solution Architect with more than 15 years of experience in software industry. Alexey specializes on big data/low latency system in advertising, finance and eCommerce industries. Alexey worked for large institutions and startups across the globe including Integral Ad Science, Bank of America,

Kiril Tsemekhman leads data science and data infrastructure development at Integral Ad Science, an online advertising analytics and data provider. Kiril designed and lead implementation of several large-scale systems in the past all created to solve complex data science and analytics problems on terabytes of data in online advertising space. He also developed and applied machine learning algorithms solving these complex problems. Kiril holds Ph.D in theoretical physics from University of Washington.

Rahul Ratnakar is a Data Engineer with more than 10 years of experience in software industry. He specializes in designing and optimizing low latency high volume system. Rahul has worked under various capacities for companies like Integral Ad Science, EA, Ingersoll Rand, Bobcat, Tavant.

As usual, we will have books raffle from our sponsors, pizza and drinks.

Please share this event with your social network on LinkedIn, Twitter, Facebook. 


Join or login to comment.

  • Alexey K.

    Thank you for all your time and attention. I really enjoyed the conversation. Here is the slide deck.

    Any questions are welcome.

    2 · November 24, 2013

  • Mauricio V.

    Great session, definitely appreciated the level of detail in discussing the solution, challenges, design choices, and performance testing results. Will the presentation be made available to attendees?

    November 22, 2013

    • James Q

      I second that question, I hope the slides and / or video will be uploaded :)

      November 22, 2013

  • Larry T.

    Awesome group. For first meetup, the topic was very specific on how to combine technologies, interesting insight on limitations on modeling analytic worklaods in AWS Cloud, and excellent handling of a broad range of questions. Thanks to Eugene, WebMD< and all the sponsors for location, books, pizza, etc. Appreciated the wide range of companies looking for personnel -- what job shortage!?! All the best. Larry

    November 21, 2013

  • Mark G.

    Awesome discussion, thanks for hosting!

    November 21, 2013

  • Suhas V.

    Can't attend due to sickness. Are slides going to be available?

    November 21, 2013

  • A former member
    A former member

    Will be there.

    November 18, 2013

    • Eugene D.

      Hi Noel, Please email me your full name for building security.

      November 19, 2013

  • Otis G.

    This is a great topic and some of the issues mentioned here is what we are solving at Sematext, too. I'll try to attend, but in case I don't make it, I'm wondering if the talk will be recorded?

    November 8, 2013

    • Eugene D.

      Hi Otis, It would be great if you can attend in person. I don't have proper equipment to do video recording and editing. if anybody can help with this, we all appreciate it.

      November 9, 2013

Our Sponsors

People in this
Meetup are also in:

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy