Skip to content

Apache Kafka @ Production

Photo of Ori Donner
Hosted By
Ori D. and 2 others
Apache Kafka @ Production

Details

18:00 - 18:30: Networking, mingling & refreshments.

18:30 - 19:30: So You've Inherited Kafka….Now What?
Alon Gavra, Platform Team Lead @ Appsflyer. YouTube Livestream: http://bit.ly/2SrZNc0

19:30 - 20:00: Handling Transient Failures in Kafka Streams.
David Ostrovsky @ Proofpoint. YouTube Livestream: http://bit.ly/2S6xel7

*** All talks are delivered in English and live-streamed via YouTube ***

First session description:
Kafka, many times is just a piece of the stack that lives in production that often times no one wants to touch - because it just works. At AppsFlyer, a mobile attribution and analysis platform that generates a constant "storm" of 70B+ events (HTTP Requests) daily, Kafka sits at the core of our infrastructure.
Recently I inherited the daunting task of managing our Kafka operation and discovered a lot of technical debt we needed to recover from if we wanted to be able to sustain our next phase of growth. This talk will dive into how to safely migrate from outdated versions, how to gain trust with developers to migrate their production services, how to manage and monitor the right metrics and build resiliency into the architecture,
as well as how to plan for continued improvements through paradigms such as sleep-driven design, and much more.

Bio:
Alon Gavra has been with Appsflyer for the past two years - and today serves as the Platform Team Lead. Originally a backend developer he has transitioned to lead the real-time infrastructure team and took on the role of bringing some of the most heavily used infrastructure in AppsFlyer to the next level. A strong believer in sleep driven design, Alon's main focus is stability and resiliency in building massive data ingestion and storage solutions.

Second session description:
In-order processing and strong delivery guarantees are two of Kafka Streams’ greatest strengths. However, they come with an inherent weakness: you must finish processing each message before moving to the next one in the partition. There is no built-in mechanism to retry handling a message without blocking the processing for the partition in question.
At Proofpoint we rely on Kafka to move a lot of data between dozens of different services, which call external 3rd party APIs, perform IO, or do various other things that are prone to temporary failures. We care a lot about end-to-end latency, so we’re quite reluctant to implement local retry logic in every service because that would add multiple seconds to the total processing time. We had to implement our own solution to retry temporary processing failures asynchronously, without blocking the processing of following messages. In this session, we’ll talk about the various considerations we had for designing an asynchronous retry mechanism, why we eventually settled on our current implementation, and whether you might do something different for your own use-case.

Bio:
A software developer and architect with over 18 years of industry experience, trainer, author of multiple courses and books. Currently specializing in big data systems, NoSQL, distributed architecture and cloud computing.
Participated in designing and building dozens of large-scale distributed systems, using NoSQL databases such as Couchbase Server, Cassandra and MongoDB, and open source tools like ElasticSearch, Hadoop, Spark, Storm, Kafka and more. Hands-on experience with cloud environments, including Microsoft Azure, Amazon Web Services, and various private cloud stacks.
Certified trainer, with over 20 successful courses taught in Israel and abroad. Experience at leading a team of developers, defining tasks and project goals, and managing development resources. In-depth knowledge of .NET Framework, including WPF, Win8, and ASP.NET. Extensive knowledge of database administration and programming, with MS-SQL, MySQL, and Oracle.

Photo of #ApacheKafkaIL group
#ApacheKafkaIL
See more events