• Stream Processing with Apache Kafka & Apache Samza

    LinkedIn Building R (LSNR)

    Welcome: Welcome to the upcoming Stream Processing Meetup hosted by LinkedIn in Sunnyvale. This meetup focuses on Apache Kafka, Apache Samza, and related streaming technologies. Location: Unify Conference Room, LinkedIn Corporate HQ in Sunnyvale. We will be on the 1st floor of 950 W Maude Ave, Sunnyvale, CA 94085 Agenda: 6 PM: Doors open 6-6:35 PM: Networking 6:35-7:10 PM: Apache Samza 1.0: Recent Advances and our plans for future in Stream Processing Prateek Maheshwari, LinkedIn Apache Samza has reached a major milestone with its recent 1.0 release. In this talk, we step back and take stock of the major new features and enhancements in Samza 1.0. We also take a sneak peek at what's next on our roadmap. Both Stream Processing veterans and developers new to Stream Processing will discover useful new features to leverage for their applications. 7:15-7:50 PM: How and why we moved away from Kafka Mirror Maker to Brooklin- LinkedIn's story Shun-ping Chiu, LinkedIn For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges. To address such issues, we have developed a new mirroring solution, built on top of our stream ingestion service, Brooklin. Brooklin Mirror Maker (BMM) aims to provide improved performance and stability while facilitating better management through finer control of data pipelines. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka Mirror Maker, how BMM is designed to tackle these problems and our plans for iterating further on this. 7:55-8:30 PM: Puma - Stream Processing in Facebook Speaker: Rithin Shetty, Facebook In this talk, we’ll discuss ‘Puma’, a stream processing service at Facebook. Puma, developed internally at Facebook, is a mature stream processing system and has been in production for over 7 years. Users author their stream processing applications in a SQL like declarative language called Puma Query Language(PQL). Puma is used by hundreds of teams across Facebook for their stream processing needs. Due to its familiar SQL like syntax and support for rich testing environment, the application development is simple and fast. The user only needs to focus on the business logic while Puma takes care of the rest (provisioning jobs, handling variations in input traffic, load balancing, DR events, etc). Puma serves a wide range of use cases including accelerating batch pipelines, analyzing user behavior on Facebook, ingesting to various sinks, machine learning, etc. The talk will go into the overall Puma architecture and its main building blocks. It’ll also touch upon the SLA model and our learnings from running a stream processing service at scale. 8:30-9PM: Additional networking and Q&A RSVP: Please RSVP *only* if you plan to attend in person. Our facility can host 250 guests. Parking: You can park in the uncovered parking that is along the building or in the parking garage located next to the building. NDA You will need to sign a standard NDA when you enter the lobby. Food & Drink: Food & drink will be provided. Can’t join us in person?: Join us online - https://primetime.bluejeans.com/a2m/live-event/wgexhshj Want to talk at a future meetup? Please contact us via the “Contact” button in meetup.com.

    4
  • Stream Processing with Apache Kafka & Apache Samza

    LinkedIn Building R (LSNR)

    Welcome: Welcome to the upcoming Stream Processing Meetup hosted by LinkedIn in Sunnyvale. This meetup focuses on Apache Kafka, Apache Samza, and related streaming technologies. Location: Unify Conference Room, LinkedIn Corporate HQ in Sunnyvale. We will be on the 1st floor of 950 W Maude Ave, Sunnyvale, CA 94085 Agenda: 6PM: Doors open 6-6:35 PM: Networking 6:35-7:10 PM: How LinkedIn navigates Streams Infrastructure using Cruise Control - (Speaker: Efe Gencer, LinkedIn) We’ll share our work and experiences towards alleviating the management overhead of large-scale Kafka clusters using Cruise Control at LinkedIn. The talk will consist of two parts: The first part will provide an overview of Cruise Control, including the operational challenges that it solves, its high-level architecture, and some evaluation results from real-world scenarios. The second part will go through a hands-on tutorial to demonstrate how we can manage a real Kafka cluster using Cruise Control. 7:15-7:50 PM: Stream Analytics Manager -(Speaker: Sriharsha Chintalapani, Uber) Stream Analytics Manager provides a simplified UI interface to build complex big data applications. It makes it possible for the end user to not only build but also deploy and monitor streaming applications. It provides pluggable interfaces to provide user supplied business logic through Custom Processors, UDFs. Streamline’s main goal is to let developers build, deploy, manage, monitor streaming applications easily in minutes. In this talk we will go through how we can add other engines like Flink, Spark, Airflow into Streamline and allow users to build both Batch and Streaming applications. 7:55-8:30 PM:: Operating Samza at LinkedIn -(Speaker: Abhishek Shivanna, Stephan Soileau, LinkedIn) Operating Samza at LinkedIn, which, processes around a trillion of messages a day with over several thousand jobs, is a daunting task. This talk will go over the best practices of running Samza as a managed service and will take a look at how SREs at LinkedIn use intelligent automation to operate at LinkedIn scale. 8:30-9PM: Additional networking and Q&A RSVP: Please RSVP *only* if you plan to attend in person. Our facility can host 250 guests. Parking: You can park in the uncovered parking that is along the building or in the parking garage located next to the building. NDA You will need to sign a standard NDA when you enter the lobby. Food & Drink: Food & drink will be provided. Can’t join us in person?: Join the live video stream! https://primetime.bluejeans.com/a2m/live-event/ekbhwdpw Want to talk at a future meetup? Please contact us via the “Contact” button in meetup.com.

    6
  • Stream Processing with Apache Kafka & Apache Samza (July 2018)

    LinkedIn Building F (LSNF)

    Welcome to the upcoming Stream Processing Meetup hosted by LinkedIn! This event focuses on Apache Kafka, Apache Samza, and related streaming technologies. We will be hosting the actual event at Sunnyvale office, and we will also host a "viewing party" from San Francisco. LOCATION: Main Event - Yosemite Conference Room, LinkedIn Corporate HQ in Sunnyvale. 2nd floor of 605 W Maude Ave, Sunnyvale, CA. (Capacity for 200) Viewing Party - Lotta’s Fountain Conference Room, LinkedIn in San Francisco at 222 2nd Street, San Francisco, CA. (Capacity for 70) AGENDA: 6PM: Doors open 6-6:35 PM: Networking & Welcome 6:35-7:10 PM: Beam me up Samza: How we built a Samza Runner for Apache Beam (Xinyu Liu, LinkedIn) Apache Beam provides an easy-to-use, and powerful model for state-of-the-art stream and batch processing, portability across a variety of languages, and the ability to converge offline and nearline data processing. At LinkedIn, we have developed a Samza Runner to leverage the cutting-edge features of Beam. This runner combines the large-scale streaming processing capabilities and first-class state support in Samza with the advancements in Beam data processing. In this talk, we will discuss the Beam API and its implementation in Samza and the benefits of Beam Runner to the Samza and Beam community. 7:15-7:50 PM: uReplicator: Uber Engineering’s Scalable Robust Kafka Replicator (Hongliang Xu, Uber) At Uber, we operate 20+ Kafka clusters to collect system and application logs as well as event data from rider and driver apps. We need a Kafka replication solution to replicate data between Kafka clusters across multiple data centers for different purposes. This talk will introduce the history behind uReplicator and the high level architecture. As the original uReplicator ran into scalability challenges and operational overhead as the scale of Kafka clusters increased, we built the Federated uReplicator which addressed above issues and provide an extensible architecture for further scaling. 7:55-8:30 PM: Concourse - Near real time notifications platform at Linkedin (Ajith Muralidharan & Vivek Nelamangala, LinkedIn) Concourse is LinkedIn’s first near-real-time targeting and scoring platform for notifications. In this talk, we will provide an in-depth overview of the design and discuss various scaling optimizations. We'll explain how Concourse can score millions of notifications per second, while supporting the use of feature-rich machine learning models based on terabytes of feature data. RSVP: Please RSVP *only* if you plan to attend in person. Our facility can host ~200 guests in Sunnyvale and ~70 guests in San Francisco. Parking: You can park in the uncovered parking that is along the building or in the parking garage located next to the building. NDA: You will need to sign a standard NDA when you enter the lobby. Food & Drink: Food (pizza, wings) & drink (water, beer, wine) will be provided. Live Stream: https://primetime.bluejeans.com/a2m/live-event/pawhxrsd Want to talk at a future meetup?: Please contact us via the “Contact” button in meetup.com.

    6
  • Stream Processing with Apache Kafka & Apache Samza

    LinkedIn Building R (LSNR)

    Welcome: Welcome to the upcoming Stream Processing Meetup hosted by LinkedIn in Sunnyvale. This meetup focuses on Apache Kafka, Apache Samza, and related streaming technologies. Location: Unify Conference Room, LinkedIn Corporate HQ in Sunnyvale. We will be on the 1st floor of 950 W Maude Ave, Sunnyvale, CA 94085 Agenda: 6 PM: Doors open 6-6:35 PM: Networking & Welcome 6:35-7:10 PM: Apache Pulsar - The next generation messaging system(Karthik Ramasamy, Co-Founder at Streamlio) This talk introduces Apache Pulsar, a durable, distributed messaging system, underpinned by Apache BookKeeper a streaming storage system. It was originally developed at Yahoo, open sourced in November 2016 and incubating at Apache. Apache Pulsar introduces a segment centric architecture that provides durability, separation of storage and serving and low publish latency. It corporates several enterprise-grade features for multi-tenancy, geo-replication, support for different delivery semantics, and unified messaging model for queuing and streaming. In this talk, Karthik will discuss Apache Pulsar architecture and discuss how it decreases the complexity of development and operations. 7:15-7:50 PM: Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza(Khai Tran, Staff Software Engineer, LinkedIn) Metrics play an important role in data-driven companies like LinkedIn, where we leverage them extensively for reporting, experimentation, and in-product applications. We built an offline platform to help people define and produce metrics driven through their transformation code, mostly in Pig or Hive, and metadata-rich configurations. Many of our users would like to look at these metrics in a real-time fashion. To support this, we recently built an extension to the platform that auto-generates Samza real-time flow from existing offline transformation code with just a single command. Combining with the existing offline platform, we delivered Lambda architecture without maintaining multiple code bases. In this talk, we will describe how we use Apache Calcite to translate our offline logic, served as the single source of truth, into both Samza code and configuration for real-time execution. 7:55-8:30 PM: Building Venice with Apache Kafka & Samza (Gaojie Liu, Senior Software Engineer, LinkedIn) Over the last two years at LinkedIn, we have been working on a distributed key-value store called Venice, which specializes in serving the datasets computed in Hadoop and Samza. Venice "Hybrid Stores" can ingest data from both Hadoop and Samza and internally combine it, thus offering first-class support for lambda architectures. In this talk, we will share how we built Venice by leveraging Kafka and how it empowers new Samza use cases at LinkedIn. RSVP: Please RSVP *only* if you plan to attend in person. Our facility can host 200 guests. Parking: You can park in the uncovered parking that is along the building or in the parking garage located next to the building. NDA: You will need to sign a standard NDA when you enter the lobby. Food & Drink: Food & drink will be provided. Can’t join us in person?: Live Stream is available here: https://primetime.bluejeans.com/a2m/live-event/vbawkkue Want to talk at a future meetup?: Please contact us via the “Contact” button in meetup.com.

    6
  • Stream Processing with Apache Kafka & Apache Samza

    LinkedIn Building F (LSNF)

    Welcome: Welcome to the upcoming Stream Processing Meetup hosted by LinkedIn in Sunnyvale. This meetup focuses on Apache Kafka, Apache Samza, and related streaming technologies. Location: Yosemite Conference Room, LinkedIn Corporate HQ in Sunnyvale. We will be on the 2nd floor of 605 W Maude Ave, Sunnyvale, CA 94085 Agenda: 6 PM: Doors open 6-6:35 PM: Networking & Welcome 6:35-7:10 PM: Stream processing using Samza-SQL@LinkedIn (Srinivasulu Punuru, LinkedIn) Imagine if you can develop and run a stream processing job in few minutes and Imagine if a vast majority of your organization (business analysts, Product manager, Data scientists) can do this on their own without a need for a development team. Need for real-time insights into the big data is increasing at a rapid pace. The traditional Java-based development model of developing, deploying and managing the stream processing application is becoming a huge constraint. With Samza SQL we can simplify application development by enabling users to create stream processing applications and get real-time insights into their business using SQL. In this talk, we try to answer the following questions • How can SQL language be used to perform stream processing? • How is Samza SQL implemented - Architecture? • How can you deploy Samza SQL in your company? 7:15-7:50 PM: Streaming data pipelines @ Slack (Ananth Packkildurai, Slack) Slack is a communication and collaboration platform for teams. Our millions of users spend 10+ hrs connected to the service on a typical working day. They expect reliability, low latency, and extraordinarily rich client experiences across a wide variety of devices and network conditions. It is crucial for the developers to get the real-time insights on Slack operational metrics. In this talk, I will talk about how our data platform evolves from the batch system to near real-time. I will also touch base on how Samza helps us to build low latency data pipelines & Experimentation framework. 7:55-8:30 PM: Improving Kafka at-least-once performance (Ying Zheng, Uber) At Uber, we are seeing an increasing demand for Kafka at-least-once delivery. So far, we are running a dedicated at-least-once Kafka cluster with special settings. With a very low workload, the dedicated at-least-once cluster has been working well for more than a year. When we want to allow at-least-once producing on the regular Kafka clusters, the producing performance became a concern. We spent some effort on this issue in the recent months and managed to at-least-once producer latency by about 80% with code changes and configuration tuning. Most of these improvements also help increase Kafka throughput and reducing Kafka end-to-end latency in general, not especially for at-least-once. RSVP: Please RSVP *only* if you plan to attend in person. Our facility can host 200 guests. Parking: You can park in the uncovered parking that is along the building or in the parking garage located next to the building. NDA: You will need to sign a standard NDA when you enter the lobby. Food & Drink: Food & drink will be provided. Can’t join us in person?: Live Stream will be available here: https://primetime.bluejeans.com/a2m/live-event/ezeuvzqd Want to talk at a future meetup?: Please contact us via the “Contact” button in meetup.com.

    25
  • Stream Processing with Apache Kafka & Apache Samza

    LinkedIn 5th Floor

    Welcome: Welcome to the upcoming Stream Processing Meetup hosted by LinkedIn in Sunnyvale. This meetup focusses on Apache Kafka, Apache Samza and related streaming technologies. Location: Our new Corporate HQ in Sunnyvale. We will be on the 5th floor of 580 Mary. Agenda: 6PM: Doors open 6-6:35PM: Networking & Welcome 6:35-7:10PM: Real-time Indexing of LinkedIn’s Economic Graph (Almog Gavra, LinkedIn) In this presentation, we will cover the basics of LinkedIn’s Search Engine indexing pipeline, focusing on how we leverage Kafka and Samza to ingest over 10K events per second of real time updates. Furthermore, we will examine how we made the system both flexible and horizontally scalable; our pipeline accepts different input system streams and supports both full and partial document updates, but remains agnostic to the type of document (e.g. member profile, job posting or company page) 7:15-7:50PM: Samza at Redfin: Using Streaming to Help Home Buyers and Sellers (Brian Hanks, Redfin) Redfin sends millions of notifications per day to our customers to help them buy and sell homes. In a hot market, customers who learn about new homes first have an advantage, and we want to be faster than any of our competitors. I'll talk about how we developed a streaming system based on Samza to provide a low latency, resilient, horizontally scalable, high throughput system to send notifications to our customers. I'll also speak about some of the challenges we have combining data from multiple sources, how we use some Samza features (such as local store) in unusual ways, some other ways Samza is being used at Redfin, and suggest some features that we'd like to see in Samza. 7:55-8:30PM: Kafka Controller Internals (Onur Karaman, LinkedIn) The Kafka controller plays a critical role in the functioning of a Kafka cluster. It is responsible for broker coordination, topic creation, partition reassignments, and more. We will deep-dive into the controller's internals, protocols, best practices on controller operations, monitoring, as well as some recent enhancements. RSVP: Please RSVP *only* if you plan to attend in person. Our facility can host 300 guests. Parking & Entrance: You can park in the uncovered parking that is along the building or in the parking garage located behind the building. There is also street parking available for overflow. NDA: You will need to sign a standard NDA when you enter the lobby. Food & Drink: Food & drink will be provided. Can’t join us live?: Live Stream will be available here: https://primetime.bluejeans.com/a2m/live-event/hhzcgqqj Want to talk at a future meetup?: Please contact us via the “Contact” button in meetup.com.

    4
  • Stream Processing with Apache Kafka & Apache Samza

    LinkedIn Building R (LSNR)

    Welcome: Welcome to the May 2017 Stream Processing Meetup hosted by LinkedIn in Sunnyvale. This meetup focusses on Apache Kafka, Apache Samza and related streaming technologies. Location: Our new Corporate HQ in Sunnyvale. We will be in a 300-person auditorium named Unity at 950 W Maude Ave in Sunnyvale. Agenda: 6PM: Doors open 6-6:35PM: Networking & Welcome 6:35-7:10PM: Streaming Data Pipelines with Brooklin (Samarth Shetty, LinkedIn) In recent years, data and streaming applications have grown by leaps and bound and streaming data fast and reliably from the storage layer to the streaming applications has become a non-trivial problem. Building one-off data pipelines that serve the requirements of every application and dataset combination is not sustainable. At LinkedIn, we’ve developed a system called Brooklin to create data pipelines connecting streaming data sources (i.e. Kafka, EventHubs, Change-Capture streams) with nearline applications. In this talk we will talk about Brooklin, the problems it addresses, its design, usage and future directions. 7:15-7:50PM: Kafka at Half the Price (Dong Lin, LinkedIn) At LinkedIn we have 1500+ machines for running Kafka which costs millions of dollars in operation and maintenance. As our cluster size increases and hardware becomes older, we observed increasing occurrence of double broker failure in the last year which motivates us to increase replication factor from 2 to 3 to keep our data available to users. However, this change in replication factor is prohibitively expensive as it increases our hardware cost by another 50% which means millions of dollars a year. In this talk we present our work on supporting JBOD setup in Kafka which allows us to save 50% cost, or increase replication factor to 3 and save 25% hardware cost at the same time. We will compare JBOD with alternatives including RAID and one-broker-per-disk, explain its high level design and discuss possible future work to further reduce Kafka's operation cost. 7:55-8:30PM: Managed or stand alone, streaming or batch; Unified processing with the Samza Fluent API (Yi Pan, LinkedIn) Samza 0.13 improves the simplicity and portability of Samza applications. The new fluent API supports common operations like windowing, map and join on streams. Developers can now express application logic concisely in few lines of code and accomplish what previously used to require several jobs. The other exciting Samza[masked] feature is Standalone Deployment. It empowers developers to deploy and scale Samza applications as a simple embedded library, which is much more flexible than the original YARN deployment model. This talk will cover the new Fluent API and Standalone as well as batch processing. both in terms of what is available in the[masked] release and what is coming in the future. RSVP: Please RSVP *only* if you plan to attend in person. Our facility can host 300 guests. Parking & Entrance: You can park in the uncovered parking that is along 950 Maude or in the parking garage located behind the building. There is also street parking available for overflow. NDA: You will need to sign a standard NDA when you enter the lobby. Food & Drink: Food & drink will be provided. Can’t join us live?: Live Stream will be at https://primetime.bluejeans.com/a2m/live-event/ow53483. Recording will be posted in a few days. Want to talk at a future meetup?: Please contact us via the “Contact” button in meetup.com.

    8
  • Stream Processing with Apache Kafka & Apache Samza

    LinkedIn 5th Floor

    Welcome: Welcome to the February 2016 Stream Processing Meetup hosted by LinkedIn in Sunnyvale. This meetup focusses on Apache Kafka, Apache Samza and related streaming technologies. Location: This will be Linkedin's second Streams Processing meetup at our new Corporate HQ in Sunnyvale. We have a beautiful facility on the top floor of the building full of comfortable couches and chairs. Agenda: 6PM: Doors open 6-6:35PM: Networking & Welcome 6:35-7:10PM: SSD Benchmarks for Apache Kafka (Mingmin Chen, Uber) At Uber, we operate 20+ Kafka clusters on commodity hardware with spinning disks. We used to run into disk IO saturation from time to time. In addition, some of our Kafka clusters are dedicated for business critical use cases with acks=all which requires very low latency SLA. In this talk we present our work on benchmarking SSD based Kafka clusters and its impact on end-end producer latency, partition scalability, failure recovery and so on. We will also discuss how this helps power our 0 data loss cluster used for financial pipelines. 7:15-7:50 PM: Asynchronous Processing and Multithreading in Apache Samza (Xinyu Liu, LinkedIn) With the Apache Samza 0.11 release, Samze becomes the first stream processing framework to support both asynchronous processing and parallel processing models. This is unique among current open source stream processors because not only Samze can run traditional synchronous processing in parallel on multiple threads, but also it provides first-class support for asynchronous processing. Users can now perform non-blocking I/O directly for remote data access. This new model also introduces out-of-order processing to maximize parallelism with certain semantics guaranteed. In this talk we will discuss the Samza asynchronous API and model, explore the details of the asynchronous event loop and the semantics, and finally study the performance enhancements using benchmark jobs. 7:55-8:30 PM: Batching to Streaming Analytics at Optimizely (Vignesh Sukumar, Mike Davis, Hao Xia; Optimizely) At Optimizely, we are building a cutting edge experimentation platform that ingests billions of click-stream events a day from millions of visitors for analysis. In this talk, we will highlight our transition to stream processing to provide real-time metrics on top of this event stream. We will also explain how Samza fits our needs and walk through a production level use case of Sessionization and aggregation. RSVP: Please RSVP *only* if you plan to attend in person. Our facility can host 200 guests. Parking & Entrance: You can park in the uncovered parking that is along 580 Mary Ave or in the parking garage located behind the building. There is also street parking available for overflow. You need to enter 580 Mary from the rear of the building (opposite Maude Ave). NDA: You will need to sign a standard NDA when you enter the lobby of 580 Mary. Food & Drink: Food & drink will be provided. Can’t join us live?: We will be live-streaming this event as well as posting recordings of the presentations. http://www.ustream.tv/channel/yBwP2uf4xFk Want to talk at a future meetup?: Please contact us via the “Contact” button in meetup.com.

    13
  • Stream Processing Meetup @ LinkedIn - Wednesday, November 2 2016

    Welcome: Welcome to the November 2016 Stream Processing Meetup hosted by LinkedIn in Sunnyvale. This meetup focusses on Apache Kafka, Apache Samza and related streaming technologies. New Location: LinkedIn is in process of moving our Corporate HQ from Mountain View to Sunnyvale. This meetup will be hosted at a new location in Sunnyvale instead of the Mountain View location where previous meetups were hosted. Agenda: 6PM: Doors open 6-6:35PM: Networking & Welcome 6:35-7:15 PM: Apache Samza: Past, Present, and Future (Kartik Paramasivam, LinkedIn) Samza got registered with Apache Incubator in July 2013 and became a top level project in Jan of 2015. In this presentation we will quickly walk-through the journey so far. We will then spend some time on the current state of Samza and cover some of the key differentiators of Samza in the crowded field of Stream processing. In closing, we will go into the details of some of the big changes that are coming next in Samza. 7:15-7:55PM: Cruise Control: Dynamic Workload Balancing for Kafka (Jiangjie (Becket) Qin, LinkedIn) At LinkedIn, we have over 80 Kafka clusters and more than 1800 Kafka brokers. When running Kafka at this scale, operation becomes critical to ensure the availability as well as the performance of the system. Kafka Cruise Control is developed to help us to solve this problem by automatically managing the Kafka clusters. It handles broker failure/addition and performs dynamic workload balance for the Kafka clusters. We will talk in detail about how it works and the challenges we have faced and solved. 7:55-8:30 PM: General Q& A MC'd by Kartik Paramasivam, will likely pull in committers. RSVP: Please RSVP *only* if you plan to attend in person. Our facility can host 200 guests. Parking: You can park in the uncovered parking that surrounds 580 Mary Ave or in the nearby parking garage. There is also street parking available for overflow. NDA: You will need to sign a standard NDA when you enter the lobby of 605 W Maude Ave. Food & Drink: Food & drink will be provided. Can’t join us live?: We will be live-streaming this event (http://www.ustream.tv/linkedin-events) as well as posting recordings of the presentations. Want to talk at a future meetup?: Please contact us via the “Contact” button in meetup.com.

    11
  • Stream Processing Meetup @ LinkedIn - Tuesday, August 23 2016

    LinkedIn (Unite Conference Room)

    Welcome to the Tuesday, August 23rd Stream Processing Meetup hosted at LinkedIn in Mountain View. This meetup focusses on Apache Kafka, Apache Samza and related streaming technologies. Agenda: 6PM: Doors open 6-6:30PM: Networking 6:30-7:05 PM: Consumer Group Internals: Rebalancing, Rebalancing, Rebalancing, Rebalancing, Jason Gustafson & Onur Karaman Getting data out of Kafka means working with consumer groups. In 0.9, the Kafka team introduced a new coordination protocol built on top of Kafka itself and a new consumer client which leverages it. But how does it work and how does it scale? In this talk, you will find out from two of its main developers. Bio: Jason Gustafson is a software engineer at Confluent Inc. who has spent the last year working on Kafka internals and the Confluent Stream Data Platform. Onur Karaman is a software engineer on the Kafka team at LinkedIn. Before LinkedIn, Onur studied computer science at UIUC. 7:05-7:40PM: Nearline Topic Tagging of News Articles on Samza, Eric Huang At LinkedIn, to provide meaningful and fresh content to our users at scale, we automatically tag news articles with the topics that they are about. We do this at the global scale for each article entering the LinkedIn ecosystem within minutes, using topic models for concepts from "3M" to "Zoology" that exceed the size of the typical Samza container. In this talk, I will present a distributed architecture for our nearline topic tagger built on Samza, offline-to-online model delivery, the overarching machine learning workflow, and interesting problems and solutions we have encountered along the way. Bio: Eric Huang is an analytics engineer at LinkedIn, helping to build and scale LinkedIn's big data analytics and personalization platforms, enabling their products to support hundreds of millions of users worldwide. Prior to this Eric was a scientist at Palo Alto Research Center (PARC) researching graph algorithms, automated planning, and automated data integration. Eric received his Ph.D. in Computer Science from UCLA 7:40-8:20 PM: How to convert a legacy Hadoop Map/Reduce ETL systems to Samza Streaming, Louis Calisi In this presentation Louis Calisi will present how Tripadvisor converted our legacy Hadoop Map/Reduce jobs to Samza Streaming. This system feeds thousands of tables and downstream reports. No data loss and full backwards capability were required. Bio: Louis is a Principle Software Engineer working at Tripadvisor. I help lead the architecture and development of the core ETL and reporting systems. RSVP: Please RSVP *only* if you plan to attend in person. Our facility can host 200 guests. Parking: Anywhere that you see an open spot! NDA: You will need to sign a standard NDA when you enter the lobby of 2025. Food & Drink: Food & drink will be provided. Can’t join us in Mountain View?: We will be live-streaming this event as well as posting recordings of the presentations. We will post the live-stream URL in this group within 1-hour of the event. Want to talk at a future meetup? Please contact us via the “Contact” button in meetup.com.

    6