• Bay Area Druid Meetup @ Pinterest

    Pinterest HQ [505 Building]

    *** Notes *** Please register with your full name and email for NDA purposes. IDs will be checked at the front door. ------------------------------------------------------------------------------------------------------ *** Presentations *** Talk 1: Pinterest: Our Journey to Operationalizing Druid at Scale Speaker(s): Filip Jaros + colleagues Abstract: We will present the unique challenges of migrating our ads metrics store backed by HBase into a new stack on top of Druid. First, we will discuss why we chose Druid, highlighting the features which were impossible to implement with the old solution and which unlocked opportunities for us to bring deeper analytical capabilities to our ads platform. Next, we will take you through the long journey it took to change our access patterns and productionize Druid in our ecosystem: Development of our special ingestion process on top of Apache Spark Namespacing of segments to allow ingestions from multiple pipelines Metadata service to short-circuit querying non-existent data The choice of Native vs SQL querying Effort of tuning access patterns from HBase-friendly to Druid-friendly. Bio: Filip Jaros is a software engineer at Pinterest. Starting his career in VoIP technologies, he has worked in the ads space for the last five years focusing on developing scalable ETL pipelines, designing database schemas and data retrieval services. Outside of the office, he is passionate about learning foreign languages, computer game design philosophy and nutrition science. ------------------------------------------------------------------------------------------------------ Talk 2: Rethinking Druid's user experience Speaker: Vadim Ogievetsky Abstract: Apache Druid has always been a fast, powerful, and scalable, but it has never been "user friendly" from a UX perspective. This talk will examine how the Druid UX is being redesigned from the ground up. This will make Druid straightforward to get started with, load data into, and manage at scale. Bio: Vadim Ogievetsky, co-founder of Imply, cares about two things: making huge datasets accessible to everyone, and highly complex distributed systems easier to manage. Previously Vadim led the Application team at Metamarkets (acquired by Snap). He holds an MS in Computer Science from Stanford and a BA in Mathematics and Computer Science from Oxford. ------------------------------------------------------------------------------------------------------ Talk 3: Druid Roadmap Discussion Speaker: Gian Merlino Abstract: We will talk about Druid news, including details about the latest roadmap and releases. Bio: Gian is a co-founder of Imply, a San Francisco based technology company, and a committer on Apache Druid. Previously, Gian led the data ingestion team at Metamarkets (now a part of Snapchat) and held senior engineering positions at Yahoo. He holds a BS in Computer Science from Caltech. ------------------------------------------------------------------------------------------------------ *** Schedule *** 6:00 - 6:25 -- People shuffle in, get food and beverage and talk 6:25 - 6:30 -- "Hi, Welcome to Druid Meetup Group" talk and introduction 6:30 - 7:30 -- Talk 1 + Q&A 7:30 - 8:00 -- Talk 2 + Q&A 8:00 - 8:15 -- Druid roadmap discussion 8:15 - 9:00 -- Networking, exit.

  • Rethinking Druid's user experience (@ Big Data and Cloud Meetup)

    1st Talk: TITLE: Combining Large Scale Data Processing and Low Latency on the Cloud with Spark + Druid SPEAKER: Minesh Patel is a seasoned enterprise software professional with years of customer facing experience. He excels at establishing solid relationships with customers and partners across a variety of business verticals. Minesh has architected and been hands on with some of the largest big data implementations in the cloud. ABSTRACT: Apache Spark is one of the most widely used cluster computing frameworks. Spark is used to process tons of data for different workloads. These workloads include batch/streaming ingestion and ETL, ad hoc analytics, and data science. Apache Druid focuses on enabling low latency queries for OLAP workloads. Combining these technologies on a cloud substrate allows for massive scalability with managed cost and faster time to delivery. 2nd Talk: TITLE: Rethinking Druid's user experience SPEAKER: Vadim Ogievetsky, Co-Founder and CPO of Imply LinkedIn: https://www.linkedin.com/in/vogievetsky/ ABSTRACT: Apache Druid has always been a fast, powerful, and scalable, but it has never been "user-friendly" from a UX perspective. This talk will examine how the Druid UX is being redesigned from the ground up. This will make Druid straightforward to get started with, load data into, and manage at scale. SPEAKER BIOGRAPHY: Vadim Ogievetsky, cares about two things: making huge datasets accessible to everyone, and highly complex distributed systems easier to manage. Previously Vadim led the Application team at Metamarkets (acquired by Snap). He holds an MS in Computer Science from Stanford and a BA in Mathematics and Computer Science from Oxford. SPONSOR: Imply.io

  • SF Big Analytics Meetup Group (Co-op event): Apache Druid and YuniKorn

    Agenda: 6 pm -- 6:30 pm Check-in + Networking 6:30 pm -- 7:20 pm Talk 1 (Cloudera) 7:20 pm -- 8:10 pm Talk 2 (Imply) 8:30 pm -- 9 pm Networking 9 pm -- closing Talk 1 : YuniKorn: A Universal Resource Scheduler for both Kubernetes and YARN. We will talk about our open source work - YuniKorn scheduler project (Y for YARN, K for K8s, uni- for Unified) brings long-wanted features such as hierarchical queues, fairness between users/jobs/queues, preemption to Kubernetes; and it brings service scheduling enhancements to YARN. Any improvements to this scheduler can benefit both Kubernetes and YARN community. YARN schedulers are optimized for high-throughput, multi-tenant batch workloads. It can scale up to 50k nodes per cluster, and schedule 20k containers per second; On the other side, Kubernetes schedulers are optimized for long-running services, but many features like hierarchical queues, fairness resource sharing, and preemption etc, are either missing or not mature enough at this point of time. However, underneath they are responsible for one same job: the decision maker for resource allocations. We see the need to run services on YARN as well as run jobs on Kubernetes. This motivates us to create a universal scheduler which can work for both YARN and Kubernetes, and configure in the same way. This YuniKorn scheduler (Y for YARN, K for K8s, uni- for Unified) brings long-wanted features such as hierarchical queues, fairness between users/jobs/queues, preemption to Kubernetes; and it brings service scheduling enhancements to YARN. Most importantly, it provides the opportunity to let YARN and Kubernetes share the same user experience on scheduling big data workloads. And any improvements to this scheduler can benefit both Kubernetes and YARN community. In this talk, we’re going to talk about our efforts of design and implement the YuniKorn scheduler. We have integrated it with both YARN and Kubernetes. We will show demos and best practices. Speaker: Wangda Tan (Apache Hadoop PMC, Cloudera) Wangda is Product Management Committee (PMC) member of Apache Hadoop and Sr. Engineering Manager of computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-prem use cases of Cloudera. His primary interesting areas are YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and Hadoop submarine project (running Deep learning workload across YARN and Kubernetes). He has also led features like resource scheduling, GPU isolation, node labeling, resource preemption, etc. efforts in the Hadoop YARN community. Before joining Cloudera, he was working at Pivotal, working on integration OpenMPI/GraphLab with Hadoop YARN. Before that, he was working at Alibaba cloud computing, participated in creating a large scale machine learning, matrix and statistics computation platform using Map-Reduce and MPI. Talk 2: Swimming in the Data River The dirty secret of most “streaming analytics” technologies is that they are just stream processors: they sit on a stream and continuously compute the results of a particular query. They’re good for alerting, keeping a dashboard up-to-date in real-time, and streaming ETL, but they’re not good at powering apps that give you true insight into what is happening: for this, you need the ability to explore, slice/dice, drill down, and search into the data. This talk will cover the current state of the streaming analytics world and what Apache Druid, a real-time analytical database, brings to the table. Speaker Gian (Imply) Gian is a co-founder and CTO of Imply, a San Francisco based technology company. Gian is also one of the main committers of Druid. Previously, Gian led the data ingestion team at Metamarkets and held senior engineering positions at Yahoo. He holds a BS in Computer Science from Caltech.

    1
  • Apache Druid Bay Area Meetup @ Intuit

    Intuit Building 20

    *** Location *** Our July Bay Area Druid meetup will be hosted by Intuit in Mountain View. Parking is available in the multi-story parking garage to the left of building 20. Press the call button at the gate and security will buzz you into the parking garage. The reception desk at building 20 will have pre-printed name badges available. The meetup itself will be on the 4th floor. *** RSVP early *** Please RSVP by July 3 to ensure a spot at the event! After July 3, we may need to switch from open signups to a wait list. Please also double-check that your Meetup account has your full name, which building security will need in order to verify attendees. *** Presentations *** Talk 1: MiloSage: parallel data processing using SageMaker Abstract: MiloSage is an innovative homegrown solution to use SageMaker in an unconventional way to split the data in chunks to train the model in parallel with a shorter turnaround. The framework allows to train any model in a container and to configure to scale. Come and See! Speakers: Kevin Geraghty, Principal Software Engineer at Intuit. Leveraging Big Data to deliver insights and personalized experiences to our customers. Passionate about all things data! Andrew Conegliano is a data engineer who traded the cold winters of NJ for the earthquakes of CA. He likes to solve complex customers’ problems using Spark and AWS, and working on new platforms that help solve current and future data and ML instrumentation challenges. When not banging his head because of AWS misconfigurations, he’s either binging a TV series, at the gym or watching YouTube videos of track cars. ------------------------------------------------------------------------------------------------------ Talk 2: Rethinking Druid's User Experience Abstract: Apache Druid has always been a fast, powerful, and scalable, but it has never been "user friendly" from a UX perspective. This talk will examine how the Druid UX is being redesigned from the ground up. This will make Druid straightforward to get started with, load data into, and manage at scale. Speaker: Vadim Ogievetsky, co-founder of Imply, cares about two things: making huge datasets accessible to everyone, and highly complex distributed systems easier to manage. Previously Vadim led the Application team at Metamarkets (acquired by Snap). He holds an MS in Computer Science from Stanford and a BA in Mathematics and Computer Science from Oxford. ------------------------------------------------------------------------------------------------------ Talk 3: Druid Roadmap Discussion Abstract: We will talk about Druid news, including details about the latest roadmap and releases. Speaker: Gian Merlino, co-founder of Imply, a San Francisco based technology company, and a committer on Apache Druid. Previously, Gian led the data ingestion team at Metamarkets (now a part of Snapchat) and held senior engineering positions at Yahoo. He holds a BS in Computer Science from Caltech. ------------------------------------------------------------------------------------------------------ *** Schedule *** 6:30 - 6:50 -- People shuffle in, get food and beverage and talk 6:50 - 7:15 -- First speaker, Q&A 7:15 - 7:45 -- Second speaker, Q&A 7:45 - 8:15 -- Druid roadmap discussion, wrap up

    2
  • Apache Druid Bay Area Meetup @ Unity Technologies

    Unity Technologies

    ***DATE CHANGE*** Please note that the date of this meetup has changed. Unfortunately because of scheduling conflicts we have to postpone the meetup until April. *** Presentations *** Talk 1: High cardinality aggregations using Spark and Druid Abstract: Unity's monetization business generates billions of in-game events in a multi-sided marketplace, which creates complexity, slowness, and overhead for reporting. To work around these issues, Unity deploys a Kafka, Spark, and Druid based ingestion and aggregation pipeline. In this talk, we'll be discussing how Unity has built a high-cardinality data cube that allows for various business, product, and engineering groups to evaluate and act on the same data source. Speaker: Mehdi Asefi, Unity Technologies Mehdi Asefi Completed his MS and PhD from the University of Waterloo, Canada in Electrical and Computer Engineering. He has been working on a range of problems in big data, machine learning and data science in startups, midsize and large size companies. Prior joining Unity Technologies, Mehdi was part of personalization team at Yahoo working on building machine learning pipelines for Yahoo home page. Currently, Mehdi is technical lead for ads monetization data team working on designing and building various pipelines which feed data science model training and business reports. ------------------------------------------------------------------------------------------------------ Talk 2: Druid Ecosystem at Yahoo Abstract: Flurry Analytics enables you to measure and analyze activity across your app portfolio to answer your hardest questions and optimize your app experience. On a typical day, there are over 100B events streaming into the system with over 1M companies using Flurry. In this session, we will talk about how we have developed an ecosystem with Druid at its core along with Kafka, Airflow, SuperSet and Hive. We will also discuss about ingestion data into Druid, querying your data, monitoring and tuning Druid to work best for you Speaker: Niketh Sabbineni, Oath Niketh is a Principal Engineer at Yahoo/Oath. In his current role, he works with Flurry and helps to build infrastructure and applications required to support analytics at Petabyte scale. Niketh was previously the CTO at Bookpad, which was acquired by Yahoo in 2014. He holds a BTech in Computer Science from IIT. ------------------------------------------------------------------------------------------------------ Talk 3: Druid Roadmap Discussion Abstract: Gian will talk about Druid news, including details about the latest roadmap and releases. Speaker: Gian Merlino Gian is an Apache Druid (incubating) PMC member and a co-founder of Imply. Previously, Gian led the data ingestion team at Metamarkets and held senior engineering positions at Yahoo. He holds a BS in Computer Science from Caltech. ------------------------------------------------------------------------------------------------------ *** Schedule *** 6:30 - 7:00 -- People shuffle in, get food and beverage and talk 7:00 - 7:05 -- "Hi, Welcome to Druid Meetup Group" talk and introduction 7:05 - 7:25 -- First speaker 7:25 - 7:30 -- Q/A 7:30 - 7:50 -- Second speaker 7:50 - 7:55 -- Q/A 7:55 - 8:15 -- Druid roadmap discussion 8:15 - 8:20 -- Q/A

    10
  • Druid Bay Area Meetup @ Netflix

    Netflix

    *** Notes *** Please RSVP on this meetup page as well as at https://druidmeetup.splashthat.com/ *** Presentations *** Netflix is excited to host the Druid User meetup group. Druid is being used in many different projects at Netflix and we would love to share our experience of using Druid and how it's helped provide the best streaming experience to our users through a series of lightning talks. We look forward to seeing you! Talk 1 : Scaling & Sketch Strings: How Netflix analyzes 160B daily customer actions to improve application performance Speaker : Matt Herman, Vivek Pasari Summary : Learn how Netflix uses Druid to analyze billions of daily actions on hundreds of different device types across the world to improve our customer’s Application Quality of Experience. We will discuss data ingestion, querying, and the visualization decisions made to empower data driven decision making without overwhelming business stakeholders. Talk 2 : Druid Deployment and Use cases @ Netflix Speaker: Samarth Jain Talk 3: Druid Roadmap Speaker: Gian Merlino ------------------------------------------------------------------------------------------------------ *** Schedule *** 6:00 - 7:00 pm: Welcome, Registration, Light Bites & Networking 7:00 - 8:00 pm: Lightning Talks 8:00 - 9:00 pm: Dessert & Additional Networking

    10
  • Druid Bay Area Meetup @ Lyft

    Lyft

    *** Notes *** We are hosting this meetup together with the SF Big Analytics meetup group (https://www.meetup.com/SF-Big-Analytics/events/252678379/). Lyft is asking all attendees to register for the event on ti.to (free) before arriving to the event. After registration, an eNDA will sent to you and after you sign the NDA, a badge will be pre-printed for you when you arrive at the event. If for some reason you are not able to sign the eNDA online, you can still attend, however you may have a wait in line to sign in at the front desk. Ti.to link: https://ti.to/big-data/big-data-systems-for-operational-analytics/with/vryrkr9c46c *** Presentations *** Talk 1: The rise of operational analytic data stores Abstract: Operational analytic data stores are a new emerging class of databases that merges ideas of logsearch systems (Elastic, Splunk, etc) and traditional analytic databases (Vertica, Teradata, etc). Popular open source projects in this class include Apache Druid (incubating), Clickhouse (from Yandex), Pinot (from LI), Palo (from Baidu), and more. We will discuss the motivation behind these databases, and discuss in the detail the history, architecture, and future of Druid. Speaker: Gian Merlino Gian is an Apache Druid (incubating) PMC member and a co-founder of Imply. Previously, Gian led the data ingestion team at Metamarkets and held senior engineering positions at Yahoo. He holds a BS in Computer Science from Caltech. ------------------------------------------------------------------------------------------------------ Talk 2: Data modeling tradeoffs with Druid (Lyft) Abstract: When dealing with an in-memory database and large volumes of data, limiting infrastructure costs can quickly becomes a concern. Luckily there are many data modeling techniques and Druid functionalities that can be used to mitigate costs. Between summarization techniques, leveraging sketches, sampling data and more, methods can be combined to achieve desired results while staying within reasonable cost boundaries. In this talk, we'll explore how we can do more with less, and describe a methodology to limit Druid data source sizes while delivering reliable, fast analytics. Speaker: Maxime Beauchemin Maxime Beauchemin works as a Senior Software Engineer at Lyft where he develops open source products that reduce friction and help generate insights from data. He is the creator and a lead maintainer of Apache Airflow [incubating], a data pipeline workflow engine; and Apache Superset [incubating], a data visualization platform; and is recognized as a thought leader in the data engineering field. Before Lyft, Maxime worked at Airbnb on the "Analytics & Experimentation Products team". Previously, he worked at Facebook on computation frameworks powering engagement and growth analytics, on clickstream analytics at Yahoo!, and as a data warehouse architect at Ubisoft. ------------------------------------------------------------------------------------------------------ Talk 3: Streaming SQL and Druid (Lyft) Abstract: Druid provides sub-second query latency and Flink provides SQL on streams allowing rich transformation/enrichment of events as it happens. In this talk we will learn how Lyft uses flink sql and druid together to support real time analytics. Speaker: Arup Malakar Arup is a Software Engineer at Lyft, working on the Data Platform team. Prior to Lyft, Arup had helped build the data platform at Ooyala and Yahoo! ------------------------------------------------------------------------------------------------------ *** Schedule *** 6:00 - 6:30 pm: Check in and settle, networking 6:30 - 6:35 pm: Intros 6:35 - 7:10 pm - Talk #1 7:15 - 7:50 pm - Talk #2 7:55 - 8:30 pm - Talk #3 8:30 - 8:45 pm - Wrap up

    6
  • Druid Bay Area Meetup @ eBay

    2065 Hamilton Ave

    *** Presentations *** 1. Druid for real time monitoring & analytics at eBay Speaker: Saurabh Mehta, Engineering Manager, Monitoring Platform @ eBay At eBay, we are using Druid for application monitoring and analytics in near real time. We process data coming from thousands of apps which generates more than 100’s of Billion events per day. Druid helped us to monitor critical events and health of our apps in near real time and scale for our needs. We will be presenting how we built and scale Druid at eBay for our use-case in detail. We will be sharing challenges, learnings, and our journey building on this platform at eBay. 2. Apache Druid (incubating) and the future Speaker: Gian Merlino, Co-founder, Imply Gian will talk about Druid news, including details about the latest roadmap and an update about the upcoming migration of the project to the Apache Software Foundation. *** Schedule *** 6:30 - 7:00 -- People shuffle in, get food and beverage and talk 7:00 - 7:05 -- "Hi, Welcome to Druid Meetup Group" talk and introduction 7:05 - 7:35 -- First speaker 7:35 - 7:40 -- Q/A 7:40 - 8:10 -- Second speaker 8:10 - 8:15 -- Q/A *** What to bring *** Government issued photo ID

    8
  • Druid Nov Meetup @ MZ

    MZ

    *** Parking and check-in logistics *** Our hosts at MZ are providing off-site parking located at 3300 El Camino Real, Palo Alto, CA 94304 (https://goo.gl/maps/hidYF4zSaeJ2) (~0.6 miles from the meetup location). There is a complimentary shuttle service from the parking lot to 1050 Page Mill Road (the MZ HQ campus). Please arrive by 6:15pm to catch the shuttle timed for the start of the meetup. Please bring valid photo ID in order to board the shuttle, and let the driver know that you are attending the Druid meetup. If you arrive after 6:15pm and there is no shuttle immediately available, you can call[masked], extension 1111 to request another shuttle, or take Uber/Lyft or walk (0.6 miles). MZ has asked us to pass along the following details about check-in at their office: • Once onsite, you will be required to sign a Non-Disclosure Agreement (NDA). Please bring a government issued photo ID such as a driver's license, passport, military ID, or state ID.​​ • We would like to ask attendees to stay with the Druid Meetup group and try not to wander to other areas of our building. Due to the nature of our business we manage a lot of sensitive data and would appreciate attendees being mindful of that. *** Presentations *** 1. Druid @ MZ Speakers: Pushkar Priyadarshi, Igor Yurinok, Bikrant Nepune MZ is re-inventing how the entire world experiences data via our mobile games division MZ Games Studios, our digital marketing division Cognant, and our live data platform division Satori. We will be presenting how we use Druid to democratize access to data across teams at MZ. We will be going in details about our journey of building this real-time data exploration platform using Druid and Superset and lessons we learned along the way. 2. Measuring Slack API performance using Druid Speakers: Ananth Packkildurai Slack is a communication and collaboration platform for teams. Our millions of users spend 10+ hrs connected to the service on a typical working day. They expect reliability, low latency, and extraordinarily productive client experiences across a wide variety of devices and network conditions. It is crucial for the developers to get the real-time insights on Slack operational metrics. We will be presenting how we setup Druid with auto scale, monitoring metrics to build trust with our clients and wishlist from Druid. 3. Druid roadmap Speaker: Gian Merlino Gian will discuss the upcoming Druid roadmap and the latest features. *** Schedule *** 6:30 - 7:00 -- People shuffle in, get food and beverage and talk 7:00 - 7:05 -- "Hi, Welcome to Druid Meetup Group" talk and introduction 7:05 - 7:25 -- First speaker 7:25 - 7:30 -- Q/A 7:30 - 7:50 -- Second speaker 7:50 - 7:55 -- Q/A 7:55 - 8:15 -- Druid roadmap discussion 8:15 - 8:20 -- Q/A

    3
  • Druid May Meetup @ Target

    Target

    Our next Druid meetup will be held at Target (http://www.target.com/). We will be having 2 talks and a roadmap discussion. 1. Creating real-time data pipelines for analytical workloads Speakers: Gurudev Karanth, Karthik Rajagopalan, Mithun Yarlagadda Recent advancements in streaming technologies has allowed us to answer and solve use cases that were very challenging and expensive in the not so past. We built our own real time data-warehousing solution using purely open source technologies: Apache Kafka(message broking), Apache Apex(Stream Processing), Druid(Data-Warehousing). Our solution (under development) reads data from different sources and is made available for consumption in real time (less than few hundred milliseconds). The entire solution is built with High Availability and Fault Tolerant activities and monitored with self-healing capabilities. We’ll walk through our experiences and challenges in this session 2. Automated Druid Cluster Deployment and Monitoring with Chef Speakers: Dan Getzke, Karthik Rajagopalan Dan Getzke will describe the automated deployment and monitoring of Druid clusters at Target using Chef. Cluster deployment, configuration management, monitoring and metric gathering will be topics for this talk 3. Druid roadmap Speaker: Gian Merlino Gian will discuss the upcoming Druid roadmap and the latest features. Schedule 6:30 - 7:00 -- People shuffle in, get food and beverage and talk 7:00 - 7:05 -- "Hi, Welcome to Druid Meetup Group" talk and introduction 7:05 - 7:25 -- First speaker 7:25 - 7:30 -- Q/A 7:30 - 7:50 -- Second speaker 7:50 - 7:55 -- Q/A 7:55 - 8:15 -- Druid roadmap discussion 8:15 - 8:20 -- Q/A

    5