• Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More

    This is an join event with Dynamic Talks at AI NextCon ( http://aisf19.xnextcon.com/), sponsored by AICamp, Grid Dynamics and Aluxio 5:45 pm -- 6:10 pm --- Check-in 6:10 pm -- 6:15 pm -- Intro 6:15 pm -- 6:50 pm -- Talk 1 + QA 6:50 pm -- 7:25 pm -- Talk 2 + QA 7:25 pm -- 8:05 pm -- Talk 3 + QA Talk 1: Alluxio - Data Orchestration for Analytics and AI in the Cloud Data storage is migrating from the colocated model (e.g., HDFS) to a more cost-effective, scalable but often fully disaggregated and remote data lake model (e.g. S3). This has created a strong need for data orchestration in the cloud  like what K8s does for container-based workloads, so that data can be presented in the right layout at right location for data applications on the cloud. Originally developed from UC Berkeley AMPLab project "Tachyon", Alluxio (www.alluxio.io) implements the world’s first open-source data orchestration system in the cloud: an unified access layer for data-driven applications in bigdata and ML, enabling Spark, Presto or TensorFlow to transparently access different external storage systems while actively leveraging in-memory cache to accelerate data access. In this talk, we will present: trends and challenges in the data ecosystem in cloud era; Data engineering in the cloud with data orchestration; Use cases of using tech stacks (Presto or Tensorflow) with Alluxio on S3 Speakers: Haoyuan Li, Bin Fan (H.Y.) Li is the Founder, and CTO of Alluxio. He co-created Alluxio (formerly Tachyon), an open source virtual distributed file system. Bin Fan is the VP of Open Source at Alluxio. Prior to Alluxio, he worked for Google to build the next-generation storage infrastructure. Talk 2: Analytics evolution from columnar data stores to transactional data lakes Self service analytics has been evolving rapidly over the last two decades, this talk will highlight some of the important innovations that have happened, what they have enabled and what the next big challenges are going forward. Technologies and research areas that will be covered are: in-memory data stores, ETL, data visualization, the Grammar of Graphics, data catalogs, massively scalable dashboards, algorithm and ML driven data modeling and query building, Delta Lake (DataBricks), Apache Spark, Apache Arrow, R, Python, HyPer, Weld, GraalVM and SQL2 (SlamData). There might also be some anecdotes about companies in the BI/Analytics field that the speaker has interacted with or worked for. Speaker Jonas Lagerblad Jonas Lagerblad is a Sr. Architect at Alteryx, developing BI products that have made it to the leader quadrant of the Gartner BI Magic quadrant for three different companies; (TIBCO) Spotfire, Oracle and ClearStory Data. Talk3: Building an In-stream Data Summarization Pipeline Using Spark In-stream data summarization is important for many applications that deal with extreme data volumes or require low latency analytics. Although in-stream processing frameworks have been rapidly evolving over the last years, building fault-tolerant high performance in-stream pipelines still represents a challenge for certain types of data summarization. In this talk, we present lessons learned from building a fault-tolerant data summarization pipeline for processing 10Bil+ events per day. We discuss the core Spark/Cassandra-based architecture, failure recovery design, deployment approach, and monitoring components. We also discuss techniques for challenging cases of data summarization such as counting distinct elements and finding most frequent elements in a data steam. Speakers: Max Martynov , Ilya Katsov Max Martynov is the VP of Technology at Grid Dynamics, leading High Performance Computing practices for enterprises. Ilya Katsov is managing the Industrial AI Consulting Practice at Grid Dynamics. He is the author of several scientific articles and international patents, and also authored a book, “Introduction to Algorithmic Marketing: Artificial Intelligence for Marketing Operations” (2017).

  • [AICamp Event] AI developer conference 10/8-11

    Santa Clara Convention Center

    This is a partner event: Our Partner AICamp organized the following event: use the discount code: SFBIGANALYTICS for 10% off discount for your members. I have attended several conferences in last few years and I really like this conference. http://aisf19.xnextcon.com/ ================ AI developer conference 10/8-11, Machine/Deep learning, NLP, CV, ML lifecycle 60+ tech talks and hands-on workshops The annual AI NEXTCon Developer Conference is headed to San Francisco on Oct 8-11. Join us with 60+ top-notch tech lead speakers on in-depth tech talks on Computer Vision, NLP, Machine Learning, Deep Learning, AutoML, Machine learning life cycle, etc.. The event is specially geared to developers, engineers, data scientists and researchers to learn the latest on AI, practical experiences in applying deep learning, and machine learning in production. Featured Speakers: Michael Jordan, Distinguished Professor from UC Berkeley Danny Lange, VP of AI from Unity Technologies Lukasz Kaiser, Sr Research Scientist from Google Brain Anoop Deoras, Lead AI Researcher from Netflix Ted Way, Machine Learning Researcher from Microsoft Anish Sarma, Engineer Manager from Airbnb Cibele Halasz, Sr. Machine Learning Engineer from Twitter Rui Wang, AI Researcher, Uber AI View All 60+ Speakers If you are a developer looking to hone your skills, a tech lead and manager to learn latest AI tech that apply to your engineering teams to innovate products and services, or someone who just wants to learn more about the AI industry that's re-shaping the tech world, the AI NEXTCon is right for you. Best Price (starting at just $349) is available for a limited time. Additional 10% off discount for members, with code: SFBIGANALYTICS. http://aisf19.xnextcon.com/ The Organizer: AICamp, with the mission of “Make AI available to all developers”, is a global online AI learning platform for developers, engineers, data scientists to learn and practice AI technology. We’re one of largest AI developers communities with 90,000+ developers word wide in the group, and have 40+ local learning groups in 40+ cities from 10+ countries. Online AI learning platform: https://learn.xnextcon.com AI Developers Conference (Seattle, NYC, San Francisco, Beijng): http://www.xnextcon.com

  • Machine Learning on Big Data (talks from Lyft, Netflix and Walmart Labs)

    In this meetup, we will focus on the art and science of doing Machine Learning on Big Data. We will have talks on best practices for ML models, and then dive deep into what a scalable ML infra looks like. It’s an evening not to be missed! Food and Drinks sponsored by Lyft Agenda: 6:00 - 6:30 pm: Check in, food, networking 6:30 - 6:35 pm: Intros 6:35 - 8:30 pm - 3 Talks 8:30 - 8:45 pm - Wrap up Important Note: It is required to register for the event (free) on ti.to, before the event. You will then be sent an eNDA which needs to be signed 24 hours before the event, for security reasons. A badge would be pre-printed for you when you arrive at the event. Please register here (https://ti.to/big-data/machine-learning-on-big-data/with/pv-t9pxogse). If for some reason you are not able to sign the eNDA online, you can still attend, however you may have a wait in a long line at the sign in desk. Talk #1: Ridesharing - Accounting for uncertainty in dispatch decisions to optimize marketplace balance Dispatch is one of the most powerful levers to optimize a two-sided marketplace of physical goods, as it is able to use rider payments to reallocate supply within a network. However, uncertainty of user behavior, such as riders canceling or drivers rejecting dispatches, makes achieving perfect optimality a challenge. In this talk, Parker discusses how Lyft has accounted for uncertainty in ride-sharing networks to achieve better overall outcomes. This talk will dive into modeling challenges with sparsity and non-continuity of various ML models, preventing moral hazard in user behavior from these assumptions, and understanding the biases different model assumptions have on the overall objective. Speaker Bio: Parker Spielman has extensive experience in ridesharing, both at Lyft and previously Uber, where he has worked on a variety of problems including dynamic pricing, dispatch, and incentives. All of these areas contribute to a set of levers focused on better overall control systems for real-time marketplaces. Talk #2: More Data Science with Less Engineering: ML Infrastructure at Netflix Netflix is known for its unique culture that gives an extraordinary amount of freedom and responsibility for individual engineers and data scientists. Our data scientists are expected to develop and operate large machine learning workflows autonomously. However, we do not expect that all our scientists are deeply experienced with systems or data engineering. Instead, we provide them with delightfully usable machine learning infrastructure that they can use to manage the whole lifecycle of a data science project. In this talk, we will share the key concepts that has made our ML infrastructure successful at Netflix. Speaker Bio: Ville Tuulos manages the machine learning infrastructure team at Netflix. Prior to Netflix, Ville has been designing and leading ML and data infrastructure efforts at various startups and large companies in the Bay Area for over a decade, with a particular focus on human-centric tooling. Talk #3: Machine learning and large-scale data analysis on a centralized platform at Walmart In this talk, speakers explore the design of a centralized risk and abuse management platform and how this highly sophisticated platform enables dynamic and complex analytics of large-scale data from different domains. They share a study of protecting customer accounts through linking customer behaviors in their purchases, returns, and financial services. You’ll get an introduction to the Walmart risk and abuse management platform, risk and abuse problems in the Walmart ecosystem, the data-driven analytics and advanced machine learning algorithm used to defend against fraud and abuse, and case studies of customer account protection. Speaker Bio: James Tang is a senior director of engineering at Walmart Labs. Yiyi Zeng is a senior manager and principal data scientist at Walmart Labs. Linhong Kang is a manager and staff data scientist at Walmart Labs.

  • Apache Druid and YuniKorn: Universal Resource scheduler for both K8s and Yarn

    Sponsors: Workday (Venue) + IBM (food) 6 pm -- 6:30 pm Check-in + Networking 6:30 pm -- 7:20 pm Talk 1 (Cloudera) 7:20 pm -- 8:10 pm Talk 2 (Imply) 8:30 pm -- 9 pm Networking 9 pm -- closing Talk 1 : YuniKorn: A Universal Resource Scheduler for both Kubernetes and YARN. We will talk about our open source work - YuniKorn scheduler project (Y for YARN, K for K8s, uni- for Unified) brings long-wanted features such as hierarchical queues, fairness between users/jobs/queues, preemption to Kubernetes; and it brings service scheduling enhancements to YARN. Any improvements to this scheduler can benefit both Kubernetes and YARN community. YARN schedulers are optimized for high-throughput, multi-tenant batch workloads. It can scale up to 50k nodes per cluster, and schedule 20k containers per second; On the other side, Kubernetes schedulers are optimized for long-running services, but many features like hierarchical queues, fairness resource sharing, and preemption etc, are either missing or not mature enough at this point of time. However, underneath they are responsible for one same job: the decision maker for resource allocations. We see the need to run services on YARN as well as run jobs on Kubernetes. This motivates us to create a universal scheduler which can work for both YARN and Kubernetes, and configure in the same way. This YuniKorn scheduler (Y for YARN, K for K8s, uni- for Unified) brings long-wanted features such as hierarchical queues, fairness between users/jobs/queues, preemption to Kubernetes; and it brings service scheduling enhancements to YARN. Most importantly, it provides the opportunity to let YARN and Kubernetes share the same user experience on scheduling big data workloads. And any improvements to this scheduler can benefit both Kubernetes and YARN community. In this talk, we’re going to talk about our efforts of design and implement the YuniKorn scheduler. We have integrated it with both YARN and Kubernetes. We will show demos and best practices. Speaker: Wangda Tan ,Suma Shivaprasad (Cloudera) Wangda is PMC member of Apache Hadoop and Sr. Engineering Manager of computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-prem use cases of Cloudera. His primary interesting areas are YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and Hadoop submarine project (running Deep learning workload across YARN and Kubernetes). He has also led features like resource scheduling, GPU isolation, node labeling, resource preemption, etc. efforts in the Hadoop YARN community. Previously, he worked at Pivotal working on OpenMPI/GraphLab and Alibaba on cloud computing, large scale machine learning, matrix and statistics computation platform with Map-Reduce and MPI. Suma Shivaprasad is an Apache Hadoop Committer and member of Apache Atlas Project Management Committee. Working in the compute platform team at Cloudera that focuses on Hadoop, YARN, Kubernetes and enabling these platforms in the Public Cloud. Talk 2: Swimming in the Data River The dirty secret of most “streaming analytics” technologies is that they are just stream processors: they sit on a stream and continuously compute the results of a particular query. They’re good for alerting, keeping a dashboard up-to-date in real time, and streaming ETL, but they’re not good at powering apps that give you true insight into what is happening: for this you need the ability to explore, slice/dice, drill down, and search into the data. This talk will cover the current state of the streaming analytics world and what Apache Druid, a real-time analytical database, brings to the table. Speaker Gian (Imply) Gian is a co-founder and CTO of Imply, a San Francisco based technology company. Gian is also one of the main committers of Druid. Previously, Gian led the data ingestion team at Metamarkets and held senior engineering positions at Yahoo. He holds a BS in Computer Science from Caltech.

  • All About Streaming: Monitor Kafka Like a Pro and Apache Pulsar

    We have two talks sponsored by Google (Venue) and IBM (food + Drink) IBM would also like you to participate in Data Science Community — https://www.ibm.com/community/datascience/ Agenda: 6 - 6:30 pm Check-In + Networking 6:35 pm -- 7:15 pm Talk1 7:20 pm -- 8:00 pm Talk2 8:05 pm -- 8:30 pm Networking + closing Talk1: Unifying Messaging, Queuing, Streaming & Light Weight Compute in Apache Pulsar Online event processing applications often require the ability to ingest, store, dispatch and process events. Until now, supporting all of these needs has required different systems for each task -- stream processing engines, messaging queuing middleware, and pub/sub messaging systems. This has led to the unnecessary complexity for the development of such applications and operations leading to increased barrier to adoption in the enterprises. In this talk, Karthik will outline the need to unify these capabilities in a single system and make it easy to develop and operate at scale. Karthik will delve into how Apache Pulsar was designed to address this need with an elegant architecture. Apache Pulsar is a next generation distributed pub-sub system that was originally developed and deployed at Yahoo and running in production in more than 100+ companies. Karthik will explain how the architecture and design of Pulsar provides the flexibility to support developers and applications needing any combination of queuing, messaging, streaming and lightweight compute for events. Furthermore, he will provide real life use cases how Apache Pulsar is used for event processing ranging from data processing tasks to web processing applications. Speaker: Karthik Ramasamy (streamlio) Karthik Ramasamy is the co-founder and CEO of Streamlio that focuses on building next generation event processing infrastructure using Apache Pulsar. Before Streamlio, he was the engineering manager and technical lead for real-time infrastructure at Twitter where he co-created Twitter Heron. He co-founded Locomatix, a company that specializes in real-time streaming processing on Hadoop and Cassandra using SQL, that was acquired by Twitter. Karthik has a Ph.D. in computer science from the University of Wisconsin, Madison with a focus on big data and databases. Karthik is the author of book "Network Routing: Algorithms, Protocols and Architectures". Talk 2: Monitor Kafka Like a Pro Kafka operators need to provide guarantees to the business that Kafka is working properly and delivering data in real time, and they need to identify and triage problems so they can solve them before end users notice them. This elevates the importance of Kafka monitoring from a nice-to-have to an operational necessity. In this talk, Kafka operations experts Xavier Léauté and Gwen Shapira share their best practices for monitoring Kafka and the streams of events flowing through it. How to detect duplicates, catch buggy clients, and triage performance issues – in short, how to keep the business’s central nervous system healthy and humming along, all like a Kafka pro. Speakers: Gwen Shapira, Xavier Leaute (Confluence) Gwen is a software engineer at Confluent working on core Apache Kafka. She has 15 years of experience working with code and customers to build scalable data architectures. She currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an author of “Kafka - the Definitive Guide”, "Hadoop Application Architectures", and a frequent presenter at industry conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects. Xavier Leaute is One of the first engineers to Confluent team, Xavier is responsible for analytics infrastructure, including real-time analytics in KafkaStreams. He was previously a quantitative researcher at BlackRock. Prior to that, he held various research and analytics roles at Barclays Global Investors and MSCI.

  • Scale@Uber/Lyft: Managing data lake, workflow & Spark on Kubernetes

    GoPro Headquarters Building D

    Agenda: 6 -- 6:30 pm check-in & light food+ Networking 6:35 -- 6:40 pm Intro 6:40 -- 7:15 pm Talk 1 (Lyft) 7:15 - 7:50 pm talk 2 (Uber) 7:50 - 8:25 PM talk 3 (Uber) Talk 1. Scaling Apache Spark on Kubernetes at Lyft As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, We will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup. Speaker: Li Gao Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra. Talk 2. Managing Uber’s Data workflow at Scale. Uber microservices serving millions of rides a day, leading to 100+ PB of data. To democratize data pipelines, Uber needed a central tool that provides a way to author, manage, schedule, and deploy data workflows at scale. This talk details Uber’s journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected several components of the system—such as scheduling and serialization—to make them highly available and more scalable. Speaker Alex Kira (Uber) Alex Kira is an engineering tech lead at Uber, where he works on the data workflow management team. His team provides a data infrastructure platform. In 19-year, he’s had experience across several software disciplines, including distributed systems, data infrastructure, and full stack development. Talk 3: Building highly efficient data lakes using Apache Hudi (Incubating) Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake. Speaker: Vinoth Chandar (Uber) Vinoth is Technical Lead at Uber Data Infrastructure Team

  • Making big data easy (talks from Lyft, Netflix and Quilt Data)

    Agenda: 6:00 - 6:30 pm: Check in, food, networking 6:30 - 6:35 pm: Intros 6:35 - 8:30 pm - 3 Talks 8:30 - 8:45 pm - Wrap up Important Note: It is required to register for the event (free) on ti.to, before the event. You will then be sent an eNDA which needs to be signed 24 hours before the event, for security reasons. A badge would be pre-printed for you when you arrive at the event. Please register here (https://ti.to/big-data/data-science-best-practices-and-productivity). If for some reason you are not able to sign the eNDA online, you can still attend, however you may have a wait in a long line at the sign in desk. Talk #1: Disrupting data discovery Before any analysis can begin, a data scientist needs to discover the right data sources, understand them and trust them. Most of the time, this is done through slacking coworkers, which is very inefficient and not scalable. We will discuss how we solved this issue by building Amundsen: a map of Lyft’s data, powered by rich metadata and leveraged by an intuitive search interface. Speakers Bio: Jin Hyuk Chang is a software engineer at Lyft data platform team working on various data products. Jin is a main contributor to Apache Gobblin, and Azkaban. Previously, Jin worked at Linkedin and Amazon Web Service focused on Big data and Service oriented architecture. Phil Mizrahi is an Associate Product Manager in the Data Discovery team at Lyft. Previously, Phil worked in a Fintech startup in Berlin, in an Investment Bank in Paris and served as an Officer in the French Air Force. Talk #2: Scaling data lineage at Netflix to improve data infrastructure reliability and efficiency Data lineage plays a central role at Netflix to improve platform reliability by forecasting accurate job SLAs while also increasing company-wide trust in the data, enhancing developer productivity by providing better visibility into data movement and enhancing efficiency of the data infrastructure by establishing appropriate data retention levels. Please join us to understand how Netflix built a centralized lineage service to better understand the movement and evolution of data and related data artifacts within the company's data warehouse Speaker bios: Di Lin is a senior data engineer on the infrastructure and information security team at Netflix, where he focuses on building and scaling complex data systems to help infrastructure teams improve reliability and efficiency. Previously, he was a data engineer at Facebook, where he built company-wide data products related to identity and subscriber growth. Girish Lingappa is a senior data engineer on the infrastructure and information security team at Netflix, helping build applications and data assets aimed at building an efficient and intelligent platform. Previously, he spent several years solving various data problems at a few bay area tech companies. Talk #3: Manage Data Like Code In this talk, Michael and Aleksey will use examples from their experience working with messy transit data and building traffic demand models to motivate Continuous Data Integration and Delivery (CDID), a process inspired by CI/CD workflows in software development for managing, testing and versioning heterogeneous data sets. CDID automates data quality checks for a pipeline to ensure consistency, reduce bugs and avoid costly production pipeline failures. Speaker Bio: Michael Sindelar is Director of Engineering at Quilt Data. Before Quilt, Michael developed ML pipelines to predict mobility and traffic patterns in cities at Sidewalk Labs and at Uber. Aleksey Bilogur is a Data Scientist and Developer Advocate at Quilt Data. He is a recent graduate of the Recurse Center and formerly worked at Kaggle.

  • [AICamp]:AI Tech Meetup -- by LinkedIn, Uber and Facebook

    LinkedIn Building 4 (LMVE)

    FYI, This is a CROSS POSTING of AI CAMP's event. please register at https://www.eventbrite.com/e/ai-tech-meetup-tickets-61073014029 Here is the details of the event. Join us on the AI tech meetup on 5/9 6-9pm at LinkedIn (Mountain View). We have speakers from Uber AI lab, LinkedIn AI, Facebook to share the latest work in AI and practical experiences with machine learning, deep dive into tech details how to solve engineering problems. · Overview of AI at LinkedIn · Reinforcement learning with open-ended algorithms for autonomous driving. · Optimize machine learning models with compiler and runtime · Deep Natural Language Processing in Search Systems This would be a great opportunity for you to connect with other like minded engineers, data scientists share and learn from each others experiences. Schedule: · 6:00pm - 6:30pm: Mix & Dinner · 6:30pm - 6:45pm: Welcome and opening notes, by Liang Zhang, Director of AI at LinkedIn · 6:45pm - 7:25pm: Reinforcement learning with open-ended algorithms, by Rui Wang at Uber · 7:25pm - 8:05pm: Machine learning compiler and runtime, Garret Catron at Facebook · 8:05pm - 8:40pm: Deep Natural Language Processing in Search Systems, by Weiwei Guo & Huiji Gao at LinkedIn · 8:40pm - Mixing & Close

  • [Apache Heron-Bay Area] : Apache Heron First Anniversary

    This is our collaboration event cross post from Apache Heron --Bay Area meetup. Original post is from https://www.meetup.com/Apache-Heron-Bay-Area/events/xvmlqqyzgbtb/ Details from above event description *** Come join us Celebrate *** This is our First Anniversary (2019) of Apache Heron, the fastest stream processing engine. *** space is limited *** In addition to RSVP on Meetup, All Event attendees MUST register at: https://apacheheronmeetup.splashthat.com/ with full name (first and last), must bring photo id matching the name. MUST bring your registration confirmation email. *** space is limited *** Agenda: 600 pm : Registration Verification, Nework Over Food and Drinks 700 pm : Apache Heron 800 pm : Q & A, Networking Continues... 900 pm : Doors Close Speaker Bios: Karthik Ramasamy is the co-founder of Streamlio that focuses on building next generation real time processing engines. Before Streamlio, he was the engineering manager and technical lead for real-time analytics at Twitter where he co-created Twitter Heron. At the University of Wisconsin he worked extensively in parallel database systems, query processing, scale out technologies, storage engines, and online analytical systems. Several of these research projects were later spun off as a company acquired by Teradata. Karthik is the author of several publications, patents, and "Network Routing: Algorithms, Protocols and Architectures". He has a Ph.D. in computer science from the University of Wisconsin, Madison with a focus on big data and databases. Host Bio: Sree Vaddi, a Java Veteran and an Apache Committer now. Started his journey in Java in 1995, downloading a copy of JDK 1.0.2 from javasoft.com. Continued working in Java, including an opportunity to work at JavaSoft of Sun Microsystems in 1998 and further to today and ever. He is experienced in Core Java, Enterprise Edition, Mobile, Big Data, IoT and ML/AI. He has been contributing to open source, from the user forums to apache.org to now. What we covered: 2018: Month#0. An intro to Apache and it's major features - San Francisco Sep 2018: An intro to Apache and it's major features - South Bay Oct 2018: Caladrius Nov 2018: Apache Heron - Stateful Dec 2018: Caladrius (On popular demand) 2019: Jan 2019: Streamlet: Heron Functional API Feb 2019: Use of Heron at Twitter for Network Analysis Mar 2019: *A Series of Talks on Apache Heron - Part 1* (Low-Level API in Java & Python. Migrate a Storm Topology to Heron Topology.) Please join me in thanking our sponsors: 1. Twitter.com, for hosting, gourmet food and drinks. https://twitter.com/ 2. Streaml.io, for providing speakers and content. https://streaml.io/ 3. Foundation For Excellence, for generously providing swag to our meetup. https://ffe.org/site/

  • AirBnB/Lyft/Google: End-to-End ML Platform, Airflow and More

    Google SF office Info: 345 Spear Street, on the 7th Floor at the room Batgirl. Enter via the West elevator lobby to Google Office. Recommend Hills Plaza Garage for Parking, as it is right underneath the SPE building and costs $10 per vehicle after 5:00PM. It's open until 11:00 PM. Agenda: 6 - 6:30 pm Networking + food 6:30 pm -- 6:40 pm Introduction 6:40 pm -- 7:15 pm Talk 1 (Google) + QA 7:15 pm -- 7:50 pm Talk 2 (AirBnb) + QA 7:50 pm -- 8:25 pm Talk 3 (Lyft) + QA 8:30 pm -- 9 pm Closing Talk 1: Demystifying Hybrid Data Management using CDAP Cloud has emerged as a critical enabler of digital transformation, with the aim of reducing IT overheads and costs. However, cloud migration is not instantaneous for a variety of reasons including data sensitivity, compliance and application performance. This results in the creation of diverse hybrid and multi-cloud environments and amplifies data management and integration challenges. This talk demonstrates how CDAP’s flexibility can allow you to utilize your existing on-premises infrastructure, as you evolve to the latest Big Data and Cloud services at your own pace, all while providing you a single, unified view of all your data, wherever it resides. Speaker: Bhooshan Mogal, Google Bhooshan Mogal is a Product Manager at Google, where he is focused on delivering best-in-class Data and Analytics services to GCP users. Prior to Google, he worked on data systems at Cask Data Inc, Pivotal and Yahoo. Talk 2: Bighead: Airbnb's end-to-end machine learning platform Airbnb has a wide variety of ML problems ranging from models on traditional structured data to models built on unstructured data such as user reviews, messages and listing images. The ability to build, iterate on, and maintain healthy machine learning models is critical to Airbnb’s success. Bighead aims to tie together various open source and in-house projects to remove incidental complexity from ML workflows. Bighead is built on Python, Spark, and Kubernetes. The components include a lifecycle management service, an offline training and inference engine, an online inference service, a prototyping environment, and a Docker image customization tool. Each component can be used individually. In addition, Bighead includes a unified model building API that smoothly integrates popular libraries including TensorFlow, XGBoost, and PyTorch. Each model is reproducible and iterable through standardization of data collection and transformation, model training environments, and production deployment. This talk covers the architecture, the problems that each individual component and the overall system aims to solve, and a vision for the future of machine learning infrastructure. It’s widely adopted in Airbnb and we have variety of models running in production. We plan to open source Bighead to allow the wider community to benefit from our work. Speaker: Andrew Hoh Andrew Hoh is the Product Manager for the ML Infrastructure and Applied ML teams at Airbnb. Previously, he has spent time building and growing Microsoft Azure's NoSQL distributed database. He holds a degree in computer science from Dartmouth College. Talk3: Apache Airflow At Lyft Lyft has been one of the first companies to adopt Airflow in production. Today Airflow powers many Lyft use cases: from powering executive dashboards to metrics aggregation, to derived data generation, to machine learning feature computation, etc. In this talk, we will first cover how we operate Airflow at Lyft in production, then we will talk about the improvement we have done for Airflow to boost internal ETL development productivity. Lastly, we will talk about some of our open source contributions which could benefit the whole community. Speaker: Tao Feng, Tao Feng is a software engineer at Lyft data platform team working on various data products. Tao is also a committer and PMC on Apache Airflow. Previously, Tao worked at Linkedin and oracle on data infrastructure, tooling and performance.