What we're about

SF Big Analytics Youtube Channel (https://www.youtube.com/channel/UC9MOf69YTbmDKqr22l7aigA)

The SF Big Analytics meetup focuses on all aspects of the big data analytics, from data ETL, feature generation, AI/machine learning theory, algorithm and implementation to technologies and infrastructures associated with big data analytics. Topics include AI/Machine Learning (Algorithm & ML Infrastructure), data processing and monitoring, data Infrastructure, data visualization, data science lifecycle etc. This meetup covers the full range of the big data analytics topics and data mining pipelines.

We try to provide high quality talks for each meetup, here are some of the policies related to talks we have been following in last few years

-- Technical focused

-- No marketing

-- No product promotion (unless it is open sourced project)

-- No high level business talks (unless it is from highly respected leaders)

Upcoming events (3)

The missing story of the columnar format & Apache Kylin

Online event

register at https://us02web.zoom.us/webinar/register/WN_k0nytglMTLSp-UJJEfAXLw We have two exciting talk. Eric Sun from LinkedIn will discuss the missing story of using columnar format: how to fully take advantage of the columnar format. Kaige from Kyligence will discuss Apache Kylin query engine will sub-seconds response time. Agenda 12: 00 -- 12:05 pm Introduction 12:05 -- 12: 40 pm Talk1 + QA 12:40 -- 01: 15 pm. Talk2 + QA 1:15 -- 1:30 pm closing Talk 1: Are We Taking Only Half Of The Advantage Of Columnar File Format? Offline data ecosystem is mainly for batch (small ~ huge) ETL/analytics/ML/DL, which means the majority of useful data files are ingested once and then scanned/read hundreds of thousands times. More than 90% of workload on HDFS/S3/ADLS are read, therefore it is very important to optimize for the read operations. To simply keep the data format and schema identical as online upstreams (Kafka, RDBMS, Cassandra and MongoDB) in Avro or JSON can actually prevent us from leveraging modern compute engines and all related optimization. Switching to columnar format (such as Parquet or ORC) is only about half-way to get more done for less, this meetup talk will explain the other half. Among many areas related to storage, the following optimization can give a Data Lake the most significant ROI with relatively low investment: sorting for the mostly-filtered field (w/ low~medium carnality) bucketing the big dimension/lookup tables which are (remove shuffle stage for joins or simply distributing the records by the almost unique field w/o bucketing sub-partition (multi-level partition) for big and frequently-used tables rolling hourly partitions into daily instead of daily compaction Speaker: Eric Sun (LinkedIn) Talk 2 : Apache Kylin: Achieve Exact COUNT DISTINCT with Sub-Second Latency at PB Scale With over 450 million customers, Didi (world’s largest rideshare company) conducts complex user behavior analysis on huge datasets daily. Exact Count Distinct is one of Didi’s most critical metrics, but it is known for being computationally heavy and notoriously slow. The difference between exact Count Distinct and approximate Count Distinct can cost Didi millions of dollars. In this talk, Kaige Liu of the Apache Kylin project will explain how Didi uses Apache Kylin to return exact Distinct Count on billions of rows of data with sub-second latency to generate the most accurate picture of its business. You will also learn about the latest development in modern OLAP technologies. Kaige will share how Didi and Truck Alliance (a truck-hailing company that processes $100 billion worth of goods yearly) use Apache Kylin to power their analytics platforms that allow 100s of analysts to achieve sub-second latency on petabyte-scale data. Speaker: Kaige (Kyligence) Kaige is a senior solutions architect at Kyligence, where he works on building the next-generation big data analytics platform. Previously, he worked on the OpenStack and Bluemix team at IBM, focusing on cloud computing and virtualization technology. Kaige loves the open source community and is an active Apache Kylin committer.

"Apache Submarine" & "Bootstrapping meaning with embodied language learning"

Register at Zoom https://us02web.zoom.us/webinar/register/WN_eP2WYjsURa-9MPddQvvpYg Agenda 12:00 -- 12:05 pm Intro 12:05 -- 12:40 pm talk 1 + QA 12:40 -- 1:15 pm. Talk 2 + QA 1:30 pm -- close Talk 1: Bootstrapping meaning with embodied language learning NLP models like GPT-3,word2vec, and transformers have been making huge leaps and bounds in machine learning and textual understanding. Cracks in these models are starting to appear thought and many research scientists are worried that we have reached the limits of what these models can do. While these models show seemingly amazing abilities to understand what we are saying, we will show that these models actually understanding nothing. They have no frame of reference to our world. While there are many debated pieces as to what scientist believe are missing to take AI to the next level, we will talk about one very key point: embodiment. AIs cannot understand things the way we do without having some way to interact with the world. Words like rough and heavy obtain their meaning from the physical world around us, not from parsing billions of lines of text. We will go through a tour of why grounding meaning is key and recent developments. Speaker : Jason Toy ( CloudApp) Jason Toy is startup generalist focused on technology, operations, and growth. Occasional angel investor and advisor. Currently COO at CloudApp, a remote communications platform. He has spent a lot of time working with machine learning and artificial intelligence in both production environments and research. Current area of research is embodied cognition and sensorimotor representations in the brain. Talk 2 Apache Submarine: State of the union Apache Submarine is the ONE PLATFORM to allow Data Scientists to create end-to-end machine learning workflow. ONE PLATFORM means it supports Data Scientists to finish their jobs on the same platform without frequently switching their toolsets. From dataset exploring data pipeline creation, model training (experiments), and push model to production (model serving and monitoring). All these steps can be completed within the ONE PLATFORM. In this talk, we’ll start with the current status of Apache Submarine – how it is used today in deployments large and small. We'll then move on to the exciting present & future of Submarine – features that are further strengthening Submarine as the ONE PLATFORM for data scientists to train/manage machine learning models. We’ll discuss highlights of the newly released 0.4.0 version, and new features 0.5.0 release which is planned in 2020 Q3: - New features to run model training (experiments) on K8s, submit mode training job by using easy-to-use Python/REST API or UI. - Integration to Jupyter notebook, and allows Data-Scientists to provision, manage notebook session, and submit offline machine learning jobs from notebooks. - Integration with Conda kernel, Docker images to make hassle-free experiences to manage reusable notebook/mode-training experiments within a team/company. - Pre-packaged Training Template for Data-Scientists to focus on domain-specific tasks (like using DeepFM to build a CTR prediction model). We will also share mid-term/long-term roadmap for Submarine, including Model management for model-serving/versioning/monitoring, etc. Speaker: Wangda Tan (Cloudera) Wangda Tan is Sr. Manager of Compute Platform engineering team @ Cloudera, responsible for all engineering efforts related to Kubernetes, Apache Hadoop YARN, Resource Scheduling, and internal container cloud. In the open-source world, he's a member of Apache Software Foundation (ASF), PMC Chair of Apache Submarine project, He is also project management committee (PMC) members of Apache Hadoop, Apache YuniKorn (incubating). Before joining Cloudera, he leads High-performance-computing on Hadoop related work in EMC/Pivotal. Before that, he worked in Alibaba Cloud and participated in the development of a distributed machine learning platform (later became ODPS XLIB).

Magnet Shuffle Service: Push-based Shuffle at LinkedIn

Online event

Please register at : https://us02web.zoom.us/webinar/register/WN_rdNofU3uQ4eJlM4_YffS6A We are very happy to have Min Shen from LinkedIn to do a deep dive on Spark Shuffle Service Agenda 12 pm -- 12:05 pm Intro 12:05 pm -- 12:50 pm --Talk 12:50 pm -- 1:05 pm QA 1:05 pm -- 1:30 pm closing The number of daily Spark applications at LinkedIn has increased by more than 3X in the past year. The shuffle process alone is processing 10+ PB of data and billions of blocks daily in our clusters nowadays. With such a rapid increase of Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workload efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on disks. To tackle those challenges and optimize shuffle performance in Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Spark. Our paper describing this work has recently been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will also talk about a few highlights of the implementation behind Magnet shuffle service, which can work natively with Spark and does not require deploying any external infrastructure or specialized hardware. Speaker : Min Shen (LinkedIn) Min Shen is a tech lead at LinkedIn. His team's focus is to build and scale LinkedIn's general purpose batch compute engine based on Apache Spark. The team empowers multiple use cases at LinkedIn ranging from data explorations, data engineering, to ML model training. Prior to this, Min mainly worked on Apache YARN. He holds a PhD degree in Computer Science from University of Illinois at Chicago.

Past events (110)

Photos (435)