What we're about

[b]Youtube channel for archived talk videos[/b]

SF Big Analytics Youtube Channel (https://www.youtube.com/channel/UC9MOf69YTbmDKqr22l7aigA)

The SF Big Analytics meetup focuses on all aspects of the big data analytics, from data ETL, feature generation, AI/machine learning theory, algorithm and implementation to technologies and infrastructures associated with big data analytics. Topics include AI/Machine Learning, data processing and monitoring (Hadoop, Spark, Hive, Streaming (Flink, Apex, Kafka etc)), data visualization, data science lifecycle etc. This meetup covers the full range of the big data analytics topics and data mining pipelines.

We try to provide high quality talks for each meetup, here are some of the policies related to talks we have been following in last few years

-- Technical focused

-- No marketing

-- No product promotion (unless it is open sourced project)

-- No high level business talks (unless it is from highly respected leaders)

Upcoming events (2)

Data driven development in autonomous-driving and Spark performance Tuning

Excited to have three talks from Blacksesame, Databricks and Uber Venue+ Food Sponsors J.J. Lake and Harnham Agenda 6 pm -- 6:30 Check-In 6:30 --6:35 pm Sponsors(JJ Lake + Harnham ) intro 6:35 -- 7:10 pm Talk 1 (BlackSesame) 7:10 -- 7:45 pm Talk 2 (Databricks) 7:45 -- 8:20 pm Talk 3 (Uber) 8:30 -- Closing Talk 1: Case studies of data driven development in autonomous driving. Autonomous driving software system is highly complex, mission critical, and rapidly iterating. The development cycle of such system involves software, hardware, and continuous testing. To guarantee a closed and efficient feedback loop from software engineering procedure to the road tests, there has to be a data driven pipeline that facilitates that. In this talk, we will use a in-garage vision based autonomous driving system as an example to talk about not only some unique challenges in such system design, but also how to use data driven development approaches to enable the fast delivery a close-to-production level software in this vertical. Spark: Guan Wang (BlackSesame) Dr. Guan Wang is the Head of AI at Blacksesame Technologies (AI chip startup in Santa Clara). His team has built a pure vision-based autonomous driving platform in the in-garage driving vertical, and a pure vision-based crowdsourcing HD mapping system. Prior to Blacksesame, he was one of the founding team member of NIO US (an electronic car startup, IPO'19). He worked on productionizing deep learning systems in the embedded environment. Before NIO, he worked at LinkedIn for cloud-based machine learning platform Talk 2: Uncovering performance regressions in the TCP SACKs vulnerability fixes In early July 2019, Databricks noticed some Apache Spark workloads regressing by as much as 6x. In this talk, we'll discuss how we traced these regressions back to the Linux kernel and the fixes for the TCP SACKs vulnerabilities. We will explain the symptoms we were seeing, walk through how we debugged the TCP connections, and dive into the Linux source to uncover the root cause. Speaker: Chris Stevens (Databricks) Chris Stevens is a software engineer at Databricks where he works on the reliability, scalability, and security of Apache Spark clusters. His work focuses on auto-scaling compute, auto-scaling storage, node initialization performance, and node health monitoring. Prior to Databricks, Chris founded the Minoca OS project, where he built a POSIX compliant, general purpose OS - from scratch - to run on resource constrained device. He got his start at Microsoft working on the Windows kernel team, porting the Windows boot environment from BIOS to UEFI. Talk3: How to performance-tune Spark applications in large clusters Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage. Speaker: Omkar Joshi (Uber) Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.

Building data lineage, data orchestration and data mesh

We jointly organize this event with Google focused on how to build data platform: data lineage, data lake/mesh and data storage. Agenda - 06:00 - 06:25 - Reception and networking - 06:25 - 06:30 - Introductions - 06:30 - 07:00 - talk 1 (Google) - 07:00 - 07:30 - talk 2 (ThoughtWorks) - 07:30 - 08:00 - talk 3 (Alluxio) - 08:00 - 08:30 - Social time Talk1: Fine grained root cause and impact analysis with CDAP Lineage Lineage is a critical aspect of data governance in large enterprises, and provides traceability for data as it flows through a data system. It can unlock various use cases such as root cause analysis (discover the cause of a bad data event) and impact analysis (gauge the impact of a change before making the change). In this talk, the speaker will demonstrate how CDAP’s granular data lineage capabilities can solve these use cases for enterprises. Speakers : TBD (Google) Talk2: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product. Speaker : Zhamak Dehghani (ThoughtWorks) Zhamak is a principal technology consultant at ThoughtWorks with a focus on distributed systems architecture and digital platform strategy at Enterprise. She is a member of ThoughtWorks Technology Advisory Board and contributes to the creation of ThoughtWorks Technology Radar. Talk3: Alluxio - "Accelerating EMR Spark with Alluxio on S3" Apache Spark and Alluxio are cousin open-source projects that originated from UC Berkeley’s AMPLab. Running Spark with Alluxio is a popular stack particularly for hybrid environments. In this session, Bin will briefly introduce Alluxio, share the top 10 tips for performance tuning for real-world workloads, and demo Alluxio with Spark. Speaker: Bin Fan (Alluxio)

Photos (420)