• Reactive Summit + SBTB 2020 CFP Open through July 31

    Online event

    Scale By the Bay is now following Reactive Summit, the conference of the Reactive Foundation, a Linux Foundation project focused on cloud-native applications. We're happy to report that Martin Odersky and Matei Zaharia, the creators of Scala and Spark, will keynote SBTB 2020, among other awesome keynote speakers. The joint CFP is extended through July 31. There's still time to submit a talk: https://scale.bythebay.io! Looking forward to more great speakers to join us in November.

  • Exabytes Delivered Daily: Lessons Building Cloud Software at Databricks

    This is a joint event with SF Scala. RSVP there! The Zoom link is available only to those who do: https://www.meetup.com/SF-Scala/events/270818386 Exabytes Delivered Each Day: Some Lessons Building Large-Scale Cloud Software at Databricks Cloud service developers need to handle massive scale workloads from thousands of customers with no downtime or regressions. In this talk, we’ll present our experience building a very large scale cloud service at Databricks, which provides a data and ML platform as a service running over AWS and Azure used by some of the largest enterprises in the world. Databricks launches millions of VMs per day in each of these clouds that process exabytes of data per day for interactive, streaming and batch production applications. This means that our control plane has to handle a wide range of workload patterns and cloud issues such as outages. We will describe how we built our control plane for Databricks using Scala services and open source infrastructure such as Kubernetes, Envoy, Prometheus, Apache Spark and Delta Lake, and various design patterns and engineering processes that we learned along the way to make development and operations reliable. Speaker: Matei Zaharia is the Chief Technologist and cofounder at Databricks, the creator of Apache Spark, and an Assistant Professor at Stanford University, where he cofounded the DAWN Lab. Note: Matei will keynote the Scale By the Bay 2020 Conference in November, the CFP is still open at https://scale.bythebay.io!

  • First Choice Scale By the Bay 2020 CFP is now Open through May 31

    The 8th Annual Scale By the Bay developer conference will be held either online or in person in November, 2020. The CFP is now open at https://scale.bythebay.io. The First Choice CFP will run until May 31st, when 1/2 of the program will be selected. The next 1/4 will be selected by June 30th, and so on. The bar will move higher in each iteration, allowing for the strongest talks to still join. Please submit your best talk early, and hope to see you on the program!

  • Starting 2020 at Microsoft Reactor: Making Apache Spark Better with Delta Lake

    From now on, all the meetups By the Bay will be announced at the umbrella meetup, Scale By the Bay: https://www.meetup.com/scale-by-the-bay We'll keep the downstream meetups for consistency and will cross-post. If you are a member of any of these, join us and you will be up to date on the holistic, full-stack, approach to software systems. We’re kicking off the year at our new partner venue, Microsoft Reactor! Apache Spark™ is the dominant processing framework for big data. Delta Lake adds reliability to Spark so your analytics and machine learning initiatives have ready access to quality, reliable data. This session covers the use of Delta Lake to enhance data reliability for Spark environments. Topics: The role of Apache Spark in big data processing Use of data lakes as an important part of the data architecture Data lake reliability challenges How Delta Lake helps provide reliable data for Spark processing Specific improvements that Delta Lake adds The ease of adopting Delta Lake for powering your data lake Speaker: Chris Hoshino-Fish is a Solutions Architect at Databricks. Chris is an active member of the Performance Subject Matter Expert group and a former Principal Consultant focused on Data Engineering, working with several Fortune 500 Databricks customers. Prior to Databricks, Chris worked for an adtech company as a data engineer managing pipelines using Apache Spark for 3.5 years. Chris has a B.A. in Computational Mathematics from the University of California, Santa Cruz. Lightning Talks -- we'll open the floor for the rest of the meetup to the lightning talks proposed in the comments!

  • [External Registration][Conference] Scale By the Bay 2019, November 13-15

    This year Scale By the Bay (https://scale.bythebay.io) runs for only two days. But we packed an incredible 70 sessions in these two days! We start with a hot breakfast and excellent coffee. Coffee never ends -- continuous uninterruptible supply of great coffee is a hallmark of every conference By the Bay. Each morning there is a keynote where we all gather as a community, and a panel closing each conference day where we all get together again before the happy hour -- also every day. The heart of the conference are its iconic four tracks: Thoughtful Software Engineering, Service Architectures, End-to-end Data Pipelines up to ML/AI, which we historically call Functional, Reactive, and Data. That's three right? The fourth is the hallway track -- and we're legendary for it! The core theme this year is Distributed Systems. Joe Beda, Principal Engineer at VMware and the co-creator of Kubernetes, keynotes one day, and Heather Miller, Professor at CMU and former Director of the Scala Center, keynotes the other. We have multiple talks considering cloud deployments on Kubernetes in concert with other systems, such as Kafka, Spark, and Flink. We cover important new directions with Unison, and inherent issues such as Change Data Capture from Disney Streaming. We will learn about the new GIS features for Google BigQuery from their author, about the Databricks Data Lake approach, and infrastructure as code at Target. Our "reactive" track started as reactive microservice architectures but came to encompass all kinds of systems, as well as data manipulation techniques. We'll see how Lyft is enabling real-time queries with Apache Kafka, Flink and Druid. We'll hear about the lessons learned developing and running Netty from its creator. We see how Serverless is developed at Google. Machine Learning and AI are only as scalable as the data pipeline feeding them. Moreover, you need to ensure your data is typesafe and your predictions are based on the data whose integrity or even privacy is provable. This year, we have three talks on Swift for TensorFlow, including from the original Google team developing it, as well as Coinbase and Quarkworks. We hear from Sony Entertainment on near real-time, low latency predictions, and many many other leaders. And we'll uphold the rigorous and thoughtful software engineering that is underpinning of every system scalable in time and tech space -- a system that can deliver but also grow with companies and their people. We'll hear from Comcast and Netflix on human-centric software engineering and ML organizations. We'll hear about community-first Open-Source approaches. We'll see how F# invigorates .Net ecosystem with functional approach, including JavaScript apps, and how Scala with React is doing the same for the full-stack development on the JVM. We'll hear about Rust, Haskell, Scala, Java, Python, F# and other ecosystems used for quality development and production deployment. We'll see how the sausage is made at JetBrains to power our IDEs. We'll dig deeper into GraalVM with Oracle and Twitter, as well as Scala Native. We pioneered GraphQL at Scale By the Bay three years ago when almost nobody heard about it. Furthermore, our focus was not on the frontend alone but on middleware usage of GraphQL. This year Nick Schrock, a co-creator of GraphQL, joins us. The day before the conference, we run a bespoke, all-day, hands-on training that we build specifically for SBTB. This year, it's Portable Serverless Workshop with Ryan Knight and James Ward. James is now at GCP and has a driver seat going to the serverless future. You'll go home with a complete serverless backend under your belt! All in all, we'll have a lot of fun, pack a year of learning in just two or free days, and again experience the magic that makes Scale By the Bay a legend! Reserve your Early Bird seat soon at https://scale.bythebay.io.


    Note: ML Model Versioning, Deployment, and Monitoring are core themes of the https://scale.bythebay.io 2019, 11/14-15, Oakland. Reserve your seat today using the code MEETSFHADOOP15 for 15% off all passes, including the complete Serverless workshop! Joint meetup -- please RSVP at http://bay.area.ai! (1) MODEL VERSIONING: WHY, WHEN, AND HOW Models are the new code. While machine learning models are increasingly being used to make critical product and business decisions, the process of developing and deploying ML models remain ad-hoc. In the “wild-west” of data science and ML tools, versioning, management, and deployment of models are massive hurdles in making ML efforts successful. As creators of ModelDB, an open-source model management solution developed at MIT CSAIL, we have helped manage and deploy a host of models ranging from cutting-edge deep learning models to traditional ML models in finance. In each of these applications, we have found that the key to enabling production ML is an often-overlooked but critical step: model versioning. Without a means to uniquely identify, reproduce, or rollback a model, production ML pipelines remain brittle and unreliable. In this talk, we draw upon our experience with ModelDB and Verta to present best practices and tools for model versioning and how having a robust versioning solution (akin to Git for code) can streamlining DS/ML, enable rapid deployment, and ensure high quality of deployed ML models. Speakers: Manasi Vartak, CEO, Verta.ai, Conrado Miranda, CTO, Verta.ai Manasi Vartak is the founder and CEO of Verta.ai (www.verta.ai), an MIT-spinoff building software to enable high-velocity machine learning. Manasi previously worked on deep learning for content recommendation as part of the feed-ranking team at Twitter and dynamic ad-targeting at Google. Conrado Miranda is the CTO at Verta.AI. Conrado has a PhD in Machine Learning and a focus on building platforms for AI. He was the tech lead for the Deep Learning platform at Twitter’s Cortex, where he designed and led the implementation of TensorFlow for model development and PySpark for data analysis and engineering. He also led efforts on NVIDIA’s self-driving car initiative, including the Machine Learning platform, large scale inference for the Drive stack, and build and CI for Deep Learning models. (2) Model Monitoring in Production Machine Learning models continuously discover new data patterns in production they have never seen during training and testing iterations. The best offline experiment can lose in production. The most accurate model is not always tolerant to a minor data drift or adversarial input. Neither prodops, data science or engineering teams are skilled to detect, monitor and debug model degradation behaviour. Real mission critical AI systems require advanced monitoring and model observability ecosystem which enables continuous and reliable delivery of machine learning models into production. Common production incidents include: - Data anomalies - Data drifts, new data, wrong features - Vulnerability issues, adversarial attacks - Concept drifts, new concepts, expected model degradation - Domain drift - Biased Training set In this demo based talk we discuss algorithms for monitoring text and image use cases as well as for classical tabular datasets. Demo part will cover the full cycle of machine learning model in production: Model training and deployment with Kubeflow pipelines Production traffic simulation Model monitoring metrics configuration Data drift detection Drift exploration and monitoring metadata mining New training dataset generation from production feature store Model retraining and redeployment Stepan Pushkarev is a CTO of Hydrosphere.io - Model Management platform and co-founder of Provectus - an AI Solutions provider and consultancy, a parent company of Hydrosphere.io.

  • OmniSci-StreamSets F1 Demo + Scaling StreamSets On Azure Kubernetes Service

    This is a joint event with the StreamSets User Group: https://www.meetup.com/San-Francisco-StreamSets-User-Group-Meetup/events/261489457 Join us for two great presentations, food, beverages, and take a turn in OmniSci's instrumented F1 simulator! Agenda: 6:30pm - Food, beverages, and try out OmniSci's F1 simulator! 7:15pm - Creating the OmniSci F1 Demo: Real-Time Data Ingestion With StreamSets Veda Shankar - Senior Developer Advocate - OmniSci https://www.linkedin.com/in/veda-shankar-6260a516 Telematics is a rapidly growing use case for IoT and Big Data, and OmniSci hacked a F1 racing game to demonstrate how telematics data can be collected and analyzed in real-time. A combination of open source tools were used to generate, capture, process, analyze, and chart the data from a Formula 1 racing simulation. StreamSets was used to visually architect and implement the data flows with an open-source Docker container. Read this blog for more details: https://streamsets.com/blog/omnisci-f1-demo-real-time-data-ingestion-streamsets/ 7:45pm - Scaling StreamSets On Azure Kubernetes Service Speaker TBD Provisioning agents are containerized applications that run within a container orchestration framework, such as Kubernetes. You can run Kubernetes on-premise, or leverage cloud-based solutions such as Azure Kubernetes Service (AKS) and Google Kubernetes Engine for a "pay-as-you-consume" model without the complexity of implementation, installation, and maintenance. In this session, we will show how to scale StreamSets Data Collector instances on Azure Kubernetes Service (AKS) using provisioning agents that help automate upgrading and scaling resources on-demand, without having to stop execution of dataflow pipeline jobs. 8:30pm - Close

  • Scale By the Bay 2019 CFP Open until May 31

    Needs a location

    Friends — the month of May is when the Scale By the Bay (SBTB) CFP always runs, for the conference in November. The CFP is now open at https://scale.bythebay.io There are three tracks, as usual: — Functional Programming — Service Architectures — Data Pipelines, including ML/AI The theme for this year is the emergence of new distributed systems and their applications, including Edge, IoT, DLT, and AI on the Edge. Helena Edelson lead a team at Apple enabling ML/AI with Spark, Joe Beda started Google Compute Engine and Kubernetes, and Heather Miller lead Scala Center at EPFL and now advances distributed and edge systems at CMU. We have two talk lengths, 20 minutes and 40 minutes. There are 5-10 minute breaks between some, but not all, talk slots, and excellent coffee is served all day long so every break is a coffee break. Please check each time length you can work with. We often ask 40 min talks to shrink to 20 min as we try to accommodate all the best talks — and our acceptabnce rate is going down to 1:3 with years. We also serve hot breakfast and great lunch and amazing happy hours follow the main program in between all days. The hallway track is legendary, facilitated by the high ratio of speakers — 100+ out of the 600 attendees. We are committed to community above all and are working with underrepresented groups to send speakers. Please share this CFP with your diversity advocates, community managers, and encourage female engineers, African-American developers, and others to submit talks. If you could send such speakers on behalf of your company, it will help the community a lot. We’re also proactively reaching out to meetups, our core constituents, to help our established diversity program. We also work with companies like Stripe on diversity scholarships — let us know if you’d like to partner on this. Submit your best talks at https://scale.bythebay.io by May 31!

  • [Register at Bay.Area.AI] Applied Machine Learning: a Netflix Production

    This is a joint meetup By the Bay: register at http://bay.area.ai! ----- Applied Machine Learning is about as mature as Software Engineering circa 1998. For Data Scientists, it’s hard to collaborate, hard to be productive and hard to deploy to production. In the last 20 years, Software Engineers have become far more collaborative thanks to tools like git, far more productive thanks to cloud computing and far more effective at delivering quality software thanks to CI/CD and agile development practices. At Netflix, I get to work on problems like: how do we scale Data Science innovation by making collaboration effortless? How do we enable Data Scientists to single-handedly and reliably introduce their models to production? How do we make it easy to develop ML models that humans trust? More importantly, how do we use ML to make humans BETTER? In this talk, we’ll explore how Netflix is approaching these problems to further our mission of creating joy for our 125 Million+ members worldwide! Speaker: Julie Pitt leads the Machine Learning Infrastructure at Netflix, with the goal of scaling Data Science while increasing innovation. She previously built streaming infrastructure behind the "play" button while Netflix was transitioning from domestic DVD-by-mail service to international streaming service. Julie also co-founded Order of Magnitude Labs, with a mission to build AI capable of doing things that humans find easy and today’s machines find hard: exploration, communication, creativity and accomplishing long-range goals. Early in her career, Julie developed data processing software at Lawrence Livermore National Laboratory that enabled scientists to study the newly-sequenced human genome. ----- Julie is a regular speaker at Scale By the Bay, the 2019 CFP opens May 1 and ends May 31, submit your best talks early starting May 1 at http://scale.bythebay.io!

  • Managing Globally Distributed Data for Deep Learning using TensorFlow on YARN

    The benefits of large datasets for deep learning are well known. But what if the source of this data is globally distributed? Jagane Sundar shares a system for replicating data across geographically distributed data centers, discusses the benefits of consistently replicating data that is used by TensorFlow for training, and explores the advantages of using a Paxos-based distributed coordination algorithm for replication. Jagane then details the resultant unique capability to maintain consistent writable copies of the data in multiple data centers. Speaker: Jagane Sundar is the CTO at WANdisco. Jagane has extensive big data, cloud, virtualization, and networking experience. He joined WANdisco through its acquisition of AltoStor, a Hadoop-as-a-service platform company. Previously, Jagane was founder and CEO of AltoScale, a Hadoop- and HBase-as-a-platform company acquired by VertiCloud. His experience with Hadoop began as director of Hadoop performance and operability at Yahoo. Jagane’s accomplishments include creating Livebackup, an open source project for KVM VM backup, developing a user mode TCP stack for Precision I/O, developing the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun Microsystems, and creating and selling a 32-bit VxD-based TCP stack for Windows 3.1 to NCD Corporation for inclusion in PC-Xware. Jagane is currently a member of the technical advisory board of VertiCloud. He holds a BE in electronics and communications engineering from Anna University. WANdisco will be giving away 3 Ipad Airs (the new model!) at the meetup.To enter the drawing, take this 3-question quiz https://forms.gle/op8PChvjuz2NXfSK6 by 12pm PST on Wed, 27 Mar and show up at the meetup for the drawing.