• Scale By the Bay 2021 CFP is now open until May 31/June 15!

    Needs a location

    Scale By the Bay 2021 returns online in October this year.


    A major independent conference for the Bay Area and the world, we're in our 9th year. Our defining characteristics are:

    -- deeply technical content accepted on merit
    -- data engineering for AI working together with software engineering and devops
    -- soup to nuts, high performance to distributed systems approach

    The CFP works in two stages as always:
    -- Submit first choice talk by May 31
    -- Submit a talk to be considered for the program by June 31.

    Early bird registration is also open!


  • Scale By the Bay 2020 begins this Thursday!

    Online event

    Folks -- the first-ever online SBTB is this week!


    The online ticket is a already a low $125, and we give you 20% off that with SFHADOOP20:


    Some highlights:

    Martin Odersky opens SBTB on 11/12 with

    Countdown to 3!

    Matei Zaharia and Anima Anandkumar keynote

    Li Haoyi, Getting Things Done in the Scala REPL

    Julien Truffault, Monocle 3: a peek into the future

    Adam Warski, Project Loom? Better Futures? What’s next for JVM concurrent programming

    Prof. Bayer, the co-creator of B-trees, presents C-chain: the Integration of 5G and real time Blockchain

    Shameera Rathnayaka of Spotify, Materialize Typeclasses with Magnolia

    Justin Heyes-Jones, YoppWorks, Applicative: The Origin Story

    Steve Cosenza, Twitter, Rebuilding Twitter’s public API

    Greg Kesler, Intuit, Query Planning in GraphQL

    Lei Gao, Workday Goku Flow: A Self-Service Data Pipeline Builder

    Prashant Sharma, IBM, Apache Spark meets FIPS standard

    Dean Wampler, Domino Data Labs, Ray: A System for High-performance, Distributed Machine Learning Applications

    Dirk Slama, VP Co-Innovation, Bosch, AIoT: Why now? And How To?

    Antje Barth, AWS, Put Your Machine Learning on Autopilot

    We have three debate panels we are (in)famous for:

    Will AI Kill Programming?

    Were Microservices a Huge Mistake?

    Programming Languages in the Era of the Cloud

    See the full program at https://www.scale.bythebay.io/schedule, and register!

  • Scale By the Bay 2020 Program is Live, Early Bird Registration Open until 9/30

    Scale By the Bay, the iconic developer conference, returns this, 8th year, by the global bay at

    https://scale.bythebay.io -- register for Early Bird passes until September 30th!

    As always, we have three tracks:
    -- thoughtful software engineering
    -- cloud-native applications (reactive systems and microservices)
    -- end-to-end data pipelines (with ML and AI)

    This year, due to our global nature, we expect thousands of attendees. The three tracks are structured in two days with two tracks each.

    We start at 6:30AM San Francisco time and have two morning keynotes, at 6:45AM and at 9:30AM, to cover both the EU and US East Coast as well as other timezones, and run until 7:30PM. We run a dedicated Q&A track and an innovative community track that will aim to recreate our legendary hallway track experience, with 20-40% of the attendees mingling and old and new friends (re)connecting.

    The theme of this year is Cloud-Native Applications. We focus on the best way to build them with functional programming, and best way to feed them data and use ML/AI to make them deliver for their companies.

    Our keynote speakers include Martin Odersky, the creator of Scala (with a Scala 3 update), Jaana Dogan of Google, and Matei Zaharia (Stanford and Databricks).

    The companies presenting at SBTB 2020 include Adobe, AWS, Databricks, Facebook, Google, IBM, Twitch, Twitter, Salesforce, SAP, Spotify, Wix, and more.

    We cover topics such as Reactive Systems, to GraphQL, JAMstack, Cloud Streaming, Stateful Serverless, TypeScript, Rust, Scala, Swift and Haskell in production, and others.

    Some highlights:
    -- the new Twitter API from the creator of Finatra
    -- Machine learning with Scala 3 at Twitter
    -- Blockchain in IoT from the inventor of the B-tree Prof. Bayer (TU Munich and catena)
    -- Programming hardware in Rust
    -- Using memory to optimize Facebook Presto latency
    -- MLOps on Azure
    -- and more.

    The registration is open and much lower than a physical conference. It is not free as we work to produce it and pay for the staff work and the tools we use to stream and connect, but we are able to do it at a lower cost than a physical conference. Your support allows us to do the best job and also to sustain Scale By the Bay until it will return to the actual Bay for all of us to attend. Thus we created a Patron of the Arts and Sciences By the Bay guild that will be invited to exclusive events and will form the core of those who support the community By the Bay over the years. We also invite instant sponsors to register and share their logo with us. If your company would like to sponsor at higher levels, the prospectus is available.

    We hope to see you by the bay soon, online and for real!

  • Reactive Summit + SBTB 2020 CFP Open through July 31

    Online event

    Scale By the Bay is now following Reactive Summit, the conference of the Reactive Foundation, a Linux Foundation project focused on cloud-native applications.

    We're happy to report that Martin Odersky and Matei Zaharia, the creators of Scala and Spark, will keynote SBTB 2020, among other awesome keynote speakers.

    The joint CFP is extended through July 31.

    There's still time to submit a talk: https://scale.bythebay.io!
    Looking forward to more great speakers to join us in November.

  • Exabytes Delivered Daily: Lessons Building Cloud Software at Databricks

    This is a joint event with SF Scala. RSVP there! The Zoom link is available only to those who do:


    Exabytes Delivered Each Day: Some Lessons Building Large-Scale Cloud Software at Databricks

    Cloud service developers need to handle massive scale workloads from thousands of customers with no downtime or regressions. In this talk, we’ll present our experience building a very large scale cloud service at Databricks, which provides a data and ML platform as a service running over AWS and Azure used by some of the largest enterprises in the world. Databricks launches millions of VMs per day in each of these clouds that process exabytes of data per day for interactive, streaming and batch production applications. This means that our control plane has to handle a wide range of workload patterns and cloud issues such as outages. We will describe how we built our control plane for Databricks using Scala services and open source infrastructure such as Kubernetes, Envoy, Prometheus, Apache Spark and Delta Lake, and various design patterns and engineering processes that we learned along the way to make development and operations reliable.

    Speaker: Matei Zaharia is the Chief Technologist and cofounder at Databricks, the creator of Apache Spark, and an Assistant Professor at Stanford University, where he cofounded the DAWN Lab.

    Note: Matei will keynote the Scale By the Bay 2020 Conference in November, the CFP is still open at https://scale.bythebay.io!

  • First Choice Scale By the Bay 2020 CFP is now Open through May 31

    The 8th Annual Scale By the Bay developer conference will be held either online or in person in November, 2020.

    The CFP is now open at https://scale.bythebay.io.

    The First Choice CFP will run until May 31st, when 1/2 of the program will be selected. The next 1/4 will be selected by June 30th, and so on. The bar will move higher in each iteration, allowing for the strongest talks to still join.

    Please submit your best talk early, and hope to see you on the program!

  • Starting 2020 at Microsoft Reactor: Making Apache Spark Better with Delta Lake

    From now on, all the meetups By the Bay will be announced at the umbrella meetup, Scale By the Bay:


    We'll keep the downstream meetups for consistency and will cross-post. If you are a member of any of these, join us and you will be up to date on the holistic, full-stack, approach to software systems.

    We’re kicking off the year at our new partner venue, Microsoft Reactor!

    Apache Spark™ is the dominant processing framework for big data. Delta Lake adds reliability to Spark so your analytics and machine learning initiatives have ready access to quality, reliable data. This session covers the use of Delta Lake to enhance data reliability for Spark environments.

    The role of Apache Spark in big data processing
    Use of data lakes as an important part of the data architecture
    Data lake reliability challenges
    How Delta Lake helps provide reliable data for Spark processing
    Specific improvements that Delta Lake adds
    The ease of adopting Delta Lake for powering your data lake

    Chris Hoshino-Fish is a Solutions Architect at Databricks. Chris is an active member of the Performance Subject Matter Expert group and a former Principal Consultant focused on Data Engineering, working with several Fortune 500 Databricks customers. Prior to Databricks, Chris worked for an adtech company as a data engineer managing pipelines using Apache Spark for 3.5 years. Chris has a B.A. in Computational Mathematics from the University of California, Santa Cruz.

    Lightning Talks
    -- we'll open the floor for the rest of the meetup to the lightning talks proposed in the comments!

  • [External Registration][Conference] Scale By the Bay 2019, November 13-15

    This year Scale By the Bay (https://scale.bythebay.io) runs for only two days. But we packed an incredible 70 sessions in these two days! We start with a hot breakfast and excellent coffee. Coffee never ends -- continuous uninterruptible supply of great coffee is a hallmark of every conference By the Bay. Each morning there is a keynote where we all gather as a community, and a panel closing each conference day where we all get together again before the happy hour -- also every day.

    The heart of the conference are its iconic four tracks: Thoughtful Software Engineering, Service Architectures, End-to-end Data Pipelines up to ML/AI, which we historically call Functional, Reactive, and Data. That's three right? The fourth is the hallway track -- and we're legendary for it!

    The core theme this year is Distributed Systems. Joe Beda, Principal Engineer at VMware and the co-creator of Kubernetes, keynotes one day, and Heather Miller, Professor at CMU and former Director of the Scala Center, keynotes the other. We have multiple talks considering cloud deployments on Kubernetes in concert with other systems, such as Kafka, Spark, and Flink. We cover important new directions with Unison, and inherent issues such as Change Data Capture from Disney Streaming. We will learn about the new GIS features for Google BigQuery from their author, about the Databricks Data Lake approach, and infrastructure as code at Target.

    Our "reactive" track started as reactive microservice architectures but came to encompass all kinds of systems, as well as data manipulation techniques. We'll see how Lyft is enabling real-time queries with Apache Kafka, Flink and Druid. We'll hear about the lessons learned developing and running Netty from its creator. We see how Serverless is developed at Google.

    Machine Learning and AI are only as scalable as the data pipeline feeding them. Moreover, you need to ensure your data is typesafe and your predictions are based on the data whose integrity or even privacy is provable. This year, we have three talks on Swift for TensorFlow, including from the original Google team developing it, as well as Coinbase and Quarkworks. We hear from Sony Entertainment on near real-time, low latency predictions, and many many other leaders.

    And we'll uphold the rigorous and thoughtful software engineering that is underpinning of every system scalable in time and tech space -- a system that can deliver but also grow with companies and their people. We'll hear from Comcast and Netflix on human-centric software engineering and ML organizations. We'll hear about community-first Open-Source approaches. We'll see how F# invigorates .Net ecosystem with functional approach, including JavaScript apps, and how Scala with React is doing the same for the full-stack development on the JVM. We'll hear about Rust, Haskell, Scala, Java, Python, F# and other ecosystems used for quality development and production deployment. We'll see how the sausage is made at JetBrains to power our IDEs. We'll dig deeper into GraalVM with Oracle and Twitter, as well as Scala Native.

    We pioneered GraphQL at Scale By the Bay three years ago when almost nobody heard about it. Furthermore, our focus was not on the frontend alone but on middleware usage of GraphQL. This year Nick Schrock, a co-creator of GraphQL, joins us.

    The day before the conference, we run a bespoke, all-day, hands-on training that we build specifically for SBTB. This year, it's Portable Serverless Workshop with Ryan Knight and James Ward. James is now at GCP and has a driver seat going to the serverless future. You'll go home with a complete serverless backend under your belt!

    All in all, we'll have a lot of fun, pack a year of learning in just two or free days, and again experience the magic that makes Scale By the Bay a legend! Reserve your Early Bird seat soon at https://scale.bythebay.io.


    Note: ML Model Versioning, Deployment, and Monitoring are core themes of the https://scale.bythebay.io 2019, 11/14-15, Oakland. Reserve your seat today using the code MEETSFHADOOP15 for 15% off all passes, including the complete Serverless workshop!

    Joint meetup -- please RSVP at http://bay.area.ai!


    Models are the new code. While machine learning models are increasingly being used to make critical product and business decisions, the process of developing and deploying ML models remain ad-hoc. In the “wild-west” of data science and ML tools, versioning, management, and deployment of models are massive hurdles in making ML efforts successful. As creators of ModelDB, an open-source model management solution developed at MIT CSAIL, we have helped manage and deploy a host of models ranging from cutting-edge deep learning models to traditional ML models in finance. In each of these applications, we have found that the key to enabling production ML is an often-overlooked but critical step: model versioning. Without a means to uniquely identify, reproduce, or rollback a model, production ML pipelines remain brittle and unreliable. In this talk, we draw upon our experience with ModelDB and Verta to present best practices and tools for model versioning and how having a robust versioning solution (akin to Git for code) can streamlining DS/ML, enable rapid deployment, and ensure high quality of deployed ML models.

    Speakers: Manasi Vartak, CEO, Verta.ai, Conrado Miranda, CTO, Verta.ai

    Manasi Vartak is the founder and CEO of Verta.ai (www.verta.ai), an MIT-spinoff building software to enable high-velocity machine learning. Manasi previously worked on deep learning for content recommendation as part of the feed-ranking team at Twitter and dynamic ad-targeting at Google.

    Conrado Miranda is the CTO at Verta.AI. Conrado has a PhD in Machine Learning and a focus on building platforms for AI. He was the tech lead for the Deep Learning platform at Twitter’s Cortex, where he designed and led the implementation of TensorFlow for model development and PySpark for data analysis and engineering. He also led efforts on NVIDIA’s self-driving car initiative, including the Machine Learning platform, large scale inference for the Drive stack, and build and CI for Deep Learning models.

    (2) Model Monitoring in Production

    Machine Learning models continuously discover new data patterns in production they have never seen during training and testing iterations.

    The best offline experiment can lose in production. The most accurate model is not always tolerant to a minor data drift or adversarial input. Neither prodops, data science or engineering teams are skilled to detect, monitor and debug model degradation behaviour.

    Real mission critical AI systems require advanced monitoring and model observability ecosystem which enables continuous and reliable delivery of machine learning models into production. Common production incidents include:
    - Data anomalies
    - Data drifts, new data, wrong features
    - Vulnerability issues, adversarial attacks
    - Concept drifts, new concepts, expected model degradation
    - Domain drift
    - Biased Training set

    In this demo based talk we discuss algorithms for monitoring text and image use cases as well as for classical tabular datasets.
    Demo part will cover the full cycle of machine learning model in production:
    Model training and deployment with Kubeflow pipelines
    Production traffic simulation
    Model monitoring metrics configuration
    Data drift detection
    Drift exploration and monitoring metadata mining
    New training dataset generation from production feature store
    Model retraining and redeployment

    Stepan Pushkarev is a CTO of Hydrosphere.io - Model Management platform and co-founder of Provectus - an AI Solutions provider and consultancy, a parent company of Hydrosphere.io.

  • OmniSci-StreamSets F1 Demo + Scaling StreamSets On Azure Kubernetes Service

    This is a joint event with the StreamSets User Group:


    Join us for two great presentations, food, beverages, and take a turn in OmniSci's instrumented F1 simulator!


    6:30pm - Food, beverages, and try out OmniSci's F1 simulator!

    7:15pm - Creating the OmniSci F1 Demo: Real-Time Data Ingestion With StreamSets

    Veda Shankar - Senior Developer Advocate - OmniSci

    Telematics is a rapidly growing use case for IoT and Big Data, and OmniSci hacked a F1 racing game to demonstrate how telematics data can be collected and analyzed in real-time.

    A combination of open source tools were used to generate, capture, process, analyze, and chart the data from a Formula 1 racing simulation. StreamSets was used to visually architect and implement the data flows with an open-source Docker container. Read this blog for more details: https://streamsets.com/blog/omnisci-f1-demo-real-time-data-ingestion-streamsets/

    7:45pm - Scaling StreamSets On Azure Kubernetes Service

    Speaker TBD

    Provisioning agents are containerized applications that run within a container orchestration framework, such as Kubernetes. You can run Kubernetes on-premise, or leverage cloud-based solutions such as Azure Kubernetes Service (AKS) and Google Kubernetes Engine for a "pay-as-you-consume" model without the complexity of implementation, installation, and maintenance.

    In this session, we will show how to scale StreamSets Data Collector instances on Azure Kubernetes Service (AKS) using provisioning agents that help automate upgrading and scaling resources on-demand, without having to stop execution of dataflow pipeline jobs.

    8:30pm - Close