- [External Registration][Conference] Scale By the Bay 2019, November 13-15
Oakland Scottish Rite Center
- MODEL VERSIONING: WHY, WHEN, AND HOW
Note: model versioning and deployment is an integral part of the https://scale.bythebay.io data pipelines track. Join us very soon, in mid-November using the code MEETBAYAREAAI15 to get 15% off all passes, including the bespoke Serverless workshop with Google! A special discount for Scale By the Bay will be revealed at the event for actual attendees only. We have two talks. (1) MODEL VERSIONING: WHY, WHEN, AND HOW Models are the new code. While machine learning models are increasingly being used to make critical product and business decisions, the process of developing and deploying ML models remain ad-hoc. In the “wild-west” of data science and ML tools, versioning, management, and deployment of models are massive hurdles in making ML efforts successful. As creators of ModelDB, an open-source model management solution developed at MIT CSAIL, we have helped manage and deploy a host of models ranging from cutting-edge deep learning models to traditional ML models in finance. In each of these applications, we have found that the key to enabling production ML is an often-overlooked but critical step: model versioning. Without a means to uniquely identify, reproduce, or rollback a model, production ML pipelines remain brittle and unreliable. In this talk, we draw upon our experience with ModelDB and Verta to present best practices and tools for model versioning and how having a robust versioning solution (akin to Git for code) can streamlining DS/ML, enable rapid deployment, and ensure high quality of deployed ML models. Speakers: Manasi Vartak, CEO, Verta.ai, Conrado Miranda, CTO, Verta.ai Manasi Vartak is the founder and CEO of Verta.ai (www.verta.ai), an MIT-spinoff building software to enable high-velocity machine learning. Manasi previously worked on deep learning for content recommendation as part of the feed-ranking team at Twitter and dynamic ad-targeting at Google. Conrado Miranda is the CTO at Verta.AI. Conrado has a PhD in Machine Learning and a focus on building platforms for AI. He was the tech lead for the Deep Learning platform at Twitter’s Cortex, where he designed and led the implementation of TensorFlow for model development and PySpark for data analysis and engineering. He also led efforts on NVIDIA’s self-driving car initiative, including the Machine Learning platform, large scale inference for the Drive stack, and build and CI for Deep Learning models. (2) Model Monitoring in Production Machine Learning models continuously discover new data patterns in production they have never seen during training and testing iterations. The best offline experiment can lose in production. The most accurate model is not always tolerant to a minor data drift or adversarial input. Neither prodops, data science or engineering teams are skilled to detect, monitor and debug model degradation behaviour. Real mission critical AI systems require advanced monitoring and model observability ecosystem which enables continuous and reliable delivery of machine learning models into production. Common production incidents include: - Data anomalies - Data drifts, new data, wrong features - Vulnerability issues, adversarial attacks - Concept drifts, new concepts, expected model degradation - Domain drift - Biased Training set In this demo based talk we discuss algorithms for monitoring text and image use cases as well as for classical tabular datasets. Demo part will cover the full cycle of machine learning model in production: Model training and deployment with Kubeflow pipelines Production traffic simulation Model monitoring metrics configuration Data drift detection Drift exploration and monitoring metadata mining New training dataset generation from production feature store Model retraining and redeployment Stepan Pushkarev is a CTO of Hydrosphere.io - Model Management platform and co-founder of Provectus - an AI Solutions provider and consultancy, a parent company of Hydrosphere.io.
- [Register at https://swift.tf!] Swift as syntactic sugar for MLIR
Please register at https://swift.tf! This is a joint meetup with Swift for TensorFlow. If you RSVP here you'll be waitlisted and nothing else will happen! Swift for TensorFlow is covered at https://scale.bythebay.io conference in November. Reserve your seat to learn more! ----- We need a video sponsor for this meetup, at $500. You will be mentioned in the video if it happens and on the meetup! ----- Swift works great as an infinitely hackable syntactic interface to semantics that are defined by the compiler underneath it. The two options today are LLVM (there's a running joke that Swift is just syntactic sugar for LLVM) and TensorFlow graphs (which is the contribution of early versions of Swift for TensorFlow). Multi-Level Intermediate Representation (MLIR) is a generalization of both the LLVM IR and TensorFlow graphs to represent arbitrary computations at multiple levels of abstraction. This enables domain-specific optimizations and code generation (e.g. for CPUs, GPUs, TPUs, and other hardware targets). In the talk, we'll present some thoughts on how Swift could compile down to MLIR and show a few demos of prototype technologies that we've developed. Eugene Burmako ([masked]) is working on Swift for TensorFlow at Google AI. Before joining Google, he made major contributions to Scala at EPFL and Twitter, founding Reasonable Scala compiler, Scalameta and Scala macros. Eugene loves compilers, and his mission is to change the world with compiler technology. Alex Suhan ([masked]) is also working on Swift for TensorFlow at Google AI. He has been using LLVM to accelerate machine learning and data analytics workloads for the last five years. Alex enjoys working at the interface between software and various hardware accelerators. Our work is the result of discussions and collaboration with many folks - our colleagues from Google, the Swift compiler team from Apple, as well as our community members, including Jeremy Howard from http://fast.ai. We're very grateful for everyone's input and contributions!
- Scale By the Bay 2019 CFP Open until May 31
Friends — the month of May is when the Scale By the Bay (SBTB) CFP always runs, for the conference in November. The CFP is now open at https://scale.bythebay.io There are three tracks, as usual: — Functional Programming — Service Architectures — Data Pipelines, including ML/AI The theme for this year is the emergence of new distributed systems and their applications, including Edge, IoT, DLT, and AI on the Edge. Helena Edelson lead a team at Apple enabling ML/AI with Spark, Joe Beda started Google Compute Engine and Kubernetes, and Heather Miller lead Scala Center at EPFL and now advances distributed and edge systems at CMU. We have two talk lengths, 20 minutes and 40 minutes. There are 5-10 minute breaks between some, but not all, talk slots, and excellent coffee is served all day long so every break is a coffee break. Please check each time length you can work with. We often ask 40 min talks to shrink to 20 min as we try to accommodate all the best talks — and our acceptabnce rate is going down to 1:3 with years. We also serve hot breakfast and great lunch and amazing happy hours follow the main program in between all days. The hallway track is legendary, facilitated by the high ratio of speakers — 100+ out of the 600 attendees. We are committed to community above all and are working with underrepresented groups to send speakers. Please share this CFP with your diversity advocates, community managers, and encourage female engineers, African-American developers, and others to submit talks. If you could send such speakers on behalf of your company, it will help the community a lot. We’re also proactively reaching out to meetups, our core constituents, to help our established diversity program. We also work with companies like Stripe on diversity scholarships — let us know if you’d like to partner on this. Submit your best talks at https://scale.bythebay.io by May 31!
- Applied Machine Learning: a Netflix Production, Deep Recommendations at Twitch
This is a megameetup hosted by Twitch! The hosts present their tech as well as the talks from Netflix and Aperture Data engineers. Thank you so much Twitch! This meetup will be twitched -- expect a link shortly! (1) Applied Machine Learning is about as mature as Software Engineering circa 1998. For Data Scientists, it’s hard to collaborate, hard to be productive and hard to deploy to production. In the last 20 years, Software Engineers have become far more collaborative thanks to tools like git, far more productive thanks to cloud computing and far more effective at delivering quality software thanks to CI/CD and agile development practices. At Netflix, I get to work on problems like: how do we scale Data Science innovation by making collaboration effortless? How do we enable Data Scientists to single-handedly and reliably introduce their models to production? How do we make it easy to develop ML models that humans trust? More importantly, how do we use ML to make humans BETTER? In this talk, we’ll explore how Netflix is approaching these problems to further our mission of creating joy for our 125 Million+ members worldwide! Speaker: Julie Pitt leads the Machine Learning Infrastructure at Netflix, with the goal of scaling Data Science while increasing innovation. She previously built streaming infrastructure behind the "play" button while Netflix was transitioning from domestic DVD-by-mail service to international streaming service. Julie also co-founded Order of Magnitude Labs, with a mission to build AI capable of doing things that humans find easy and today’s machines find hard: exploration, communication, creativity and accomplishing long-range goals. Early in her career, Julie developed data processing software at Lawrence Livermore National Laboratory that enabled scientists to study the newly-sequenced human genome. (2) Deep Recommendations at Twitch Abstract: Deep Recommendations at Twitch: Twitch is a social video platform that democratizes broadcasting, with 15 million + daily viewers. In this talk we'll explore some of the difficulties that live content introduces to recommendations, and the recommender we built to personalize many products at Twitch. In particular, we'll explore some of the architecture decisions we made and what informed them. We'll also discuss some of our learnings around offline metrics and things to keep an eye on as you move to online experiments. Speaker: Mark Ally is a Senior Applied Scientist at Twitch, working on deep learning techniques for recommendation systems (3) Let Us Manage Your Visual Data So You Can Make Machines Learn Better ApertureData's platform accelerates AI applications through its Data Management solution that redefines how large visual data sets are stored, searched and processed. It exposes a unified interface that allows users to store and search both the data and metadata associated with visual artifacts (images or videos). ApertureData's platform provides several innovative features: the ability to evolve metadata easily without requiring costly schema change, first-class status for feature vectors and bounding boxes, the ability to perform similarity searches as well as the ability to perform common pre-processing operations close to the data. The platform will be pluggable in allowing data to be stored on different backends and serve any machine learning pipeline. Speaker: Vishakha Gupta is the Founder and CEO at ApertureData. Prior to that, she was at Intel Labs for over 7 years where she led the design and development of VDMS (the Visual Data Management System) which forms the core of ApertureData's platform. Vishakha graduated from the Georgia Institute of Technology with a Ph.D in Computer Science where her work focused on virtualization. ----- Julie is a regular speaker at Scale By the Bay, the 2019 CFP opens May 1 and ends May 31, submit your best talks early starting May 1 at http://scale.bythebay.io!
- Managing Globally Distributed Data for Deep Learning using TensorFlow on YARN
The benefits of large datasets for deep learning are well known. But what if the source of this data is globally distributed? Jagane Sundar shares a system for replicating data across geographically distributed data centers, discusses the benefits of consistently replicating data that is used by TensorFlow for training, and explores the advantages of using a Paxos-based distributed coordination algorithm for replication. Jagane then details the resultant unique capability to maintain consistent writable copies of the data in multiple data centers. Speaker: Jagane Sundar is the CTO at WANdisco. Jagane has extensive big data, cloud, virtualization, and networking experience. He joined WANdisco through its acquisition of AltoStor, a Hadoop-as-a-service platform company. Previously, Jagane was founder and CEO of AltoScale, a Hadoop- and HBase-as-a-platform company acquired by VertiCloud. His experience with Hadoop began as director of Hadoop performance and operability at Yahoo. Jagane’s accomplishments include creating Livebackup, an open source project for KVM VM backup, developing a user mode TCP stack for Precision I/O, developing the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun Microsystems, and creating and selling a 32-bit VxD-based TCP stack for Windows 3.1 to NCD Corporation for inclusion in PC-Xware. Jagane is currently a member of the technical advisory board of VertiCloud. He holds a BE in electronics and communications engineering from Anna University.
- The Feature Stores: the missing API between Data Engineering and Data Science?
This meetup is focused around Features Stores with three talks from Jim Dowling (Logical Clocks), Varant Zanoyan (Airbnb), and Nick Handel (Branch). Thanks to Mesosphere for hosting the event and ArangoDB for sponsoring Pizza! *The Feature Store: the missing API between Data Engineering and Data Science?* Machine Learning (ML) pipelines are the key building block for productionizing ML code. However, pipelines are often developed as "silos" - features tend not to be easily re-used across pipelines or even within the same pipeline. Silos lead to duplication, unnecessarily re-implementing features and in the worst case correctness problems, if, for example, the features used for training and serving have inconsistent implementations. The Feature Store solves the problem of siloed and ad-hoc machine learning pipelines, by providing a data layer where feature engineering can be separated from the usage of features to generate training data. That is, the Feature Store should provide a clean API separating Data Engineering from Data Science. In this talk, we will introduce the world's first open-source Feature Store, built on Hopsworks, Apache Spark, and Apache Hive and targeting both TensorFlow/Keras and PyTorch. We will show how ML pipelines can be programmed, end-to-end, in Python, and the role of the Feature Store as a natural interface between Data Engineers and Data Scientists. In an end-to-end pipeline, we will show how the Feature Store works, and how you can write end-to-end ML pipelines in Python only (if you so choose). Speaker Bio: Jim Dowling is the CEO of Logical Clocks AB, as well as an Associate Professor at KTH Royal Institute of Technology in Stockholm. He is the lead architect of Hops, the world's most fastest and most scalable Hadoop distribution and first Hadoop platform with support for GPUs as a resource. He is a regular speaker at AI industry conferences, and blogs at O'Reilly on AI. *Zipline at Airbnb* Zipline is Airbnb’s soon to be open-sourced data management platform specifically designed for ML use cases. It has taken the task of training data generation from months to days and offers data management solutions from model training to serving. This talk will cover the framework at a high level, focusing on the specific challenges of data engineering for ML, and how Zipline provides a solution. Speaker Bio: Varant Zanoyan is a software engineer on the Machine Learning Infrastructure team at Airbnb where he focuses on Zipline, a data management framework for Machine Learning. Previously, he solved data infrastructure problems at Palantir Technologies. *Machine Learning Infrastructure at an Early Stage* Good machine learning is built on infrastructure but many startups don't have the bandwidth or resources to build this foundation while scaling. It's difficult to prioritize the pieces of ML Infrastructure that data scientists and engineers need to be productive and successful when the scale of these projects can be months or years for small teams of engineers. The dividends are large down the road but the cost of pursuing infrastructure that doesn't work or doesn't solve the right problems can leave a team months down the road without necessary progress. This talk focuses on the foundation that any good machine learning system is built on and the elements of ML infrastructure to focus on first. Speaker Bio: Nick Handel serves as Branch International's Head of Data Science. Prior to joining Branch, he was a Product Manager for Airbnb's machine learning infrastructure teams. Before moving to centralize the company's artificial intelligence efforts, he was an early member of the company's data science team, helping the company expand internationally between 2014 and 2015 and leading a data science team that launched Airbnb's Trips product in 2016. Before joining Airbnb, he was a research economist at BlackRock, focusing on emerging market debt.
- Aggregations and knowledge extraction from social data: challenges and lessons
This talk is about the construction of new data assets from social media using techniques drawn from the areas of information retrieval, machine learning, graphs, and social networks. I’ll describe three projects based on Twitter and Foursquare data sets that use social data in different ways to help users in information seeking scenarios. The first one, a recommender system for recreational queries using location-based social networks. The second project, a social knowledge graph derived from Twitter with the goal of discovering relationships between people, links, and topics. And the third one, an application for archiving and Wikification of stories. Omar Alonso is a Principal Applied Scientist with Microsoft where he works on the intersection of information retrieval, social data, human computation, and knowledge graph generation. He is the co-chair of the Human Computation and Crowdsourcing track at WWW'19 and on the organizing committee for HCOMP'19.
- Scale By the Bay 2018
Folks -- our flagship yearly gathering, scale.bythebay.io, is fast approaching, and the program is bursting at the seems with amazing talks. We close all the gaps with the strongest additions, many related to AI that is fed by our favorite data pipelines we know how to build so well. — Clément Farabet, VP of AI Infrastructure at Nvidia, will share the updates from the GPU land on AI for Self-Driving Cars — Alex Sergeev, the creator of Horovod from Uber, will show how to speed up your Deep Learning dramatically with it — Aleksandra Kudriashova, Head of Product at Astro Digital, will show how satellite imagery can help analyze world food economy — Salesforce will show how data pipelines and cutting edge R&D connect in production with Salesforce Einstein and graph analysis, with Richard Socher, Chief Scientist of Salesforce, following engineering talks with a fireside chat and a panel. Our Data Pipelines for AI panel this year includes Richard Socher, Peter Bailis, professor at Stanford and member of DAWN lab there, as well as the founder of sisu.ai; Pete Skomoroch, the founder of SkipFlag (acquired by Workday); Lukas Biewald, founder of CrowdFlower and Weights and Biases; and Michelle Casbon, Google Cloud Platform ML/Big Data engineer. Our Thoughtful Software Engineering panel, including Martin Odersky, Julie Pitt, Marius Eriksen, Runar Bjarnason, and Bryan Cantrill, will be moderated by Cliff Click -- the creator of the HotSpot JIT and cofounder of H2O.ai, who is also teaching the bespoke Advanced Software Engineering workshop the day before. The Cloud, Edge and IoT panel now includes Anoop Nannra, the Head of Cisco Blockchain Initiative and Chairman, Trusted IoT Alliance; Roman Shaposhnik, cofounder, Zededa, and board member, Apache Software Foundation; Bernard Golden, Head of Cloud Strategy, Capital One (to be continued) -- looking for strong panelists representing GCP/Azure/AWS as well. Other talks on the program include High-Performance Bayesian Inference with Rainier by Avi Bryant (Stripe), Graph Analysis by Alexis Roos (Salesforce), Privacy-Preserving Data Science in Scala by David Andrzejewski (Sumo Logic), Towards Typesafe Deep Learning by Tongfei Chen (Johns Hopkins University), The Evolution of the GoPro data platform by David Winters (GoPro), Labels to Inference by Jeff Fenchel (Zignal Labs), Structured Deep Learning by Jayant Krishnamurthy (Semantic Machines), Hadoop Future in the AI World by Milind Bhandarkar (Ampool), and many, many more -- see the full schedule at http://scale.bythebay.io. Use the code BAYHAREAAI15 for 15% off all passes while they are available! Late Bird only from November 1st.
- Scale By the Bay 2018 CFP is accepting late submissions by 6/30
Due to the overwhelming clamor for late submissions and great talks coming in still, the CFP is logarithmically extended as follows. 1/2 the program will be formed with the submissions added by 5/31. The next quarter will take into account those sent by 6/15 and the rest of the submissions that didn’t make the cut yet. The next part will be selected from all those plus the talks submitted by 6/30. A block of time is reserved for the invited talks of exceptionally high quality and importance, expanding the scope of the conference. Submit your talk at scale.bythebay.io!