• Challenging Web-Scale Graph Analytics with Apache Spark

    Hello NYC data scientists! We're excited to announce our next meetup, featuring a talk from Joseph Bradley, an engineer at Databricks and committer on the Apache Spark project. Come hear from Joseph about graph analytics at scale with Spark! The meetup is hosted at Datadog's new HQ in the New York Times building. Datadog is sponsoring food and drink for the meetup as well. Agenda: 6:30pm: Arrive, mingle, grab a bite and a drink 7:00pm: Welcome from JM Saponaro (Datadog) 7:05pm: "Challenging Web-Scale Graph Analytics with Apache Spark" by Joseph Bradley (Databricks) 7:50pm: Q&A 8:00pm: Wrap-up Joseph's talk: Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently. At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications. About Joseph: Joseph Bradley is an Apache Spark Committer and PMC member working as a Machine Learning Software Engineer at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon.

  • Detecting Anomalies and Outliers in Real-Time

    The Yard Herald Square

    For our next meetup, we are pleased to welcome Datadog data scientist Homin Lee (https://www.linkedin.com/in/homin-lee-6644513)! Homin will be discussing how he and his colleagues have implemented systems for outlier and anomaly detection in real-time. Datadog is a monitoring service that collects and processes hundreds of billions of data points every day from web servers, databases, cloud providers, and other infrastructural components. Outlier and anomaly detection helps users make sense of the flood of data coming from their systems and to identify deviations from normal levels, even when "normal" fluctuates over time. Homin will discuss the lessons that his team has learned from using these alerts on their own systems, along with some real-life examples on how to avoid false positives and negatives. Thanks to Datadog (https://www.datadoghq.com/) for sponsoring a taco bar, beers, and soft drinks at this meetup. They will also be on hand to collect entries for a post-meetup raffle. One lucky attendee will win an Apple Watch!

  • When Recommendation Systems Go Bad

    Meetup Inc

    For our next and final meetup of 2015, we are pleased to welcome Evan Estola. Evan is Senior Machine Learning Engineer at Meetup.com, where he works on recommendations and other large-scale data problems. When Recommendation Systems Go Bad Machine learning and recommendations systems have changed the way we interact with not just the internet, but some of the basic products and services that we use to organize and run our life. As the people that build these systems, we have a social responsibility to consider how these systems affect people, and furthermore, we should do whatever we can to prevent these systems and models from perpetuating some of the prejudice and bias that exist in our society today. In this talk I intend to cover some of the recommendation systems that have gone wrong across various industries, and attempt to provide some solutions for prevention. First and foremost among these is awareness, but approaches involving interpretable models, using ensemble models to separate features that shouldn't interact, and designing test data sets for capturing bias will also be explored. Many thanks to Meetup.com for hosting this talk, and for providing food and beverages!

  • Best Practices for PySpark, with Juliet Hougland of Cloudera

    We are thrilled to welcome Juliet Hougland (https://twitter.com/j_houg) to the NYC Data Science meetup. Juliet will be talking about the Python API for Apache Spark, known as PySpark, and best practices for its use. Note: The RSVP list for this talk will close at 10:00am on the day of the event. Summary: PySpark (the component of Spark that allows users to write their code in Python) has grabbed the attention of Python programmers who analyze and process data for a living. The appeal is obvious: you don't need to learn a new language, and you still have access to modules (pandas, nltk, statsmodels, etc.) that you are familiar with, but you are able to run complex computations quickly and at scale using the power of Spark. In this talk, we will examine a real PySpark job that runs a statistical analysis of time series data to motivate the issues described above and provides a concrete example of best practices for real world PySpark applications. We will cover: • Python package management on a cluster using virtualenv. • Testing PySpark applications. • Spark's computational model and its relationship to how you structure your code. Bio: Juliet is a Data Scientist at Cloudera, and contributor/committer/maintainer for the Sparkling Pandas project. Her commercial applications of data science include developing predictive maintenance models for oil & gas pipelines at Deep Signal, and designing/building a platform for real-time model application, data storage, and model building at WibiData. Juliet was the technical editor for Learning Spark by Karau et al. and Advanced Analytics with Spark by Ryza et al. She holds an MS in Applied Mathematics from University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in Math-Physics. Thanks to Cloudera (http://www.cloudera.com/content/cloudera/en/home.html) for sponsoring food and drink for this meetup, and to work-bench (http://www.work-bench.com/) for providing meeting space.

  • Spark DataFrames and ML Pipelines for Large-Scale Data Science

    We are pleased to welcome Reynold Xin and Joseph Bradley from Databricks (https://databricks.com/) to the NYC Data Science meetup. Reynold and Joseph will be speaking about two new tools for doing data science in Apache Spark, one of today's most exciting data technologies. Food and drink will be provided by eBay, our hosts for the evening. **NOTE: The meeting room has a maximum capacity of 72 people. We have set the RSVP limit higher to accommodate some number of no-shows, but we will have to turn people away if we reach capacity.** Abstract: Data frames in R and Python have become the de facto standards for data science. However, when it comes to Big Data, neither R nor Python data frames integrate well with big data toolings and can scale up to large datasets. In this talk, we will introduce the two latest efforts in Spark to scale up data science: DataFrames and machine learning pipelines. Inspired by R and Pandas, DataFrame in Spark provides concise, powerful programmatic interfaces designed for structured data manipulation. In particular, it features: • Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster • Support for a wide array of data formats and storage systems • State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer • Seamless integration with all big data tooling and infrastructure via Spark • APIs for Python, Java, Scala, and R (in development via SparkR) On top of DataFrames, we have built a new machine learning (ML) pipeline API inspired by the similarly named concept in scikit-learn. ML pipelines enable users to express ML workflows as a sequence of processing and learning stages. For example, classifying text documents might involve cleaning the text, transforming raw text into feature vectors, and training a classification model. Speakers: Reynold Xin is a committer on Apache Spark and a co-founder of Databricks. Before Databricks, he was pursuing a PhD at UC Berkeley AMPLab. He holds the current world time record in sorting 100TB of data, and wrote the two highest cited papers in SIGMOD 2013 and SIGMOD 2011. Joseph Bradley is a Software Engineer at Databricks, working on Spark MLlib. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon University in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs. Reynold and Joseph will be in NYC for Spark Summit (http://spark-summit.org/east/2015/agenda). Members of the meetup group can get 20% off registration by using code "NYC-DATA-SCI".

  • Data Science and Journalism, with Daeil Kim of the New York Times

    For our next meetup, we are pleased to welcome Daeil Kim (http://www.daeilkim.com/) from the New York Times! Daeil is a data scientist at the Times and is also finishing up a PhD in machine learning at Brown University. His research focus at Brown is "the development of scalable Bayesian Nonparametric models to help discover hidden structures within complex datasets such as documents, networks, and more." Daeil will be discussing how machine learning was used to help a recent story about Takata airbags as well as ways that the latest, state-of-the-art machine learning can be used to assist in the investigative journalism process. Food and drink will be provided by Metis (http://www.thisismetis.com/data-science), our hosts for the evening, who run a data science bootcamp here in NYC. Metis is currently hiring a Data Science Lead (http://www.thisismetis.com/data-science-lead) and a Data Science Bootcamp Instructor (http://www.thisismetis.com/data-science-bootcamp-instructor).

  • Building data pipelines using Luigi, with Erik Bernhardsson of Spotify

    For our next meetup we are pleased to welcome Erik Bernhardsson (http://erikbern.com/), the Engineering Manager for Music Discovery & Machine Learning at Spotify. Erik will be talking about Luigi (https://github.com/spotify/luigi), a powerful and versatile open-source framework for building data pipelines, of which he is one of the principal authors. Luigi is a Python module for building automated data pipelines for complex workflows—it provides dependency management, step-by-step workflow execution, and visualization, and is designed to be scalable and fault-tolerant. Luigi has found use at Spotify (of course) and lots of other companies—for two great examples, check out how Asana (https://eng.asana.com/2014/11/stable-accessible-data-infrastructure-startup/) and Buffer (https://overflow.bufferapp.com/2014/10/31/buffers-new-data-architecture/) use it to orchestrate complex analytics pipelines. Erik's bio: "Swedish computer nerd living in NYC. I graduated with a master's degree in Physics from KTH in Stockholm, but I've been writing code for 20+ years. My work has ranged from embedded systems to high frequency trading to machine learning. Since joining Spotify in 2009, I've designed and built many large-scale machine learning algorithms we use to power the recommendation features. I've led the team that built and released features like the radio feature, the 'Discover' page, 'Related Artists', and much more." Pizza, beer and soft drinks will be provided, co-sponsored by Mortar (https://www.mortardata.com/) and Spotify (https://www.spotify.com/us/).

  • Computational Social Science, with Jake Hofman of Microsoft Research

    We're pleased to welcome Jake Hofman of Microsoft Research to the NYC Data Science Meetup! Jake's talk will provide an overview of several recent projects in modeling social data that incorporate various aspects of applied statistical inference and machine learning to address questions in social science. This will include one study on the value of aggregate search activity for predicting offline events, another on variation in online activity by different demographic groups, and a third exploring how information spreads in social networks. Bio: Jake Hofman (@jakehofman (https://twitter.com/jakehofman)) is a Researcher at Microsoft Research in New York City, where his work in computational social science involves applications of statistics and machine learning to large-scale social data. Prior to joining Microsoft, he was a member of the Microeconomics and Social Systems group at Yahoo! Research. Jake is also an Adjunct Assistant Professor of Applied Mathematics at Columbia University, where he has designed and taught classes on a number of topics ranging from biological physics to applied machine learning. More information is available at http://jakehofman.com. Food and drinks sponsored by our hosts for the evening, Metis, which runs a Data Science Bootcamp (http://www.thisismetis.com/data-science) right here in NYC!

  • Online Data Visualization with Matt Sundquist of Plotly

    We've covered a lot of topics at the NYC Data Science meetup, but we haven't yet had an entire talk devoted to data visualization. For our next meetup, we're changing that! We're excited to welcome Matt Sundquist, a co-founder at Plotly (https://plot.ly), which is a great online tool for graphing and sharing data. Says Matt: "Plotly is a platform for analyzing data and making interactive graphs. It's collaborative, free, and entirely online. In this talk, we'll look at how Plotly works, using the web interface to make 2D, 3D, and live-streaming graphs rendered with D3.js, a JavaScript visualization library. We'll delve into IPython and Plotly's packages and support for Python, matplotlib, R, ggplot2, MATLAB, Julia, Excel, Mathematica, and more. We'll also discuss lessons learned in building a web-based graphing product." See you there! Also stay tuned for details about our next meetup, featuring a talk by Jake Hofman of Microsoft Research.

  • Data for Good, with Jake Porway

    Meetup Inc

    We're delighted to welcome Jake Porway, Founder and Executive Director of DataKind (http://www.datakind.org/), to the NYC Data Science Meetup! Jake will be talking about DataKind's excellent work applying data science for social good. As you may know, DataKind is an organization headquartered in New York that brings high-impact organizations dedicated to solving the world’s biggest challenges together with leading data scientists to improve the quality of, access to and understanding of data in the social sector. About Jake Porway (http://www.datakind.org/aboutus/): Jake is a machine learning and technology enthusiast who loves nothing more than seeing good values in data. He founded DataKind™ in the hopes of creating a world in which every social organization has access to data capacity to better serve humanity. He was most recently the data scientist in the New York Times R&D lab and remains an active member of the data science community. He holds a B.S. in Computer Science from Columbia University and his M.S. and Ph.D. in Statistics from UCLA. Follow Jake and DataKind on Twitter: @jakeporway (https://twitter.com/jakeporway/), @datakind (https://twitter.com/DataKind/) We hope to see you there. Many thanks to Meetup HQ for hosting!