• EGG NYC 2019

    Spring Street Studios

    EGG NYC, our annual thought-leadership conference, is all about the human side of AI. From advancements in human progress to discussions of transparency, interpretability, and ethics, EGG NYC will spotlight leaders like Hinge, GE Aviation, and WIRED (including their Editor-in-Chief, Nicholas Thompson!), and the work they're doing to shape the future of AI. As a special thank-you for being a member of our Meetup community, we're offering a 50% discount off of ticket price with the discount code, MEETUP. Join us and let's redefine the rules of AI, together. Reserve your spot here: https://nyc.egg.dataiku.com/

  • Essential Problem Solving Skills for Data Scientists (feat. Macy's)

    In order to gain entry, you must RSVP on both meetup AND the GA site here: https://generalassemb.ly/education/essential-problem-solving-skills-for-data-scientists/new-york-city/78862 Tentative Schedule: 6:30pm: Pizza + Beer networking 7:00pm: Hey Birdie: Building a Mini Voice Assistant for Data Science with Triveni Gandhi, Data Scientist at Dataiku 7:30pm: Growing Data Science Talent through Structured Problem Solving with Jolene Mork, Senior Data Scientist at Macy's Talk Abstracts: Hey Birdie: Building a Mini Voice Assistant for Data Science with Triveni Gandhi, Data Scientist at Dataiku In this talk, I'll review the process I used to create a mini voice assistant that can be used to navigate a dashboard hands free. I'll discuss building and deploying a speech recognition model to an API Endpoint, creating a pyaudio listener, and how you can use python to control your browser. Growing Data Science Talent through Structured Problem Solving with Jolene Mork, Senior Data Scientist at Macy's: What makes a data science team more than a collection of smart individuals feeding data through algorithms? At Macy's, the core skills that unite the data science team are a knack for structured problem solving—taking a vague business ask, breaking it down to understand the underlying question, and then abstracting the question to components solvable with math—and structured communication, which allows us to communicate and effect change in our organization. This talk will focus on a specific business case drawn from pricing to illustrate the transformation of a vague business ask into a solvable data science problem, and discuss how it reflects the structured problem solving philosophy that underlies the work of the data science team at Macy's. Speaker bios: Triveni is a Data Scientist with Dataiku. She works with clients to determine best practices around data science and their specific projects. Previously, she worked as a Data Analyst with a large non-profit dedicated to improving education outcomes in NYC. Triveni holds a Ph.D in Political Science from Cornell University. Jolene is a Senior Data Scientist working within the Macy’s Supply Chain organization, leading projects related to improving the return on investment in fashion inventory through enhanced pricing and allocation strategies. She is passionate about developing intuitive explanations for how models work, and influencing the organization to adopt analytics into business practices. Before Macy’s, Jolene received her Ph.D. in physical chemistry from MIT, where she performed laser spectroscopy experiments to measure properties of the fluorescent nanoparticles used to generate the color in Samsung’s QLED TVs.

  • Machine Learning Process Automation & Operationalization (feat. Meetup)

    For this meetup, we're going a little meta with Meetup's own Shayak Banerjee, a senior Machine Learning Engineer at the event platform discussing process automation. In order to gain entry, you must RSVP on both meetup AND the GA site here: https://generalassemb.ly/education/deploying-process-automation-for-meetup-groups-approval/new-york-city/78083 Tentative Schedule: 6:30pm: Pizza + Beer networking 7:00pm: AI Meets Mail Processing by Vincent Houdebine, Data Scientist at Dataiku 7:30pm: Deploying Review Process Automation for Meetup Groups with Shayak Banerjee, Senior Machine Learning Engineer at Meetup Abstracts: AI Meets Mail Processing by Vincent Houdebine, Data Scientist at Dataiku: While virtual assistants have never sounded more human and as cars become driverless, companies still have to deal with a massive amount of mail. From unsolicited mail and bills to registered mail, mail processing solutions are a necessity. In an effort to bring AI to mail processing, we will present a prototype we've developed for a client in the insurance industry. Using Computer Vision and Deep Learning techniques, it automatically processes typed and hand-written letters to send them to the correct department within the organization. The Meetup platform sees in excess of 500 groups created daily across the world. A team of humans review each group to ensure they adhere to community guidelines, and disapproves groups which do not. In this talk we’ll discuss an internal machine learning tool called Fastpass that auto-approves a small fraction of these groups based on mimicking the understanding of a human reviewer. We will talk about the model itself (features + XGBoost), the nuances and pitfalls of determining “approvability”, and how we deployed and operationalized it using Spark, Airflow and DynamoDB. We’ll touch on some of the precision/recall tradeoffs driven by product choices, and we’ll end with a look at how we set up continuous monitoring on the model performance. Speaker bios: Vincent is a data scientist at Dataiku in Paris, he supports Data Science teams in building efficient Data Science projects and deploying them into production. In the past few years, he has been dealing with a variety of data science and machine learning problems, from fraud detection to churn prevention and product recommendation. Shayak Banerjee is the Engineering Lead on the Machine Learning team at Meetup. He has a PhD in Electrical Engineering from The University of Texas - Austin and spent several years in the hardware industry, including co-founding his own wearable tech company. For the past few years he has been working on the Machine Learning challenges of delivering personalization at low-latency on the Meetup platform.

    2
  • Tools for Productionalizing Data Science (feat. The New York Times)

    In order to gain entry, you must RSVP on both meetup AND the GA site here: https://generalassemb.ly/education/tools-for-productionalizing-data-science-feat-the-new-york-times/new-york-city/76456 Please note that space is limited - those who have signed up on the GA website and the meetup will be prioritized. Tentative Schedule: 6:30pm: Pizza + Beer networking 7:00pm: So you built a model, now what? by Jed Dougherty, Lead Data Scientist from Dataiku 7:30pm: Intro to Airflow for Data Analysts and Data Scientists with Brian Lavery, Senior Data Engineer and John Paletto, Data Scientist from The New York Times Talk Abstracts: So you built a model, now what? by Jed Dougherty, Lead Data Scientist from Dataiku: Dataiku’s lead data scientist, Jed Dougherty, will dive into an often overlooked aspect of the data science lifecycle: model deployment. Once they’ve constructed a data science model that does a good job accurately predicting their test set, many data scientists think the job is over. But really, it’s just begun. In this talk we’ll look at tracking model quality, updating models in production, benchmarking model API response times, A/B testing, and deciding on cluster deployment strategies. Intro to Airflow for Data Analysts and Data Scientists with Brian Lavery, Senior Data Engineer and John Paletto, Data Scientist from The New York Times: Why do Data Scientists and Engineers at the New York Times use Apache Airflow to chain together their batch jobs into workflows and how well does it scale? What is Airflow's role in productionalizing models and what other tools come into play for training models? And if you want to get started with Airflow, what are your options and how hard is the system to maintain? Speaker bios: Jed leads Dataiku's Data Science team in North America. He works with a wide variety of Fortune 500 clients and specializes in helping large companies spin up and organize Data Science teams. Before coming to Dataiku he worked on event detection, spam filtering, and survival analysis in the fields of breaking news, social media, and child welfare. He earned his masters at Columbia University in its QMSS program. Brian Lavery is a Senior Data Engineer at the New York Times. He currently co-hosts the NYC Apache Airflow Meetup Group. His IT career has spanned 20 years but he's been in the data engineering world for the past 12, most of that at the New York Times. Since he's worked for the Times, it has moved off of star schemas on relational databases and into the big data world. The Times has tried a lot of technologies and therefore Brian has gotten to play with a lot of different tools. From EMR and Redshift on AWS to an in-house Hadoop cluster to where the Times is now on BigQuery and Airflow on Google Cloud Platform. JD is a Data Scientist at the New York Times. He has over 3 years experience working in and deploying data science at scale. Prior to The Times, his work focused on applying data science to predictive maintenance in the fields of aerospace and performance materials. Since joining The Times he has worked on machine learning data products for advertising, subscription growth, and print distribution. He currently spends most of his time in Python, Airflow, and Google Cloud Platform. He enjoys most sports, all ice cream and lots of cookies (preferably together).

    1
  • Diversity as a Data Science Imperative (feat. Spotify, MongoDB, WiMLDS & Turner)

    In order to gain entry you MUST RSVP on this meetup AND General Assembly: https://generalassemb.ly/education/diversity-as-a-data-science-imperative-feat-spotify-mongodb-wimlds-turner/new-york-city/74871 Schedule 6:30pm: Pizza, Beer, Networking 7:00pm: Panel discussion 7:30pm: Break-outs w/ panelists Data science has drastically transformed in recent years, drawing attention to its ability to produce real impact across a number of industries. However, despite its many evolutions, the field has seriously lagged in successes in diversity - with the lowest variety in gender, race, and education diversity in technical fields (https://www.forbes.com/sites/priceonomics/2017/09/28/the-data-science-diversity-gap/#2c5ff90a5f58). Thus, we're bringing together leaders from Spotify, MongoDB, WiMLDS, and Turner Broadcasting to discuss the impact diversity can have on the field, ranging from AI ethics to the perception of the data scientist role. Following the panel, we'll break out into speaker-led groups for smaller discussions on how we can actively contribute to building diversity in our own work. We will also be collecting donations for the AnitaB.org foundation through a voluntary $5.00 Venmo to @Dataiku-Inc. Panelists Inga Chen, Product Manager @ Spotify and NYC Chapter Lead for Women in Product Haile Owusu, SVP of Analytics, Decision and Data Sciences @ Turner Broadcasting A. Jesse Jiryu Davis, Staff Software Engineer @ MongoDB Reshama Shaikh, Board Member of Women in Machine Learning & Data Science Triveni Gandhi, Data Scientist @ Dataiku Speaker bios Inga Chen leads two personalization & discovery product teams at Spotify focused on building data and ML models fir Discover Weekly, Release Radar, Daily Mix, Home, Voice & Search. Before Spotify, she led user-facing analytics products at Squarespace, turning data into actionable insights across web & mobile. Before moving to New York, Inga was a product manager in San Francisco, leading a variety of consumer and enterprise products at Automatic Labs, which leveraged ML and data science to empower drivers with insights into their driving behavior and car health. Outside of her day job, Inga hosts tea ceremony pop-ups and leads the New York Chapter of Women in Product, a community of over 1,100 women product managers and leaders in NYC. Haile Owusu is senior vice president of analytics, decisions & data sciences at Turner. In this role, Owusu focuses on building out Turner’s DS capabilities, expanding the company’s scope in applying analytics, data and decision sciences to enhance its products. His DS team works closely with many of Turner’s business groups to translate strategies into execution plans for new decision support systems and audience insight strategies. Jesse is a Staff Engineer at MongoDB in New York City. He and Guido van Rossum are coauthors of "A Web Crawler With asyncio Coroutines", a chapter in the "500 Lines or Less" book in the Architecture of Open Source Applications series. Jesse lives in Manhattan with his partner Jennifer Armstrong, and their dwarf hamsters Hazel and Gertrude. Reshama is a freelance data scientist/statistician with skills in Python, R and SAS. She earned her M.S. in statistics from Rutgers University. She earned her M.B.A. from NYU Stern School of Business studying strategy, business analytics & technology management. She began her career at Educational Testing Service, then worked for over 10 years as a biostatistician in the pharmaceutical industry at companies including PPD, Merck, Thomas Jefferson University and Pfizer. She also taught math and statistics for 2 years at Temple University. Triveni is a Data Scientist with Dataiku, working with clients to determine best practices around DS and their specific projects. She previously worked as a Data Analyst with a large non-profit dedicated to improving education outcomes in NYC. She holds a Ph.D in Political Science from Cornell University.

  • Blockchain Technologies: Intuitive Tutorial for Data Scientists (feat. JPMorgan)

    In order to gain entry, you must RSVP on both meetup AND the GA site here: https://generalassemb.ly/education/blockchain-technologies-an-intuitive-introduction-for-data-scientists/new-york-city/73133 Tentative Schedule: 6:30pm: Pizza + Beer networking 6:45pm: A Notebook Diatribe by Will Nowak, Data Scientist at Dataiku 7:10pm: Blockchain Technologies: An Intuitive Introduction for Data Scientists by Bruno Gonçalves, Senior Data Scientist at JPMorgan Chase Talk Abstracts: A Notebook Diatribe by Will Nowak, Data Scientist at Dataiku: Use of the Jupyter notebook has exploded amongst data scientists of recent - but is it really the best IDE for data science? This (not-quite-) diatribe will explore some of the limitations of the tool, primarily focusing on its effects on working on a production-ready data science environment. Blockchain Technologies: An Intuitive Introduction for Data Scientists by Bruno Gonçalves, Senior Data Scientist at JPMorgan Chase: Bitcoin has brought about a true revolution in how we think about money. In one fell stroke it solved the main problems that afflicted previous attempts at a truly digital currency: distributed consensus, double spending, and external attacks. Perhaps more importantly, it provided the first working version of a blockchain or distributed ledger. However, despite their relative simplicity, the underlying concepts on which these technologies are built are not well known and often obscured by hype and technical jargon. Since the days of Bitcoin’s founding, many other crypto-currencies have been proposed and released. This tutorial will introduce these technologies in an intuitive way for data scientists, explaining their driving algorithms, motivations, and data structures. Recent developments and proposals such as SegWit, Lightning Network and smart contracts will also be covered. Speaker bios: Will helps power the data science team at Dataiku, implementing machine learning solutions for clients while also working internally on DSS algorithm development. Prior to Dataiku, Will was a machine learning engineer at Unbox Research. Will has a BA in Mathematics and Economics from Northwestern University, and also received an MA in education administration from Columbia University earlier in his career. Bruno Gonçalves is currently a Senior Data Scientist while on leave from a tenured faculty position at Aix-Marseille Université. Since completing his PhD in the Physics of Complex Systems in 2008 he has been pursuing the use of Data Science and Machine Learning to study Human Behavior. Using large datasets from Twitter, Wikipedia, web access logs, and Yahoo! Meme he studied how we can observe both large scale and individual human behavior in an obtrusive and widespread manner. The main applications have been to the study of Computational Linguistics, Information Diffusion, Behavioral Change and Epidemic Spreading. He is the author of 60+ publications with over 7300+ Google Scholar citations and an h-index of 30. In 2015 he was awarded the Complex Systems Society's 2015 Junior Scientific Award for "outstanding contributions in Complex Systems Science" and in 2018 was named a Scientific Fellow of the Foundation for Scientific Interchange in Turin, Italy. He is also the editor of the book Social Phenomena: From Data Analysis to Models (Springer, 2015).

    1
  • Image Recognition Models: Ensuring Accuracy, Reproducibility, and Efficiency

    We are very excited to kick off March in partnership with the ACM to deep-dive into image recognition and labeling patterns. Tentative Schedule: 6:30pm: Pizza + Beer networking 7:00pm: Building an Image Recognition Model: Hotdog/Notdog Edition by Guilherme de Oliveira, Data Scientist at Dataiku 7:30pm: Human in the Loop: Crowdsourcing Visual Advertisements for Data Labeling with Lydia Chilton, Assistant Professor in Computer Science at Columbia University Talk Abstracts: Building an Image Recognition Machine: Hotdog/Notdog Edition by Guilherme de Oliveira, Data Scientist at Dataiku: Do you sometimes wonder if something is a hotdog or not? Do you want to build an image processor that can do it for you? Wonder no more! In this presentation Guilherme will show you how to create a simple image recognition pipeline in Dataiku DSS. You'll leave the presentation with the skills to train, test, and implement your very own hotdog image recognition model - giving Jian Yang a run for his money. (https://www.youtube.com/watch?v=ACmydtFDTGs) Human in the Loop: Crowdsourcing Visual Advertisements for Data Labeling with Lydia Chilton, Assistant Professor in Computer Science at Columbia University Images have the power to convey messages in striking and memorable ways. Although constructing visual messages is currently too hard for computers or novice users, by combining the intelligence of people and computers we can create compelling visual messages computationally. In this talk, we present VisiBlends, a flexible workflow for creating visual blends that follows the design process with steps involving brainstorming, synthesis, and iteration. An evaluation of the workflow shows that (1) decentralized groups of people can generate blends in independent microtasks, (2) co-located groups can collaboratively make visual blends for their own messages, and (3) VisiBlends improves novices’ ability to make visual blends. We will discuss how to decompose other complex tasks so that people and computers can collaborate in generating novel, useful and creative solutions to problems. Speaker bios: Guilherme is a Data Scientist at Dataiku. He works out of the headquarters in NYC where he helps customers build and deploy predictive applications. Before joining Dataiku, he was a fellow at the Insight Data Science Fellowship program, and prior to that he worked in quantitative finance. He holds a PhD in applied mathematics. Lydia Chilton is an Assistant Professor in Computer Science at Columbia University. She did her undergraduate work at MIT, her PhD at the University of Washington and her post-doc at Stanford. She is a member of the inaugural class of the ACM Future of Computing Academy and an early pioneer in crowdsourcing complex tasks on Mechanical Turk. Professor Chilton’s interests are in coordinating people and computers to complete complex and creative tasks that neither computers or individuals can do alone. These tasks involve conveying messages implicitly through text and images, translating and adapting research and ideas to new areas, and finding actionable insights from data.

  • Speed-Data & Marketing Data Science at Squarespace

    General Assembly

    For Dataiku's second event of the month, we're proud to partner with Alation (https://alation.com/) and bring you a presentation on data science & marketing at Squarespace, followed by a "Speed Data" networking event to get face-to-face time with other data scientists (and maybe even our keynote speaker from Squarespace!) IMPORTANT: Please RSVP on both the meetup AND the GA website here: https://generalassemb.ly/education/speed-data-marketing-data-science-at-squarespace/new-york-city/71481 Please note that those who are signed up on BOTH sites will get priority to the event. Also, due to space constraints, we may have to turn away attendees at the door - to avoid being turned away, we recommend arriving early. Tentative schedule: 6:30pm: Pizza + beer 7:00pm: Marketing Data Science at Squarespace: The Surprising Effectiveness of Invisible Ads by Braden Purcell, Data Scientist at Squarespace 7:30pm: Speed-Data What is "Speed-data?" Imagine speed-dating, but instead of finding love, you're sharing your love for data science! If you wish to participate, attendees will be paired off after the talk, and will get 10-minute rounds to talk about projects they're working on, questions they have, give/get career advice...etc.. Braden, our keynote speaker from Squarespace, will also be in the mix, and you'll get the chance to get face-time with real, practicing data scientists! Marketing Data Science at Squarespace: The Surprising Effectiveness of Invisible Ads by Braden Purcell, Data Scientist at Squarespace: Squarespace makes beautiful products to help people with creative ideas succeed. We use many advertising methods to reach consumers that can benefit from our products, but careful analysis is important to determine if these ads truly provide a favorable return on investment. In this talk, I will give a broad overview of marketing data science at Squarespace. I will then dive deeper into a recent project in which we combined experiments, data analysis, and statistical modeling to assess the effectiveness of digital ads. The results highlight how data-driven measurement can dramatically improve marketing decision-making. Braden is a data scientist on the marketing analytics team at Squarespace where he develops tools and analyses to optimize marketing spending and forecast performance. Before that he was a postdoctoral scientist at the NYU Center for Neural Science where he used computational modeling and neurophysiology to understand how the brain makes decisions. He has a PhD in cognitive neuroscience from Vanderbilt University. About Alation: Alation is the first company to bring a data catalog to market. The Alation Data Catalog combines machine learning and human collaboration to change the way people find, understand, trust, use and reuse data. More than 100 organizations leverage the Alation Data Catalog to gain confidence in data-driven decisions. Learn more at https://alation.com/

    13
  • 10 Machine Learning Issues that Nobody Talks About feat. Twitter

    Dataiku and General Assembly will be hosting two talks exploring the best practices and common setbacks teams run into when building ML systems into their infrastructure. Please RSVP on both Meetups and the GA website here: https://generalassemb.ly/education/bigger-problems-than-big-data-10-machine-learning-issues-with-twitter-cortex/new-york-city/70327 Priority will be given to those who have RSVP'd on both sites. Tentative Schedule: 6:30pm: Pizza + Beer 7:00pm: DS Best Practices (at Scale) with Jordan Volz, Senior Data Scientist at Dataiku 7:30pm: Bigger Problems than Big Data: 10 Machine Learning Issues that Nobody Talks About with Dan Shiebler, Senior Machine Learning Engineer at Twitter Cortex Abstracts: DS Best Practices (at Scale) with Jordan Volz, Senior Data Scientist at Dataiku: Although Data Science and Big Data are two worlds that are unwieldy on their own, their intersection has proven quite cumbersome for many businesses. In this talk, we will review some strategies for success in working with big and small data, common pitfalls in the data science process, building a collaborative data science experience, and how to overcome common obstacles when making the leap to large-scale data science. Bigger Problems than Big Data: 10 Machine Learning Issues that Nobody Talks About with Dan Shiebler, Senior Machine Learning Engineer at Twitter Cortex: In this presentation, we will explore the opportunities and growing pains of Machine Learning as a serious industry force. Through this exploration, we will learn how recent research in the Machine Learning space can enable large companies to become exponentially more productive in sharing and distributing Machine Learning models and insights. We will also see how Machine Learning systems can dramatically increase system complexity and technical debt. Bios: Jordan Volz is a Senior Data Scientist at Dataiku, where he helps customers design and implement ML applications. Prior to Dataiku, Jordan specialized in big data technologies as a systems engineer at Cloudera, and enterprise search technology as a technical consultant at Autonomy, frequently working with large financial organizations in the US and Canada. He holds degrees from Bard College and the University of Amherst, and is academically trained in pure mathematics. Dan works at Twitter Cortex, where he develops Machine Learning Models that make sense of the world's data. In his spare time he works with the Serre Lab at Brown University to train neural networks to think like humans. Previously, Dan designed smartphone sensor algorithms for car insurance at TrueMotion.

    7
  • Coding Outside of IT: Lessons on Automation From Risk Reporting

    NYC Data Science Academy

    Dataiku is pairing up with NYC Data Science Academy to host two presentations on the improtance of automation in business practice - and how to actually achieve it. Schedule: 6:30pm: Pizza + Beer & Networking 7:00pm: Automation: Maintaining your ML systems by Kasim Patel, Data Scientist at Dataiku 7:30pm: Coding Outside of IT: Lessons in Automation From Risk Reporting by James Long, VP of Risk Management at RenaissanceRe Abstracts: Automation: Maintaining your ML systems by Kasim Patel, Data Scientist at Dataiku Automation plays a critical role in moving data through data pipelines. Typical examples include automated nightly or weekly computations of similarity matrices for recommendation systems, retraining and deployment of machine learning models in fraud or churn cases, and so on. In this talk, Kasim will present some of Dataiku’s specific use-cases in automation, including monitoring of platform usage amongst organizations and automated support ticket tagging, all from within the Dataiku DSS platform. Coding Outside of IT: Lessons in Automation From Risk Reporting by James Long, VP of Risk Management at RenaissanceRe From XKCD cartoons to HBR articles, there's a recurring trope that automation is justified exclusively by time saved from turning manual activities into automated processes. JD will challenge this assumption and show there are many other benefits to business process automation which might be of much more value than simple time savings. He will also present lessons he's learned from increasing automation in the business by improving business analyst skills, tools, and attitudes toward process automation Bios: Kasim is a Data Scientist at Dataiku. He works on the strategy and growth team where he works with the company's own data to make Dataiku more data-driven. Before joining Dataiku, he worked as a Researcher in the Center for Brains, Minds and Machines (CBMM) at MIT. He holds an MS in Electrical and Computer Engineering from Boston University. JD Long is a native Kentuckian, an agricultural economist, insurance quant, stochastic modeler, and cocktail party host. He's an avid user of Python, R, AWS and colorful metaphors. JD is currently a risk management VP at the global reinsurer Renaissance Re. He lives in Jersey City NJ with his wife, a recovering trial lawyer, and their 11 year Roblox obsessed daughter.