• Recommendation Systems of Tumblr || Realtime Schema Enrichment || Apache Druid

    ****** Presentations ****** Talk 1: Personalized Related Blog Recommendation with User Feedback Speaker/Bio: Zhisheng Li, Principal Tech Lead. Zhisheng Li is currently the leading the "Recommendations team" which is responsible for near real-time recommendations, offline Recommendation systems of Tumblr. Abstract: Tumblr, as a popular microblogging service platform, has hosted hundreds of millions of blogs nowadays. Similar to other social networking sites, one major behavior for Tumblr users is to follow interesting blogs, so they could directly browse the content from these followed blogs on their dashboards. Blog recommendation has been proven to be an effective way to help users follow blogs, which has contributed to 50% of our daily blog follows. Specifically, related blog recommendation is the best performer among all Tumblr recommendation techniques, which is to recommend the most relevant blogs immediately when a user follows a specific blog. However, the old related blog recommendation system was not personalized and couldn't provide the best user experience. In this work, we propose innovative approaches to leverage user feedback to adjust the rankings of related blogs, so as to improve the relevancy and freshness of related blogs for each particular user. The A/B test results show that such personalized related blog recommendation approach increased the related blog daily follow count and follow rate significantly. We launched the personalized related blog recommendation system into production. Up to date, it brings 2 million daily blog follows at a 17% blog follow rate. *** Talk 2: Schema Management and Real-time Enrichment with Kafka Speaker/Bio: Max McKittrick is a data engineer at Capital One, where he works on the company's enterprise clickstream application, applying DevOps best practices to real-time stream processing. Prior to joining Capital One, he completed his MS in information science at the University of Illinois, where he worked as an NLP researcher and consultant and was later selected as an Insight Data Engineering fellow in summer 2017. In his spare time, he enjoys analyzing data in R and playing modular synthesizers. Abstract: At Capital One, the Enterprise Customer Intelligence team engineers maintain a clickstream application that serves the entire company. Kafka is an important part of this application, and messages must be enriched prior to being consumed by other internal teams. In this talk, I will discuss the challenges and lessons learned in developing real-time enrichments. *** Talk 3: Inside Apache Druid: Designed for Performance Speaker/Bio: Gian Merlino, co-founder of Imply, a San Francisco based technology company, and a committer on Apache Druid. Previously, Gian led the data ingestion team at Metamarkets (now a part of Snapchat) and held senior engineering positions at Yahoo. He holds a BS in Computer Science from Caltech. Abstract: A technical talk - Apache Druid is a modern analytical database that implements a memory-mappable storage format, indexes, compression, late tuple materialization, and a query engine that can operate directly on compressed data. There is a patch out to add vectorized processing as well, which we can expect to see show up in a future release. This talk goes into detail on how Druid's query processing layer works and how each component contributes to achieving top performance for analytical queries.

  • Building an AWS-hosted Data Platform || Design Financial Data Interfaces

    Schedule: 6:00 - Doors & Food 6:30 - Talk 1 7:15 - Talk 2 7:45 - Wrap & Chat ********* Talk 1: An Opinionated Guide to Building an AWS-hosted Data Platform Presenters: Tom LeRoux, VP of Data Engineering and Analytics @ Disney Streaming Abstract: These days there are many ways to build a cloud-based data warehouse. While AWS makes it easier to deploy infrastructure, it does not provide a prescriptive way to build out a data and analytics platform that meets the needs of both data producers and data consumers. In this talk we will dive into particular design biases that helped us choose our data architecture for The Walt Disney Company’s direct-to-consumer video businesses globally, including the ESPN+ premium sports streaming service and Disney+, the upcoming Disney subscription video service. We will dig into the different patterns of streaming and batch data ingestions, and talk about how different types of data is transformed and made available to the organization. Bio: Tom LeRoux is VP of Data Engineering at Disney Streaming Services. Tom joined DSS in July of 2018 and runs the data platform that powers Disney+ and ESPN+. Prior to DSS Tom worked at Goldman Sachs where he led the team that built Goldman's new consumer banking data and analytics platform. ********* Talk 2: How to Design and Scale Financial Data Models and Interfaces Presenter: Liwei Mao, Senior Software Engineer @ Button Abstract: Building a financial data store can be hard. You have many users of financial data within a company. There's the finance team, who sends out monthly invoices and makes projections, the marketing team, who uses financial data to gauge the efficiency of campaigns, the analytics team, who integrate financial data to provide company KPIs. And lastly, your company's external users, to whom you promised easy and accurate access to the transactions you process for them. In engineering, we often hear let's have a "single source of truth". It’s easy to mistake that to mean, let’s aim to have a single financial data interface that serves all these users needs! In this talk, we'll detail why that doesn't work. We'll discuss how to design financial data models and interfaces that flexibly and performantly serve user needs while fulfilling the high accuracy requirements for financial data. Lastly, we'll talk about some strategies for scaling and optimizations. Bio: Liwei Mao is a Senior Software Engineer at Button. She loves designing data products, nerding out over databases, and is a firm believer that good data design removes friction in building products.

  • Data Validation and Alerting. How does Airflow fit in?

    New York Times Building

    Note: This meetup event is being organized as a special joint effort with the NYC Apache Airflow Meetup group: https://www.meetup.com/NYC-Apache-Airflow-Meetup/events/260257700/ Schedule: 6:00 - Doors & Food 6:30 - Talk 1 7:15 - Talk 2 7:45 - Wrap & Chat Talk 1: Data Validation and Alerting. How does Airflow fit in? Abstract: After your ETL runs, a new kind of fun starts. -Is my output data 'right' compared to my 'source of truth'? -Wait a second, how do I even know if my input data was ok? -How do get alerted if a metric violates some threshold/tolerance or if some dimensional data is messed up? -What if I want alerts to be triggered based on dynamic thresholds? -How hard is it to maintain my checks and alerts? Like everyone else, the New York Time's Data Engineers, Data Analysts and Data Scientists have been wrestling with the above questions. This presentation will cover what the Times has tried and the approach that's been settled on (for now). And yes, Airflow plays an important part. Presenters: Brian Lavery, Data Engineer, New York Times Mariam Melikadze, Manager-Advertising Analytics, New York Times Talk 2: Abstract: Apache Airflow is a Python-based task orchestrator that has seen widespread adoption among startups and enterprises alike to author, schedule, and monitor data workflows. By deploying the Airflow stack via Helm on Kubernetes, fresh environments can be easily spun up or down, scaling to near 0 when no jobs are running. As companies scale up their Airflow usage, they need more control, and observability over their stack as it becomes more ingrained into their culture and more important to the business. This talk will go through the technical challenges of supporting thousands of airflow deployments, how to monitor them, reliably push DAG updates, and how to build all the supporting infrastructure of a rock-solid Airflow system in a cloud native environment using open source software. Presenter: Viraj Parekh, Data Engineer, Astronomer Instructions to follow upon arrival: Enter the lobby on the north side of the building. A representative will be waiting next to one of north end elevator turnstiles with a sign that says 'Airflow Meet-Up'. They will assist you in getting through security and send you up to the 15th floor where another representative will be waiting to direct you to the room.

  • Data Council San Francisco 2019

    San Francisco


    Data Council (https://www.datacouncil.ai) is coming to San Francisco, will you join us? The main event was born out of a similar meetup group to this, and we're excited to have become a cornerstone of the growing data community on meetup. What you will get out of Data Council SF 2019 (https://www.datacouncil.ai/san-francisco-2019): - 2 days & 50+ insightful talks by leading data scientists and engineers from top companies like Facebook, Salesforce, IBM, Netflix, Google, WeWork, Lyft, Stitch Fix, Datadog, Segment, Datacoral, Stanford University and many more. - 6 unique tracks: Data Platforms & Pipelines, Databases & Tools, Data Analytics, Machine & Deep Learning, and our all-new tracks: Hero Engineering and AI Products. - All-new content including our brand new Founders Panel of top founders in the data space. - Extensive networking opportunities at the conference, or connect with speakers & attendees at our Wednesday night after-party between conference days. - Small group Speaker Office Hours following each talk with an opportunity to dive deeper into the subject matter 1:1 with the speaker. - Attendees that are highly-technical data scientists, engineers, analysts & technical founders from top tech, media, and finance companies around the SF area. - Connect with our great partner companies at Sponsor Spotlight to discover their available data jobs and latest product developments. This year Data Council San Francisco ‘19 takes place on April 17 & 18th. As members of this meetup group and our community I wanted to extend you a sweet deal to get tickets for $100 lower than our lowest early bird pricing. To redeem go here: https://www.datacouncil.ai/san-francisco-2019 using coupon code: 100offeb to redeem your $100 discount. Why should you join this year?, If you believe in Quality Content > $, and would like to learn from companies like Facebook, Apache Foundation, Google, Netflix, Salesforce, Spotify, WeWork, Beeswax, Stitch Data, Capital One, Airbnb, Datadog, Lyft, Segment, Starburst, Datacoral, Columbia University, Uber, TapRecruit, Figure Eight, Dia&Co and many more along with many awesome speakers, You should join! Cheers, -Pete

  • Fraud Detection with Apache Kafka and KSQL

    The Orchard

    Schedule: 6:00 - Doors & Food 6:30 - Talk 1 7:15 - Wrap & Chat Talk 1: ATM Fraud Detection with Apache Kafka and KSQL Speaker: Robin Moffatt, Developer Advocate @ Confluent Abstract: Detecting fraudulent activity in real time can save a business significant amounts of money, but has traditionally been an area requiring a lot of complex programming and frameworks, particularly at scale. Using KSQL, it's possible to use just SQL to build scalable real-time applications. In this talk, we'll look at what KSQL is, and how its ability to join streams of events can be used to detect possibly fraudulent activity based on a stream of ATM transactions. We'll also see how easy it is to integrate Kafka with other systems—both upstream and downstream—using Kafka Connect to stream from a database into Kafka, and from Kafka into Elasticsearch. Bio: Robin is a Developer Advocate at Confluent, the company founded by the original creators of Apache Kafka®, as well as an Oracle Groundbreaker Ambassador and ACE Director (alumnus). His career has always involved data, from the old worlds of COBOL and DB2, through the worlds of Oracle and Hadoop, and into the current world with Kafka. His particular interests are analytics, systems architecture, performance testing and optimization. He blogs at http://cnfl.io/rmoff and http://rmoff.net/ (and previously http://ritt.md/rmoff) and can be found tweeting grumpy geek thoughts as @rmoff. Outside of work he enjoys drinking good beer and eating fried breakfasts, although generally not at the same time.

  • Beyond Data Engineering: Careers in AI + DevOps

    *This is a joint event with Insight Tech Talks Meetup Group: https://www.meetup.com/insight-tech-talks/events/259516414/ Schedule: 6:00 - Doors & Food 6:30 - Talk 1 7:15 - Talk 2 7:45 - Wrap & Chat Talk 1: Lessons from Building and Deploying AI Systems Speaker: Chuck Yee, ML Research Engineer @ Bloomberg Abstract: AI has experienced unprecedented hype over the past seven years, but what is the reality of applied AI within a business ecosystem? In this talk, I’ll be sharing stories of success and failure building and deploying AI systems, and personal observations from my time in the industry. Bio: Chuck-Hou is an Insight Fellow who recently joined Bloomberg’s NLP team after gaining experience building deep learning models for x-ray interpretation at Imagen Technologies. Talk 2: CICD From the Ground Up Speaker: Max McKittrick, Data Engineer, ECI @ Capital One Abstract: At Capital One, the ECI team manages a clickstream application and pipeline durability is a critical consideration. In this talk, I want to discuss our lessons learned and how we have improved our CICD practices over the past few months. Bio: Max McKittrick is a data engineer at Capital One, where he started after completing Insight Data Engineering New York. He received his MS from the University of Illinois and worked as a researcher and data consultant prior to attending Insight.

  • Containerizing Data Workflow and Testing Data Pipelines

    Schedule: 6:00 - Doors & Food 6:30 - Talk 1 7:15 - Talk 2 7:45 - Wrap & Chat Talk 1: Pros and cons of containerizing data workflows (and how to have the best of both worlds) Speaker: Tian Xie, Data Engineer @ Enigma Abstract: At Enigma, we run over one hundred workflows to ingest public data into our system. Running so many workflows also means managing dependencies and deployment for each of those workflows. Over time, we have iterated over several solutions to this problem and this is our story. Spoiler: docker is involved, but (*plot twist*) it only leads to another set of problems in the 2nd act. Bio: Tian Xie has been working in the NYC tech start-up scene for the last eight years on consumer video rendering, on-demand shipping, and now data engineering at Enigma Technologies. Talk 2: Building a Data Pipeline with Testing in Mind Speaker: Jiaqi Liu, Software Engineer @ Button, Inc Abstract: It’s one thing to build a robust data pipeline process in python but a whole other challenge to find tooling and build out the framework that allows for testing a data process. In order to truly iterate and develop a codebase, one has to be able to confidently test during the development process and monitor the production system. In this talk, I hope to address the key components for building out end to end testing for data pipelines by borrowing concepts from how we test python web services. Just like how we want to check for healthy status codes from our API responses, we want to be able to check that a pipeline is working as expected given the correct inputs. We’ll talk about key features that allows a data pipeline to be easily testable and how to identify timeseries metrics that can be used to monitor the health of a data pipeline.

  • Monitoring in Google Cloud Dataflow || AWS Lambda Best Practices

    Google NYC (8th Ave Entrance)

    Schedule: 6:00 - Doors & Food 6:30 - Talk 1 7:15 - Talk 2 7:45 - Wrap & Chat **Talk 1: Monitoring and Measuring Performance in Google Cloud's Dataflow** Speaker: Leigh Pember, Google Cloud Customer Engineer @Google Abstract: This intermediate-level session will dive into some of the different ways to measure, monitor and debug performance when running Dataflow jobs in Google Cloud. The session will primarily be centered around a demo and some of the insights will be applicable to open source Apache Beam whereas others will be more specific to GCP Dataflow. I put together this presentation based on lessons learned from debugging Dataflow with a GCP gaming/media customer, but the demo and examples should be generally applicable to anyone running these jobs in the cloud. Some high level topics will include important metrics and their meaning, logging, pipeline code optimization and infrastructure resource utilization. **Talk 2: Best Practices and Container reuse in AWS Lambda** *Speaker:* Angela Razzell, Senior Software Engineer @ Capital One *Abstract:* AWS Lambda’s Functions-as-a-Service (FaaS) saves developer time by removing the need to manage infrastructure and scaling. During this talk, Angela will demonstrate the powers and pitfalls of setting up your lambdas the right way. By following these best practices in your code, you can improve the running time and efficiencies of lambda functions. Connecting to other services such as databases are expensive operations and handling these tasks correctly can shave the execution time to a fraction of the cold start value - especially valuable in a real-time context. Angela Razzell is a Senior Data Engineer at Capital One, currently working on a messaging application in Retail Direct Bank.

  • DataEngConf NYC | Nov 8th - 9th, 2018 | Discounted Tickets Available

    We couldn't be more excited to announce our next DataEngConf NYC, this year with four unique tracks! DataEngConf NYC ‘18 (https://bit.ly/NYC-Event) is the premier deeply technical event that bridges the gap between data scientists, data engineers, data analysts and technical founders - in NYC from Nov 8-9th, 2018. Our upcoming conference will feature: - 2 days of insightful talks by 40+ leading data scientists and engineers from top teams at Salesforce, Facebook, Netflix, WeWork, TapRecruit, Beeswax, Datadog, Starburst, Figure Eight, Dia&Co, vividcortex, Buzzfeed, Stitch Data, Columbia University, Segment, Datacoral and many more. - All-new content including a AI Products and Hero Engineering track, our Keynote Panel plus our popular Investor Panel of top VCs in the data space. - Extensive networking opportunities at the conference, or connect with speakers & attendees at our after-party between conference days - Our unique Small group Office Hours with speakers, allowing you to dive deeper into the subject matter 1:1 - Connect with partner companies at Sponsor Spotlight to discover their available data jobs and latest product developments -------------------------- DataEngConf was born out of this meetup group, and we're excited to have become a cornerstone of the growing data community. As a member of our group you can get a special discount of $160 off tickets, but only when you purchase using code: “NYCData18”, grab your tickets right here: https://bit.ly/NYC-Tickets18 (Note: special recruiting opportunities and group tickets are available for your data team. Contact [masked] for more info) Hoping to see you there Cheers, -Keira and Pete

  • Abstracted Interfaces for Domain-driven Dev & Geospatial Data Analytics in Spark

    Schedule: 6:00 - Doors & Food 6:30 - Talk 1 7:15 - Talk 2 8:00 - Wrap & Chat **Talk 1: Abstracted Interfaces for Domain-driven Development** Speaker: Soren Larson, Data Scientist & AI Engineer Abstract: Abstraction is ever increasing. Services we used to have to build and maintain in low level languages are now abstracted into the cloud with sturdy SLAs and frameworks accommodating of most our use cases. Stakeholders are changing. With enterprise interfaces becoming more consumer oriented in their outlook and design, stakeholders and even users of these interfaces can reasonably be without some of the skills previously needed to shape and manage large pieces of data. With fewer resources needed to spin up a reliable and powerful big data system, we can spend more time on nuance of data manipulation and offer interfaces to nontechnical stakeholders more expressive than before, with performance secured by the guardrails of our increasingly sturdy infrastructure. I'll talk about one type of interface I've found success with, and what enterprise b2b design might say about the future of data engineering. **Talk 2: Using Spark for real time telemetry and geospatial data analytics at scale** Speaker: Dillon Bostwick, Solutions Engineer @ Databricks Abstract: We will use public NYC neighborhood data sitting on Azure Blob Storage and telemetry streams from Azure Event Hubs to analyze routes through NYC. This will open up discussion on how the Magellan geospatial analytics library uses Spark’s catalyst optimizer to conduct spatial joins, as well as how we can use Databricks Delta to improve performance as we build an optimized real time pipeline at scale. Finally we will discuss how we can leverage Azure Databricks to move the application from development to production. Technologies discussed: Azure Blob Storage, Azure Event Hubs, Azure Databricks, Magellan, Spark Catalyst optimizer, Spark Structured Streaming, Databricks Delta