- Lightning Talks
Location info: Phone2Action, Inc. 1500 Wilson Blvd Suite 700, Arlington, VA 22209 ======================================================================= This month we are having lightning talks! Several 7-minute talks back-to-back. Speakers, abstracts, and bios are below. ============================================================== Speaker: Snehal Shinde Title: Time Series Analysis with Facebook Prophet Abstract: Curious about how time series analysis and predictive forecasting can be used for predicting new sales? This lightning talk will be about how you can leverage Facebook Prophet to analyze business data in Python. Bio: Snehal Shinde is a Database Analyst at Phone2Action Inc. She holds a Masters in Information Technology and Analytics from Rutgers and a Bachelors in Computer Engineering from the University of Mumbai. Previously, Snehal was a Software Engineer at Tata Consultancy Services. ======================================================================= Speaker: Travis Hoppe Title: Stupid TensorFlow tricks: Part II. Abstract: Think Tensorflow is only about deep learning? Learn how to tackle a classical physics problem with automatic differentiation. First solved by Newton, Bernoulli, and Leibniz, you too can find the brachistochrone curve with minimal effort! Bio: Travis Hoppe is a local data scientist, recovering physicist, co-organizer of DC Hack & Tell, and creator of many irrelevant things. ======================================================================= Speaker: Nina Lopatina Title: Cross-lingual word embeddings 101 Abstract: Cross-lingual word embeddings augment machine translation and other cross-lingual tasks. This lightning talk introduces cross-lingual word embeddings along with a few quick & simple methods to evaluate embeddings and training and validation data quality. Bio: Nina Lopatina is a research data scientist at IQT Labs, currently working on machine translation interpretability & quality estimation. Previously, Nina researched machine learning privacy attacks on speaker identification models and neural processes and computations underlying decision-making. ======================================================================= Speaker: Miller Wilt Title: Classification in the real world Abstract: Traditional classification algorithms operate under the assumption of a static environment i.e. one where the test set is drawn from the same distribution as the training set. However, tin the real world these algorithms start to break when exposed to data from different distributions. We'll introduce open set recognition to tackle this problem, by simultaneously classifying samples from known classes while rejecting samples from unknown classes. Bio: Miller Wilt is a machine learning engineer at the Johns Hopkins University Applied Physics Lab. He uses deep learning to study wireless communication and reverse engineering.In his free time, Miller is either practicing archery, playing DnD, working out, or reading. ======================================================================= Speaker: Alex Gold Title: Lets get some love for linear models Abstract: Linear regression models are decidedly out of vogue in data science circles but are workhorse in many disciplines. Is it time to show some affection for these 19th century statistical models? Bio: Alex is a solutions engineer at RStudio and lives in Silver Spring. He once dropped out of an economics PhD program, did policy research and political data work, and ran a data science team at a federal contractor. ======================================================================= Speaker: Michael McKenzie Title: Hacking the Panama Papers Abstract: Creating connections between unconnected data. How did ICIJ uncover offshore tax shelters using Neo4j and graph databses? Bio: Michael is a problem-solver and graphista at heart and is a developer at CALIBRE in Alexandria. He is the organizer of the GraphDB DC.
- Data Science Best Practices
This meetup is in partnership with Statistical Programming DC. You can register here or on their page: https://www.meetup.com/stats-prog-dc/events/263461832/ SCHEDULE 6:30 - 7:00 pm: Mingling, food and beverages 7:00 - 8:30 pm: Presentation 8:30 - ???: Data Drinks at Tonic (2036 G St NW, Washington, DC) ABOUT THE SPEAKER Dr. Simina Boca analyzes "omics" data, including metabolomics and genomics, and considers their downstream application in precision medicine. In particular, she developed novel computational and statistical methods for high-dimensional data analysis, led the first comprehensive metabolomic study for Duchenne muscular dystrophy, and contributed to several of the early exome sequencing projects of human tumors. Dr. Boca is an Assistant Professor at the Innovation Center for Biomedical Informatics (ICBI) and the Departments of Oncology and Biostatistics, Bioinformatics & Biomathematics at the Georgetown University Medical Center (GUMC), as well as a member of the Cancer Prevention and Control Program at the Lombardi Comprehensive Cancer Center. Dr. Boca was a postdoctoral fellow in the Biostatistics Branch within the Division of Cancer Epidemiology and Genetics at the National Cancer Institute and holds a Ph.D. in Biostatistics and an M.H.S. in Bioinformatics from the Johns Hopkins Bloomberg School of Public Health and a B.S. in Mathematics from the University of Illinois at Urbana-Champaign.
- Machine Learning Visualization with Yellowbrick
This event is in partnership with Statistical Seminars DC. You can RSVP there or here. https://www.meetup.com/Statistical-Seminars-DC/events/261446049/ Yellowbrick is a suite of visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. In a nutshell, Yellowbrick combines scikit-learn with matplotlib in the best tradition of the scikit-learn documentation, but to produce visualizations for your models! This presentation Dr. Larry Gray and Ms. Prema Roman will give a demonstration and lecture about its capabilities. The presentation will be an introductory tutorial on Yellowbrick. It will begin with an introduction on what Yellowbrick is, and how and where it fits within a data scientist’s pipeline. We will then show some examples in a Jupyter notebook. Bios: Dr. Larry Gray: Dr. Larry Gray is a postdoctoral fellow in the field of computational biology at the National Center for Biotechnology Information within the National Institutes of Health. He has spent the past 16 years as a biomedical researcher trying to better understand human disease. In his spare time he serves as an advisor and core contributor for Yellowbrick and enjoys volunteering at Python related conferences. Dr. Gray earned his doctorate from Johns Hopkins University, School of Medicine in Cellular and Molecular Physiology. Additionally, he will begin lecturing this fall at the School of Continuing Studies at Georgetown University for the Data Science Certificate Program. Prema Roman: Prema Roman is a Senior Data Engineer at Excella. She has several years of experience in software engineering and data analysis and has worked in the financial services, internet, and consulting industries. Outside of work, she contributes to Yellowbrick and attends local meetups to learn and contribute to the tech community. Ms. Roman holds BS degrees in Management Information Systems and Marketing and a MS degree in Software Engineering from George Mason University. Difficulty: Beginner /Undergrad You should bring: Pen/paper , laptop Software: Python 3 , Jupyter notebook Location: GWU- Corcoran Hall, room COR 101 [masked]st St NW · Washington, DC
- BIG OPPORTUNITY : AMA (Ask Me Anything)
Come and join us for a night with four experienced data scientists: Dr. John Kaufhold, Martin Skarzynski, Dr. Dunstan Matekenya, and Tommy Jones. The panel will be moderated by Janet Dobbins, President of Data Community DC and Vice President of Business Development and Strategic Partnerships at Statistics.com. Have you ever browsed over the Internet and give up without having your questions answered? Do the enlightening and engaging Data Science DC Meetups lead you to more questions? This meetup is for you. This special event aims to provide participants with the opportunity to ask any data science questions. Speaker Bios John Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company named one of the four fastest growing companies by revenue in Arlington, Virginia in 2015, and again in 2016. Dr. Kaufhold also serves as Secretary of the Washington Academy of Sciences and moderates the DC2 Deep Learning Discussion list. Prior to founding Deep Learning Analytics, Dr. Kaufhold investigated deep learning algorithms as a staff scientist at NIH. Before that, over 7 years at SAIC, Dr. Kaufhold served as principal investigator or technical lead on a number of large government contracts. Prior to joining SAIC, Dr. Kaufhold investigated machine learning algorithms for medical image analysis and image and video processing at GE's Global Research Center. Dr. Kaufhold is named inventor on >10 issued patents in image analysis, and author/coauthor on >40 publications in the fields of machine learning, image understanding and neuroscience. Martin Skarzynski, a Cancer Prevention Fellow since 2017, is passionate about Bioinformatics, Data Science, Epidemiology, and Statistical Computing. Martin uses the Python and R programming languages and command line tools to explore, analyze, visualize and present data and has a strong interest in reproducibility, scientific publishing workflows, and open data/science best practices. Martin is excited to apply his computational skills in combination with his Genomics and Immunology background to the study and prevention of cancer. Martin is co-chair of the Bioinformatics and Data Science Department at the Foundation for the Advancement of Education in the Sciences (FAES), where he has been an instructor since 2015. Martin is also an instructor for Software and Data Carpentry, non-profit organizations that teach computational skills. Dr. Dunstan Matekenya is a consummate Data Scientist with over 10 years’ experience in both traditional statistics and modern machine learning methods. Currently, he works as a Data Scientist at the World Bank Group HQ in Washington DC. Prior to joining the WBG, Dunstan completed his PhD at the University of Tokyo in 2016. His PhD research focused on use of machine learning methods to explore insights from mobile phone data. Before re-orienting his career into Data Science, Dunstan earlier worked as a Statistician at the National Statistical Office in Malawi from 2007 until 2017. While there he actively contributed to flagship projects such as the 2008 Malawi Population and Housing Census and led the GIS unit. His passion includes contributing to modernization of official statistics in developing countries with use of alternative data sources such as mobile phone data as well improving capacity in Data Science. Tommy Jones is a member of the technical staff at In-Q-Tel and a coordinator for Data Science DC. He holds an MS in mathematics and statistics from Georgetown University and a BA in economics from the College of William and Mary. Tommy is a Ph.D. student in the George Mason University Department of Computational and Data Sciences. He specializes in statistical models of language and time series modeling and is the author of the textmineR package for the R language. Tommy is also a Marine Corps veteran. Agenda 6:30 - 7:00PM Gather, network, eat 7:00 - 7:10PM Intro 7:10 - 8:15PM AMA 8:30 PM Drinks @ Tonic
- Using Graph Algorithms for Improving Machine Learning Predictions
Dear Community! We are please to invite you to the Global Graph Celebration Day. Come and join us! Special Note: Please RSVP here if you want a T-shirt. Your RSVP Link: https://neo4j.typeform.com/to/USb6It?event=dc 100 units available = RSVP ASAP! And please don't forget to RSVP through the meetup page. Agenda: 6:30pm – 7:00pm Networking and Refreshments Food will be kindly sponosored by Neo4j 7:00pm – 7:10pm Introduction, Announcements 7:10pm – 7:20pm Tommy Jones 7:20pm - 7:50pm Amy E. Hodler 7:50pm – 8:00pm Q&A 8:00pm – 8:30pm Data Drinks @Tonic (2036 G St NW) Description: Talk 1: What are graphs? Why do people use them? What does graph data look like? What are some common measures applied to graphs? Tommy will be giving a very brief introduction to graphs and graph theory. Tommy Jones is a member of the technical staff at In-Q-Tel and a coordinator for Data Science DC. He holds an MS in mathematics and statistics from Georgetown University and a BA in economics from the College of William and Mary. He is a PhD student in the George Mason University Department of Computational and Data Sciences. He specializes in statistical models of language and time series modeling and is the author of the textmineR package for the R language. Tommy is also a Marine Corps veteran. Talk 2: Using Graph Algorithms to Improving Machine Learning Predictions Relationships are one of the most predictive indicators of behavior and preferences. One of the most practical ways to improve our machine learning predictions right away is by using graphs for connected features. In this session, we will start with an overview of which algorithms to apply for various features related to influence in a network, similarities, link prediction, and community detection. You’ll learn how graph algorithms can provide more predictive features as well as aid in feature selection to reduce over-fitting. We’ll also look at ways to improve machine learning efficiency such as graph filtering to avoids running ML across an entire dataset or having to manually pair data down. Amy E. Hodler is a network science devotee and AI and Graph Analytics Program Manager at Neo4j. She promotes the use of graph analytics to reveal structures within real-world networks and predict dynamic behavior. Amy helps teams apply novel approaches to generate new opportunities at companies such as EDS, Microsoft, Hewlett-Packard (HP), Hitachi IoT, and Cray Inc. Amy has a love for science and art with a fascination for complexity studies and graph theory. She tweets @amyhodler.
- Lightning Talks: new projects & great ideas
We welcome your participation in the upcoming lightning talk / meet up. Come and join us for a night full of new projects and new ideas: Aaron Schumacher - http://www.linkedin.com/in/ajschumacher Action and Analysis in the 2018 NY Senate District 5 Election Stephanie Eckman - http://stepheckman.com/ - Improving Quality of Training Data Jeff Hale - https://medium.com/@jeffhale. - Different data types besides categorical and numerical types Anastassia Kornilova - http://www.akornilo.com/about/ - Forecasting how legislators will vote in the US Congress Brendan Freehart - Traffic camera images to build a model that determined if bike lanes were blocked Agenda: 6:30 - 7:00PM Mingle with speakers with great food (empanadas) 7:00 - 7:10PM Announcements/Intros/Sponsors 7:10 - 8:00PM Data Science Projects 8:00 - 8:20PM Q&A 8:30 - 9:30PM Data Drinks @Tonic
- Using MLflow to Manage the Machine Learning Life Cycle
Sponsorship ----------------------------- This month's event is sponsored by Think Gov 2019. They are offering a *free* developer day (Code@Think Gov) on March 13. More details will be offered in an upcoming email and at this month's meetup. Code@Think Gov Registration Link: http://www.cvent.com/d/0bqf1d/4W?RefID=GEMeet Code@Think Gov Event Page: https://www.ibm.com/industries/federal/thinkgov-code Think Gov Event Page: https://www.ibm.com/industries/federal/thinkgov Abstract ----------------------------- As data science teams adopt tools for experimentation, deployment, and scale out their teams, MLflow can serve as a powerful tool to integrate the development of AI models and the overall platform surrounding them. MLflow is an open-source platform to manage the machine learning lifecycle, and in this talk, we will show how this tool can be leveraged in Databricks to track experiments from multiple runs, reproduce results, perform remote runs and deploy models for real-time testing. About the Speaker ----------------------------- Ricardo Portilla works at Databricks as a Solutions Architect. He completed his PhD in Mathematics at the University of Michigan, and after that led Spark migrations, engineered various solutions in Spark on large-scale financial data, and more recently focused on data science at scale using time series analysis and unsupervised learning methods. He is passionate about enabling data science on the Databricks platform and showing MLflow in action for model lifecycle management. http://linkedin.com/in/ricardo-portilla-a51b6a19 Agenda ---------------------------- 6:30pm – 7:00pm Networking and Refreshments 7:00pm – 7:10pm Introduction, Announcements 7:10pm – 7:40pm Presentation 7:40pm – 7:55pm Q&A 8:00pm – 8:30pm Data Drinks @Tonic (2036 G St NW)
- Which skills are employers looking for in Data Scientists?
Happy new year data scientist! Let’s start the year with this amazing topic “Skills that Employers are Looking for in Data Scientists” Abstract: Getting hired (and hiring) in data science is fun, fun, fun! Whether you are hiring, looking for a job, or thinking about transitioning to the field, come and hear from a panel of folks with experience in different parts of the process. Speakers: Kristin Abkemeier is a data scientist at Improvix Technologies supporting the U.S. Department of State. She recently made a mid-career transition to data science after working as a software developer in information technology, serving as a subject matter expert on batteries for electric vehicles for the U.S. Department of Energy, and earning a doctorate in physics. Aron Ahmadia, Senior Data Scientist at Capital One. Susan Fallon Brown is Vice President of Global Strategy and Business Development at Monster Government Solutions. Jeff Hale is an entrepreneur and data scientist who writes about data science at https://medium.com/@jeffhale. Agenda: 6:30 - 7:00pm mingle 7:00 - 8:00pm discussion 8:00 - 8:20pm Q&A
- Time Series Data Platforms
ABSTRACT Time series, a data type long neglected by Silicon Valley, is finally seeing its time in the sun with investors opening their wallets. TimeScale, a startup adding time series capabilities to Postgres closed its $16M series A in January; InfluxData closed its $35M series C in February to continue developing its time series platform; and PingThings, Inc. is currently raising its series A. While much of the recent activity is being driven by server monitoring metrics and the rising Internet of Things, time series data comes from a variety of sources from time stamped events arriving asynchronously to sensors continuously measuring physical processes. Further, time series data permeates numerous disciplines including economics and econometrics, finance, DevOps, medicine, and most of the sciences and engineering. In this presentation, Sean and Michael will examine the time series ecosystem with a focus on the various data stores and platforms that are purpose built for this data at scale and the various categories of analysis techniques that can be performed on this data. The presentation will then go in depth into a particular open source sensor analytics platform in detail, discussing some of the data structures and architectural decisions that enable performant time series analytics at scale. BIOGRAPHIES Michael Andersen is an EECS PhD student at the University of California, Berkeley working on technology for a secure internet of things. This includes high performance time series databases for next generation high-density telemetry, energy efficiency through Software Defined Buildings, and resiliency through instrumentation and analysis of smart grids. BTrDB originated in his PhD research into scalable analytics on grid data. Sean Patrick Murphy (https://www.linkedin.com/in/seanpatrickmurphy1/) is the co-CEO of PingThings, Inc. (http://www.pingthings.io/), an AI-focused startup founded in 2014 bringing advanced data science and machine learning to the nation’s electric grid. After earning dual undergraduate degrees with honors in mathematics and electrical engineering from the University of Maryland College Park, Sean completed his graduate work in biomedical engineering at Johns Hopkins University, also with honors. He stayed on as a senior scientist at the Johns Hopkins University Applied Physics Laboratory for over a decade, where he focused on machine learning, high-performance and cloud-based computing, image analysis and anomaly detection. Switching from the sciences into an MBA program, he graduated with distinction from Oxford. Using his business acumen, he built an email analytics startup and a data sciences consulting firm. Sean has also served as the chief data scientist at a series A-funded health care analytics company and the director of research and instructor at Manhattan Prep, a boutique graduate educational company. He is the author of multiple books and several dozen papers in multiple academic fields. He co-founded and served as a long-time board member for Data Community DC and the Data Innovation DC Meetup. ---------------------------- Agenda: • 6:30pm -- Networking and Refreshments • 7:00pm -- Introduction, Announcements • 7:15pm -- Presentation and Discussion • 8:30pm -- Data Drinks (Tonic , 2036 G St NW) ----------------------------
- Building Data Pipelines for Astronomical Data
This month, we're turning the reigns over to Dataiku for a great meetup. Details Dataiku is returning to DC and is excited to join with ACM to present two talks focused on bringing data science to the field of astronomy! Tentative Schedule: 6:30pm: Networking 6:45pm: Weighing the Benefits of Simulated NASA Data for Model Training by Patrick Masi-Phelps, Data Scientist at Dataiku 7:15pm: Building Data Pipelines for Astronomical Data by Ignacio Toledo, Data Analyst and Astronomer at ALMA Labs Abstracts: Weighing the Benefits of Simulated NASA Data for Model Training by Patrick Masi-Phelps, Data Scientist at Dataiku: In December 2017, researchers at Google and University of Texas, Austin announced the discovery of two exoplanets using deep learning techniques. In this talk, Patrick Masi-Phelps will discuss the Dataiku data science team's efforts to follow up on this research. We've incorporated simulated planetary transits and false positives in addition to the real, observed data used by Google and UT Austin. Patrick will talk about the pros and cons of using simulated data in the model training process, along with other challenges like accessing terabytes of data from NASA, chaining data pipelines, and tuning different network architectures. Building Data Pipelines for Astronomical Data by Ignacio Toledo, Data Analyst and Astronomer at ALMA Labs: ALMA is a radio astronomy observatory that collects over 4300 hours of high-quality data annually across its 66 antennas, amounting to more than 1TB of scientific data daily. Due to limited resources, this data is often only inspected for quality assurance purposes and is then sent out immediately to be processed by astronomers. Meanwhile, at least 750 GBs of monitoring and operational data are being stored daily – and no one is using it. This leaves a lot of room for error and ignores a lot of potentially fruitful data. To fill these gaps, we’ve begun a data science initiative at ALMA focused on creating pipelines for more efficient data collection and educating our engineers and astronomers on data science methodologies. This meetup aims to share our experiences building out a data science infrastructure within the field of astronomy, particularly through the use of data science platforms. Audience members will learn how to build more efficient data pipelines, and how data science can be used to generate productive results in fields like astronomy. Bio: Patrick Masi-Phelps is a Data Scientist at Dataiku, where he helps clients build and deploy predictive models. Before joining Dataiku, he studied math and economics from Wesleyan University and was most recently a fellow at NYC Data Science Academy. Patrick is always keeping up with the latest machine learning techniques in astronomical and public policy research. Ignacio Toledo is a Data Analyst and Astronomer on Duty at the Atacama Large Millimeter/Submillimeter Array (ALMA), currently the world's biggest ground based observatory. His primary work has been the implementation of an optimal scheduler for ALMA's astronomical observations, and he has recently been involved in the efforts to build a modern data science team.