• Time Series Data Platforms
    ABSTRACT Time series, a data type long neglected by Silicon Valley, is finally seeing its time in the sun with investors opening their wallets. TimeScale, a startup adding time series capabilities to Postgres closed its $16M series A in January; InfluxData closed its $35M series C in February to continue developing its time series platform; and PingThings, Inc. is currently raising its series A. While much of the recent activity is being driven by server monitoring metrics and the rising Internet of Things, time series data comes from a variety of sources from time stamped events arriving asynchronously to sensors continuously measuring physical processes. Further, time series data permeates numerous disciplines including economics and econometrics, finance, DevOps, medicine, and most of the sciences and engineering. In this presentation, Sean and Michael will examine the time series ecosystem with a focus on the various data stores and platforms that are purpose built for this data at scale and the various categories of analysis techniques that can be performed on this data. The presentation will then go in depth into a particular open source sensor analytics platform in detail, discussing some of the data structures and architectural decisions that enable performant time series analytics at scale. BIOGRAPHIES Michael Andersen is an EECS PhD student at the University of California, Berkeley working on technology for a secure internet of things. This includes high performance time series databases for next generation high-density telemetry, energy efficiency through Software Defined Buildings, and resiliency through instrumentation and analysis of smart grids. BTrDB originated in his PhD research into scalable analytics on grid data. Sean Patrick Murphy (https://www.linkedin.com/in/seanpatrickmurphy1/) is the co-CEO of PingThings, Inc. (http://www.pingthings.io/), an AI-focused startup founded in 2014 bringing advanced data science and machine learning to the nation’s electric grid. After earning dual undergraduate degrees with honors in mathematics and electrical engineering from the University of Maryland College Park, Sean completed his graduate work in biomedical engineering at Johns Hopkins University, also with honors. He stayed on as a senior scientist at the Johns Hopkins University Applied Physics Laboratory for over a decade, where he focused on machine learning, high-performance and cloud-based computing, image analysis and anomaly detection. Switching from the sciences into an MBA program, he graduated with distinction from Oxford. Using his business acumen, he built an email analytics startup and a data sciences consulting firm. Sean has also served as the chief data scientist at a series A-funded health care analytics company and the director of research and instructor at Manhattan Prep, a boutique graduate educational company. He is the author of multiple books and several dozen papers in multiple academic fields. He co-founded and served as a long-time board member for Data Community DC and the Data Innovation DC Meetup. ---------------------------- Agenda: • 6:30pm -- Networking and Refreshments • 7:00pm -- Introduction, Announcements • 7:15pm -- Presentation and Discussion • 8:30pm -- Data Drinks (Tonic , 2036 G St NW) ----------------------------

    GWU Elliott School of International Affairs, Room 113

    1957 E Street NW · Washington, DC

  • Building Data Pipelines for Astronomical Data
    This month, we're turning the reigns over to Dataiku for a great meetup. Details Dataiku is returning to DC and is excited to join with ACM to present two talks focused on bringing data science to the field of astronomy! Tentative Schedule: 6:30pm: Networking 6:45pm: Weighing the Benefits of Simulated NASA Data for Model Training by Patrick Masi-Phelps, Data Scientist at Dataiku 7:15pm: Building Data Pipelines for Astronomical Data by Ignacio Toledo, Data Analyst and Astronomer at ALMA Labs Abstracts: Weighing the Benefits of Simulated NASA Data for Model Training by Patrick Masi-Phelps, Data Scientist at Dataiku: In December 2017, researchers at Google and University of Texas, Austin announced the discovery of two exoplanets using deep learning techniques. In this talk, Patrick Masi-Phelps will discuss the Dataiku data science team's efforts to follow up on this research. We've incorporated simulated planetary transits and false positives in addition to the real, observed data used by Google and UT Austin. Patrick will talk about the pros and cons of using simulated data in the model training process, along with other challenges like accessing terabytes of data from NASA, chaining data pipelines, and tuning different network architectures. Building Data Pipelines for Astronomical Data by Ignacio Toledo, Data Analyst and Astronomer at ALMA Labs: ALMA is a radio astronomy observatory that collects over 4300 hours of high-quality data annually across its 66 antennas, amounting to more than 1TB of scientific data daily. Due to limited resources, this data is often only inspected for quality assurance purposes and is then sent out immediately to be processed by astronomers. Meanwhile, at least 750 GBs of monitoring and operational data are being stored daily – and no one is using it. This leaves a lot of room for error and ignores a lot of potentially fruitful data. To fill these gaps, we’ve begun a data science initiative at ALMA focused on creating pipelines for more efficient data collection and educating our engineers and astronomers on data science methodologies. This meetup aims to share our experiences building out a data science infrastructure within the field of astronomy, particularly through the use of data science platforms. Audience members will learn how to build more efficient data pipelines, and how data science can be used to generate productive results in fields like astronomy. Bio: Patrick Masi-Phelps is a Data Scientist at Dataiku, where he helps clients build and deploy predictive models. Before joining Dataiku, he studied math and economics from Wesleyan University and was most recently a fellow at NYC Data Science Academy. Patrick is always keeping up with the latest machine learning techniques in astronomical and public policy research. Ignacio Toledo is a Data Analyst and Astronomer on Duty at the Atacama Large Millimeter/Submillimeter Array (ALMA), currently the world's biggest ground based observatory. His primary work has been the implementation of an optimal scheduler for ALMA's astronomical observations, and he has recently been involved in the efforts to build a modern data science team.

    GWU Monroe Hall

    2115 G Street NW · Washington, DC

  • DC DataCon 2018
    DC DATACON: A focus on the applied tools, technologies, and methodologies that make data science work. There is a cost to attend. To register and see agenda see: https://www.ncsi.com/event/dcdatacon/ Join this multidisciplinary conversation as we explore the technology, economics, ethics, and strategy of the world’s fastest growing discipline – Data Science! This year’s conference builds on the success of our inaugural 2017 event with plans to go even bigger. Data Community DC is proud to again partner with George Washington University and the greater DC data community to produce DC DATACON 2018. The conference will be held on Wednesday, November 7, 2018 as we take over the Marvin Center – the thriving center of GWU student life. We aim to bring together nearly 1,000 stakeholders in the data community of the national capital region representing all facets of the data science ecosystem – Commercial, Government, Academia, Research, and Non-Profit. Please come as we continue charting the path ahead for data science. Our robust agenda of highly relevant speakers and presenters keeps improving by the day and we invite you to visit this site often for updates. The theme for 2018 is Simply Data Science. We will be focusing on how, by leveraging design principles and making the right design decisions throughout the entire data science technology stack, we can create intuitive, accessible systems that anyone can use. As Einstein said, “Everything should be made as simple as possible, but not simpler.” You get to simple by being thoughtful, deliberate, disciplined and highly creative. And only when we can reach such a state have we mastered our craft. DC DATACON 2018 is achieving simplicity in data science and creating a conversation to help us all think a little deeper, more completely and more creatively about what we do as data professionals. We’ll also be further exploring the social aspects of the conference striving to make it more collaborative and interactive to serve any interest you have in attending – networking, recruiting, collaborating, shaping, marketing, educating, learning and so on… There are many reasons to attend DC DATACON, the least of which is to help you be a better citizen in the data driven society in which we now live. DC DATACON: Let your voice shape the future of data science.

    George Washington University - Marvin Center

    800 21st St NW, Washington DC · Washington, DC

  • How NASA Finds Critical Data Through a Knowledge Graphs
    GWU Elliott School, Room B-12 1957 E St. NW Washington, DC This event is in co-hosted by our friends at GraphDB DC https://www.meetup.com/GraphDB-DC/. ** Abstract Ask any project manager and they will tell you the importance of reviewing lessons learned prior to starting a new project. The lesson learned databases are filled with nuggets of valuable information to help project teams increase the likelihood of project success. Why then do most lesson learned databases go unused by project teams? In my experience, they are difficult to search through and require hours of time to review the result set. Recently I had a project engineer ask me if we could search our lessons learned using a list of 22 key terms the team was interested in. Our current keyword search engine would require him to enter each term individually, select the link, and save the document for review. Also, there was no way to search only the database, the query would search our entire corpus, close to 20 million URLs. This would not do. I asked our search team if they would run a special query against the lesson database only, using the terms provided. They returned a spreadsheet with a link to each document containing the terms. The engineer had his work cut out for him: over 1100 documents were on the list;. I started thinking there had to be a better way. I had been experimenting with topic modeling, in particular to assist our users in connecting seemingly disparate documents through an easier visualization mechanism. Something better than a list of links on multiple pages. I gathered my toolbox: R/RStudio, for the topic modeling and exploring the data; Neo4j, for modeling and visualizing the topics; and Linkurious, a web front end for our users to search and visualize the graph database. ** About the Speaker David Meza currently serves as the Chief Knowledge Architect at NASA. During his tenure at NASA, he has worked in all aspects of the Information Technology field developing and deploying several IT systems in use at JSC. His desire to improve IT processes and systems lead him to earn Master’s certificates in Project Management and Six Sigma in addition to becoming a NASA certified Lean Six Sigma Master Black Belt. In his current role at JSC, he established the Operational Excellence program, promoting a viewpoint of organizational leadership that stresses the application of a variety of principles, systems, and tools toward the sustainable improvement of key performance metrics by focusing on the needs of the customer, empowering employees, and optimizing existing activities in the process. Mr. Meza is conducting research on Automatic Classification algorithms, domain specific search interfaces, topic modeling, data driven visualization and a Bayesian network model for risk analysis. He holds a Master’s in Engineering Management from the University of Houston Clear Lake. ** Food and beverages kindly sponsored by Neo4j

    GWU Elliott School, Room B-12

    1957 E St. NW · Washington, DC

  • Interpretable machine learning and Machine Learning with Oracle Cloud
    For our August Data Science DC Meetup, we are excited to organize two different talks. First talk is by Daniel Byler and Jason Lewris from Deloitte on interpretable machine learning models. Second talk is by Siyuan Yin from Oracle on machine learning with Oracle cloud. ---------------------------- Agenda: • 6:30pm -- Networking, Empanadas, and Refreshments • 7:00pm -- Introduction, Announcements • 7:10pm -- First Presentation and Discussion [Interpretable machine learning] • 7:45pm -- Second Presentation and Discussion [Machine Learning with Oracle Cloud] • 8:30pm -- Data Drinks (Tonic , 2036 G St NW) ---------------------------- Talk 1 Abstract: How does my model believe the world works? This talk will give an overview of existing techniques and also demonstrate Deloitte’s open source and model-agnostic approach to making machine learning models interpretable to humans. The techniques we discuss will primarily be focused on structured data but will include leading model types like neural networks. By the end of the talk, participants should feel confident that they can gain some insight into how their models believe the world operates. ---------------------------- Talk 1 Speakers Bio: Daniel Byler is a data scientist with Deloitte where he manages a portfolio of quantitative projects across Deloitte’s research agenda including projects ranging from Data USA (https://datausa.io/) to regulatory reform (https://www2.deloitte.com/us/en/pages/public-sector/articles/advanced-analytics-federal-regulatory-reform.html). Prior to his current role, he supported clients in large federal agencies on data-focused projects. Jason Lewris is a data scientist at Deloitte with experience in text analytics and using Python for data science. He uses data science to develop solutions that help enable people to make more informed decisions. Jason recently developed RegXplorer(https://www2.deloitte.com/us/en/pages/public-sector/articles/advanced-analytics-federal-regulatory-reform.html), a data science tool which applies modern text analytics methods to regulations at the global, federal and state levels to identify duplicative, overlapping regulations and construct hierarchies of regulatory dependence. ---------------------------- Talk 2 Abstract: With the computing resources becoming easily accessible, every day we see AI/ML being leveraged in Healthcare, Agriculture, Manufacture, etc, help us discover the future from the past. Here at Oracle cloud Solution Hub, we tailor innovative Cloud Solutions and Adaptive Intelligent business Apps for our partners that fits their business strategies, utilizing our knowledge, experience, and the broad spectrum of Oracle cloud services. Join us to learn how Oracle cloud empowers data scientist to derive business-changing insights and solutions with our cloud platform! ---------------------------- Talk 2 Speakers Bio: Siyuan Yin is currently a Solution Engineer focusing on data science at Oracle Cloud Solution Hub in Reston, VA. Prior to Oracle, she received a bachelor’s degree in psychology from UW Seattle and a master’s degree in Information Science from Cornell University. Siyuan is passionate about customizing data science solutions using her knowledge on the “reasoning" process of both human and computer.

    GWU Elliot School of Int'l Affairs

    1957 E St NW - Room 213 · Washington, DC

  • Applied Text Analysis Book Release Party
    Interested in integrating natural language processing into your applications? Want to learn how to build language-aware data products? Like free O’Reilly books?? Come to celebrate the release of Applied Text Analysis with Python with authors Benjamin Bengfort, Rebecca Bilbro, and Tony Ojeda. ( http://shop.oreilly.com/product/0636920052555.do ) This new O’Reilly title presents a data scientist’s approach to building language-aware products with applied machine learning. The authors will be signing copies of their book and hors-d'oeuvres will be served. About Applied Text Analysis with Python: From news and speeches to informal chatter on social media, natural language is one of the richest and most underutilized sources of data. Not only does it come in a constant stream, always changing and adapting in context; it also contains information that is not conveyed by traditional data sources. The key to unlocking natural language is through the creative application of text analytics. This practical book presents a data scientist’s approach to building language-aware products with applied machine learning. Readers will learn robust, repeatable, and scalable techniques for text analysis with Python, including contextual and linguistic feature engineering, vectorization, classification, topic modeling, entity resolution, graph analysis, and visual steering. About the Authors: Benjamin Bengfort is a data scientist who lives inside the Beltway but ignores politics (the normal business of DC) favoring technology instead. He is currently working to finish his Ph.D. at the University of Maryland where he studies machine learning and distributed computing. His lab does have robots (though this field of study is not one he favors) and, much to his chagrin, they seem to constantly arm said robots with knives and tools; presumably to pursue culinary accolades. A professional programmer by trade, a Data Scientist by vocation, Benjamin’s writing pursues a diverse range of subjects from Natural Language Processing, to Data Science with Python to analytics with Hadoop and Spark. Dr. Rebecca Bilbro is a data scientist, Python programmer, teacher, speaker, and author in Washington, DC. She specializes in visual diagnostics for machine learning, from feature analysis to model selection and hyperparameter tuning, and has conducted research on natural language processing, semantic network extraction, entity resolution, and high dimensional information. An active contributor to the open source software community, Rebecca enjoys collaborating with other developers on inclusive, high-impact projects like Yellowbrick—a pure Python package that aims to take predictive modeling out of the black box. Rebecca earned her doctorate from the University of Illinois, Urbana-Champaign, where her research centered on communication and visualization practices in engineering. Tony Ojeda is a data scientist, author, and entrepreneur with expertise in business process optimization and over a decade of experience creating and implementing innovative data products and solutions. He is the founder of District Data Labs, a data science consulting and corporate training firm, research lab, and open source collaborative where people from diverse backgrounds come together to work on interesting projects. He also co-founded Data Community DC. Tony has a Masters in Finance from Florida International University and an MBA with concentrations in Strategy and Entrepreneurship from DePaul University in Chicago. The parking garage is for a few buildings, follow the green columns. If you don't see any parking on the first level, drive just past the exit on the first level, and make a sharp left turn to access "Additional Parking" down the ramp. Proceed down the ramp, make 2 left turns and you'll see additional green columns.


    2231 Crystal Drive, suite 401, 22202 · Arlington, VA

  • Cooperative and Competitive Machine Learning through Question Answering
    For our July Data Science DC Meetup, we are excited to have Jordan Boyd-Graber (http://legacydirs.umiacs.umd.edu/~jbg/) join us to talk about Machine Learning through Question Answering. Jordan is an associate professor in the Department of Computer Science (http://cs.umd.edu/) at the University of Maryland. ---------------------------- Agenda: • 6:30pm -- Networking, Empanadas, and Refreshments • 7:00pm -- Introduction, Announcements • 7:15pm -- Presentation and Discussion • 8:30pm -- Data Drinks (Tonic , 2036 G St NW) ---------------------------- Abstract: My research goal is to create machine learning algorithms that are interpretable to humans, that can understand human strengths and weaknesses, and can help humans improve themselves. In this talk, I'll discuss how we accomplish this through a trivia game called quiz bowl. These questions are written so that they can be interrupted by someone who knows more about the answer; that is, harder clues are at the start of the question and easier clues are at the end of the question: a player must decide when it has enough information to "buzz in". I'll talk briefly about how we've built systems that can do well at quiz bowl games, beating the best human players, including Ken Jennings. However, playing trivia games isn't the whole story. I'll discuss how playing trivia games in some ways gives us a false impression of computer science ability and discuss how to make more realistic and challenging question answering datasets. The game of quiz bowl also allows opportunities to better understand interpretability in deep learning models to *help* human players perform better with machine cooperation. This cooperation helps us with a related task, simultaneous machine translation. Finally, I'll discuss opportunities for broader participation through open human-computer competitions: http://qanta.org ---------------------------- Bio: Jordan Boyd-Graber is an associate professor in the University of Maryland's Computer Science Department, iSchool, UMIACS, and Language Science Center. Jordan's research focus is in applying machine learning and Bayesian probabilistic models to problems that help us better understand social interaction or the human cognitive process. He and his students have won "best of" awards at NIPS (2009,2015), NAACL (2016), and CoNLL (2015), and Jordan won the British Computing Society's 2015 Karen Spärk Jones Award and a 2017 NSF CAREER award. His research has been funded by DARPA, IARPA, NSF, NCSES, ARL, NIH, and Lockheed Martin and has been featured by CNN, Huffington Post, New York Magazine, and the Wall Street Journal.

    The George Washington University, 1957 E St., Room 113

    1957 E St. NW · Washington, DC

  • Running Machine Learning at Scale by Combining Python with Vertica
    Abstract: Running Machine Learning at Scale by Combining Python with Vertica Databases leverage a range of programming languages. Data scientists prefer R and Python to build and test their Machine Learning models. And, they are not particularly interested in changing their tools. However, using R and Python alone does not address the Big Data opportunity to develop more robust Machine Learning models, based on the full corpus of data without downsampling. Join us to learn how you can take a best of both world’s approach by harnessing the power of the Vertica SQL analytical database and in-database Machine Learning capabilities with the vPython Library to develop, test, score, and perfect the end-to-end Machine Learning process at scale. Bio: Badr Ouali, a Data Scientist, joined Vertica in November 2017. Prior to Vertica, Badr received both an undergraduate and Master’s degree in Computer Science/Mathematics from the National School of Computer Science and Applied Mathematics in Grenoble, France. Badr is passionate about sharing knowledge and insights about anything related to data analytics with colleagues.

    The George Washington University, 1957 E St., Room 113

    1957 E St. NW · Washington, DC

  • The Emerging Role of Quantum Computing in Machine Learning
    Join us for an enriching evening of data science at George Washington University! We'll explore the intersection of quantum computing and machine learning with John Kelly, Director of Analytics at QxBranch. Prior to this, Brian Wright, Co-Director of the data science program at GWU, will give an overview of the data science activities at GWU. Abstract: Data science has been rapidly growing over the past decade, and its applications have become ubiquitous in our daily lives. As these applications consume more data and need faster response times, new technologies and algorithms are needed to meet the computational demands. Quantum computing is a highly promising emerging technology that could present significant opportunities to accelerate the training of machine learning algorithms and improve data science methods. This presentation will provide an overview of data science, with a focus on practical applications in industry. The current state of quantum computing technologies will also be explored, including some of the ways that quantum computing can be harnessed to advance machine learning. Bio: John Kelly, Ph.D., Director of Analytics at QxBranch John is leading the company’s development of advanced data analytics technologies. Previously, he was the Technical Lead for Corporate Data Analytics at Lockheed Martin. John has experience applying machine learning to a diverse set of domains including healthcare, supply chain optimization, sustainment, and program management. He completed his BS and MS in Electrical Engineering at NC State and his Ph.D. in Electrical and Computer Engineering at Carnegie Mellon University, where his work focused on machine learning and signal processing algorithms for brain-computer interfaces. Agenda 6:30 PM: Pizza, refreshments, & networking 6:50 PM: Introduction by Brian Wright, Co-Director of the data science program at GWU 7:00 PM: Presentation by John Kelly, Ph.D., Director of Analytics at QxBranch 7:30 PM: Q&A Note: This event is jointly organized with Dataiku Washington DC. https://www.meetup.com/Analytics-Data-Science-by-Dataiku-WashingtonDC/events/248738296/

    GWU, Funger Hall, Room 103

    2201 G St. NW · Washington, DC

  • Simulation Modeling: Generating Insight and Data About System Performance
    Simulation Modeling: A Powerful Approach for Generating Insight and Data About System Performance by Averill M. Law, Ph.D. Abstract Simulation modeling is the most widely used operations research/systems engineering technique for generating insight and data on new or proposed systems/processes. We give a tutorial on simulation and a detailed example showing the benefits that you can obtain from its use. Since a simulation model (or any model for that matter) is a surrogate for actual physical experimentation with a system, which is general impossible, we discuss the most important techniques for the critical activity of model validation. We also talk about the unique challenges in statistically analyzing the output data from a simulation model, since it is generally non-stationary and positively correlated, counter to the assumptions of classical statistics. The talk concludes with a discussion of how design of experiments and machine learning can provide insights into what factors most impact your system of interest. Bio Dr. Averill Law is the President of Averill M. Law and Associates, Inc., a company specializing in courses and consulting for simulation modeling and statistics. Previously, he was a tenured professor at the University of Wisconsin, Madison and the University of Arizona. He has a Ph.D. in operations research from the University of California, Berkeley. He has presented more than 575 short courses on simulation and statistics in 20 countries. His book Simulation Modeling and Analysis (5 th edition, McGraw-Hill) has been cited more than 18,300 times and has 172,000 copies in print. In 2009 he was awarded the INFORMS Simulation Society’s Lifetime Professional Achievement Award. Dr. Law developed the ExpertFit© distribution-fitting software.

    The George Washington University, 1957 E St., Room 113

    1957 E St. NW · Washington, DC