• Bringing Together Open Source Data Science

    Abstract This talk will cover several different open source methods and algorithms for visualization and data modeling, with a focus on R, Python, H2O.ai, and Keras within KNIME Analytics Platform. The material will include sample workflows and code snippets for you to explore." This talk will; be given by Scott Fincher (https://www.linkedin.com/in/scottfincher/) Bio: Scott Fincher works for KNIME, Inc as a Data Scientist. He has presented several talks on KNIME's open source Analytics Platform, and enjoys assisting other data scientists with optimizing and deploying their models. Prior to his work at KNIME, he worked for almost 20 years as an environmental consultant, with a focus on numerical modeling of atmospheric pollutants. He holds an MS in Statistics and a BS in Meteorology, both from Texas A&M University After the talk, we will conduct the Austin ACM SIGKDD Chapter Elections for the year 2019. The elections will be for the following offices: Chair, Vice Chair, Secretary/Treasurer. In order to run for any position you have to be both ACM member and SIGKDD member. To vote, you don't have to be ACM /SIGKDD membership. You just need to show up for the election meeting. To learn more about these positions and duties, you can contact the current officers: Chair: Omar Odibat [masked] Vice Chair: Robert Chong [masked] Secretary /Treasurer: Francisco Marquez [masked] Agenda: 6:30-7:00 Network and Food 7:00-:7:45 Talk 7:45-8:15 Elections

    4
  • Invited Speaker - Mind and Models: Deep Learning in its Victories and Defeats

    Dr. Alan Lockett has been invited as a guest speaker to give a very interesting talk regarding the future of deep learning and how it is reshaping the landscape of data science. Abstract: Under the moniker of deep learning, neural networks have achieved a number of breakthroughs in applied AI over the past five years, from image classification to machine translation. Counter to early narratives suggesting deep learning could only be applied to large datasets, techniques such as representation learning and transfer learning enable successes on small datasets as well. In a certain sense, deep learning automates the construction of ML pipelines, replacing junctures previously constructed by hand with automatically learned interfaces. New APIs and frameworks have improved the accessibility of these techniques, challenging the long-term viability of the traditional data science toolkit composed of SVMs, random forests, decision trees, and logistic regression. Nonetheless, deep learning as presently construed cannot lead to general-purpose AI for very fundamental reasons. A cursory inspection of the outputs of chatbots, machine translation, and language generation reveals that these systems fail to capture or express a consistent narrative thread. Quite simply, deep learning systems lack the qualia of human thought. In their present form, they do not build or maintain simple, consistent models, which makes interpreting or explaining their results difficult. Furthermore, unconstrained and uncurated learning of data replicates statistical biases present in the dataset, which leads to ethical questions regarding their deployment for practical purposes, such as hiring employees or awarding parole. Future work in AI, both industrial and academic, needs to consider how to address these shortcomings by superimposing an artificial mind as a curator over a statistical learning system. In this talk, the successes and challenges of deep learning will be reviewed, followed by discussion of how these challenges might be mitigated and eventually overcome in order to enable general-purpose AI for practical applications. Bio: Alan J. Lockett is the Principal Data Scientist at CS Disco, Inc., a fast-growing legal technology start-up with $60 million invested. He received his Ph.D. in Computer Science from the University of Texas at Austin within the Artificial Intelligence Lab, where he studied neural networks, graphical models, and neuroevolution for applications in games, optimization, and control with Risto Miikkulainen. After his time at UT, he was awarded ad NSF postdoctoral fellowship and worked with Jürgen Schmidhüber at the Dalle Molle Institute for Artificial Intelligence Studies in Lugano, Switzerland on humanoid robotics and deep learning. He is the author of a dozen journal articles and conference papers and holds two patents applying deep learning to legal technology. Agenda: 6:30 Social 7:00 Presentation + QA Location: Visa: 12301 Research Blvd, Bldg 3, Austin, TX 78759, United States · Austin, TX RSVP: • Seating is limited to the first 140 to RSVP. • Please let any of the ACM Officers know if you have any questions about RSVP.

    9
  • Real-time Data Analytics with Apache Spark Streaming

    Abstract: With streaming data processing, computing is done in real time as data arrives rather than as an offline batch process. Real-time data analytics is becoming a critical component of the big-data strategy for many organizations. In this presentation, I'll discuss the Apache Spark Streaming and how it can be used for processing data in real time. We'll look at a sample application using technologies like Zookeeper and Kafka to see how Spark Streaming works. Presentation Outline: Real-time data analytics Streaming data use cases Spark Streaming API Structured Streaming vs Spark Streaming Sample application Thanks Bio Srini Penchikala currently works as a senior software architect in Austin, Texas. He has over 22 years of experience in software architecture, design, and development. He recently published a book on Apache Spark framework. He is the co-author of the Spring Roo in Action book from Manning Publications. He has presented at conferences like JavaOne, SEI Architecture Technology Conference (SATURN), IT Architect Conference (ITARC), No Fluff Just Stuff, NoSQL Now, Enterprise Data World, OWASP AppSec, and Project World Conference. Penchikala also published several articles on software architecture, security and risk management, NoSQL, and big data on websites like InfoQ, TheServerSide, O’Reilly Network (ONJava), DevX Java Zone, Java.net and JavaWorld. He is the lead editor of the Data Science community at InfoQ. Agenda: 6:30 Food + Networking 7:00 Presentation + QA Location: Visa:12301 Research Blvd, Bldg 3, Austin, TX 78759, United States · Austin, TX RSVP: • Seating is limited to the first 75 to RSVP. • Please let me know if you have any questions about RSVP.

    3
  • Real-time Data Analytics with Apache Spark Streaming

    Abstract: With streaming data processing, computing is done in real time as data arrives rather than as an offline batch process. Real-time data analytics is becoming a critical component of the big-data strategy for many organizations. In this presentation, I'll discuss the Apache Spark Streaming and how it can be used for processing data in real time. We'll look at a sample application using technologies like Zookeeper and Kafka to see how Spark Streaming works. Bio: Instructor: Srini Penchikala Srini Penchikala currently works as Senior IT Architect in Austin, Texas. He is also the Lead Editor of Data Science community at InfoQ (http://www.infoq.com/author/Srini-Penchikala). Srini has over 20 years of experience in software architecture, design & development and delivery. He is currently authoring a book on Big Data Processing with Apache Spark. He is also the co-author of the book "Spring Roo in Action" (http://www.manning.com/SpringRooinAction) from Manning Publications. Srini has presented at conferences like Enterprise Data World, JavaOne, and NoSQL Now! Conference. Agenda: 6:30 Food + Networking 7:00 Presentation + QA Location: Visa:12301 Research Blvd, Bldg 3, Austin, TX 78759, United States · Austin, TX RSVP: • Seating is limited to the first 75 to RSVP. • Please let me know if you have any questions about RSVP.

    2
  • Visual Spark Development with KNIME

    Visa

    Abstract: The KNIME Analytics Platform is the leading open and open source solution for data-driven innovation. KNIME can be used to help discover the potential hidden in your data, mine for fresh insights, or predict new futures. KNIME provides a visual development environment enabling data scientists of various experience to quickly build complex solutions. The visual environment also enables collaboration with peers and other groups who may not be as technically savvy. KNIME nodes (functions) provide a wide variety of capabilities including sourcing data, data transformation, and modeling using a variety of algorithms. KNIME also has integrations with Python, R, Spark, H2O and deep learning. And all on an open source platform. The Big Data Extensions integrate the power of Apache Hadoop and Spark with the ease-of-use of KNIME Analytics Platform. They consist of two complementary node libraries: KNIME Big Data Connectors enable you to import/export HDFS data and perform SQL analytics within Hive and Impala through a series of KNIME nodes. And KNIME Extension for Apache Spark enables you to create and run Spark jobs for data transformation and model learning through another set of KNIME nodes. In this talk we'll provide a quick overview of the KNIME Analytics Platform then jump right into building Spark workflows using Hive and HDFS-based data. The examples and demonstrations will illustrate using a visual environment to build machine learning workflows that execute on a Hadoop cluster using Spark. KNIME also enables mixing visual development with coding using the Spark SQL and Java Snippet nodes. Bio: Jim has worked with KNIME for the past year helping to get their US-based operations up and running. His work includes evangelizing the KNIME open source platform and supporting customers through their journeys in data science. Jim has a mix of a data science and computer science background including building a dataflow based, distributed computation platform for deep data analysis (similar to Spark). Agenda: 6:30 Food + Networking 7:00 Presentation + QA Location: Visa:12301 Research Blvd, Bldg 3, Austin, TX 78759, United States · Austin, TX RSVP: • Seating is limited to the first 75 to RSVP. • Please let me know if you have any questions about RSVP.

    2
  • Introduction to Semi-Supervised Learning

    data.world

    Semi-Supervised learning is a relatively new approach to working with data that does not come from canonical and well-prepared sources. In practice, it is very rare to come across data that has class labels readily and abundantly available to satisfactorily train a classifier. This is when semi-supervised learning techniques can be beneficial by synthetically applying labels to unlabeled data by leveraging the underlying statistical distributions. In this talk, you will hopefully walk away with the following: - Deeper understanding to semi-supervised learning - Introduction to several different algorithms such as CPLE & S3VM - Demonstration of semi-supervised learning in Notebooks Robert Chong is the Vice-Chair of the local Austin ACM chapter. He has a very deep interesting in machine learning and is always learning (when he's not spending time with his family). https://www.linkedin.com/in/robertjchong Agenda: 6:30 Food + Networking 7:00 Presentation + QA Location: TBD - Check back later RSVP: • Seating is limited to the first 75 to RSVP. • Please let me know if you have any questions about RSVP.

    17
  • Invited Speaker: George Trujillo - Designing the Next Generation Data Lake

    Designing the Next Generation Data Lake Abstract: Data Lakes and Analytic Platforms are continuing to evolve to address the changing needs of customers. Most existing data lakes were built one use case at a time without a proper data architecture or analytics strategy. In this presentation, we will show the capabilities and characteristics of the next generation of data lakes (Data Management and Analytic platforms). We will discuss how new data lakes will accelerate time to insight, leverage cloud capabilities, separate compute and storage and optimize cloud storage. Platform decisions, data ingestion architectures, data governance, model governance, security and designing an enterprise grade data lake are topics of discussion. The goal of this presentation is to show how the next generation of data lakes will greatly reduce time to insight and platform complexity. Bio: George Trujillo is a passionate and energetic technology leader with over seven year’s experience in the delivery and implementation of enterprise big data platforms as well as twenty year’s experience in analytics and data management. George's roles in big data have included Master Principal Big Data Specialist, Managing Director of Big Data, Vice President of Big Data and Global Director of Big Data and Cloud technologies. Industry recognitions include Oracle Double Ace, VMware vExpert, recognized as of the "Oracle's of Oracle" and is an author of two books on Big Data and Virtualization. Agenda: 6:30 Social 7:00 Presentation + QA Location: HomeAway North @ Domain (https://maps.google.com/maps?q=11800%20Domain%20Blvd.%2C%20Suite%20300%20%2C%20Austin%2C%20TX) 11800 Domain Blvd., Suite 300 , Austin, TX RSVP: • Seating is limited to the first 140 to RSVP. • Please let me know if you have any questions about RSVP.

    6
  • Invited Speaker: Dr. Jennifer Prendki -WHAT TO DO WHEN YOUR DATA IS LYING TO YOU

    TELL ME THE TRUTH, THE WHOLE TRUTH, AND NOTHING BUT THE TRUTH (WHAT TO DO WHEN YOUR DATA IS LYING TO YOU) Abstract: Our job as data scientists is to demand answers from the data, even if these answers are sometimes not in line with what we would like to hear. As they say, “The only thing worse than not knowing is not wanting to know”. There are many ways in which our data, the models we build with it, or the laws of statistics are misleading us into drawing the wrong conclusions. To be successful, we have to navigate our way through common pitfalls ranging from outliers to overfitting, selection biases, etc. And even that sometimes is not enough: we also have to be weary of more subtle phenomena, such as, for example, the Simpson’s paradox. So how to figure out if the results of an analysis are unbiased and indeed depict a proper view of reality? How to check if a model is accurate? And how to even check if the data we are using are even correct? Validation might just be the most important part of the analytical process, yet it is often the most overlooked one. Thankfully, generations of statisticians have developed methods to confirm or infirm their results, and it is usually possible to catch your data in a lie before the model starts impacting the business irrevocably. In this talk, I will discuss not only the many ways that data can deceive analysts (both human-driven and technical), but also some of the tools to avoid it and the consequences that can result if you don’t ensure that your data is actually telling you the truth, the whole truth, and nothing but the truth. Bio: Dr. Jennifer Prendki is the Head of Data Science at Atlassian, where she leads all Search and Machine Learning initiatives and is in charge of leveraging the massive amount of data collected by the company to load the suite of Atlassian products with smart features. She received her PhD in Particle Physics from University UPMC - La Sorbonne in 2009 and has since that worked as a data scientist for many different industries. Prior to joining Atlassian, Jennifer was a Senior Data Science Manager in the Search team of Walmart eCommerce. She enjoys addressing both technical and non-technical audiences at conferences and sharing her knowledge and experience with aspiring data scientists. Agenda: 6:30 Pizza + Networking 7:00 Presentation + QA Location: Visa (12301 Research Blvd, Bldg 3, Austin, TX 78759) will sponsor the classes and provide pizza and drinks. Parking is available in front of the Visa building. If you cannot find a spot, you can park in the garage behind the building. RSVP: If you are planning to attend, please answer the question when RSVP and provide your first name, last name, and email address so that I create the badge for you and create a username/password to access the Visa wifi. • Seating is limited to the first 100 to RSVP. • Please bring a picture ID and arrive early to assist with the sign-in process. Please arrive no later than 6:30PM. • RSVP will end on 12/11 @ 1:00PM. • Please let me know if you have any questions about RSVP.

    5
  • Boosting Algorithms in Python

    Visa

    Boosting algorithms are very powerful techniques for building predictive models, and they are widely used algorithms in data science competitions. In this talk Omar Odibat will cover: Introduction to Boosting, Adaboost, Gradient Boosting and XGBoost. A demo of how to use Boosting algorithms in Python will be included. Omar is a data scientist at Visa, and he is the current chair of the Austin ACM SIGKDD chapter. Omar holds a PhD in Computer Science from Wayne State University. His research interests include machine learning, data mining and big data analytics. https://www.linkedin.com/in/omarodibat/ (https://www.linkedin.com/in/tuhinmahmud/) Agenda: 6:30 Pizza + Networking 7:00 Presentation + QA Location: Visa (12301 Research Blvd, Bldg 3, Austin, TX 78759) will sponsor the classes and provide pizza and drinks. Parking is available in front of the Visa building. If you cannot find a spot, you can park in the garage behind the building. RSVP: If you are planning to attend, please answer the question when RSVP and provide your first name, last name, and email address so that I create the badge for you and create a username/password to access the Visa wifi. • Seating is limited to the first 100 to RSVP. • Please bring a picture ID and arrive early to assist with the sign-in process. Please arrive no later than 6:30PM. • RSVP will end on 11/13 1:00PM. • Please let me know if you have any questions about RSVP.

    13
  • Graphs in Spark: Graphx

    Visa

    Prashant Shetty, a Lead Data Engineer @ Phunware, will give an overview of Apache Spark GraphX and GraphFrames. GraphX is Apache Spark's API for graphs and graph-parallel computation. And GraphFrames are to DataFrames as GraphX is to RDDs. The talk will also include a hands-on tutorial of GraphX & GraphFrame APIs using the Databricks notebook Agenda: 6:30 Pizza + Networking 7:00 Presentation + QA Location: Visa (12301 Research Blvd, Bldg 3, Austin, TX 78759) will sponsor the classes and provide pizza and drinks. Parking is available in front of the Visa building. If you cannot find a spot, you can park in the garage behind the building. RSVP: If you are planning to attend, please answer the question when RSVP and provide your first name, last name, and email address so that I create the badge for you and create a username/password to access the Visa wifi. • Seating is limited to the first 100 to RSVP. • Please bring a picture ID and arrive early to assist with the sign-in process. Please arrive no later than 6:30PM. • RSVP will end on 11/6 1:00PM. • Please let me know if you have any questions about RSVP.

    8