- Holden Karau on "Why PySpark is the way it is"
- NLP Community Night (on the eve of Data Day )
Come join us for an extended happy hour meet and greet with the Seattle NLP Community. (https://www.meetup.com/seattle-bellevue-natural-language-processing/) We'll be welcoming the speakers to town for the upcoming NLP Day at Data Day Seattle (http://datadayseattle.com/). RSVP at: https://nlp-community-night.eventbrite.com (https://nlp-community-night.eventbrite.com/) Some of the visiting speakers hand will be: Jonathon Morgan (http://goodattheinternet.com/) (Linkedin (https://www.linkedin.com/in/jonathonmorgan)) Sanghamitra Deb (https://www.linkedin.com/in/sanghamitra-deb-217b4122) Garrett Eastham (https://www.linkedin.com/in/garretteastham/) Jason Kessler (http://www.jasonkessler.com/) (LinkedIn (https://www.linkedin.com/in/jasonskessler)) Gunnar Kleemann (https://www.linkedin.com/in/gunnarkleemann) Zornitsa Kozareva (http://www.kozareva.com/) (Linkedin (https://www.linkedin.com/in/zkozareva)) Stefan Krawczyk (https://www.linkedin.com/in/skrawczyk) Rob McDaniel (https://www.linkedin.com/in/robmcdan) Jonathan Mugan (http://www.jonathanmugan.com/) (Linkedin (http://www.linkedin.com/in/jonathanmugan)) Julia Silge (http://www.juliasilge.com/) (LinkedIn (https://www.linkedin.com/in/juliasilge) / GitHub (https://github.com/juliasilge)) You do not need to have a ticket to Data Day to join us. However, you do have to RSVP. RSVP at: https://nlp-community-night.eventbrite.com (https://nlp-community-night.eventbrite.com/) See you there!
- TinkerPop / Gremlin Workshop
Course outline, requirements, and registration at: http://datadayseattle.com/ddsea17-workshops/tinkerpop There will only be one section of this class, and enrollment is limited to 30. Don't miss this opportunity! *If you want to understand how to take advantage of the graph functionality of Azure Cosmos DB, you need to learn Gremlin and TinkerPop* We originally commissioned Josh Perryman (https://www.linkedin.com/in/josh-perryman-5650208/), of Expero (https://www.experoinc.com/), to teach this TinkerPop/Gremlin workshop for recent Graph Day conference held in Austin, January 2017. To our knowledge, there is no one -- more than Josh -- who has a wikipedia-like knowledge of every commercial graph database. The workshop sold out and received rave reviews. When Josh offered it again at Graph Day in San Francisco, it sold out as well - once again, to rave reviews. We have asked Josh to come to Seattle and offer the course yet again in conjunction with Graph Day (http://datadayseattle.com/ddsea17/graphday) at Data Day Seattle (http://datadayseattle.com/). As far as we know, this is currently the only TinkerPop / Gremlin training workshop in the world. Course outline, requirements, and registration at: http://datadayseattle.com/ddsea17-workshops/tinkerpop What is Gremlin? Gremlin (https://tinkerpop.apache.org/gremlin.html), part of the Apache TinkerPop framework, is an incredibly rich and powerful query language for property graphs. Its functional roots and novel execution model can make it a little difficult to get started with; many instincts from set-based query languages like SQL don't translate directly. In this four hour workshop, led by Gremlin expert Josh Perryman, you will work through a series of increasingly complex exercises. What is Tinkerpop? Apache TinkerPop (http://tinkerpop.apache.org/) the an open source, vendor-agnostic, graph computing framework distributed under the commercial friendly Apache2 license. When a data system is TinkerPop-enabled, its users are able to model their domain as a graph and analyze that graph using the Gremlin query language. All TinkerPop-enabled systems integrate with one another allowing them to easily expand their offerings as well as allowing users to choose the appropriate graph technology for their application. About the instructor Josh Perryman (https://www.linkedin.com/in/josh-perryman-5650208/) is a Managing Consultant / Data Junkie / Technology Lead at Expero, Inc (http://experoinc.com/). His deep familiarity with a multitude of graph platforms and tools makes him a highly sought after speaker, trainer, and consultant in the graph space.
- Extending Spark ML for Custom Models (with Holden Karau)
This meetup is co-hosted with our friends at the Seattle Spark Meetup (https://www.meetup.com/Seattle-Spark-Meetup/). RSVP here or there. You don't need to RSVP with both groups. Special thanks to Blueprint Consulting Services (http://www.bpcs.com/) for hosting this meetup. Spark Committer/Author and Global Data Geek friend Holden Karau will be passing through town, so we invited her to spend an evening talking Spark and Python. Abstract This is an updated version of Holden's Spark Summit West talk with new material including Python support as well as Scala. Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. This talk introduces Spark’s ML pipelines, and then looks at how to extend them with your own custom algorithms. By integrating your own data preparation and machine learning tools into Spark’s ML pipelines, you will be able to take advantage of useful meta-algorithms, like parameter searching and pipeline persistence (with a bit more work, of course). Even if you don’t have your own machine learning algorithms that you want to implement, this session will give you an inside look at how the ML APIs are built. It will also help you make even more awesome ML pipelines and customize Spark models for your needs. And if you don’t want to extend Spark ML pipelines with custom algorithms, you’ll still benefit by developing a stronger background for future Spark ML projects. Agenda 6:30 Meet and Greet / Networking 7:00 Announcements and Featured Talk 8:30 Adjourn for drinks
- Analyzing The Trumpworld Graph: Applying Network Analysis to Public Data
We've invited William Lyon, one of the most requested speakers at the Data Day conferences, to visit Seattle and share his latest talk. This is going to be a good one. Special thanks to the folks at Context Relevant (https://www.contextrelevant.com/) for hosting this meetup. The Story A few weeks ago BuzzFeed released a public dataset (https://www.buzzfeed.com/johntemplon/help-us-map-trumpworld)of people and organizations connected to Donald Trump and members of his administration. As they say in their blog post (https://www.buzzfeed.com/johntemplon/help-us-map-trumpworld)announcing the data: No American president has taken office with a giant network of businesses, investments, and corporate connections like that amassed by Donald J. Trump. His family and advisers have touched a staggering number of ventures, from a hotel in Azerbaijan to a poker company in Las Vegas. In this meetup we will show how to import this data into Neo4j, write Cypher queries to find interesting connections and visualize the results using Neo4j Browser. In addition, we will show how to add public data from USASpending.gov on government contracts and campaign finance from the FEC, allowing us to answer questions like: How are members of the Trump administration connected to vendors of government contracts? Who are the most influential people in the network and how are they connected to Trump? Bring a laptop if you'd like to follow along or just watch as we cover importing the data and writing queries to apply network analysis concepts. About the speaker William Lyon (https://lyonwj.com/) is a software developer at Neo4j (http://neo4j.com/), the open source graph database. As an engineer on the Developer Relations team, he works primarily on integrating Neo4j with other technologies, building demo apps, helping other developers build applications with Neo4j, and writing documentation. Prior to joining Neo, William worked as a software developer for several startups in the real estate software, quantitative finance, and predictive API fields. William holds a Masters degree in Computer Science from the University of Montana. You can find him online at lyonwj.com (https://lyonwj.com/) Agenda 6:30 - Meet and Greet / Networking 7:00 - Announcements and Featured Talk 8:30 - Adjourn to pub
- Luca Garulli on OrientDB - MultiModal Graph Database
Without a lot of hype and fanfare, OrientDB has become one of the most widely deployed graph databases. DB-engines.com currently ranks OrientDB #2 among graph databases (http://db-engines.com/en/ranking/graph+dbms) -- and now it's multi-modal, able to store documents as well. We've been trying for a while now to bring Luca Garulli (https://www.linkedin.com/in/garulli), the author of OrientDB (http://orientdb.com/orientdb/), to Seattle. It's finally happening. For those of you not yet familiar, OrientDB is an open source NoSQL (https://en.wikipedia.org/wiki/NoSQL) database management system (https://en.wikipedia.org/wiki/Database_management_system) written in Java (https://en.wikipedia.org/wiki/Java_(programming_language)). It is a multi-model database, supporting graph (https://en.wikipedia.org/wiki/Graph_database), document (https://en.wikipedia.org/wiki/Document-oriented_database), key/value (https://en.wikipedia.org/wiki/Key-value_database), and object (https://en.wikipedia.org/wiki/Object_database) models, but the relationships are managed as in graph databases with direct connections between records. It supports schema-less, schema-full and schema-mixed modes. It has a strong security profiling system based on users and roles and supports querying with Gremlin (https://en.wikipedia.org/wiki/Gremlin_(programming_language)) along with SQL (https://en.wikipedia.org/wiki/SQL) extended for graph traversal. This is a rare opportunity. Come join us to welcome Luca to Seattle! Agenda 6:30 Networking / Meet and Greet 7:00 Announcements and Feature Presentation 8:30 Q/A and adjourn to pub
- Top 5 mistakes when writing Spark Applications
When we found out that Mark Grover (https://www.linkedin.com/in/grovermark) of Cloudera (http://cloudera.com) was coming to town, we asked if he would save a night to share the latest on Spark with the community. We twisty-breaky arm and he said: yes. Don't miss this talk. Abstract Top 5 mistakes when writing Spark Applications In the world of distributed computing, Spark has simplified development and open the doors for many to start writing distributed programs. Folks with little to none distributed coding experience can now start writing just a couple lines of code that will get 100s or 1000s of machines, immediately, working on creating business value. However, even through Spark code is easy to write and read, that doesn’t mean that users don’t run into issues of long running, slow performing jobs or out of memory errors. Thankfully most of the issues with using Spark have nothing to do with Spark but the approach we take when using it. This session will go over the top 5 things that we’ve seen in the field that prevent people from getting the most out of their Spark clusters. When some of these issues are addressed, it is not uncommon to see the same job running 10x or 100x faster with the same clusters, the same data, just a different approach. Speaker Bio Mark Grover (https://www.linkedin.com/in/grovermark) is a software engineer working on Apache Spark at Cloudera (http://cloudera.com). He is a co-author of the O'Reilly book Hadoop Application Architectures (http://shop.oreilly.com/product/0636920033196.do) and also wrote a section in Programming Hive book. Mark is also a committer on Apache Bigtop (http://bigtop.apache.org/) and a committer and PMC member on Apache Sentry (https://sentry.apache.org/). He has contributed to a number of open source projects including Apache Hadoop, Apache Hive, Apache Sqoop and Apache Flume projects. Mark is sought after speaker on topics related to Big Data at various national and international conferences. He occasionally blogs on topics related to technology on his blog. Agenda 6:30 Networking 7:00 Announcements and featured talk 8:30 Adjourn to pub
- Integrating Data using Graphs and Semantics
Special thanks to the folks at Whitepages for hosting! Juan Sequeda (https://www.google.com/search?q=juan%20sequeda), co-founder of Capsenta (https://capsenta.com/), longtime authority on the Semantic Web (W3C RDB2RDF Working Group (http://www.w3.org/2001/sw/rdb2rdf/) / Direct Mapping of Relational Data to RDF (http://www.w3.org/TR/rdb-direct-mapping/)), is coming to town for Data Day Seattle (http://datadayseattle.com). We've asked Juan to spend an evening to give a bit of historical background and share recent developments with respect to the Semantic Web (https://en.wikipedia.org/wiki/Semantic_Web), Linked Data (https://en.wikipedia.org/wiki/Linked_data), and related topics. Abstract Organizations derive immediate value and knowledge residing in disparate and heterogenous legacy systems regardless of platform, data structure or data modeling. Integrating data from legacy systems is hard, complex and perhaps one of the most difficult jobs in the world. This talk will present how graphs can be used to integrate data coming from different sources and how semantics can be added to make your data smarter and enhance search, analysis and interpretation of your data. I will also discuss different implementation architectures. Additional details to follow. This should be a great talk. Check out Juan's recent interview with Global Data Geeks: https://www.youtube.com/watch?v=xQniB_Z9aoU Agenda 6:30 pm Networking 7:00 pm Featured Talk 8:30 pm Adjourn to pub
- Natural Language Processing with Python/NLTK Workshop
Benjamin Bengfort (https://www.linkedin.com/in/bbengfort), Head Faculty for District Datalabs (http://www.districtdatalabs.com/) and co-author of the upcoming O'Reilly publication Data Analytics with Hadoop: An introduction for Data Scientists (http://shop.oreilly.com/product/0636920035275.do), is coming to speak at Data Day Seattle (http://datadayseattle.com/). We asked him, while he is in town, if he would take a day to offer his Natural Language Processing with Python Workshop. Fortunately, he agreed. Don't miss this opportunity! Full details at registration at: http://nlp-python-workshop-seattle.eventbrite.com WORKSHOP OVERVIEW Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK). NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications. WHAT YOU WILL LEARN In this course we will begin by exploring NLTK from the view of the corpora that it already comes with, and in this way we will get a feel for the various features and functionality that NLTK has. This will last us the first part of the course. However, most NLP practitioners want to work on their own corpora, therefore during the second half of the course we will focus on building a language aware data product from a specific corpus - a topic identification and document clustering algorithm from a web crawl of blog sites. The clustering algorithm will use a simple Lesk K-Means clustering to start, and then will improve with an LDA analysis. COURSE OUTLINE The following represents the one-hour modules that will make up the course. Part One: Using NLTK Introduction to NLTK: code + resources=magicThe counting of things: concordances, frequency distributions, tokenizationTagging and parsing: PoS tagging, NERC, Syntactic ParsingClassifying text: sentiment analysis, document classification Part Two: Building an NLP Data Product Using the NLTK API to wrap a custom corpusWord vectors for K-Means clusteringLDA for topic analysis Notably not mentioned: morphology, n-gram language models, search, raw text preprocessing, word sense disambiguation, pronoun resolution, language generation, machine translation, textual entailment, question and answer systems, summarization, etc. After taking this workshop students will be able to create a Python module that wraps their own corpora and begin to leverage NLTK tools against it. They will also have an understanding of the features and functionality of NLTK, and a working knowledge of how to architect applications that use NLP. Finally, students who complete this course will have built an information extraction system that performs topic analyses on a corpora of documents. PREREQUISITES This course is an intermediate Python course as well as an intermediate Data Science course. Students will be expected to have a beyond beginner knowledge and understanding of both Python and software development, as well as analytical and mathematical techniques used in Data Science. In particular, the students will be required to have the following knowledge, preparations before the course: Python installed on their systemKnowledge of how to write and execute Python programsUnderstanding of how to use the command lineNLTK installed along with all corpora and NLTK DataKnowledge of the English language (adjectives, verbs, nouns, etc.) Basic probability and statistical knowledge. Full details at registration at: http://nlp-python-workshop-seattle.eventbrite.com (http://nlp-python-workshop-seattle.eventbrite.com/) INSTRUCTOR: BENJAMIN BENGFORT Benjamin Bengfort is a Data Scientist who lives inside the beltway but ignores politics (the normal business of DC) favoring technology instead. He is currently working to finish his PhD at the University of Maryland where he studies machine learning and distributed computing. His focus is on highly consistent local distributed storage and visual diagnostics for data modeling. The lab next door does have robots and, much to his chagrin, they seem to constantly arm said robots with knives and tools; presumably to pursue culinary accolades. Having seen a robot attempt to slice a tomato, Benjamin prefers his own adventures in the kitchen where he specializes in fusion French and Guyanese cuisine as well as BBQ of all types. A professional programmer by trade, a Data Scientist by vocation, Benjamin's writing pursues a diverse range of subjects from Natural Language Processing, to Data Science with Python to analytics with Hadoop and Spark.
- Data Product Architectures - with Benjamin Bengfort
NOTE: only the West doors of the Rainier Square building will be unlocked after 5 pm. Please make sure to enter the building from 4th St. Benjamin Bengfort (https://www.linkedin.com/in/bbengfort) of District Data Labs (http://www.districtdatalabs.com/) is coming to town for Data Day Seattle (http://datadayseattle.com/). We asked him if, while he was in town, he would spend an evening with the community. He said yes. This is a great opportunity. If your company would like to host this presentation, send a note to data at lynnbender dot com ([masked]). Data Product Architectures Data products derive their value from data and generate new data in return; as a result, machine learning techniques must be applied to their architecture and their development. Machine learning fits models to make predictions on unknown inputs and must be _generalizable_ and _adaptable_. As such, fitted models cannot exist in isolation; they must be operationalized and user facing so that applications can benefit from the new data, respond to it, and feed it back in to the data product. Data product architectures are therefore _life cycles_ and understanding the data product life cycle will enable architects to develop robust, failure free workflows and applications. In this talk we will discuss the data product life cycle, explore how to engage a model build, evaluation, and selection phase with an operation and interaction phase. Following the lambda architecture, we will investigate wrapping a central computational store for speed and querying, as well as incorporating a discussion of monitoring, management, and data exploration for hypothesis driven development. From web applications to big data appliances; this architecture serves as a blueprint for handling data services of all sizes! Speaker Bio Benjamin Bengfort (https://www.linkedin.com/in/bbengfort) is a Data Scientist who lives inside the beltway but ignores politics (the normal business of DC) favoring technology instead. He is currently working to finish his PhD at the University of Maryland where he studies machine learning and distributed computing. His focus is on highly consistent local distributed storage and visual diagnostics for data modeling. The lab next door does have robots and, much to his chagrin, they seem to constantly arm said robots with knives and tools; presumably to pursue culinary accolades. Having seen a robot attempt to slice a tomato, Benjamin prefers his own adventures in the kitchen where he specializes in fusion French and Guyanese cuisine as well as BBQ of all types. A professional programmer by trade, a Data Scientist by vocation, Benjamin's writing pursues a diverse range of subjects from Natural Language Processing, to Data Science with Python to analytics with Hadoop and Spark. Agenda 6:30PM - Networking 7:00PM - Featured Talk 8:30PM - Adjourn to pub If your company would like to host this presentation, send a note to data at lynnbender dot com ([masked]).