Skip to content

Distributed Computing with Hydra and HPCC Systems

Photo of Matt Abrams
Hosted By
Matt A.
Distributed Computing with Hydra and HPCC Systems

Details

We will follow the two speaker format again for this meetup. The meetup will feature two great open source distributed systems that aren't named Hadoop! Matt Abrams from AddThis will be presenting Hydra (https://github.com/addthis/hydra). The second talk will discuss large-scale entity extraction using LexisNexis' High Performance Computing Cluster (http://hpccsystems.com/) (HPCC Systems).

AddThis is sponsoring the beer for this event and LexisNexis will be sponsoring the pizza. LexisNexis has also provided a Kindle Fire HDX that we will be raffling off at the event!

Hydra: An Introduction

Matt Abrams will present a practical introduction to using Hydra for distributed data processing. This talk will focus on technical execution rather than the low-level design and theory behind Hydra. Several examples of different types of Hydra jobs and queries will be discussed in order to demonstrate Hydra's capabilities. The goal of this talk is to give the audience a core understanding of what Hydra is and how to use it solve data challenges. The examples discussed during the presentation will be available for download so that you can replicate the experiments in your own development environments.

Title: Large-scale Entity Extraction and Probabilistic Record Linkage

Short Description: Large-scale entity extraction, disambiguation and linkage in Big Data can challenge the traditional methodologies developed over the last three decades. Entity linkage, in particular, is cornerstone for a wide spectrum of applications, such as Master Data Management, Data Warehousing, Social Graph Analytics, Fraud Detection and Identity Management. Traditional rules based heuristic methods usually don't scale properly, are language specific and require significant maintenance over time.

We will introduce the audience to the use of probabilistic record linkage, also known as specificity based linkage, on Big Data, to perform language independent large-scale entity extraction, resolution and linkage across diverse sources. We will also present a live demonstration reviewing the different steps required during the data integration process (ingestion, profiling, parsing, cleansing, standardization and normalization), and show the basic concepts behind probabilistic record linkage on a real-world application.

Speaker:

Joe Barter is a Consulting Software Engineer for LexisNexis Special Services, where he is involved in a variety of “big data” projects pertaining to LexisNexis’ High Performance Computing Cluster (HPCCSystems). Joe has worked extensively with the platform since 2008 and with HPCC Systems’ Scalable Automated Linking Technology (SALT) since 2010. Joe’s expert knowledge of SALT has enabled him to develop solutions that address complex entity disambiguation and non-obvious relationship problems, particularly in his role as key contributor to Smart View.

Joe is a graduate of the University of Dayton with over 25 years of software development experience. While he cut his coding teeth with IBM’s 360/370 assembly, his primary development languages are currently SALT, ECL, and Java. He has a passionate interest in employing advanced analytics on “big data” to produce actionable information. Other interests include Machine Learning and Natural Language Processing.

Photo of AI Performance Engineering Meetup (Arlington VA) group
AI Performance Engineering Meetup (Arlington VA)
See more events
AddThis HQ
1595 Spring Hill Road, Suite 300 · Vienna, VA