Skill Level: any
Discipline: mainly developers, but business analysts, managers, data scientists, and others will find value.
Location: OC Tanner building. Please use South Entrance.
We'll be using this meeting to further the discussion around the Air Quality Competition (Fall 2014) and specifically targeting some of the technical aspects of using pig for ETL and data processing in an effort to win the competition.
Matt Davies will be leading this interactive session using real scripts and lessons learned from recent production deployments at a large specialty retailer.
Data is messy and ETL logic may be simple or complex, but we have pig to make our life simpler. The problem, though, is pig is another language and has certain power and limitations as compared to Java / C++ / Ruby / etc. The more we understand where it fits in the Hadoop ecosystem the better we know which tool for which job.
We'll be covering topics such as :
• "why pig? When to use hive, Java M/R, streams..."
• "how do I load data using pig?"
• "how can I scrub data using pig?"
• "how do I do a custom compute operation within pig?"
• "WTH - why is this so darn slow?? I thought this was supposed to be fast?"
We'll end with a very brief overview of the amazing Lipstick tool created by Netflix, and how we can leverage it's power on a daily basis.
Please tell your friends/enemies/coworkers - this will be useful to all skill levels and foci.
About the Speaker:
Matt Davies is Principal, Big Data Architecture for Miller & Associates based out of Dallas, TX. He has recently deployed a novel Hadoop-driven software system aimed at correlating various data feeds together to provide a much higher level of insight around a "customer", and , along the way, overcoming difficult technical challenges that many said could not be solved.
Previously, Matt has worked at Nike as a technical lead for their ID product line, as a Principal Software Engineer at Symantec, and Senior Software Engineer at Tynt (now 33Across).