- Microservices, Event Streaming and Machine Learning
5:15 PM Members arrive - Pizza, soft drinks and beer 6:00 PM Announcements & Recruiting Shoutouts 6:15 PM Speakers (25 min each) Seeking speakers and sponsors: Contact us if you would like to speak at future events! Agenda 1) Microservices and Event Streaming - Steve Howard Microservices have historically followed a request-response pattern, with a single database per microservice. Tonight, we will discuss how the observer pattern can be used to use event streams as the source of data for microservices. 2) MLaaS (Machine Learning as a Service) - Sagar Kewalramani Machine Learning Models Management and Deployment is another paradigm shift from the traditional Software Development Life Cycle. In this talk, we shall demonstrate a proof-of-concept where a Financial Enterprise moved from traditional jupyter notebook kind of environment to a REST API Deployable Models for different fraud detections including credit card fraud, mobile payment fraud, deposit frauds etc. 3) Distributed NoSQL Warehouses - Sagar Mangam Developing and Deploying schema-agnostic analytical databases. Speaker Bios 1) Steve Howard, Systems Engineer, Confluent: Steve Howard is a Systems Engineer with Confluent. His background includes enterprise architecture, data engineering, application development, and infrastructure management. He has led enterprise initiatives including customer analytics, enterprise integration, and ecommerce. Steve's most recent role was as Principal Architect of EXPRESS, an American fashion retailer. Steve has a degree in Finance from Bowling Green and is based in Columbus, Ohio. His passion is getting the right information in the right hands at the right time to drive results and have fun in the process. 2) Sagar Kewalramani, Strategic Solution Architect & Data Scientist, Cloudera Sagar Kewalramani is a Strategic Solution Architect & Data Scientist at Cloudera, where he helps Customers Install, Build, Secure, Optimize & tune their Big Data Environments. Sagar has worked with customers from all verticals, including Banking, Manufacturing, Healthcare, Retail etc. He has led the discovery and development of big data and machine-learning applications to accelerate digital business and simplify data management and analytics. He has spoken in multiple Hadoop & Big Data Conferences including O'Reilly Strata. 3) Sagar Mangam, Senior Consultant, Navigator Management Partners Sagar Mangam is a Principal Consultant at Navigator Management Partners. He has spent 14+ years in the Business Intelligence and Analytics domain. He started his career as an Excel programmer and worked his way through the plethora of tools and technologies in the BI space. In his current role, he works with agencies and helps them adopt and implement projects in the Big Data ecosystem. Sagar has an MBA in Operations and Analytics from Fisher School of Business.
- MOHUG is relaunching as MODUG at a new location
5:15 PM Members arrive - Pizza, soft drinks and beer 6:00 PM Announcements & Recruiting Shoutouts 6:15 PM Speakers (25 min each) Seeking speakers and sponsors: Contact us if you would like to speak at future events! Agenda 1) Kafka 101 - Patrick Druley Apache Kafka Fundamentals for Architects, Admins, and Developers! 2) Building a real-time CDC pipeline - Meher Bezawada Using Kafka connect to load transactions from RDBMS source to Hadoop. Bios: 1) Patrick Druley is a Systems Engineer at Confluent Inc, the Event Streaming Platform company. His career has focussed on databases, data warehousing and included various analytics and big data projects with companies across multiple industries while working at Teradata and Oracle. In addition to talking to folks about Apache Kafka and Stream Processing every day at Confluent, he is also the Confluent Cloud subject matter expert for the East team. Patrick is an Ohio native, currently based out of Medina, OH. 2) Meher Bezawada is a Sr. Consultant at Navigator Management Partners where he is a Hadoop Technical Architect and Developer. Over his tenure, Meher has worked with a number of clients to implement and optimize their Hadoop deployments and help them maximize the value of their investment.
- Finally! We're having another MOHUG!
- Joe Intrakamhang It been a long time coming, but we're getting back on track. The Fuse data nerds have been locked away in the dev cave, but we're ready to come out and talk about all things Big Data. Hope you can make it. Topics Big Data and Machine Learning - Joe Intrakamhang from Google Learn about a fully managed, petabyte scale, and serverless database called Big Query. Also, we will have a live machine learning demo using Tensorflow and then productionizing it with Google's Cloud ML service. Joe works at Google as a Solutions Engineer on the Google Cloud team. In this role, he focuses on architecting and designing solutions for companies migrating to Google Cloud. He is a passionate developer who loves technology and he continuously works on fine tuning his software craftsmanship. Stream Processing for Analytic Workloads - Ron Buckley from Hortonworks Stream processing has become the defacto standard for building real-time ETL and Stream Analytics applications. We see batch workloads move into Stream processing to act on the data and derive insights faster. With the explosion of data with "Perishable Insights" such IoT and machine-generated data, Stream Processing + Predictive Analytics is driving tremendous business value. This is evidenced by the explosion of Stream Processing frameworks like proven and evolving Apache Storm and newer frameworks such as Apache Flink, Apache Apex, and Spark Streaming. Today, users have to choose and try to understand the benefits of each of these frameworks and not only that they have to learn the new APIs and also operationalize their applications. To create value faster, we are introducing new open source tool - Streamline. It is a self-service tool that will ease building streaming application and deploy the streaming application across multiple frameworks/engines that users prefer in a snap. It simplifies integration with Machine Learning models for scoring and classification of data for Predictive Analytics. It provides an elegant way to build Analytics dashboards to derive business insights out of the streaming data and to allow the business users to consume it easily. In this talk, we will outline the fundamentals of real-time stream processing and demonstrate Streamline capabilities to show how it simplifies building real-time streaming analytics applications. Ron Buckley is a Solutions Engineer at Hortonworks. Previous to Hortonworks, Ron worked on teams at Nationwide Children’s Research Institute and OCLC implementing Hadoop for HealthCare and Library centric systems. Ron has presented multiple times on HBase at HBaseCon and various other events. Google Cloud Spanner: Worldwide consistent database at scale - Joe Intrakamhang from Google A worldwide consistent database that is fully managed. What is this database you speak of? It is a product from Google called Cloud Spanner. In this talk, we will share an overview, how Cloud Spanner works, and an awesome demo.
- October Edition of MOHUG
Thank-you to Sandy Simpson (https://www.linkedin.com/in/sandysimpson?authType=NAME_SEARCH&authToken=tVGS&locale=en_US&trk=tyah&trkInfo=clickedVertical%3Amynetwork%2CclickedEntityId%3A39044124%2CauthType%3ANAME_SEARCH%2Cidx%3A1-1-1%2CtarId%3A1474903213908%2Ctas%3Asand) from Illumination Works (http://www.illuminationworksllc.com/) for sponsoring this October edition of the MOHUG Meet-Up! There has been significant growth in the volume, variety, and velocity of digital data being produced by and shared between computing systems today. Behavioral, social, sensor, and spatial data sources have exploded, personal and wearable devices have become ubiquitous, cloud computing has become commonplace, and wireless connectivity has become nearly universal. Traditional data warehousing is unable to keep up with the vast amount of information being created by the minute. Advances in information technology infrastructure and environments, such as Hadoop and similar data lake concepts, are enabling greater scalability, utilization, and flexibility when dealing with free text data, log files, and other non-traditional data sources. This session will examine the shortcomings of traditional data warehousing and discuss best practices, lessons learned and important considerations in transitioning into a new big data environment while leveraging existing EDW investments. Shawn Huntington (https://www.linkedin.com/in/shawn-huntington-pmp-50737b27?authType=NAME_SEARCH&authToken=c13C&locale=en_US&srchid=2089619761476296841208&srchindex=1&srchtotal=8&trk=vsrp_people_res_name&trkInfo=VSRPsearchId%3A2089619761476296841208%2CVSRPtargetId%3A93476887%2CVSRPcmpt%3Aprimary%2CVSRPnm%3Atrue%2CauthType%3ANAME_SEARCH) has over seventeen years of Information Technology experience in a broad range of data capabilities. He has a passion for keeping up with technological advances in the big data and advanced analytics revolution. With hands-on experience in Big Data implementations as an Illumination Works (http://www.illuminationworksllc.com/) Senior Consultant, Shawn understands how to effectively transition from traditional data warehouses to modern Big Data platforms. In his most recent accomplishment, Shawn and his team developed and enhanced a Big Data Cloudera Hadoop environment for the world’s leading provider of critical infrastructure technologies and life cycle services for information and communications technology systems. Data Ingestion Patterns – Jordan Martz from Attunity Highlighting methods from CDC to Batch and understanding the native Hadoop, 3rd Party real-time tools, as well as, enterprise vendor solution. High level will be around the understanding the ingestion concerns and operational issues around landing and managing the data. Jordan has extensive experience delivering Enterprise IT solutions, such as software development on web, core/enterprise, and Client/Server applications; and implementing analytics solutions, such as, Business Intelligence, Machine Learning/Artificial Intelligence, Simulation, Optimization, Predictive Analytics, Data Warehousing, Big Data, Master Data Management, and Data Governance. Prior to Attunity, Jordan worked at Oracle, Domino’s Pizza (Hadoop/BI/Data Warehouse Architect), Kalido, and Information Builders. He’s is also the Founder and Lead Data Scientist for DataMartz (http://www.datamartz.com/) (a lucky pun on words for his last name and profession). Leveraging Hadoop – Scott Howser from Nutonian Scott (https://www.linkedin.com/in/scotthowser?authType=NAME_SEARCH&authToken=LCyC&locale=en_US&trk=tyah&trkInfo=clickedVertical%3Amynetwork%2CclickedEntityId%3A9794746%2CauthType%3ANAME_SEARCH%2Cidx%3A1-1-1%2CtarId%3A1474903665684%2Ctas%3Ascott%20howser) serves as Executive Vice President at Nutonian (http://www.nutonian.com/) with responsibilities for product management, marketing and business development. Prior to joining Nutonian, Scott served as vice president of products and marketing at Hadapt with responsibilities for product management and marketing. Prior to Hadapt, Scott was vice president of product marketing at Vertica, an HP Company, with responsibility for product messaging, corporate branding, and establishing best practices for deployment and solutions architectures. Scott earned an M.B.A. from the University of Notre Dame, an M.S.I.S.M from Loyola University Chicago, and a bachelor's degree from Ohio Dominican University.
- August edition of MOHUG
Happy 1st birthday MOHUG! Its been one year since we went public and officially started with meetup. Topics: Doing Data Science with Apache Spark - Dong Meng from MapR (https://www.mapr.com/blog/author/dong-meng). Spark is a distributed computational framework that make data science handy over huge datasets. This presentation will cover some spark core introduction. Then dive in with use cases to run ad-hoc analytical query with SparkSQL, build machine learning pipeline with MLlib, doing graph modeling on GraphX Impala performance benchmarks and use cases - Derek Kane from Cloudera (https://www.linkedin.com/in/derek-kane-13a655). Security in the cluster - Erik Nor from Moser Consulting (https://www.linkedin.com/in/eriknor). As data in Enterprise Hadoop clusters continues to grow, securing that data continues to be an important part of any implementation, yet it often is an afterthought in many implementations. This presentation will cover best practices of securing a cluster including authentication via Kerberos, authorization, ongoing administration, auditing via Ranger, access via Knox, and encryption via TDE, SASL, and SSL. This presentation will demonstrate why each aspect of security is needed, how it is implemented, and what each tool does to protect the data. If time allows, live examples of how the tools are configured and how they protect your data will be shown. Speaker Bios Derek (https://www.linkedin.com/in/derek-kane-13a655) has spent the last 20 years building solutions with data. The last ten years he was with JP Morgan where he was a Lead Architect. As a part of the JP Morgan Innovation team, Derek led the creation of Big Data solutions for an organization that managed $2 trillion in assets. He has also built out multiple Centers of Excellence covering Business Intelligence and Data Visualization. He is a patent holder for an application that manages Total Cost of Ownership of technology solutions. Derek has worked at Cloudera as a Systems Engineer since 2015 and is based in Columbus, Ohio. Erik (https://www.linkedin.com/in/eriknor) is a Principal Consultant and Big Data Tech Lead at Moser Consulting. He has been working with Hadoop since 2012 when he naively installed it onto a cluster of Solaris servers. Since then he has become certified in a variety of distributions and travels around the country architecting, implementing, and supporting solutions for clients big and small. Dong Meng (https://www.linkedin.com/in/dong-meng-6b7a7b19) is a Data Scientist for MapR, focused on building data science solution leveraging MapR tech stack for our customers. He has several years of experience in machine learning, data mining, and big data software development. Previously, Dong was a senior data scientist for ADP, where he built a machine learning pipeline and data products to power ADP Analytics. Prior to ADP, Dong was a staff software engineer for SPSS, where he helped build analytical catalyst (current Watson analytics). During graduate study, he serves as research assistant at the Ohio State University, where he concentrated on compressive sensing and solving point estimation problems from a Bayesian perspective.
- April Edition of MOHUG
Topics How to Keep your Friends Close and your Enemies Even Closer: Shared Ontologies and Best Practices - Jeff Young from OCLC Creating ontologies is hard. Using a graph-based approach to data facilitates vocabulary and data reuse. Doing this well should make it easier to address new business opportunities by simply expanding the graph. Speaker Bio: Jeff Young is a Software Architect at OCLC Research. Jeff has served on technical committees for various NISO standards and was coauthor of the W3C Library Linked Data Incubator Group Final Report (https://www.w3.org/2005/Incubator/lld/XGR-lld-20111025/) in 2011. Hadoop and Data Exploration for the other 95% - Brandon Culver from Cardinal Health Tech and data are cool. Unfortunately it is hard for most humans to do let alone understand why to invest into the ‘magical black box of analytics’. Let’s take a break from the cool tech stuff and review a few of my experiences with 'Self Service Hadoop' and how you can enable your users and understand the value of tech and data through self service. Speaker Bio: Brandon Culver is a Senior Advisor of Data Analytics at Cardinal Health. With experience as an IT professional as well as many years in Sales and Marketing, He constantly looks for opportunities to wear his LeanSixSigma Black belt and connect speed and quality in each person's gemba. Hawq - A true SQL engine for Hadoop -Shailesh Doshi from Pivotal Over 50% of the enterprises are struggling to get more value out of existing data lakes.To unlock the true value of the entire data stored within Hadoop, enterprises need a simpler way of connecting to the data to query, analyze and perform deep data analytics. While the MapReduce and other powerful paradigms exist, nothing can compare the ease of SQL. SQL support on Hadoop began with Apache Hive, a SQL-like query engine that compiles a limited SQL using MapReduce or Tez. However the latencies and limited SQL support makes it useful primarily for batch-mode operation. HDB aka Hawq provides you ability to address enterprise-class SQL based analysis capability. Speaker Bio: Shailesh is a Data Specialist at Pivotal Software. He specializes in MPP and Big data technologies, helping fortune 500 companies plan and implement analytics and data science initiatives focused on Smart App build, Profitability, IOT, machine learning and in-memory/cloud native systems.
- February edition of MOHUG
We had to make a slight change to this meetup as we had a conflict here at Fuse (and I lost). We'll be meeting on Thursday this time instead of Tuesday. Topics Kibana, Timelion, and Graph API: Three ways to explore data within Elasticsearch - Nick Drost from Elastic Kibana: Intuitively discover meaningful insights into your data, in near real- time. In this scenario we will show how you can use the power of search and analytics to deep dive into web logs investigate what's interesting. Oh, yeah, dark themed dashboards are back. Timelion: Taking time series data in Elasticsearch and Kibana to the next level. Graph API: Explore interesting connections in your data using Elasticsearch Graph API. This is a pre-release feature where we can use the power of Elasticsearch relevancy to tune into the useful signals of your data. No experience with Elasticsearch is needed. Just bring your curiosity to explore data! Speaker Bio: Nick Drost, a solution architect at Elastic, has over 15 years of industry experience working as an employee, consultant, pre-sales architect with numerous open systems and distributed technologies in the cloud and on premise. Marketing Analytics on Big Data - Paul Mazak from Impact Radius Come hear a case study about a next-gen platform in the Marketing Analytics space. We leverage Spark, HBase and Impala to get the performance we need. You'll be able to see how our focus on flexibility allows us to keep up with a fast-paced business strategy. Bio: Paul is a Software Engineer at Impact Radius. He leads the Hadoop development on their Media Manager Attribution product. He enjoys innovation and working with a creative team! Paul is always thinking of ways to improve on process, frameworks, and automated testing. He currently subscribes to the motto that "anything can be automated". (...except the parenting of his 3 kids!) SQL on HBase: Phoenix by Example - Alex Daher from Hortonworks Apache Phoenix is a relational database layer over HBase that allows low latency queries and mutation over HBase data. We'll discuss what its good for, architecture, SQL syntax, indexes, practical examples, and future roadmap. Bio: Alexander Daher is a Solution Engineer with Hortonworks; Alex likes to innovate and tinker aimlessly and vicariously with software development languages and frameworks, usually at late nights. Favorite food: All (and it shows). Favorite quote: Success consists of going from failure to failure without loss of enthusiasm. Favorite vacation spot: Bavarian Alps. Favorite sports: College Football and Yankee Baseball.
- December edition of MOHUG - early this month!
Take note that we're meeting early this month! Our usual schedule (last Tue of every even month) is not the best fit for the holiday season, so we're moving it up to December 8th. We're still meeting at Fuse as normal. Topics Knowledge from Noise: Geospatial Analytics at Progressive - Brian Durkin from Progressive How do you visually analyze trillions of records using only millions of pixels? Progressive needed to solve this big data challenge with Snapshot, its industry leading usage based insurance offering. Learn how data scientists on Progressive’s product research and development team integrated Hadoop, D3, and Tableau into its technology stack to enable quick data exploration and rapid hypothesis testing. See how noisy vehicle telemetry data can lead to unexpected results... and new insights. Brian Durkin is an innovation strategist in Progressive's Enterprise Architecture Organization. Throughout his eleven years at Progressive he has played many roles, ranging from application developer to enterprise architecture consultant; the common thread being a passion for making data more useful. He is currently part of the product research and development team focusing on geospatial analytics for usage based insurance where he uses technology to power data exploration, ideation, and rapid hypothesis testing on big datasets. Overview of how running Hadoop in AWS differs from running traditional on-premises clusters. - Erik Swensson from Amazon Web Services With all the press around AWS building new data centers here in central Ohio, I thought it would be great if we could get it straight from the source. If you've thought about running Hadoop in the cloud, but have a bunch of questions, this is a talk you don't want to miss. Erik is an experienced cloud solution architect who has been helping companies utilize the cloud to help drive their business for 5+ years. Author of Big Data Analytics Options on AWS (https://d0.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf) whitepaper and a few big-data blogs which can be found here (https://blogs.aws.amazon.com/bigdata/blog/author/Erik+Swensson) . Currently a Solution Architect & Manager at AWS. Kudu: A new storage layer for Hadoop. - Brandon Freeman from Cloudera Storing data in Hadoop generally means a choice between HDFS and Apache HBase. The former is great for high-speed writes and scans; the latter is ideal for random-access queries -- but you can't get both behaviors at once. The new storage engine built by Cloudera, Kudu will combine the best of both HDFS and HBase in a single package and could make Hadoop into a general-purpose data store with uses far beyond analytics. This presentation will have a overview of Kudu, the motivations behind creating it, and what’s available as beta today. After the presentation a demonstration of Kudu will be performed showcasing fast analytics on fast data. Additionally, Kudu and Impala have been submitted to enter the Apache Foundation Incubator. Brandon recently joined Cloudera from Explorys, in Cleveland, OH where he was the Infrastructure Architect for hundreds of Hadoop nodes in production and non-production environments. As one of the Architects, he was responsible for scalability, stability, performance, hardware selection and assessing various technologies for adoption within Explorys. Learn more about Kudu here (http://getkudu.io/).
- Next meetup
Topics Manage Your Dataflow with Nifi - Shawn Hooton What is Apache NiFi? Put simply NiFi was built to automate the flow of data between systems. Come experience Apache NiFi as Shawn demonstrates some common ingestion patterns that are seen in many different enterprises. Click here (https://nifi.apache.org/) to read more about Nifi. Rock your data with Zeppelin - Jeff Graham Check our how we're using Zeppelin for collaborative data analytics and visualizations at Fuse. Zeppelin is an Apache incubator project for distributed, general-purpose data processing systems such as Apache Hive, Spark, Flink, and a number of other interpreters. Click here (http://zeppelin-project.org/) to read more about Zeppelin. Identifying copied clinical notes using HBase - Ron Buckley Medical clinicians are busy. Sometimes when updating record it's faster to copy and paste data from existing notes than to re-enter the same data. How often does this happen, what's the most frequently copied text? Find out, using Apache HBase. Ron Buckley is the Big Data Manager for the Research Institute at Nationwide Children's Hospital. NCH Research is implementing a data strategy that brings together all the varying types of health data in a central system. Previous to that, Ron was a systems architect and development manager at OCLC, leading the implementation of Apache HBase as the central datastore for OCLC WorldCat. Ron has presented multiple times on HBase at HBaseCon and various other events.