- Sanjay Radia, original Hadoop team member & hear about Manulife's Hadoop journey
After a small hiatus, the Toronto Hadoop User Group (THUG) is excited to announce the latest meetup Nov 13th @7pm. The evening will consist of two presentations. The first presentation will be an in-depth summary of the Hadoop journey at Manulife. Manulife has been rolling out a 100% cloud strategy for Hadoop globally to tackle the most challenging analytic use cases. Derek Hwong (AVP, Global Data Engineering and Operations) and Bill Graves (AVP, Data Management Services) will present on Manulife’s Hadoop strategy, lessons learned, and the forward leaning data management vision. The second presenter for the evening is Sanjay Radia (Hortonworks founder and Chief Architect). Sanjay will provide a glimpse into topics ranging from a Hadoop 3.0 overview, describe advancements in cloud storage, convey the enterprise data management strategy with Data Plane, and how to obtain interactive Hive query performance with LLAP. Access to the event is limited based on space available and those people who have RSVP’d. Please note that only individuals on the RSVP list will be permitted access. Derek Hwong’s Bio: Derek has many years of experiences in leading and building award-winning, innovative and patented data driven solutions, advanced technology platforms and high performance data ecosystems. He is leading the data engineering globally to build out data capabilities for the enterprise data lake and enable analytics that ensure accessibility, consistent governance and well modeled architecture to meet the needs of the business, regional divisions and position competitively for the future. He is also responsible for the global data operations. Bill Graves Bio: In Bill’s current role, he is focused on enhancing the Data and Information Management capabilities of Manulife's Canadian Division. Enable and deliver data management solutions in support of the division's business strategy, including the introduction of Master Data Management and Enterprise Data Lake capabilities. Responsible for Canadian Division Data Governance program. Sanjay Radia’s Bio: Sanjay is an Apache Hadoop committer and member of the Apache Hadoop PMC. Prior to co-founding Hortonworks, Sanjay was the architect of the Hadoop HDFS project at Yahoo!. He has also held senior engineering positions at Sun Microsystems and INRIA, where he developed software for distributed systems and grid/utility computing infrastructures. Sanjay has a PhD in Computer Science from the University of Waterloo in Canada.
- Live Webinar: Hadoop 2.x Introduction to Big Data and Hadoop Using Hive
Hello, We'd like to invite you for an expert live Webinar on 'Hadoop 2.x Introduction to Big Data and Hadoop Using Hive (http://www.mylanderpages.com/hadoop/Hadoop-Courses)' scheduled on 3rd August, Wednesday, 9:00PM to 10:45PM EST The session agenda is as follows: TOPICS • Introduction to Big Data • Challenges of Big Data and Introduction to Hadoop • Hadoop definition and its characteristics • Hadoop echo system and Hadoop core component • Introduction to Hive ,Components and Architecture of Hive • Tables in Hive, Data types and operation in Hive • Partition and Bucketing in Hive This promises to be an extremely enriching session and we hope you can make it - Register Now (http://www.mylanderpages.com/hadoop/Hadoop-Courses) In case you can't make it sign-up anyway, we'll send you the recording. Cheers.!
- State of Resource Management in Hadoop & An Intro to Kudu
We will have 2 presentations this month from experienced implementers. The first presentation is a look at implementing products with YARN & Mesos from some of IBM's best. Our second presentation introduces an open source project that could potentially revolutionize storage in the Hadoop community, Kudu. Kudu is a scalable storage system that provides a happy middle ground between HDFS and random I/O filesystems. 1. State of Resource Management in Hadoop: Why should you care? Speakers: Yong Feng Yong Feng is a Software Architect at IBM Platform. He has more than 10 years of experience in design, and implements cluster, grid and cloud computing systems with a focus on scheduling, resource and workload management. He has a deep knowledge on related open-source projects such as OpenStack, Mesos, Swarm, Kubernetes, Spark, etc. and leads the IBM teams who work on them. Yong holds a Ph.D from Northwestern Polytechnical University of China. Khalid Ahmed Khalid Ahmed is an STSM, Chief Architect of Infrastructure Software at IBM Platform. He works on the design and architecture of large scale grid and cloud computing systems with a focus on scheduling, resource, workload and data management. With over 20 years of industry experience, he has worked in a number of roles including development, product management, and architecture. His latest interests include big data systems, container technology, and data center operating system concepts. Khalid has an M.A.Sc from the University of Toronto. 2. Introduction to Kudu Speakers: Mladen Kovacevic Mladen Kovacevic is a Solutions Architect at Cloudera and has architected and developed end-to-end Hadoop applications providing meaningful insight for clients. He has operationalized Hadoop clusters targeted for multi-tenant use, meeting security and performance requirements, and specializes in the telco space. Mladen has over a decade of professional experience in software development building industry leading RDBMS technology as well as SQL on Hadoop. He has worked on SQL on Hadoop performance benchmarks, systems optimization and architecture for workload optimized systems and has architected Hadoop applications leveraging the entire ecosystem while contributing to open source projects such as the Kite SDK.
- Apache NiFi Introduction
Rob Sader will join us from Onyara, now Hortonworks, to discuss the Apache NiFi project and how it works. Apache NiFi is an extensible data processing and integration framework. NiFi can construct highly structured data flows with connectors into many traditional and Hadoop-related technologies. https://nifi.apache.org/ Rob's Bio: Rob is the Emerging Product Specialist for Hortonworks in Canada, responsible for educating organizations on new open source frameworks from the Apache Software Foundation. Within that Emerging Products Team, Rob focuses primarily on educating companies on the capabilities and architecture of Apache NiFi and how it fits into the broader Big Data Ecosystem. Previous to Hortonworks, Rob was a Co-Founder and Head of Business Development for Onyara, a startup that was formed to commercialize service and support for Apache NiFi and was sold to Hortonworks in August of 2015. Previous to Onyara, Rob led early business development efforts at a variety enterprise software startups and was a Product Specialist at Salesforce.com.
- Architecture Review Session
We are bringing back our Architecture Review sessions by popular demand. This is open to speakers, please contact me at adam.muise at gmail.com Sessions: Richard Xu (Hortonworks) - YARN & Hive Tuning Oliver Meyn ( http://www.gbif.org/ ) - A review of how the Global Biodiversity Information Facility Adam Muise (Paytm Labs) - Realtime Data Pipeline with Hadoop, Kafka/Confluent, Spark Streaming, and Cassandra Chen Zhang (Graph Intelligence) - Using Graph databases to enhance your analytics
- *-* FREE Internet Marketing Seminar - Toronto, CA *-*
2 Hours Internet Marketing Seminar - Free Ticket - Toronto, CA How to Start an Internet Marketing Business Price: FREE Admission (limited seating) - Click Below - Confirm Your Date & Session -- http://bit.ly/imf-toronto September 1st Crowne Plaza Hamilton Session 1: 12:30 pm – 2:30 pm Session 2: 6:00 pm – 8:00 pm September 2nd Hilton Toronto Markham Suites Session 1: 12:30 pm – 2:30 pm Session 2: 6:00 pm – 8:00 pm September 3rd Four Points by Sheraton Mississauga Meadowvale Session 1: 12:30 pm – 2:30 pm Session 2: 6:00 pm – 8:00 pm Price: FREE Admission (limited seating) - Click Below - Confirm Your Date & Session -- http://bit.ly/imf-toronto If you are not willing to risk the usual, you will have to settle for the ordinary.
- Advanced Data Science on Spark
Guest Speaker: Reza Zadeh Overview: We discuss how to combine the scalability of Spark with machine learning and graph processing. This talk covers a subset of material covered in Stanford’s CME 323: Distributed Algorithms and Optimization. Lessons focus on building and using machine learning at scale via MLlib and GraphX. Topics covered include: - Building scalable Machine Learning algorithms on Spark, discussing design decisions inside MLlib and GraphX - Understand how primitives like Matrix Factorization are implemented in a distributed framework from the designers of MLlib CME 323: http://stanford.edu/~rezab/dao Bio: http://stanford.edu/~rezab/bio.html Note: this event is cross-listed with the new Spark User Group: http://www.meetup.com/Toronto-Apache-Spark/events/224035398/ I am not setting RSVP limits as this is a cross-listed event with the Spark Meetup Group. Please RSVP to at least one of the events let us know you are coming, we can union the lists. Be warned that there will be some people who will not have a seat. For those who do find a seat please be kind and be ready to give it to those who need it more than you do. Food and Drinks will be provided.
- Data Ingest and Processing - spotlight on Streaming
Data Ingest and Processing: A lot of companies are looking to reduce the time it takes to get from ingest to intelligence with their critical business data. The challenges around moving from batch-focused processing to realtime/micro-batch can be difficult for both startups and established organizations. With these sessions, we will try to explore some solutions that have come up at various companies. Presentations: 1. Stateful Stream Processing with Kafka and Samza (50min) Intergration with in-memory local state is one of Samza's most interesting features, but how do you maintain and update the local state with fault-tolerance and multi-tenancy in mind ? How do you test it? We will talk about our solutions and problems to be solved Speaker Bio: George Li Software Team Lead @ Vericent, an IBM Company - a little Schemer, diehard fan of "How to Solve It" by George Pólya 2. Moving to a Realtime Ingestion and Processing Architecture (50min) Postponed until next free presentation meetup. 3. Realtime Streaming Analytics Topic: Real-time Streaming AnalyticsIf we look at where time to business insights from data is being significantly delayed in the entire analytics modeling life cycle, we can easily identify several areas. This presentation identify model deployment and execution as the two major bottlenecks and how it can be solved using a standards-based approach. It will cover both batch, real-time and streaming analytics. Eddie Soong A software engineer by training, transitioned to business development for enterprise software and self taught big data analytics enthusiast. Working experience in Data Management to BI to big data predictive analytics. A member of Zementis, a standards-based predictive model deployment and execution engine on big data infrastructure for batch and real-time scoring Note: We are going to move our Q&A session to the Elephant and Caste so that thirsty THUGs can converse and so our presentations finish on time. :) Please excuse the summer hiatus, we are back on track. :)
- Creating a Data Science Practice
UPDATE - We will have 2 presentations on the same subject with different speakers. _______________________________________ Presentation 1 - 60mins Speaker: Adam Muise Chief Architect, Paytm Labs So you want to Data Science: Don't forget to Data Engineer. What people make it sound like: Step 1: Collect data scientists. Step 2: .. Step 3: Profit. Right? The realities from an Architect's perspective: - The Leadership & Team - How fast can you Hadoop? - Hire Data Engineers - The Architecture - Making Use of the Team Disclaimer: I am not a Data Scientist and this presentation will not be about Data Science techniques or practices specifically. This presentation is from a technical startup perspective. We will discuss team building, where Hadoop and other technologies fit in, as well as when to introduce said technologies. As usual, the target audience is engineers and technical leaders. _____________________________________ _____________________________________ Presentation 2 - 60 mins Speaker: Ashish Bansal Head of Data and Analytics Platforms at Gale Partners Idiot's Guide to Data Science Agenda: Context of Gale’s Data Science Practice Knowing what you want Data Science vs Big data Data Scientists vs. Data Engineers Exploration vs Operations How to hire if you are not a data scientist yourself Process of delivering work Q&A _________________________________________
- Apache Ignite: Introducing the Future of Fast Data
THUGs, This meetup is cross-listed with the Toronto In-memory Computing Group as we feel it is a relevant discussion for all. Please note that this will be a technical presentation about Apache Ignite and not a marketing presentation. http://ignite.incubator.apache.org/ What does this have to do with Hadoop? Why do I care? As an In Memory Data Grid, Ignite will be complimentary to a Hadoop cluster. This can be part of an extended lambda architecture as a "fast layer" due to the cache, streaming capabilities, and txn support. If fast processing is done on data in Ignite, it is quite easy to push data to HDFS as a secondary file system. Beyond that, Apache Ignite has some write-through and read-through caching for HDFS via it's own file system, IGFS. Bring your Hadoop-integration questions. Synopsis: Join us in Toronto at 6:30 PM EDT on March 26 for an evening of networking with Big Data professionals and learning about the latest Apache project, which provides critical capabilities for the emerging world of Fast Data. In this presentation, we will provide an introduction to Apache Ignite(TM) (incubating), which is an open source, distributed framework for a unified In-Memory Data Fabric, originally developed by GridGain Systems. Apache Ignite provides a high-performance, distributed in-memory data management software layer that has been designed to operate between both new and existing data sources and applications, boosting application performance and scale by orders of magnitude. We will start with a summary of the technical drivers and market forces, and will cover popular and emerging use cases for in-memory computing, from financial industry trading platforms to mobile payment processing, online advertising, online/mobile gaming back-ends and more. We will then present some foundational concepts and terminology, and discuss the architecture, capabilities and benefits of the Apache Ignite In-Memory Data Fabric in quite some detail. Speaker Bio: Nikita Ivanov - CTO, GridGain Nikita Ivanov is founder and CTO of GridGain Systems, the leading Java in-memory data fabric starting every 10 seconds around the world today. Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems.