Hadoop for big data. An intro.
Details
by Steven Lembark. Apache Hadoop is an open src, Java-based sftwr platform/ecosystem that manages processing & storage for big data apps. It handles datasets ranging in size from gigabytes to petabytes of data. They can be fed and analyzed by many distributed computers over many distributed disk farms to be read and analyzed by many dispersed computers requesting data.
Hadoop is an ecosystem of open source components that fundamentally changes the way enterprises store, process, and analyze data.
In the infancy of The Internet, there was the quest to "find stuff". “Search engines” were needed. Google, AltaVista, Yahoo, AskJeeves,...all had ideas how to do it.
Inspired by their MapReduce, a programming model that divides an application into small fractions to run on different nodes, Google started Hadoop in 2002 while working on the Apache Nutch.
In 2003, Hadoop was in the academic paper describing the "Google File System". In 2006, the Apache Software Foundation released an open src version.
Altho now there are other tools used for such large data (ex Apache Hive · Apache Spark · Amazon EMR · Azure Data Lake Storage · IBM Analytics Engine · Hortonworks Data Platform · Apache Pig, Clarissa,....) there are still those depending on Hadoop, including Netflix.
So, Steven will tell us…
An Overview of the Apache Hadoop Ecosystem:
There is stuff that's growing on your data warehouse hard disks.
In the beginning was Hadoop, and was, well, Google's. And everyone tried
it.
But as Google dropped the approach as ineffective lots of other
folks had found ways to make pieces of it work, added new pieces to it,
and out of the ashes of single-purpose Hadoop grew the Apache Hadoop
ecosystem.
Today this includes a variety of software for intake,
querying, mapping SQL to key:value stores, and a few other cute tricks.
This talk will look at the pieces of this ecosystem, a bit about
how they fit together, and how they can be used for Really Truly
HUUUUUUGE data processing.
++++++++++++++++++++
https://stllinux.org/
The url link to this Zoom mtng is posted earlier on the day of the mtng at the above home page. It is the link called "linked here".
TOPIC: Hadoop for big data. An intro.
Presenter: Steven Lembark
++++++++
ONLINE MEETINGS ONLY until further notice.
ONLINE session will use remote video software.
HOW TO CONNECT instructions on https://stllinux.org/ web page and our mailing lists. Note that your browser cache may need to be refreshed each time you check the above web page for the instructions. We will open the remote session at about 6:00 PM Central Standard Time ( CST ), so that you can join early to test your microphone, screen and video sharing.
The Saint Louis MO, STL Linux Users Group (STLLUG) meets monthly to talk about Linux. This GNU/Linux Users Group usually holds its meetings on the third or fourth Thursday of every month. Meetings are free and open to everyone.
At 6:30 PM CST we start with introductions, announcements, current events of interest, and a general CALL FOR HELP segment. Then we will go into the presentation of our main topic, sometime around or after 7:00 PM CST.
