Hold the Date! Hands-on Session!

Brian Baillod, Cloudera, will be leading a hands-on session focused on Pig. 

This meeting will be a continuation of the previous Cloudera Hands on session.  This session will focus on using Pig and Cloudera Search where the previous session looked at MapReduce, Hive, and Impala.   We will use the Cloudera Quickstart VM and the NFL Dataset and perhaps some other data as well.

 

Join or login to comment.

  • Brian B.

    cut and paste material
    LOAD 'arrests.csv' USING PigStorage(',') AS(year:int, team:chararray, player:chararray);
    Describe arrests;
    Dump arrests;
    grouped_arrests = GROUP arrests BY team;
    num_arrests = FOREACH grouped_arrests GENERATE group AS team, COUNT(arrests) AS total;
    Dump num_arrests;
    ordered_arrests = ORDER num_arrests BY total;
    bad_boys = FILTER ordered_arrests BY (total>25);

    CREATE EXTERNAL TABLE arrests
    ( YEAR INTEGER,
    TEAM STRING,
    PLAYER STRING
    ) ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    LOCATION "/user/cloudera/arrests” -- copied arrests.csv to /user/cloudera/arrests folder in Hue
    invalidate metadata;
    Show tables;

    select team, year, count(*) from arrests
    group by 1,2
    order by 3 desc
    limit 10;

    August 5

  • Dharmesh R.

    Can someone tell which building we are meeting? Randy can you share ur phone number?

    August 5

    • Randall S. K.

      Note for folks, Dharmesh did make it...drive to parking lot in east end of campus (toward 62nd street). You can't miss the signboard by the driveway for MBDUG... :) )

      August 5

  • Dharmesh R.

    I went to 6767 and I was told to go in 6687. How do I get into building?

    August 5

  • Randall S. K.

    Pizza will arrive by 5:30, we would like to start the formal presentation as close to 6pm as possible so please arrive early... :) Please park in the fenced parking lot on the east end of our campus on Industrial Road. Security will be stationed at the gate by 5:15pm - look for the signs on which door to enter the building after you walk around the water retention pond. See you soon!!

    August 5

  • Brian B.

    Hi Shaoli,
    My apologies, I believe you will need to start the Oozie and SOLR services in Cloudera Mgr before running the setupsearch.sh. They are turned off by default in the 4.7 quickstart vm. I ran into this myself when testing on a fresh VM today.

    August 5

  • Shaoli L.

    Updated line 7 for setupsearch.sh, but got errors (invalid identifiers, etc.) when I ran the script.
    ---
    [cloudera@localhost nfldata]$ ./setupsearch.sh
    Setting up env
    ./setupsearch.sh: line 7: export: `/': not a valid identifier
    ./setupsearch.sh: line 7: export: `-iname': not a valid identifier
    ./setupsearch.sh: line 7: export: `search-mr*-job.jar': not a valid identifier
    Cleanup any old configs
    Error: can't discover Solr URI. Please specify it explicitly via --solr.
    Setup config directory
    Uploading configs from /home/cloudera/workspace/nfldata/nflsearch/conf to localhost:2181/solr. This may take up to a minute.
    Error: can't discover Solr URI. Please specify it explicitly via --solr.
    Exception in thread "main" java.io.IOException: Error opening job jar: sudo
    at org.apache.hadoop.util.RunJar.main(RunJar.java:135)
    Caused by: java.util.zip.ZipException: error in opening zip file
    ...

    August 5

  • Brian B.

    See you all this evening.
    2 things regarding the prep.
    Misspelling below should be GIT clone not GET.
    git clone https://github.com/bosshart/nfldata
    The search UI won't work until you turn on a couple of services in Cloudera Mgr. Specifically SOLR. Also turn on Oozie and Impala.
    The scripts prepare the data for our search exercise. We don't need them run for the first exercise. The setup.sh script runs for over a half hour so you should try to do it in advance (or it could run during the Pig exercise)

    August 5

  • Brian B.

    We've updated the setupsearch.sh so you should not have to edit it now, it has the correct jar for CDH 4.7.
    Remember to give your VM 4GB of RAM if possible. That's what I've tested the process with.

    1 · August 3

  • Brian B.

    download link for CDH 4.7 quickstart vm: http://www.cloudera.com/content/support/en/downloads/quickstart_vms/cdh-4-7-x.html

    Steps to do before the meetup:
    1. start up Quickstart VM. Open terminal window. cd workspace
    2. get clone https://github.com/bosshart/nfldata.
    3. cd nfldata. update setupsearch.sh script 7th line per instructions below
    4. run scripts ./setup.sh ./setupsearch.sh

    reply here with any questions or email me at [masked]

    1 · August 2

  • Brian B.

    Update - can you please download the Cloudera 4.7 Quickstart for this exercise? We've noticed that having Yarn built into CDH5+ increases the required memory such that most laptops won't have enough RAM to work with joining large 500K tables. Here's the Github statement:
    git clone https://github.com/bosshart/nfldata

    You will need to update the 7th line of the setupsearch.sh script in the nfldata directory
    export CLOUDERA_SEARCH_MR_PATH=
    set it to the results of this command:
    sudo find / -iname search-mr*-job.jar

    Then run 2 scripts:
    1. ./setup.sh
    2. ./setupsearch.sh
    Script 1 imports the data, transforms it, and loads the hive tables.
    Script 2 runs morphlines ETL and indexes the data in Cloudera Search

    August 2

  • Brian B.

    Regarding the preparation for the hands on:
    We are building a new github for you to download the data and code from, I will share that next week when it is finished. At a high level you will be doing the following preparation:
    1. Download the Cloudera Quickstart VM and verify that it works. Open a browser and check out Cloudera Manager, make sure the services are running (you may need to start them). Feel free to check out Cloudera Manager and HUE from the splash screen.
    2. Github download - we are building a different github that will include the search info and a tableau report. I will share more info next week.
    3. Running a setup script in advance - this was taking too long for lower powered laptops in the last session. We'll go through what the script did but it will be nice to have finished ahead of time.

    July 17

    • Paul Van D.

      Any news on Github?

      July 30

    • Brian B.

      we are close. I tested it this morning and it has a bug. we are trying to fully automate the cloudera search indexing of the nfl data. biggest problem is our day jobs. :)

      July 30

  • Jagdish M. D.

    Before meeting cold you please pass all instructions and data that we are going to work with ?

    July 12

  • Randall S. K.

    We will post specific instructions to best prepare soon. If you are able to setup a virtual machine on your laptop and have a little familiarity with the Linux command line, you should take value from the session. It may be a bit much for a first step, but pending your other experience it could be just fine.

    1 · July 9

  • Dhia

    Hello
    I know only the term "Big Data". Will the material provided be suitable for some at their first step in the topic?

    July 8

Create a Meetup Group and meet new people

Get started Learn more
Henry

I decided to start Reno Motorcycle Riders Group because I wanted to be part of a group of people who enjoyed my passion... I was excited and nervous. Our group has grown by leaps and bounds. I never thought it would be this big.

Henry, started Reno Motorcycle Riders

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy