Powered by GitBook

launch zepplin container and map a volume from local to container.

docker run -p 8080:8080 --rm -d -t -v D:\data<loacl vol> -v D:\data\<container vol> -e ZEPPELIN_LOG_DIR

Change PIG to run in local mode

restart zepplin

start a new notebook in zepplin

%sh --> shell

try running ll command.. it should list the dir contents.

%pig

A = LOAD '/shared/<whatever path you shared in your container defnition>'

--> here we will load the business.kjson file into pig.

%pig

B = LIMIT A 10;

DUMP B;

--> shoud give the file.

%pig

B = LIMIT A 10;

DESCRIBE A;'--> Should say schema for A is unknown.

Json Pig interpreter is lready there by twitter: elephantbird core

you can use elephantbird hadoop compatibility and elephantbird pig and json simple lib from apache

all three are there on mvn repository

you need these four to run with pig for json interpretation

%pig

REGISTER '/shared/<the path to the jar files you downloaded above>'

REGISTER '/shared<etc..>'

do REGISTER for all four libraries

A = LOAD '/shared..business.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (Json: MAP[]);

--> here you are telling pig to load business.json file using the four lubraries as a json map.

Now do DESCRIBE A --> {Json: map[]}

%pig

B = FOREACH b IN A GENERATE b#city;

in HW - analyse twitter json dataset.

know - generate, foreach something generate something, flatten, load, dump

B = LIMIT A 10;

DUMP B;

flatten takes each record key> take BUSID, flatten each inner structure

eg : BUSID: {('Monday' 11:00), ('Teusday 788')...}

flatten will output: A, Monday

A, Tuesday..

category = FOREACH ..A FLATTEN etc.. - > core part of HW

HW has two parts: pig code, do the same thing using SPARK(extra credit)

results matching ""

No results matching ""