Powered by GitBook

SPARK does not implement a job tracker.

it doe snot have any knowedge of any file system.. its not rack aware.

its just reader and writers.

when you run soark-> you have access to four major components:

Streaming

MLib

Graphx

Spark SQL -->

1st benefit: A is data frame, a mtrix where each column can be of different types.. I can chnage A's usage from SQL to somehting else without chnging it in our memory. We can use the same data structure.

we dont need to use four different tools we already have streaming, sql etc.

Zepplin:

%spark

sc

--> res0: sark context.. on map redue job variable is the context in hadoop.. similarly.. in spark.. built in sc is in the context

%spark

spark

--> built in SparkSession

val a = spark.read.

in scala, every statement retuns a value.. you donthave to do a return statement in it.

In SPARK, the distribution among workers and all is done automatically.

RDD Blocks: the file distribution that spark is gonna do. --> 4 blocks for example

Data file--> split into chunks

If i had four cores then the four parts will be done parallely.

basic data structure in spoark 1 used to be RDD. --> first implementation of a data frame on hadoop. It knows how to sort itself, iterate etc.. its a full fledged class with behavior.

This was in SOPARK 1.

SPRAK 2: its called Data Frame. Difference in RDD and Data fzrame: Data Frame is an RDD iwtha a schema

Whats ada data frame?

abstraction of a distributed file system.. it has the knowledge of the chunks of the file etc.

data frame might exist in memory, now we can preload it into memory.. one of the improvements by spark over hadoop.

A data fame in spark is immutable.

Each RDD has multiple partitions.

Spark data frames are modeled after python pandas.

if you are using spark R... you can chnage a panda data frame into a sprak data frame.

data.groupBy("dept").avg("age")

HW in notebook format.

columnar storage

Pyhtoin Spark (pySpark):

pyspark --master local[2]

go to spark documentation--> scala --> data frame

select, filter, foreach, first, count, HW will require aggregate or a select, filter.

What is a dataframe?

its a collection of rows and columns, like . matrix... they all cal be of different types.

DataFrame is an alias for Dataset.

type dataframe = Dataset[Row]

Row is a type in spark.

val people = spark.read.parquet("...").as[Person]  // Scala
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java

Read can read evrything.. json, text, etc.

parquet is like zip file.. compressed.

val names = people.map(_.name)  // in Scala; names is a Dataset[String]
Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING)); // in Java 8

A map : operation will go row by row.. and create a datafrane and the output is a new dataframe with a new svh=chema of whatevrr the mapper is doing.

so above: it will create a new dataframe named: name as shown above.

val x = DF.SELECT("lime")

Column is a seperate type.

MAP, SELECT, FILTER are frequently used by sir.

// The following creates a new column that increases everybody's age by 10.
people("age") + 10  // in Scala
people.col("age").plus(10);  // in Java

the + and plus are the same.

// To create Dataset[Row] using SparkSession
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")

people.filter("age > 30")
  .join(department, people("deptId") === department("id"))
  .groupBy(department("name"), "gender")
  .agg(avg(people("salary")), max(people("age")))

df is dataframe variable.

df.count

df.head(10)

df.filter("top10 > 0").count

scala lets you omit the parenthesis .. it will do t either way.

https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.sql.Dataset

sc.parallelize

if you want to read a file csv file but instead of being delimited by ',' it is delimited by ';'

val bank = spark.read.option("header",'true").option("delimiter",";").csv("path to the csv file")

pivot

no more quizzes and homeworks.

only one hw on spark - either in spark or pig. if you do both then you get double points.

pig - flatten

______________________________________________________________________________________

sc is an environmental variable in the CORE

spark is an enviornmnetal variable in the sql context

val df = spark.read.csv(//path).withColumn("last", $"_c26".cast("Double")*3);

this will create a new data frame with c26 Double and multiplied by 3 in a new column named last.

val df = spark.read.csv(//path).withColumn("c_26", $"_c26".cast("Double")*3);

here it wont create a new column.

text - single line of entire data.

textFile - lines of data

df.select("docID","_c26");

this will create a new DF with only _c26 and docID.

spark.sql("select docID. _c26 from df")

This is exactly the same as above one.

Map

filter

sort

group_by

sample

randomSplit

pivot

df.select("docID","_c26").sort(desc("_c26")).head

Spark was created to make hadoop MR faster.

Spark SQL and HIVE is

whenever you hsve a df

user prsist to save the o/p in spark else it il reoeat the same steps again and again

Group By

val result = df.groupBy("Building","Name").agg(avg("temp"), max("Time"));

%spark

bank = spark.read.option("delimiter",";").option("header","true").csv(//path).withColumn("age", $"age".cast("Int"))

%spark

val d3 = bank.groupBy("marital").pivot("loan").count

d3.show

functions like groupBy, pivot are always in twos. groupBy expects a modifier, so does pivot etc etc.

results matching ""

No results matching ""