SPARK does not implement a job tracker.
it doe snot have any knowedge of any file system.. its not rack aware.
its just reader and writers.
when you run soark-> you have access to four major components:
Streaming
MLib
Graphx
Spark SQL -->
1st benefit: A is data frame, a mtrix where each column can be of different types.. I can chnage A's usage from SQL to somehting else without chnging it in our memory. We can use the same data structure.
we dont need to use four different tools we already have streaming, sql etc.
Zepplin:
%spark
sc
--> res0: sark context.. on map redue job variable is the context in hadoop.. similarly.. in spark.. built in sc is in the context
%spark
spark
--> built in SparkSession
val a = spark.read.
in scala, every statement retuns a value.. you donthave to do a return statement in it.
In SPARK, the distribution among workers and all is done automatically.
RDD Blocks: the file distribution that spark is gonna do. --> 4 blocks for example
Data file--> split into chunks
If i had four cores then the four parts will be done parallely.
basic data structure in spoark 1 used to be RDD. --> first implementation of a data frame on hadoop. It knows how to sort itself, iterate etc.. its a full fledged class with behavior.
This was in SOPARK 1.
SPRAK 2: its called Data Frame. Difference in RDD and Data fzrame: Data Frame is an RDD iwtha a schema
Whats ada data frame?
abstraction of a distributed file system.. it has the knowledge of the chunks of the file etc.
data frame might exist in memory, now we can preload it into memory.. one of the improvements by spark over hadoop.
A data fame in spark is immutable.
Each RDD has multiple partitions.
Spark data frames are modeled after python pandas.
if you are using spark R... you can chnage a panda data frame into a sprak data frame.
data.groupBy("dept").avg("age")
HW in notebook format.
columnar storage
Pyhtoin Spark (pySpark):
pyspark --master local[2]
go to spark documentation--> scala --> data frame
select, filter, foreach, first, count, HW will require aggregate or a select, filter.
What is a dataframe?
its a collection of rows and columns, like . matrix... they all cal be of different types.
DataFrame is an alias for Dataset.
type dataframe = Dataset[Row]
Row is a type in spark.
val people = spark.read.parquet("...").as[Person] // Scala
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
Read can read evrything.. json, text, etc.
parquet is like zip file.. compressed.
val names = people.map(_.name) // in Scala; names is a Dataset[String]
Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING)); // in Java 8
A map : operation will go row by row.. and create a datafrane and the output is a new dataframe with a new svh=chema of whatevrr the mapper is doing.
so above: it will create a new dataframe named: name as shown above.
val x = DF.SELECT("lime")
Column is a seperate type.
MAP, SELECT, FILTER are frequently used by sir.
// The following creates a new column that increases everybody's age by 10.
people("age") + 10 // in Scala
people.col("age").plus(10); // in Java
the + and plus are the same.
// To create Dataset[Row] using SparkSession
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), "gender")
.agg(avg(people("salary")), max(people("age")))
df is dataframe variable.
df.count
df.head(10)
df.filter("top10 > 0").count
scala lets you omit the parenthesis .. it will do t either way.
https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.sql.Dataset
sc.parallelize
if you want to read a file csv file but instead of being delimited by ',' it is delimited by ';'
val bank = spark.read.option("header",'true").option("delimiter",";").csv("path to the csv file")
pivot
no more quizzes and homeworks.
only one hw on spark - either in spark or pig. if you do both then you get double points.
pig - flatten
______________________________________________________________________________________
sc is an environmental variable in the CORE
spark is an enviornmnetal variable in the sql context
val df = spark.read.csv(//path).withColumn("last", $"_c26".cast("Double")*3);
this will create a new data frame with c26 Double and multiplied by 3 in a new column named last.
val df = spark.read.csv(//path).withColumn("c_26", $"_c26".cast("Double")*3);
here it wont create a new column.
text - single line of entire data.
textFile - lines of data
df.select("docID","_c26");
this will create a new DF with only _c26 and docID.
spark.sql("select docID. _c26 from df")
This is exactly the same as above one.
Map
filter
sort
group_by
sample
randomSplit
pivot
df.select("docID","_c26").sort(desc("_c26")).head
Spark was created to make hadoop MR faster.
Spark SQL and HIVE is
whenever you hsve a df
user prsist to save the o/p in spark else it il reoeat the same steps again and again
Group By
val result = df.groupBy("Building","Name").agg(avg("temp"), max("Time"));
%spark
bank = spark.read.option("delimiter",";").option("header","true").csv(//path).withColumn("age", $"age".cast("Int"))
%spark
val d3 = bank.groupBy("marital").pivot("loan").count
d3.show
functions like groupBy, pivot are always in twos. groupBy expects a modifier, so does pivot etc etc.