Introduction to Pig

http://pig.apache.org/docs/r0.14.0/basic.html

Source: https://www.tutorialspoint.com/apache_pig/apache_pig_architecture.htm

Pig makes writing mapper and reducer for hadoop easier.

It was invented by Yahoo.

They wanted to make things easier for developers so that they can concentrate on the cpmplex task instead of writing mappers and reducers to be able to run on the cluster.

Apache® Pig™ is made up of two components: the first is the language itself, which is called PigLatin (yes, people naming various Hadoop projects do tend to have a sense of humor associated with their naming conventions), and the second is a runtime environment where PigLatin programs are executed. Think of the relationship between a Java Virtual Machine (JVM) and a Java application. In this section, we’ll just refer to the whole entity as Pig.

Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Pig.

To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data.

To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.

Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.

Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.

Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.

Properties of Pig:

  • A dataflow programming environment for processing very large files

  • An execution engine on top of Hadoop

  • Removes need for users to tune Hadoop for their needs.

  • Insulates users from changes in Hadoop interfaces.

  • A Pig Latin program consists of a directed acyclic graph where each node represents an operation.

  • Pig is schema on read :

Source: https://www.thomashenson.com/schema-read-vs-schema-write-explained/

Let's first understand what schema on write means:

Schema on write is defined as creating a schema for data before writing into the database. For eg: SQL. We need to define the schema in SQL before writing it to the database, hence only structured data can be added.

Schema on read:

Schema on read differs from schema on write because you create the schema only when reading the data. Structured is applied to the data only when it’s read, this allows unstructured data to be stored in the database. Since it’s not necessary to define the schema before storing the data it makes it easier to bring in new data sources on the fly.

  • Lazy Evaluation: unless you do not produce an output file or does not output any message, it does not get evaluated. This has an advantage in the logical plan, it could optimize the program beginning to end and optimizer could produce an efficient plan to execute.

Source : http://bugra.github.io/work/notes/2014-02-08/pig-advantages-and-disadvantages/

results matching ""

    No results matching ""