fs Dm Dtable nor.cora 009333什么意思

Pig Latin 2
Miscellaneous
Pig Latin Reference Manual 2
Use this manual together with .
Also, be sure to review the information in the .
Conventions
Conventions for the syntax and code examples in the Pig Latin Reference Manual are described here.
Convention
Description
Parentheses enclose one or more items.
Parentheses are also used to indicate the tuple data type.
Multiple items:
(1, abc, (2,4,6) )
Straight brackets enclose one or more optional items.
Straight brackets are also used to indicate the map data type. In this case && is used to indicate optional items.
Optional items:
[INNER | OUTER]
Curly brackets enclose two or more items, one of which is required.
Curly brackets also used to indicate the bag data type. In this case && is used to indicate required items.
Two items, one required:
{ gen_blk | nested_gen_blk }
Horizontal ellipsis points indicate that you can repeat a portion of the code.
Pig Latin syntax statement:
cat path [path &]
In general, uppercase type indicates elements the system supplies.
In general, lowercase type indicates elements that you supply.
Note: The names (aliases) of relations and fields are case sensitive. The names of Pig Latin functions are case sensitive. All other Pig Latin keywords are case insensitive.
Pig Latin statement:
A = LOAD 'data' AS (f1:int);
LOAD, AS supplied BY system
A, f1 are names (aliases)
data supplied by you
Italic type indicates placeholders or variables for which you must supply values.
Pig Latin syntax:
alias = LIMIT alias &n;
You supply the values for placeholder alias and variable n.
Pig keywords are listed here.
and, any, all, arrange, as, asc, AVG
bag, BinStorage, by, bytearray
cache, cat, cd, chararray, cogroup, CONCAT, copyFromLocal, copyToLocal, COUNT, cp, cross
%declare, %default, define, desc, describe, DIFF, distinct, double, du, dump
e, E, eval, exec, explain
f, F, filter, flatten, float, foreach, full
generate, group
if, illustrate, inner, input, int, into, is
l, L, left, limit, load, long, ls
map, matches, MAX, MIN, mkdir, mv
or, order, outer, output
parallel, pig, PigDump, PigStorage, pwd
register, right, rm, rmf, run
sample, set, ship, SIZE, split, stderr, stdin, stdout, store, stream, SUM
TextLoader, TOKENIZE, through, tuple
union, using
-- V, W, X, Y, Z
-- Symbols
& & &= & &=
Data Types and More
Relations, Bags, Tuples, Fields
work with relations. A relation can be defined as follows:
A relation is a bag (more specifically, an outer bag).
A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.
A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don't require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.
Also note that relations are unordered which means there is no guarantee that tuples are processed in any particular order. Furthermore, processing may be parallelized in which case tuples are not processed according to any total ordering.
Referencing Relations
Relations are referred to by name (or alias). Names are assigned by you as part of the Pig Latin statement. In this example the name (alias) of the relation is A.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
Referencing Fields
Fields are referred to by positional notation or by name (alias).
Positional notation is generated by the system. Positional notation is indicated with the dollar sign ($) and begins with zero (0); for example, $0, $1, $2.
Names are assigned by you using schemas (or, in the case of the GROUP operator and some functions, by the system). You can use any name that is not a P for example, f1, f2, f3 or a, b, c or name, age, gpa.
Given relation A above, the three fields are separated out in this table.
First Field
Second Field
Third Field
Positional notation (generated by system)
Possible name (assigned by you using a schema)
Field value (for the first tuple)
As shown in this example when you assign names to fields you can still refer to the fields using positional notation. However, for debugging purposes and ease of comprehension, it is better to use names.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
X = FOREACH A GENERATE name,$2;
(John,4.0F)
(Mary,3.8F)
(Bill,3.9F)
(Joe,3.8F)
In this example an error is generated because the requested column ($3) is outside of the declared schema (positional notation begins with $0). Note that the error is caught before the statements are executed.
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
B = FOREACH A GENERATE $3;
23:03:46,715 [main] ERROR org.apache.pig.tools.grunt.GruntParser - java.io.IOException:
Out of bound access. Trying to access non-existent
: 3. Schema {f1: bytearray,f2: bytearray,f3: bytearray} has 3 column(s).
Referencing Fields that are Complex Data Types
As noted, the fields in a tuple can be any data type, including the complex data types: bags, tuples, and maps.
Use the schemas for complex data types to name fields that are complex data types.
Use the dereference operators to reference and work with fields that are complex data types.
In this example the data file contains tuples. A schema for complex data types (in this case, tuples) is used to load the data. Then, dereference operators (the dot in t1.t1a and t2.$0) are used to access the fields in the tuples. Note that when you assign names to fields you can still refer to these fields using positional notation.
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
A = LOAD 'data' AS (t1:tuple(t1a:int, t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
X = FOREACH A GENERATE t1.t1a,t2.$0;
Data Types
Simple and Complex
Simple Data Types
Description
Signed 32-bit integer
Signed 64-bit integer
Data: & & 10L or 10l
Display: 10L
32-bit floating point
Data: & & 10.5F or 10.5f or 10.5e2f or 10.5E2F
Display: 10.5F or 1050.0F
64-bit floating point
Data: & & 10.5 or 10.5e2 or 10.5E2
Display: 10.5 or 1050.0
Character array (string) in Unicode UTF-8 format
hello world
Byte array (blob)
Complex Data Types
An ordered set of fields.
An collection of tuples.
{(19,2), (18,1)}
A set of key value pairs.
[open#apache]
Note the following general observations about data types:
Use schemas to assign types to fields. &If you don't assign types, fields default to type bytearray and implicit conversions are applied to the data depending on the context in which that data is used. For example, in relation B, f1 is converted to integer because 5 is integer. In relation C, f1 and f2 are converted to double because we don't know the type of either f1 or f2.
A = LOAD 'data' AS (f1,f2,f3);
B = FOREACH A GENERATE f1 + 5;
C = FOREACH A generate f1 + f2;
If a schema is defined as part of a load statement, the load function will attempt to enforce the schema. If the data does not conform to the schema, the loader will generate a null value or an error.
A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
If an explicit cast is not supported, an error will occur. For example, you cannot cast a chararray to int.
A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE (int)
This will cause an error &
If Pig cannot resolve incompatible types through implicit casts, an error will occur. For example, you cannot add chararray and float (see the Types Table for addition and subtraction).
A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name +
This will cause an error &
A tuple is an ordered set of fields.
( field [, field &] ) &
A tuple is enclosed in parentheses ( ).
A piece of data. A field can be any data type (including tuple and bag).
You can think of a tuple as a row with one or more fields, where each field can be any data type and any field may or may not have data. If a field has no data, then the following happens:
In a load statement, the loader will inject null into the tuple. The actual value that is substituted for nul for example, PigStorage substitutes an empty field for null.
In a non-load statement, if a requested field is missing from a tuple, Pig will inject null.
In this example the tuple contains three fields.
(John,18,4.0F)
A bag is a collection of tuples.
Syntax: Inner bag
{ tuple [, tuple &] }
An inner bag is enclosed in curly brackets { }.
Note the following about bags:
A bag can have duplicate tuples.
A bag can have tuples with differing numbers of fields. However, if Pig tries to access a field that does not exist, a null value is substituted.
A bag can have tuples with fields that have different data types. However, for Pig to effectively process bags, the schemas of the tuples within those bags should be the same. For example, if half of the tuples include chararray fields and while the other half include float fields, only half of the tuples will participate in any kind of computation because the chararray fields will be converted to null.
Bags have two forms: outer bag (or relation) and inner bag.
Example: Outer Bag
In this example A is a relation or bag of tuples. You can think of this bag as an outer bag.
A = LOAD 'data' as (f1:int, f2:int, f3;int);
Example: Inner Bag
Now, suppose we group relation A by the first field to form relation X.
In this example X is a relation or bag of tuples. The tuples in relation X have two fields. The first field is type int. The seco you can think of this bag as an inner bag.
X = GROUP A BY f1;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(8,{(8,3,4)})
A map is a set of key value pairs.
Syntax (&& denotes optional)
[ key#value &, key#value && ]
Maps are enclosed in straight brackets [ ].
Key value pairs are separated by the pound sign #.
Must be chararray data type. Must be a unique value.
Any data type.
Key values within a relation must be unique.
In this example the map includes two key value pairs.
[name#John,phone#5551212]
In Pig Latin, nulls are implemented using the SQL definition of null as unknown or non-existent. Nulls can occur naturally in data or can be the result of an operation.
Nulls and Operators
Pig Latin operators interact with nulls as shown in this table.
Interaction
Comparison operators:
If either sub-expression is null, the result is null.
Comparison operator:
If either the string being matched against or the string defining the match is null, the result is null.
Arithmetic operators:
&+ , -, *, /
If either sub-expression is null, the resulting expression is null.
Null operator:
If the tested value is null, otherwise, returns false.
Null operator:
is not null
If the tested value is not null, otherwise, returns false.
Dereference operators:
tuple (.) or map (#)
If the de-referenced tuple or map is null, returns null.
Cast operator
Casting a null from one type to another type results in a null.
Functions:
AVG, MIN, MAX, SUM
These functions ignore nulls.
This function counts all values, including nulls.
If either sub-expression is null, the resulting expression is null.
If the tested object is null, returns null.
For Boolean sub-expressions, note the results when nulls are used with these operators:
FILTER operator & If a filter expression results in null value, the filter does not pass them through (if X is null, !X is also null, and the filter will reject both).
Bincond operator & If a Boolean sub-expression results in null value, the resulting expression is null (see the interactions above for Arithmetic operators)
Nulls and Constants
Nulls can be used as constant expressions in place of expressions of any type.
In this example a and null are projected.
A = LOAD 'data' AS (a, b, c).
B = FOREACH A GENERATE a,
In this example of an outer join, if the join key is missing from a table it is replaced by null.
A = LOAD 'student' AS (name: chararray, age: int, gpa: float);
B = LOAD 'votertab10k' AS (name: chararray, age: int, registration: chararray, donation: float);
C = COGROUP A BY name, B BY
D = FOREACH C GENERATE FLATTEN((IsEmpty(A) ? null : A)), FLATTEN((IsEmpty(B) ? null : B));
Like any other expression, null constants can be implicitly or explicitly cast.
In this example both a and null will be implicitly cast to double.
A = LOAD 'data' AS (a, b, c).
B = FOREACH A GENERATE a +
In this example &both a and null will be cast to int, a implicitly, and null explicitly.
A = LOAD 'data' AS (a, b, c).
B = FOREACH A GENERATE a + (int)
Operations That Produce Nulls
As noted, nulls can be the result of an operation. These operations can produce null values:
Division by zero
Returns from user defined functions (UDFs)
Dereferencing a field that does not exist.
Dereferencing a key that does not exist in a map. For example, given a map, info, containing [name#john, phone#5551212] if a user tries to use info#address a null is returned.
Accessing a field that does not exist in a tuple.
Example: Accessing a field that does not exist in a tuple
In this example nulls are injected if fields do not have data.
A = LOAD 'data' AS (f1:int,f2:int,f3:int)
B = FOREACH A GENERATE f1,f2;
Nulls and Load Functions
As noted, nulls can occur naturally in the data. If nulls are part of the data, it is the responsibility of the load function to handle them correctly. Keep in mind that what is considered a null value is loader- however, the load function should always communicate null values to Pig by producing Java nulls.
The Pig Latin load functions (for example, PigStorage and TextLoader) produce null values wherever data is missing. For example, empty strings (chararrays) instead, they are replaced by nulls.
PigStorage is the default load function for the LOAD operator. In this example the is not null operator is used to filter names with null values.
A = LOAD 'student' AS (name, age, gpa);
B = FILTER A BY
Pig provides constant representations for all data types except bytearrays.
Constant Example
Simple Data Types
19.2F or 1.92e2f
19.2 or 1.92e2
'hello world'
Not applicable.
Complex Data Types
(19, 2, 1)
A constant in this form creates a tuple.
{ (19, 2), (1, 2) }
A constant in this form creates a bag.
[ 'name' # 'John', 'ext' # 5555 ]
A constant in this form creates a map.
Please note the following:
On UTF-8 systems you can specify string constants consisting of printable ASCII characters such as 'abc'; you can specify control characters such as '\t'; and, you can specify a character in Unicode by starting it with '\u', for instance, '\u0001' represents Ctrl-A in hexadecimal (see Wikipedia , , and ). In theory, you should be able to specify non-UTF-8 constants on non-UTF-8 systems but as far as we know this has not been tested.
To specify a long constant, l or L must be appended to the number (for example, L). If the l or L is not specified, but the number is too large to fit into an int, the problem will be detected at parse time and the processing is terminated.
Any numeric constant with decimal point (for example, 1.5) and/or exponent (for example, 5e+1) is treated as double unless it ends with f or F in which case it is assigned type float (for example, &1.5f).
The data type definitions for tuples, bags, and maps apply to constants:
A tuple can contain fields of any data type
A bag is a collection of tuples
a map value can be any data type
Complex constants (either with or without values) can be used in the same places scalar c that is, in FILTER and GENERATE statements.
A = LOAD 'data' USING MyStorage() AS (T: tuple(name:chararray, age: int));
B = FILTER A BY T == ('john', 25);
D = FOREACH B GENERATE T.name, [25#5.6], {(1, 5, 18)};
Expressions
In Pig Latin, expressions are language constructs used with the FILTER, FOREACH, GROUP, and SPLIT operators as well as the eval functions.
Expressions are written in conventional mathematical infix notation and are adapted to the UTF-8 character set. Depending on the context, expressions can include:
Any Pig data type (simple data types, complex data types)
Any Pig operator (arithmetic, comparison, null, boolean, dereference, sign, and cast)
Any Pig built-in function.
Any user-defined function (UDF) written in Java.
In Pig Latin,
An arithmetic expression could look like this:
X = GROUP A BY f2*f3;
A string expression could look like this, where a and b are both chararrays:
X = FOREACH A GENERATE CONCAT(a,b);
A boolean expression could look like this:
X = FILTER A BY (f1==8) OR (NOT (f2+f3 & f1));
Field expressions
Field expressions represent a field or a dereference operator applied to a field. See
for more details.
Star expression
The star symbol, *, can be used to represent all the fields of a tuple. It is equivalent to writing out the fields explicitly. In the following example the definition of B and C are exactly the same, and MyUDF will be invoked with exactly the same arguments in both cases.
A = LOAD 'data' USING MyStorage() AS (name:chararray, age: int);
B = FOREACH A GENERATE *, MyUDF(name, age);
C = FOREACH A GENERATE name, age, MyUDF(*);
A common error when using the star expression is the following:
G = GROUP A BY $0;
C = FOREACH G GENERATE COUNT(*)
In this example, the programmer really wants to count the number of elements in the bag in the second field: COUNT($1).
Boolean expressions
Boolean expressions can be made up of UDFs that return a boolean value or boolean operators (see ).
Tuple expressions
Tuple expressions form subexpressions into tuples. The tuple expression has the form (expression [, expression &]), where expression is a general expression. The simplest tuple expression is the star expression, which represents all fields.
General expressions
General expressions can be made up of UDFs and almost any operator. Since Pig does not consider boolean a base type, the result of a general expression cannot be a boolean. Field expressions are the simpliest general expressions.
Schemas enable you to assign names to and declare types for fields. Schemas are optional but we encourage you to use th type declarations result in better parse-time error checking and more efficient code execution.
Schemas are defined using the AS keyword with the LOAD, STREAM, and FOREACH operators. If you define a schema using the LOAD operator, then it is the load function that enforces the schema (see the LOAD operator and the
for more information).
Note the following:
You can define a schema that includes both the field name and field type.
You can define a schema that includes in this case, the field type defaults to bytearray.
You can choose no in this case, the field is un-named and the field type defaults to bytearray.
If you assign a name to a field, you can refer to that field using the name or by positional notation. If you don't assign a name to a field (the field is un-named) you can only refer to the field using positional notation.
If you assign a type to a field, you can subsequently change the type using the cast operators. If you don't assign a type to a field, the field d you can change the default type using the cast operators.
Schemas with LOAD and STREAM Statements
With LOAD and STREAM statements, the schema following the AS keyword must be enclosed in parentheses.
In this example the LOAD statement includes a schema definition for simple data types.
A = LOAD 'data' AS (f1:int, f2:int);
Schemas with FOREACH Statements
With FOREACH statements, the schema following the AS keyword must be enclosed in parentheses when the FLATTEN operator is used. Otherwise, the schema should not be enclosed in parentheses.
In this example the FOREACH statement includes FLATTEN and a schema for simple data types.
X = FOREACH C GENERATE FLATTEN(B) AS (f1:int, f2:int, f3:int);
In this example the FOREACH statement includes a schema for simple data types.
X = FOREACH A GENERATE f1+f2 AS x1:
Schemas for Simple Data Types
Simple data types include int, long, float, double, chararray, and bytearray.
(alias[:type]) [, (alias[:type]) &] )
The name assigned to the field.
(Optional) The simple data type assigned to the field.
The alias and type are separated by a colon ( : ).
If the type is omitted, the field defaults to type bytearray.
Multiple fields are enclosed in parentheses and separated by commas.
In this example the schema defines multiple types.
John 18 4.0
A = LOAD 'student' AS (name:chararray, age:int, gpa:float);
DESCRIBE A;
A: {name: chararray,age: int,gpa: float}
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
In this example field "gpa" will default to bytearray because no type is declared.
John 18 4.0
Mary 19 3.8
Bill 20 3.9
Joe 18 3.8
A = LOAD 'data' AS (name:chararray, age:int, gpa);
DESCRIBE A;
A: {name: chararray,age: int,gpa: bytearray}
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
Schemas for Complex Data Types
Complex data types include tuples, bags, and maps.
Tuple Schema
A tuple is an ordered set of fields.
alias[:tuple] (alias[:type]) [, (alias[:type]) &] )
The name assigned to the tuple.
(Optional) The data type, tuple (case insensitive).
The designation for a tuple, a set of parentheses.
alias[:type]
The constituents of the tuple, where the schema definition rules for the corresponding type applies to the constituents of the tuple:
alias & the name assigned to the field
type (optional) & the simple or complex data type assigned to the field
In this example the schema defines one tuple. The load statements are equivalent.
A = LOAD 'data' AS (T: tuple (f1:int, f2:int, f3:int));
A = LOAD 'data' AS (T: (f1:int, f2:int, f3:int));
DESCRIBE A;
A: {T: (f1: int,f2: int,f3: int)}
In this example the schema defines two tuples.
(3,8,9) (mary,19)
(1,4,7) (john,18)
(2,5,8) (joe,18)
A = LOAD data AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
((3,8,9),(mary,19))
((1,4,7),(john,18))
((2,5,8),(joe,18))
Bag Schema
A bag is a collection of tuples.
alias[:bag] {tuple}
The name assigned to the bag.
(Optional) The data type, bag (case insensitive).
The designation for a bag, a set of curly brackets.
A tuple (see Tuple Schema).
In this example the schema defines a bag. The two load statements are equivalent.
A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
A = LOAD 'data' AS (B: {T: (t1:int, t2:int, t3:int)});
DESCRIBE A:
A: {B: {T: (t1: int,t2: int,t3: int)}}
({(3,8,9)})
({(1,4,7)})
({(2,5,8)})
Map Schema
A map is a set of key value pairs.
Syntax (where && means optional)
alias&:map& [ ]
The name assigned to the map.
(Optional) The data type, map (case insensitive).
The designation for a map, a set of straight brackets [ ].
In this example the schema defines a map. The load statements are equivalent.
[open#apache]
[apache#hadoop]
A = LOAD 'data' AS (M:map []);
A = LOAD 'data' AS (M:[]);
DESCRIBE A;
a: {M: map[ ]}
([open#apache])
([apache#hadoop])
Schemas for Multiple Types
You can define schemas for data that includes multiple types.
In this example the schema defines a tuple, bag, and map.
A = LOAD 'mydata' AS (T1:tuple(f1:int, f2:int), B:bag{T2:tuple(t1:float,t2:float)}, M:map[] );
A = LOAD 'mydata' AS (T1:(f1:int, f2:int), B:{T2:(t1:float,t2:float)}, M:[] );
Parameter Substitution
Description
Substitute values for parameters at run time.
Syntax: Specifying parameters using the Pig command line
pig {&param param_name = param_value | &param_file file_name} [-debug | -dryrun] script
Syntax: Specifying parameters using preprocessor statements in a Pig script
{%declare | %default} param_name param_value
Note: exec, run, and explain also support parameter substitution.
Flag. Use this option when the parameter is included in the command line.
Multiple parameters can be specified. If the same parameter is specified multiple times, the last value will be used and a warning will be generated.
Command line parameters and parameter files can be combined with command line parameters taking precedence.
param_name
The name of the parameter.
The parameter name has the structure of a standard language identifier: it must start with a letter or underscore followed by any number of letters, digits, and underscores.
Parameter names are case insensitive.
If you pass a parameter to a script that the script does not use, this parameter is silently ignored. If the script has a parameter and no value is supplied or substituted, an error will result.
param_value
The value of the parameter.
A parameter value can take two forms:
A sequence of characters enclosed in single or double quotes. In this case the unquoted version of the value is used during substitution. Quotes within the value can be escaped with the backslash character ( \ ). Single word values that don't use special characters such as % or = don't have to be quoted.
A command enclosed in back ticks.
The value of a parameter, in either form, can be expressed in terms of other parameters as long as the values of the dependent parameters are already defined.
&param_file
Flag. Use this option when the parameter is included in a file.
Multiple files can be specified. If the same parameter is present multiple times in the file, the last value will be used and a warning will be generated. If a parameter present in multiple files, the value from the last file will be used and a warning will be generated.
Command line parameters and parameter files can be combined with command line parameters taking precedence.
The name of a file containing one or more parameters.
A parameter file will contain one line per parameter. Empty lines are allowed. Perl-style (#) comment lines are also allowed. Comments must take a full line and # must be the first character on the line. Each parameter line will be of the form: param_name = param_value. White spaces around = are allowed but are optional.
Flag. With this option, the script is run and a fully substituted Pig script produced in the current working directory named original_script_name.substituted
Flag. With this option, the script is not run and a fully substituted Pig script produced in the current working directory named original_script_name.substituted
A pig script. The pig script must be the last element in the Pig command line.
If parameters are specified in the Pig command line or in a parameter file, the script should include a $param_name for each para_name included in the command line or parameter file.
If parameters are specified using the preprocessor statements, the script should include either %declare or %default.
In the script, parameter names can be escaped with the backslash character ( \ ) in which case substitution does not take place.
Preprocessor statement included in a Pig script.
Use to describe one parameter in terms of other parameters.
The declare statement is processed prior to running the Pig script.
The scope of a parameter value defined using declare is all the lines following the declare statement until the next declare statement that defines the same parameter is encountered.
Preprocessor statement included in a Pig script.
Use to provide a default value for a parameter. The default value has the lowest priority and is used if a parameter value has not been defined by other means.
The default statement is processed prior to running the Pig script.
The scope is the same as for %declare.
Parameter substitution enables you to write Pig scripts that include parameters and to supply values for these parameters at run time. For instance, suppose you have a job that needs to run every day using the current day's data. You can create a Pig script that includes a parameter for the date. Then, when you run this script you can specify or supply a value for the date parameter using one of the supported methods.
Specifying Parameters
You can specify parameter names and parameter values as follows:
As part of a command line.
In parameter file, as part of a command line.
With the declare statement, as part of Pig script.
With default statement, as part of a Pig script.
Precedence
Precedence for parameters is as follows:
Highest - parameters defined using the declare statement
Next - parameters defined in the command line
Lowest - parameters defined in a script
Processing Order and Precedence
Parameters are processed as follows:
Command line parameters are scanned in the order they are specified on the command line.
Parameter files are scanned in the order they are specified on the command line. Within each file, the parameters are processed in the order they are listed.
Declare and default preprocessors statements are processed in the order they appear in the Pig script.
Example: Specifying parameters in the command line
Suppose we have a data file called 'mydata' and a pig script called 'myscript.pig'.
myscript.pig
A = LOAD '$data' USING PigStorage() AS (f1:int, f2:int, f3:int);
In this example the parameter (data) and the parameter value (mydata) are specified in the command line. If the parameter name in the command line (data) and the parameter name in the script ($data) do not match, the script will not run. If the value for the parameter (mydata) is not found, an error is generated.
$ pig &param data=mydata myscript.pig
Example: Specifying parameters using a parameter file
Suppose we have a parameter file called 'myparams.'
# my parameters
data1 = mydata1
cmd = `generate_name`
In this example the parameters and values are passed to the script using the parameter file.
$ pig &param_file myparams script2.pig
Example: Specifying parameters using the declare statement
In this example the command is executed and its stdout is used as the parameter value.
%declare CMD 'generate_date';
A = LOAD '/data/mydata/$CMD';
B = FILTER A BY $0&'5';
Example: Specifying parameters using the default statement
In this example the parameter (DATE) and value ('') are specified in the Pig script using the default statement. If a value for DATE is not specified elsewhere, the default value
%default DATE '';
A = load '/data/mydata/$DATE';
Examples: Specifying parameter values as a sequence of characters
In this example the characters (in this case, Joe's URL) can be enclosed in single or double quotes, and quotes within the sequence of characters can be escaped.
%declare DES 'Joe\'s URL';
A = LOAD 'data' AS (name, description, url);
B = FILTER A BY description == '$DES';
In this example single word values that don't use special characters (in this case, mydata) don't have to be enclosed in quotes.
$ pig &param data=mydata myscript.pig
Example: Specifying parameter values as a command
In this example the command is enclosed in back ticks. First, the parameters mycmd and date are substituted when the declare statement is encountered. Then the resulting command is executed and its stdout is placed in the path before the load statement is run.
%declare CMD '$mycmd $date';
A = LOAD '/data/mydata/$CMD';
B = FILTER A BY $0&'5';
Arithmetic Operators and More
Arithmetic Operators
Description
subtraction
multiplication &
division &
Returns the remainder of a divided by b (a%b).
Works with integral numbers (int, long).
(condition ? value_if_true : value_if_false)
The bincond should be enclosed in parenthesis.
The schemas for the two conditional outputs of the bincond should match.
Use expressions only (relational operators are not allowed).
Suppose we have relation A.
A = LOAD 'data' AS (f1:int, f2:int, B:bag{T:tuple(t1:int,t2:int)});
(10,1,{(2,3),(4,6)})
(10,3,{(2,3),(4,6)})
(10,6,{(2,3),(4,6),(5,7)})
In this example the modulo operator is used with fields f1 and f2.
X = FOREACH A GENERATE f1, f2, f1%f2;
In this example the bincond operator is used with fields f2 and B. The condition is "f2 equals 1"; if the condition is true, return 1; if the condition is false, return the count of the number of tuples in B.
X = FOREACH A GENERATE f2, (f2==1?1:COUNT(B));
Types Table: addition (+) and subtraction (-) operators
* bytearray cast as this data type
cast as int
cast as long &
cast as float &
cast as double &
cast as double
Types Table: multiplication (*) and division (/) operators
* bytearray cast as this data type
cast as int
cast as long
cast as float
cast as double &
cast as double &
Types Table: modulo (%) operator
cast as int
cast as long
Comparison Operators
Description
less than &
greater than
less than or equal to &
greater than or equal to
pattern matching &
Regular expression matching. &Use the Java
for regular expressions.
Use the comparison operators with numeric and string data.
Example: numeric
X = FILTER A BY (f1 == 8);
Example: string
X = FILTER A BY (f2 == 'apache');
Example: matches
X = FILTER A BY (f1 matches '.*apache.*');
Types Table: equal (==) and not equal (!=) operators
* bytearray cast as this data type
boolean (see Note 1)
(see Note 2)
cast as boolean
cast as boolean
cast as boolean &
cast as boolean &
cast as boolean
Note 1: boolean (Tuple A is equal to tuple B if they have the same size s, and for all 0 &= i & s A[i] = = B[i])
Note 2: boolean (Map A is equal to map B if A and B have the same number of entries, and for every key k1 in A with a value of v1, there is a key k2 in B with a value of v2, such that k1 = = k2 and v1 = = v2)
boolean (bytearray cast as int)
boolean (bytearray cast as long)
boolean (bytearray cast as float)
boolean (bytearray cast as double)
boolean (bytearray cast as chararray)
Types Table: matches operator
*Cast as chararray (the second argument must be chararray)
bytearray*
Null Operators
Description
is null& & & &
is not null &
is not null &
X = FILTER A BY f1
Types Table
The null operators can be applied to all data types. For more information, see Nulls.
Boolean Operators
Description
Pig does not support a boolean data type. However, the result of a boolean expression (an expression that includes boolean and comparison operators) is always of type boolean (true or false).
X = FILTER A BY (f1==8) OR (NOT (f2+f3 & f1));
Dereference Operators
Description
tuple dereference & & &
tuple.id or tuple.(id,&)
Tuple dereferencing can be done by name (tuple.field_name) or position (mytuple.$0). If a set of fields are dereferenced (tuple.(name1, name2) or tuple.($0, $1)), the expression represents a tuple composed of the specified fields. Note that if the dot operator is applied to a bytearray, the bytearray will be assumed to be a tuple.
bag dereference
bag.id or bag.(id,&)
Bag dereferencing can be done by name (bag.field_name) or position (bag.$0). If a set of fields are dereferenced (bag.(name1, name2) or bag.($0, $1)), the expression represents a bag composed of the specified fields.
map dereference
Map dereferencing must be done by key (field_name#key or $0#key). If the pound operator is applied to a bytearray, the bytearray is assumed to be a map. If the key does not exist, the empty string is returned.
Example: Tuple
Suppose we have relation A.
LOAD 'data' as (f1:int, f2:tuple(t1:int,t2:int,t3:int));
(1,(1,2,3))
(2,(4,5,6))
(3,(7,8,9))
(4,(1,4,7))
(5,(2,5,8))
In this example dereferencing is used to retrieve two fields from tuple f2.
X = FOREACH A GENERATE f2.t1,f2.t3;
Example: Bag
Suppose we have relation B, formed by grouping relation A (see the GROUP operator for information about the field names in relation B).
A = LOAD 'data' AS (f1:int, f2:int,f3:int);
B = GROUP A BY f1;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
ILLUSTRATE B;
----------------------------------------------------------
| group: int | a: bag({f1: int,f2: int,f3: int}) |
----------------------------------------------------------
In this example dereferencing is used with relation X to project the first field (f1) of each tuple in the bag (a).
X = FOREACH B GENERATE a.f1;
({(4),(4)})
({(8),(8)})
Example: Tuple and Bag
Suppose we have relation B, formed by grouping relation A &(see the GROUP operator for information about the field names in relation B).
A = LOAD 'data' AS (f1:int, f2:int, f3:int);
B = GROUP A BY (f1,f2);
((1,2),{(1,2,3)})
((4,2),{(4,2,1)})
((4,3),{(4,3,3)})
((7,2),{(7,2,5)})
((8,3),{(8,3,4)})
((8,4),{(8,4,3)})
ILLUSTRATE B;
-------------------------------------------------------------------------------
| group: tuple({f1: int,f2: int}) | a: bag({f1: int,f2: int,f3: int}) |
-------------------------------------------------------------------------------
| {(8, 3, 4), (8, 3, 4)} |
-------------------------------------------------------------------------------
In this example dereferencing is used to project a field (f1) from a tuple (group) and a field (f1) from a bag (a).
X = FOREACH B GENERATE group.f1, a.f1;
Example: Map
Suppose we have relation A.
A = LOAD 'data' AS (f1:int, f2:map[]);
(1,[open#apache])
(2,[apache#hadoop])
(3,[hadoop#pig])
(4,[pig#grunt])
In this example dereferencing is used to look up the value of key 'open'.
X = FOREACH A GENERATE f2#'open';
Sign Operators
Description
positive & & &
&Has no effect.
negative (negation)
&Changes the sign of a positive or negative number.
A = LOAD 'data' as (x, y, z);
B = FOREACH A GENERATE -x,
Types Table: negation ( - ) operator
double (as double)
Flatten Operator
The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples
and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and
result is different for each type of structure.
For tuples, flatten substitutes the fields of a tuple in place of the tuple. For example, consider a relation that has a tuple
of the form (a, (b, c)). The expression GENERATE $0, flatten($1), will cause that tuple to become (a, b, c).
For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a
relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two
tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen.
For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator.
If we apply the expression GENERATE $0, flatten($1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).
For examples using the FLATTEN operator, see .
Cast Operators
Description
Pig Latin supports casts as shown in this table.
{(data_type) | &(tuple(data_type)) &| (bag{tuple(data_type)}) | (map[]) } field
(data_type)
The data type you want to cast to, enclosed in parentheses. You can cast to any data type except bytearray (see the table above).
The field whose type you want to change.
The field can be represented by positional notation or by name (alias). For example, if f1 is the first field and type int, you can cast to type long using (long)$0 or (long)f1.
Cast operators enable you to cast or convert data from one type to another, as long as conversion is supported (see the table above). For example, suppose you have an integer field, myint, which you want to convert to a string. You can cast this field from int to chararray using (chararray)myint.
Please note the following:
A field can be explicitly cast. Once cast, the field remains that type (it is not automatically cast back). In this example $0 is explicitly cast to int.
B = FOREACH A GENERATE (int)$0 + 1;
Where possible, Pig performs implicit casts. In this example $0 is cast to int (regardless of underlying data) and $1 is cast to double.
B = FOREACH A GENERATE $0 + 1, $1 + 1.0
When two bytearrays are used in arithmetic expressions or with built-in aggregate functions (such as SUM) they are implicitly cast to double. If the underlying data is really int or long, you&ll get better performance by declaring the type or explicitly casting the data.
Downcasts may cause loss of data. For example casting from long to int may drop bits.
In this example an int is cast to type chararray (see relation X).
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
B = GROUP A BY f1;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
DESCRIBE B;
B: {group: int,A: {f1: int,f2: int,f3: int}}
X = FOREACH B GENERATE group, (chararray)COUNT(A) AS
DESCRIBE X;
X: {group: int,total: chararray}
In this example a bytearray (fld in relation A) is cast to type tuple.
A = LOAD 'data' AS fld:
DESCRIBE A;
a: {fld: bytearray}
B = FOREACH A GENERATE (tuple(int,int,float))
DESCRIBE B;
b: {(int,int,float)}
In this example a bytearray (fld in relation A) is cast to type bag.
{(0522200L)}
{(2837493L)}
{(7398783L)}
A = LOAD 'data' AS fld:
DESCRIBE A;
A: {fld: bytearray}
({(0522200L)})
({(2837493L)})
({(7398783L)})
B = FOREACH A GENERATE (bag{tuple(long)})
DESCRIBE B;
B: {{(long)}}
({(0522200L)})
({(2837493L)})
({(7398783L)})
In this example a bytearray (fld in relation A) is cast to type map.
[open#apache]
[apache#hadoop]
[hadoop#pig]
[pig#grunt]
A = LOAD 'data' AS fld:
DESCRIBE A;
A: {fld: bytearray}
([open#apache])
([apache#hadoop])
([hadoop#pig])
([pig#grunt])
B = FOREACH A GENERATE ((map[])
DESCRIBE B;
B: {map[ ]}
([open#apache])
([apache#hadoop])
([hadoop#pig])
([pig#grunt])
Relational Operators
COGROUP is the same as GROUP. For readability, programmers usually use GROUP when only one relation is involved and COGROUP with multiple relations re involved. See
for more information.
Computes the cross product of two or more relations.
alias = CROSS alias, alias [, alias &] [PARALLEL n];
The name of a relation.
PARALLEL n
Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task). Note the following:
Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.
If you don&t specify parallel, you still get the same map parallelism but only one reduce task.
For more information, see the .
Use the CROSS operator to compute the cross product (Cartesian product) of two or more relations.
CROSS is an expensive operation and should be used sparingly.
Suppose we have relations A and B.
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
B = LOAD 'data2' AS (b1:int,b2:int);
In this example the cross product of relation A and B is computed.
X = CROSS A, B;
(1,2,3,2,4)
(1,2,3,8,9)
(1,2,3,1,3)
(4,2,1,2,4)
(4,2,1,8,9)
(4,2,1,1,3)
Removes duplicate tuples in a relation.
alias = DISTINCT alias [PARALLEL n];& & & &
The name of the relation.
PARALLEL n
Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task). Note the following:
Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.
If you don&t specify parallel, you still get the same map parallelism but only one reduce task.
For more information, see the .
Use the DISTINCT operator to remove duplicate tuples in a relation. DISTINCT does not preserve the original order of the contents (to eliminate duplicates, Pig must first sort the data). You cannot use DISTINCT on a subset of fields. To do this, use FOREACH & GENERATE to select the fields, and then use DISTINCT.
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
In this example all duplicate tuples are removed.
X = DISTINCT A;
Selects tuples from a relation based on some condition.
alias = FILTER alias &BY
The name of the relation.
Required keyword.
expression
A boolean expression.
Use the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH &GENERATE operation).
FILTER is commonly used to select th or, conversely, to filter out (remove) the data you don&t want.
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
In this example the condition states that if the third field equals 3, then include the tuple with relation X.
X = FILTER A BY f3 == 3;
In this example the condition states that if the first field equals 8 or if the sum of fields f2 and f3 is not greater than first field, then include the tuple relation X.
X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 & f1));
Generates data transformations based on columns of data.
alias &= FOREACH { gen_blk | nested_gen_blk } [AS schema];
The name of relation (outer bag).
FOREACH & GENERATE used with a relation (outer bag). Use this syntax:
alias = FOREACH alias GENERATE expression [expression &.]
nested_gen_blk
FOREACH & GENERATE used with a inner bag. Use this syntax:
alias = FOREACH nested_alias {
& &alias = nested_ [alias = nested_ &]
& &GENERATE expression [, expression &]
The nested block is enclosed in opening and closing brackets { & }.
The GENERATE keyword must be the last statement within the nested block.
expression
An expression.
nested_alias
The name of the inner bag.
Allowed operations are DISTINCT, FILTER, LIMIT, ORDER and SAMPLE.
The FOREACH & GENERATE operation itself is not allowed since this could lead to an arbitrary number of nesting levels.
A schema using the AS keyword (see Schemas).
operator is used, enclose the schema in parentheses.
If the FLATTEN operator is not used, don't enclose the schema in parentheses.
Use the FOREACH &GENERATE operation to work with columns of data (if you want to work with tuples or rows of data, use the FILTER operation).
FOREACH &GENERATE works with relations (outer bags) as well as inner bags:
If A is a relation (outer bag), a FOREACH statement could look like this.
X = FOREACH A GENERATE f1;
If A is an inner bag, a FOREACH statement could look like this.
X = FOREACH B {
S = FILTER A BY 'xyz';
GENERATE COUNT (S.$0);
Suppose we have relations A, B, and C (see the GROUP operator for information about the field names in relation C).
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
B = LOAD 'data2' AS (b1:int,b2:int);
C = COGROUP A BY a1 inner, B BY b1
(1,{(1,2,3)},{(1,3)})
(4,{(4,2,1),(4,3,3)},{(4,6),(4,9)})
(8,{(8,3,4),(8,4,3)},{(8,9)})
ILLUSTRATE C;
--------------------------------------------------------------------------------------
| group: int | a: bag({a1: int,a2: int,a3: int}) | B: bag({b1: int,b2: int}) |
--------------------------------------------------------------------------------------
| {(1, 2, 3)}
| {(1, 3)}
-------------------------------------------------------------------------------------
Example: Projection
In this example the asterisk (*) is used to project all tuples from relation A to relation X. Relation A and X are identical.
X = FOREACH A GENERATE *;
In this example two fields from relation A are projected to form relation X.
X = FOREACH A GENERATE a1, a2;
Example: Nested Projection
In this example if one of the fields in the input relation is a tuple, bag or map, we can perform a projection on that field (using a deference operator).
X = FOREACH C GENERATE group, B.b2;
(4,{(6),(9)})
In this example multiple nested columns are retained.
X = FOREACH C GENERATE group, A.(a1, a2);
(1,{(1,2)})
(4,{(4,2),(4,3)})
(8,{(8,3),(8,4)})
Example: Schema
In this example two fields in relation A are summed to form relation X. A schema is defined for the projected field.
X = FOREACH A GENERATE a1+a2 AS f1:
DESCRIBE X;
x: {f1: int}
Y = FILTER X BY f1 & 10;
Example: Applying Functions
In this example the built-in function SUM() is used to sum a set of numbers in a bag.
X = FOREACH C GENERATE group, SUM (A.a1);
Example: Flattening
In this example the
operator is used to eliminate nesting.
X = FOREACH C GENERATE group, FLATTEN(A);
Another FLATTEN example.
X = FOREACH C GENERATE GROUP, FLATTEN(A.a3);
Another FLATTEN example. Note that for the group '4' in C, there are two tuples in each bag. Thus, when both bags are flattened, the cross product of thes that is, tuples (4, 2, 6), (4, 3, 6), (4, 2, 9), and (4, 3, 9).
X = FOREACH C GENERATE FLATTEN(A.(a1, a2)), FLATTEN(B.$1);
Another FLATTEN example. Here, relations A and B both have a column x. When forming relation E,
you need to use the :: operator to identify which column x to use - either relation A column x (A::x) or relation B column x (B::x). This example uses relation A column x (A::x).
A = load 'data' as (x, y);
B = load 'data' as (x, z);
C = cogroup A by x, B
D = foreach C generate flatten(A), flatten(b);
E = group D by A::x;
Example: Nested Block
Suppose we have relations A and B. Note that relation B contains an inner bag.
A = LOAD 'data' AS (url:chararray,outline:chararray);
(,www.xyz.org)
(,www.cvn.org)
(,www.kpt.net)
(,www.xyz.org)
(,www.xyz.org)
B = GROUP A BY
(,{(,www.cvn.org)})
(,{(,www.xyz.org),(,www.xyz.org)})
(,{(,www.kpt.net),(,www.xyz.org)})
In this example we perform two of the operations allowed in a nested block, FILTER and DISTINCT. Note that the last statement in the nested block must be GENERATE.
X = foreach B {
FA= FILTER A BY outlink == 'www.xyz.org';
DA = DISTINCT PA;
GENERATE GROUP, COUNT(DA);
Groups the data in one or multiple relations. GROUP is the same as . For
readability, programmers usually use GROUP when only one relation is involved and COGROUP with multiple relations are involved.
alias = GROUP alias { ALL | BY expression}&[, alias ALL | BY expression &]
[USING 'collected'] [PARALLEL n];
The name of a relation.
Keyword. Use ALL if you want all tuples to
for example, when doing aggregates across entire relations.
B = GROUP A ALL;
Keyword. Use this clause to group the relation by field, tuple or expression.
B = GROUP A BY f1;
expression
A tuple expression. This is the group key or key field. If the result of the tuple expression is a single field, the key will be the value of the first field rather than a tuple with one field. To group using multiple keys, enclose the keys in parentheses:
B = GROUP A BY (key1,key2);
'collected'
Allows for more efficient computation of a group if the loader guarantees that the data for the
same key is continuous and is given to a single map. As of this release, only the Zebra loader makes this
guarantee. The efficiency is achieved by performing the group operation in map
rather than reduce (see ). This feature cannot be used with the COGROUP operator.
PARALLEL n
Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task). Note the following:
Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.
If you don&t specify parallel, you still get the same map parallelism but only one reduce task.
For more information, see the .
The GROUP operator groups together tuples that have the same group key (key field). The key field will be a tuple if the group key has more than one field, otherwise it will be the same type as that of the group key. The result of a GROUP operation is a relation that includes one tuple per group. This tuple contains two fields:
The first field is named "group" (do not confuse this with the GROUP operator) and is the same type as the group key.
The second field takes the name of the original relation and is type bag.
The names of both fields are generated by the system as shown in the example below.
Note that the GROUP (and thus COGROUP) and JOIN operators perform similar functions. GROUP creates a nested set of output tuples while JOIN creates a flat set of output tuples.
Suppose we have relation A.
A = load 'student' AS (name:chararray,age:int,gpa:float);
DESCRIBE A;
A: {name: chararray,age: int,gpa: float}
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
Now, suppose we group relation A on field "age" for form relation B. We can use the DESCRIBE and ILLUSTRATE operators to examine the structure of relation B. Relation B has two fields. The first field is named "group" and is type int, the same as field "age" in relation A. The second field is name "A" &after relation A and is type bag.
B = GROUP A BY
DESCRIBE B;
B: {group: int, A: {name: chararray,age: int,gpa: float}}
ILLUSTRATE B;
----------------------------------------------------------------------
| group: int | A: bag({name: chararray,age: int,gpa: float}) |
----------------------------------------------------------------------
| {(John, 18, 4.0), (Joe, 18, 3.8)}
| {(Bill, 20, 3.9)}
----------------------------------------------------------------------
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Continuing on, as shown in these FOREACH statements, we can refer to the fields in relation B by names "group" and "A" or by positional notation.
C = FOREACH B GENERATE group, COUNT(A);
C = FOREACH B GENERATE $0, $1.
(18,{(John),(Joe)})
(19,{(Mary)})
(20,{(Bill)})
Suppose we have relation A.
A = LOAD 'data' as (f1:chararray, f2:int, f3:int);
In this example the tuples are grouped using an expression, f2*f3.
X = GROUP A BY f2*f3;
(2,{(r1,1,2),(r2,2,1)})
(16,{(r3,2,8),(r4,4,4)})
Suppose we have two relations, A and B.
A = LOAD 'data1' AS (owner:chararray,pet:chararray);
(Alice,turtle)
(Alice,goldfish)
(Alice,cat)
B = LOAD 'data2' AS (friend1:chararray,friend2:chararray);
(Cindy,Alice)
(Mark,Alice)
(Paul,Bob)
(Paul,Jane)
In this example tuples are co-grouped using field &owner& from relation A and field &friend2& from relation B as the key fields. The DESCRIBE operator shows the schema for relation X, which has two fields, "group" and "A" (see the GROUP operator for information about the field names).
X = COGROUP A BY owner, B BY friend2;
DESCRIBE X;
X: {group: chararray,A: {owner: chararray,pet: chararray},b: {firend1: chararray,friend2: chararray}}
Relation X looks like this. A tuple is created for each unique key field. The tuple includes the key field and two bags. The first bag is the tuples from the first relation with the matching key field. The second bag is the tuples from the second relation with the matching key field. If no tuples match the key field, the bag is empty.
(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
(Jane,{},{(Paul,Jane)})
In this example tuples are co-grouped and the INNER keyword is used to ensure that only bags with at least one tuple are returned.
X = COGROUP A BY owner INNER, B BY friend2 INNER;
(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
In this example tuples are co-grouped and the INNER keyword is used asymmetrically on only one of the relations.
X = COGROUP A BY owner, B BY friend2 INNER;
(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
(Jane,{},{(Paul,Jane)})
(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
This example shows to group using multiple keys.
A = LOAD 'allresults' USING PigStorage() AS (tcid:int, tpid:int, date:chararray, result:chararray, tsid:int, tag:chararray);
B = GROUP A BY (tcid, tpid);
This example shows a map-side group.
register zebra.
A = LOAD 'studentsortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa&, 'sorted');
B = GROUP A BY name USING "collected";
C = FOREACH b GENERATE group, MAX(a.age), COUNT_STAR(a);
JOIN (inner)
Performs inner, equijoin of two or more relations based on common field values.
alias = JOIN alias BY {expression|'('expression [, expression &]')'} (, alias BY {expression|'('expression [, expression &]')'} &) [USING 'replicated' | 'skewed' | 'merge'] [PARALLEL n];&
The name of a relation.
expression
A field expression.
Example: X = JOIN A BY fieldA, B BY fieldB, C BY fieldC;
'replicated'
Use to perform replicated joins (see ).
Use to perform skewed joins (see ).
Use to perform merge joins (see ).
PARALLEL n
Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task). Note the following:
Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.
If you don&t specify parallel, you still get the same map parallelism but only one reduce task.
For more information, see the .
Use the JOIN operator to perform an inner, equijoin join of two or more relations based on common field values.
The JOIN operator always performs an inner join. Inner joins ignore null keys, so it makes sense to filter them out before the join.
Note that the JOIN and COGROUP operators perform similar functions.
JOIN creates a flat set of output records while COGROUP creates a nested set of output records.
Suppose we have relations A and B.
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
B = LOAD 'data2' AS (b1:int,b2:int);
In this example relations A and B are joined by their first fields.
X = JOIN A BY a1, B BY b1;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
JOIN (outer)
Performs an outer join of two or more relations based on common field values.
alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column
[USING 'replicated' | 'skewed'] [PARALLEL n];&
The name of a relation. Applies to alias, left-alias and right-alias.
alias-column
The name of the join column for the corresponding relation. Applies to left-alias-column and right-alias-column.
Left outer join.
Right outer join.
Full outer join.
(Optional) Keyword
'replicated'
Use to perform replicated joins (see ).
Only left outer join is supported for replicated joins.
Use to perform skewed joins (see ).
PARALLEL n
Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task). Note the following:
Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.
If you don&t specify parallel, you still get the same map parallelism but only one reduce task.
For more information, see the .
Use the OUTER JOIN operator to perform left, right, or full outer joins. The Pig Latin syntax closely adheres to the SQL standard.
The keyword OUTER is optional for outer joins (the keywords LEFT, RIGHT and FULL will imply left outer, right outer and full outer joins respectively
when OUTER is omitted).
Please note the following:
Outer joins will only work provided the relations which need to produce nulls (in the case of non-matching keys) have schemas.
Outer joins will only work for two- to perform a multi-way outer join, you will need to perform multiple two-way outer join statements.
This example shows a left outer join.
A = LOAD 'a.txt' AS (n:chararray, a:int);
B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A by $0 LEFT OUTER, B BY $0;
This example shows a full outer join.
A = LOAD 'a.txt' AS (n:chararray, a:int);
B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A BY $0 FULL, B BY $0;
This example shows a replicated left outer join.
A = LOAD 'large';
B = LOAD 'tiny';
C= JOIN A BY $0 LEFT, B BY $0 USING 'replicated';
This example shows a skewed full outer join.
A = LOAD 'studenttab' as (name, age, gpa);
B = LOAD 'votertab' as (name, age, registration, contribution);
C = JOIN A BY name FULL, B BY name USING 'skewed';
Limits the number of output tuples.
alias = LIMIT alias &n;
The name of a relation.
The number of tuples.
Use the LIMIT operator to limit the number of output tuples. If the specified number of output tuples is equal to or exceeds the number of tuples in the relation, the output will include all tuples in the relation.
There is no guarantee which tuples will be returned, and the tuples that are returned can change from one run to the next. A particular set of tuples can be requested using the ORDER operator followed by LIMIT.
Note: The LIMIT operator allows Pig to avoid processing all tuples in a relation. In most cases a query that uses LIMIT will run more efficiently than an identical query that does not use LIMIT. It is always a good idea to use limit if you can.
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
In this example output is limited to 3 tuples. Note that there is no guarantee which three tuples will be output.
X = LIMIT A 3;
In this example the ORDER operator is used to order the tuples and the LIMIT operator is used to output the first three tuples.
B = ORDER A BY f1 DESC, f2 ASC;
X = LIMIT B 3;
Loads data from the file system.
LOAD 'data' [USING function] [AS schema];& & & &
The name of the file or directory, in single quotes.
If you specify a directory name, all the files in the directory are loaded.
You can use Hadoop-supported globing to specify files at the file system or directory levels (see Hadoop
for details on globing syntax).
If the USING clause is omitted, the default load function PigStorage is used.
The load function.
You can use a built-in function (see the ). PigStorage is the default load function and does not need to be specified (simply omit the USING clause).
You can write your own load function
if your data is in a format that cannot be processed by the built-in functions (see the ).
A schema using the AS keyword, enclosed in parentheses (see Schemas).
The loader produces the data of the type specified by the schema. If the data does not conform to the schema, depending on the loader, either a null value or an error is generated.
Note: For performance reasons the loader may not immediately convert the data to
however, you can still operate on the data assuming the specified type.
Use the LOAD operator to load data from the file system.
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are newline-separated.
In this example the default load function, PigStorage, loads data from myfile.txt to form relation A. The two LOAD statements are equivalent. Note that, because no schema is specified, the fields are not named and all fields default to type bytearray.
A = LOAD 'myfile.txt';
A = LOAD 'myfile.txt' USING PigStorage('\t');
In this example a schema is specified using the AS keyword. The two LOAD statements are equivalent. You can use the DESCRIBE and ILLUSTRATE operators to view the schema.
A = LOAD 'myfile.txt' AS (f1:int, f2:int, f3:int);
A = LOAD 'myfile.txt' USING PigStorage(&\t&) AS (f1:int, f2:int, f3:int);
DESCRIBE A;
a: {f1: int,f2: int,f3: int}
ILLUSTRATE A;
---------------------------------------------------------
| f1: bytearray | f2: bytearray | f3: bytearray |
---------------------------------------------------------
---------------------------------------------------------
---------------------------------------
| f1: int | f2: int | f3: int |
---------------------------------------
---------------------------------------
For examples of how to specify more complex schemas for use with the LOAD operator, see Schemas for Complex Data Types and Schemas for Multiple Types.
Sorts a relation based on one or more fields.
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] &] } [PARALLEL n];
The name of a relation.
Required keyword.
The designator for a tuple.
Sort in ascending order.
Sort in descending order.
field_alias
A field in the relation.
PARALLEL n
Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task). Note the following:
Parallel only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.
If you don&t specify parallel, you still get the same map parallelism but only one reduce task.
For more information, see the .
In Pig, relations are unordered (see Relations, Bags, Tuples, and Fields):
If you order relation A to produce relation X (X = ORDER A BY * DESC;) relations A and X still contain the same thing.
If you retrieve the contents of relation X (DUMP X;) they are guaranteed to be in the order you specified (descending).
However, if you further process relation X (Y = FILTER X BY $0 & 1;) there is no guarantee that the contents will be processed in the order you originally specified (descending).
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
In this example relation A is sorted by the third field, f3 in descending order. Note that the order of the three tuples ending in 3 can vary.
X = ORDER A BY a3 DESC;
Partitions a relation into two or more relations.
The name of a relation.
Sample size, range 0 to 1 (for example, enter 0.1 for 10%).
Use the SAMPLE operator to select a random data sample with the stated sample size.
SAMPLE is a pr there is no guarantee that the exact same number of tuples will be returned for a particular sample size
each time the operator is used.
In this example relation X will contain 1% of the data in relation A.
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
X = SAMPLE A 0.01;
Partitions a relation into two or more relations.
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression &];
The name of a relation.
Required keyword.
Required keyword.
expression
An expression.
Use the SPLIT operator to partition the contents of a relation into two or more relations based on some expression. Depending on the conditions stated in the expression:
A tuple may be assigned to more than one relation.
A tuple may not be assigned to any relation.
In this example relation A is split into three relations, X, Y, and Z.
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
SPLIT A INTO X IF f1&7, Y IF f2==5, Z IF (f3&6 OR f3&6);
Stores or saves results to the file system.
STORE alias INTO 'directory' [USING function];
The name of a relation.
Required keyword.
'directory'
The name of the storage directory, in quotes. If the directory already exists, the STORE operation will fail.
The output data files, named part-nnnnn, are written to this directory.
Keyword. Use this clause to name the store function.
If the USING clause is

我要回帖

更多关于 oracle ora 00933 的文章

 

随机推荐