A good comment opportunity: describe what one of the RDD transformations at the top of this table does.
ChandlerBing
flatMap(): The function f is applied to each input in the RDD of type T, which returns a sequence of outputs of type U. (Source: Spark Programming Guide)
landuo
join(): Return an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other. Performs a hash join across the cluster.
map(): apply function f on each entry t of RDD T to generate new RDD U whose entries u = f(t)
mingf
Transformations are operations on RDDs that return a new RDD. It is worth noting that Spark defer the actual computations of a transformed RDD until you use them in an action. filter() returns a new RDD that only contains the elements for which the given predicate is true.
ankit1990
filter(): applies function f on each entry of the RDD (T) to generate a new RDD (U) for which f return true.
msebek
sample(fraction : Float):
(a more precise signature is as follows): Float, 0 <= fraction <= 1
sample(withReplacement : Boolean, fraction : Float 0 <=fraction<=1, seed : Long)
Sample a fraction of the data, chosen randomly given seed as a seed value for the random number generator. This random-but-deterministic approach will allow experiments/bugs/strange behavior to be replayed. Return this sample for further computation.
Return a copy of the RDD partitioned using the specified partitioner. If mapSideCombine is true, Spark will group values of the same key before the repartitioning, to only send each key over the network once.
uncreative
In class it was asked whether or not reduce was associative.
According to the spark documentation, the function given to reduce must be associative
def reduce(f: (T, T) -> T): T
Reduces the elements of this RDD using the specified commutative and associative binary operator.
It's also interesting to look at the full list of available operations.
afzhang
cogroup(): Takes in 2 RDDs and returns a new RDD with a tuple containing the list of values for a given key in both of the input RDDs.
muyangya
mapValues() is an operation that just makes sense for a key-value pair RDD. In this operation, we just pass the values of the key-value pair to a "map" function, without changing its key. Also, this operation will retain the original RDD's partitions
abist
sort(c): sorts a key-value RDD based on a comparator c for K (key)
rhnil
union() returns a new RDD that contains elements from either of the input RDDs. Duplicate elements will not be removed. A call to distinct() followed by union() can be used to remove duplicate elements.
A good comment opportunity: describe what one of the RDD transformations at the top of this table does.
flatMap(): The function f is applied to each input in the RDD of type T, which returns a sequence of outputs of type U. (Source: Spark Programming Guide)
join(): Return an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other. Performs a hash join across the cluster.
Source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
More on hash join: http://en.wikipedia.org/wiki/Hash_join
map(): apply function f on each entry t of RDD T to generate new RDD U whose entries u = f(t)
Transformations are operations on RDDs that return a new RDD. It is worth noting that Spark defer the actual computations of a transformed RDD until you use them in an action. filter() returns a new RDD that only contains the elements for which the given predicate is true.
filter(): applies function f on each entry of the RDD (T) to generate a new RDD (U) for which f return true.
sample(fraction : Float): (a more precise signature is as follows): Float, 0 <= fraction <= 1 sample(withReplacement : Boolean, fraction : Float 0 <=fraction<=1, seed : Long)
Sample a fraction of the data, chosen randomly given seed as a seed value for the random number generator. This random-but-deterministic approach will allow experiments/bugs/strange behavior to be replayed. Return this sample for further computation.
partitionBy(partitioner: Partitionner, mapSideCombine: Boolean = false):
Return a copy of the RDD partitioned using the specified partitioner. If mapSideCombine is true, Spark will group values of the same key before the repartitioning, to only send each key over the network once.
In class it was asked whether or not reduce was associative.
According to the spark documentation, the function given to reduce must be associative
def reduce(f: (T, T) -> T): T Reduces the elements of this RDD using the specified commutative and associative binary operator.
It's also interesting to look at the full list of available operations.
cogroup(): Takes in 2 RDDs and returns a new RDD with a tuple containing the list of values for a given key in both of the input RDDs.
mapValues() is an operation that just makes sense for a key-value pair RDD. In this operation, we just pass the values of the key-value pair to a "map" function, without changing its key. Also, this operation will retain the original RDD's partitions
sort(c): sorts a key-value RDD based on a comparator c for K (key)
union()
returns a new RDD that contains elements from either of the input RDDs. Duplicate elements will not be removed. A call todistinct()
followed byunion()
can be used to remove duplicate elements.