Previous | Next --- Slide 16 of 31
Back to Lecture Thumbnails
kayvonf

A good comment opportunity: describe what one of the RDD transformations at the top of this table does.

ChandlerBing

flatMap(): The function f is applied to each input in the RDD of type T, which returns a sequence of outputs of type U. (Source: Spark Programming Guide)

landuo

join(): Return an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other. Performs a hash join across the cluster.

Source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

More on hash join: http://en.wikipedia.org/wiki/Hash_join

sanchuah

map(): apply function f on each entry t of RDD T to generate new RDD U whose entries u = f(t)

mingf

Transformations are operations on RDDs that return a new RDD. It is worth noting that Spark defer the actual computations of a transformed RDD until you use them in an action. filter() returns a new RDD that only contains the elements for which the given predicate is true.

ankit1990

filter(): applies function f on each entry of the RDD (T) to generate a new RDD (U) for which f return true.

msebek

sample(fraction : Float): (a more precise signature is as follows): Float, 0 <= fraction <= 1 sample(withReplacement : Boolean, fraction : Float 0 <=fraction<=1, seed : Long)

Sample a fraction of the data, chosen randomly given seed as a seed value for the random number generator. This random-but-deterministic approach will allow experiments/bugs/strange behavior to be replayed. Return this sample for further computation.

zhanpenf

partitionBy(partitioner: Partitionner, mapSideCombine: Boolean = false):

Return a copy of the RDD partitioned using the specified partitioner. If mapSideCombine is true, Spark will group values of the same key before the repartitioning, to only send each key over the network once.

uncreative

In class it was asked whether or not reduce was associative.

According to the spark documentation, the function given to reduce must be associative

def reduce(f: (T, T) -> T): T Reduces the elements of this RDD using the specified commutative and associative binary operator.

It's also interesting to look at the full list of available operations.

afzhang

cogroup(): Takes in 2 RDDs and returns a new RDD with a tuple containing the list of values for a given key in both of the input RDDs.

muyangya

mapValues() is an operation that just makes sense for a key-value pair RDD. In this operation, we just pass the values of the key-value pair to a "map" function, without changing its key. Also, this operation will retain the original RDD's partitions

abist

sort(c): sorts a key-value RDD based on a comparator c for K (key)

rhnil

union() returns a new RDD that contains elements from either of the input RDDs. Duplicate elements will not be removed. A call to distinct() followed by union() can be used to remove duplicate elements.