Previous | Next --- Slide 15 of 31
Back to Lecture Thumbnails
kayvonf

I'd like to see someone describe how this code works.

rflood

Note: This comes from lecture discussions and the python documentation for apache spark at https://spark.apache.org/docs/latest/api/python/pyspark.html. I have tried to keep it generalized, but probably could have done better.

First, it would partition the lines RDD across its workers, hopefully in some smart way (equal distribution, performance-adjusted distribution).

Then it tells each worker to only consider the members of the RDD that return true for isMobileClient(member), this could be done by loading the entire partition of the file into memory first and deleting the entries which do not match, or by line by line loading the partition and only allocating space for the entries which do match.

Next it tells its workers to replace each record with the key-value pair of the output of parseUserAgent(x) and 1. This could be done iteratively over the completed result from filter, or applied at the end of the filter step for each record.

It then runs a reduceByKey step, which requests a lambda function (commutative and associative according to the documentation) and will operate on the values of two key-value pairs. This will reduce the dataset into 1 record for each key. This most likely should be done at the end of the map step, since this requires 2 records for one operation, so having more work instantly available would significantly help any in-worker parallelism.
According to the spark-python documentation "This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce." So it will also perform communication between each worker node to produce a final result.

Finally it collects, which means to return to the user in some datatype for further computation.

jazzbass

@rflood I believe your description is correct, however no work is actually done by Spark until the RDD is materialized, that means, until you call an action on the RDD, such as collect or count. This means that all of the steps that you described don't actually get executed until the last function call (collect). More specifically, if we had omitted the collect call, none of the filter, map and reduce by key operations would have been performed.