Question: It might be interesting to compare the role of the partitionBy() and persist() RDD operations and the role of the schedule in a Halide program.
Here is a lesson I learned in 719 class: If you want to use joined RDD in the future, you'd better enforce joined to be evaluated right after join() (e.g. use count()), otherwise things will mess up all of a sudden.
The reason is that partitionBy(), persist() and join() will be lazy evaluated. It means they won't take effect when you call them. And the trick is if spark doesn't run partition() of two RDDs in one job, data will not co-locate in the same node.
More detailed information can be found here
partitionBy() and persist() can be compared to schedule in a Halide program because using partitionBy() and persist() are the ways in which we are hinting/controlling the framework's actual implementation of the abstractions. In case of schedule in Halide we were doing the same by telling the framework to use a tiled block of x*y size, where to use SIMD and where to use parallel fors etc.
If we don't use paritionBy() or persist() then maybe the framework is intelligent enough to use the most optimal implementation on its own but by explicitly using it we are controlling the implementation.
Another similar role between Spark transformation and Halide scheduling is: in Halide, keyword compute_root() indicates that the last stage in pipeline should finish executions before the next stage starts. The effect of Spark join() is the same. Another Halide keyword compute_at(), has effect that same to fusing in Spark.
.persist() seems like an awkward break in the abstraction and semantics of Spark. It suggests to Spark that some data should be saved, but doesn't inform Spark about why the data should be saved.
Are there systems that have a cleaner solution to this problem?