Previous | Next --- Slide 15 of 46
Back to Lecture Thumbnails

How do programmers decide when to save a checkpoint? Is it based on time or certain flags?


MPI has infrastructure for both kernel level and user defined checkpoint / restores. See here Fault tolerance for parallel MPI jobs.


Can someone give more examples of what the failing components can comprise of, that lead to performance scaling?


@stride16 I think the point was that if you expect performance scaling based on the number of nodes/CPUs/etc. and then some of those nodes fail, you get significantly worse scaling than expected even if only a few of them fail.