Previous | Next --- Slide 15 of 46
Back to Lecture Thumbnails
althalus

How do programmers decide when to save a checkpoint? Is it based on time or certain flags?

lol

MPI has infrastructure for both kernel level and user defined checkpoint / restores. See here Fault tolerance for parallel MPI jobs.

stride16

Can someone give more examples of what the failing components can comprise of, that lead to performance scaling?

jsunseri

@stride16 I think the point was that if you expect performance scaling based on the number of nodes/CPUs/etc. and then some of those nodes fail, you get significantly worse scaling than expected even if only a few of them fail.