I would start off with the example in Assignment Four, and especially the "projectidea" request. The working set of the algorithm is 14MB, fitting right inside the L3 cache of CPU (15MB). In order to avoid unnecessary cache evictions, the optimal strategy is to make sure that two "projectidea" requests are not processed on the same processor concurrently (by using pthread processor affinity).
Another example during Assignment 4 is the caching used for countprimes / compareprimes. The caching for countprimes could've been done on either the master or the worker - on the master, we wouldn't need to handle the communication of sending requests over to servers, which would likely speed up performance. However, the compareprimes caching doesn't benefit from the countprimes caching in this way, but the overlap of cached countprimes requests ending up on the same worker node as the one of the four countprimes requests that make up a compareprimes request might have been minimal anyway.
There are simple-to-use tools such as Cachegrind that can measure cache hit/miss rates of your program. This might be useful if you want to measure or improve the cache locality of your program.
An example from last lecture is the way in which loops were rewritten in order to take advantage of temporal locality. Another example is the grid solver we did in class.
There are several examples from class of loops being rewritten using the idea of chunking in order to improve cache locality.