Some drawbacks in multi-threading to keep in mind include synchronization problems such as race conditions and deadlocks, and issues that arise if one thread crashes.
Is it possible to have a scenario where the time and effort needed to switch between different threads outweighs the benefit that this will bring?
Sometimes the task is memory i/o oriented (compared to computation oriented). For example, adding corresponding elements in two long vectors to the new vector. Only one addition is needed in each position, and three memory read/write are need to do this addition. In this kind of situation, bandwidth is still the bottleneck, because one thread may still unable to run after doing the computation needed for other threads.
To answer the question from POTUS, I think here we assume that there's fairly little cost from context switching since all thread data is stored on the chip (L1 cache I think), unlike the context switch that happens in OS.
I like @POTUS' question. Like discuss that a bit.
@POTUS another scenario where time and effort switching threads doesn't benefit is if a stall in a thread does not actually take very long (maybe even less time than it takes to switch threads) in which case we would not be operating at maximum efficiency. Although I would hope there is a mechanism which causes us to only switch threads when this won't happen.
@POTUS: One scenario that I can think where hardware-level interleaved multithreading is not useful: If a thread requires high responsiveness (for example, in a real time system where timing constraints are critical) and the switching back and forth causes it to miss its deadline.
@POTUS: Another scenario where cost of switching may outweigh its benefit is if the cache size is small and data of one thread evicts the data of the other thread i.e. both threads are working on separate data. In this scenario, switching between threads will cause a lot of cache misses. On the other hand, if there is a single thread and it has good locality with a lot of CPU intensive tasks, then that may give better performance than 2 threads.
@butterfly: In real life scenarios, do we need to take these tradeoffs into consideration? Especially in @butterfly's scenario, is it common that we optimize total throughput of a program by using multi-threading and consider about locality at the same time? If yes, what could be a good way to measure them?