In class, we talked about how the size of the program and the size of the machine matters - architects continuously thing about this. What I found interesting was the instance when the program is too small. Does this mean that we must match our program and machine sizes? How does that exactly work?
I think scaling problem size to machine size is analogous to noting constants in runtime analysis. When analyzing an algorithm, we're usually instructed to ignore constant factors. However, in reality, constants matter, especially when the problem size is small.
For example, insertion sort, though $O(n^2)$, is often used over $O(n \log n)$ sorts for small arrays due to its low overhead. Python's built-in sort (Timsort) uses insertion sort for small lists and small base cases, because it has better constants when $n$ is small.
When your problem is small, constants matter, and the overhead of spinning up some massive parallelism on a large machine will dominate over the actual work performed.
I suppose, in general, the best way to determine whether some task will benefit from some parallel optimization is to test it, and determine where bottlenecks lie.