Do most modern CPUs employ inter-core prefetching (using prefetching threads on idle cores in addition to the current compute thread), and is it a better alternative to the hardware controlled prefetcher shown in the pre-multicore era processor (slide 14 of this lecture)?
Inter-core prefetching is a software controlled prethreading technique, so it would seem like it's a slower alternative to hardware controlled preprocessors. Also, there would be considerable communication overhead between threads if there are several prefetching threads involved.
As far as I have found it seems most architectures now provide support for both hardware and software prefetching. The comparison for hardware and software prefetching is difficult as different architectures are better at doing different things "best prefetching technique on SandyBridge performs worst onXeon Phi and vice-versa" it is always very important to also consider the problem in determining which to use.
Inter-core prefetching has been found to greatly speed up certain problems (2.8x) and more interestingly it has also been found to save power when applied to certain problems.
If you are interested in the specifics these two paper go in further detail