I could be wrong, but I think the 64 concurrent instruction streams comes from (4 threads/core * 16 simultaneous instruction streams), and then your calculation about the independent pieces of work needed to run chip with maximal latency hiding ability is correct.

aeu

I am confused about the '512 independent pieces of work' calculation as well.

My first source of confusion is how this number is calculated. Is it, as @yimmyz noted, 4 threads per core x 4 cores = 64 concurrent instruction streams, each of which can use the 8 SIMD ALUs, which gives us 64 x 8 = 512 independent pieces of work?

My second confusion is regarding the 'independency' of these 512 pieces of work. Since SIMD ALUs need to run the same instruction stream to operate in parallel on a bunch of data, wouldn't that mean each and every single one of those 512 pieces of work are not truly independent of each other but rather it's the case that 64 truly independent streams of work, each of which can work on 8 pieces of data at the same time?

monkeyking

I think both @yimmyz and @bdebebe are correct. Because "16 cores" just means "16 simultaneous instruction streams" since this is SIMD.

Dracula08MS

Where does this 4 threads per core coming from? Is it just an arbitrary number?

yimmyz

@Dracula08MS I believe it is an arbitrary number specific to this chip.

skylake

Yep, each core allows an instruction stream to operate - Remember, each instruction stream needs to be able to fetch its instructions independently. Thus, 16 cores ~ 16 instruction streams.
Please correct me if I am missing something.

captainFlint

@skylake You are correct, 16 cores means 16 instruction streams and due to 4 threads per core, we have 64 independent instruction streams.
@aeu I think the definition of independent is causing the confusion. Here 512 independent pieces of work means none of these 512 things should be waiting for each other or are in any way dependent on each other. More specifically, these 512 pieces of work perform their calculation on independent pieces of data.

Am I understanding the calculations correctly?

64 concurrent instruction streams = 16 cores * 4 threads/core

512 independent work = 64 concurrent instruction streams * 8 SIMD ALUs/core

I could be wrong, but I think the 64 concurrent instruction streams comes from (4 threads/core * 16 simultaneous instruction streams), and then your calculation about the independent pieces of work needed to run chip with maximal latency hiding ability is correct.

I am confused about the '512 independent pieces of work' calculation as well.

My first source of confusion is how this number is calculated. Is it, as @yimmyz noted, 4 threads per core x 4 cores = 64 concurrent instruction streams, each of which can use the 8 SIMD ALUs, which gives us 64 x 8 = 512 independent pieces of work?

My second confusion is regarding the 'independency' of these 512 pieces of work. Since SIMD ALUs need to run the same instruction stream to operate in parallel on a bunch of data, wouldn't that mean each and every single one of those 512 pieces of work are not truly independent of each other but rather it's the case that 64 truly independent streams of work, each of which can work on 8 pieces of data at the same time?

I think both @yimmyz and @bdebebe are correct. Because "16 cores" just means "16 simultaneous instruction streams" since this is SIMD.

Where does this 4 threads per core coming from? Is it just an arbitrary number?

@Dracula08MS I believe it is an arbitrary number specific to this chip.

Yep, each core allows an instruction stream to operate - Remember, each instruction stream needs to be able to fetch its instructions independently. Thus, 16 cores ~ 16 instruction streams. Please correct me if I am missing something.

@skylake You are correct, 16 cores means 16 instruction streams and due to 4 threads per core, we have 64 independent instruction streams. @aeu I think the definition of independent is causing the confusion. Here 512 independent pieces of work means none of these 512 things should be waiting for each other or are in any way dependent on each other. More specifically, these 512 pieces of work perform their calculation on independent pieces of data.