Jawed
Legend
The important parameter is how many threads are required to hide a fetch from DDR. When the chip is starved of threads then the ALUs will idle when DDR becomes the bottleneck.
NVidia is aiming for a programming model where DDR fetch latency is minimised - by optimisation of the application's memory hierarchy. So, even if the number of threads required to hide DDR latency goes up, if the chip does less fetches from DDR then it matters less.
And if the compiler inserts "pre-fetch" instructions into the kernel, for example, so that the latency being hidden by threads is L2 (or L1), then the number of fetches from DDR matters even less.
ALU scheduling, in this case, is eased by pre-computed memory access patterns.
This is essentially what all GPGPU compute has been about, optimisation of the memory hierarchy against latency hiding of hardware threads. NVidia plans to take more control of this with its tools, because it's a nightmare for programmers (the optimisation space is vast even with a only a few dimensions).
NVidia is aiming for a programming model where DDR fetch latency is minimised - by optimisation of the application's memory hierarchy. So, even if the number of threads required to hide DDR latency goes up, if the chip does less fetches from DDR then it matters less.
And if the compiler inserts "pre-fetch" instructions into the kernel, for example, so that the latency being hidden by threads is L2 (or L1), then the number of fetches from DDR matters even less.
ALU scheduling, in this case, is eased by pre-computed memory access patterns.
This is essentially what all GPGPU compute has been about, optimisation of the memory hierarchy against latency hiding of hardware threads. NVidia plans to take more control of this with its tools, because it's a nightmare for programmers (the optimisation space is vast even with a only a few dimensions).