Jawed
Legend
I'm saying that the compiler is optimising ALU scheduling at the instruction level and it's NVidia's plan to pre-program the memory hierarchy usage pattern pro-actively in order to minimise the effort programmers spend on trying to get the hardware to use an optimal memory hierarchy.As I understood it, he is suggesting the compiler has some say in which other warp to switch to.
e.g. using pre-fetch so that data is in cache rather than DDR. Which has the effect of improving ALU throughput.
No different than the kinds of techniques we've discussed with respect to Larrabee.