Mintmaster
Veteran
I'm still not sure if you really understand it. The compiler is not telling the GPU what to load, it's telling it when to load.How can the compiler construct load clauses when it doesn't know what the program will need to load at runtime? I'm not talking about static texture lookups in a shader here.
For example, it knows that it can't load values needed in ALU instructions 17 and 28 until ALU instructions 15 and 16 determine the addresses for those loads. Hence there is a tex clause consisting of two loads before ALU instruction 17. After instruction 16, the batch is put aside until the two loads arrive, and then it continues. Basically the compiler makes a big dependency graph and groups together loads when it can. The average group size effectively multiplies latency hiding.
More complex hardware would issue loads at instruction 15 and 16, put it aside until one of them to get back, then have the option of either waiting until the next load came back or executing up to instruction 27 and then waiting. This flexibility is important if you don't have enough threads to saturate either the ALU or TEX throughput with the previous method, but if you do have enough threads then it's overkill.
I made a little GPU simulation program for Jawed about a year ago to illustrate how all this affects latency hiding and thus efficiency for any given program, but it was based on the simple scheduler. Maybe I'll add a more complicated scheduler to see what kind of difference it makes.