http://semiaccurate.com/2011/06/29/amd-southern-islands-possible-for-september/
So now Charlie thinks we'll see SI on 40nm
Doubtful.
http://semiaccurate.com/2011/06/29/amd-southern-islands-possible-for-september/
So now Charlie thinks we'll see SI on 40nm
By the way, New Zealand appears to be the dual GPU card with two Tahiti chips, as obviously known since april.
That means the SI series consists of 3 GPUs, Tahiti, Thames (small town in the north of New Zealand? May it), and Lombok (Indonesian island). And three GCN IDs appeared also in the driver.
What would mean there are only two desktop GPUs (Tahiti and Lombok) but 4 mobile ones. I would say that doesn't add up as long as the other parts are not coming significantly later. They didn't got added in the last 3 months or so since Tahiti appeared.Thames = Laptop chip, not desktop, the other reported laptop chips are Heathrow, Wimbledon and Chelsea
Guess you would have to multiply that number by the 4 cycles each SIMD gets scheduled.This might allow for latency hiding even if code is not as arithmetically dense as the earlier SIMD design would have liked. If the compiler can collect enough loads, 15 outstanding loads and 10 wavefronts would give 150 cycles of latency hiding before a wavefront would stall on the first load.
At some points the dependency on the loads has to be resolved either way. The solution can do nothing what nvidias scoreboarding couldn't do either. It just requires less hardware.It falls apart if a dependence in data or control pops up, though this scheme does expose risk if there is divergence.
As a first point it is the maximum the instruction format can handle. It doesn't tell us directly what the first incarnation of the hardware can do (but Cayman already supported 8 or even 16 [would have to look it up] outstanding memory accesses per wavefront). And secondly it is of course the maximums allowed per wavefront. There are probably other limits per CU (the memory pipeline is shared by all SIMDs) which -depending on your code and the conditions- you may hit earlier or not. And as the last point, in principle it defines only the maximum you can specify in that barrier-like instruction, there could be even more in flight in sections where it is not used (but unlikely in my opinion).I wonder if the hardware can meet the implied number of in-flight loads. 4 SIMDs with 150 issued loads per L1 cache?
That is highly unlikely as only complete wavefronts get issued, so the complete loads have to be ready either way.Does the 4-cycle wavefront issue count as 4 separate cache loads to be tracked?
You have to take into account that it is shared with GDS and scalar memory accesses (the access to the scalar L1 is shared by 4 CUs). I would expect the LDS to be pretty fast as you can even broadcast a value from (a single location in the) LDS to all lanes of the vector ALU directly by choosing the LDS value as source operand. This obviously takes precedence over pending LDS load/store instructions, where bank conflicts may need to be handled (increasing the latency), but shows that the latency itself can't be very high as it appears to be hidden by the pipeline without any additional waiting for the single value (where of course no bank conflict can occur).The number LDS/GDS accesses is even higher. This must hint at the expectation that the data share will be heavily contended and that there is a signficant amount of latency involved in serializing around bank conflicts with the hardware used to manage the data share scheduling.
If the scheduler is as dependent on the barrier instruction as it appears, it can do more. It may not be correct or defined behavior if a wavefront is set to run past N loads and in that window there is divergence or a set of ALU instructions dependent on one of the loads, which may or may not be ready in time.At some points the dependency on the loads has to be resolved either way. The solution can do nothing what nvidias scoreboarding couldn't do either. It just requires less hardware.
In some sense, that would just be invalid code generated by the compiler. I'm pretty sure the compiler will always insert the appropriate instruction so the correct and reproducible behavior is guaranteed. If you bypass the compiler and write the code directly in assembler by yourself, you can get to this point of course. But I wouldn't promote that as a featureIf the scheduler is as dependent on the barrier instruction as it appears, it can do more. It may not be correct or defined behavior if a wavefront is set to run past N loads and in that window there is divergence or a set of ALU instructions dependent on one of the loads, which may or may not be ready in time.
Like what? Loads can complete out of order. Just stall when the destination register of the load is being used, else keep the load in flight.Going forward, the use of the x86 memory model also could lead to faults that prefetches typically do not need to cover, but a load would require handling, unless there is a no-register no-fault load option.
A load can be issued only every 4 cycles, so it should be 600 cycles. Quite enough.This might allow for latency hiding even if code is not as arithmetically dense as the earlier SIMD design would have liked. If the compiler can collect enough loads, 15 outstanding loads and 10 wavefronts would give 150 cycles of latency hiding before a wavefront would stall on the first load.
If the scheduler is as dependent on the barrier instruction as it appears, it can do more. It may not be correct or defined behavior if a wavefront is set to run past N loads and in that window there is divergence or a set of ALU instructions dependent on one of the loads, which may or may not be ready in time.
A prefetch could fault by reading from an address that could be invalid for cachability or protection reasons. A prefetch would give up with no ill effect.Like what? Loads can complete out of order. Just stall when the destination register of the load is being used, else keep the load in flight.
If the hardware is able to support that number of loads in flight, which I questioned earlier. That's a lot of in-flight accesses. The memory pipeline might max out at a lower number of outstanding loads.A load can be issued only every 4 cycles, so it should be 600 cycles. Quite enough.
In the case that the compiler is correct and it does insert another waitcnt, it shows a theoretical limit of the functionality and one area where it may not be practical to use to the fullest.Why wouldn't the compiler simply insert another waitcnt instr after the branch if that code path is dependent on other in-flight loads?
This type of prefetching is strictly in programmer's hands, so it's safe to shift this burden onto him.A prefetch could fault by reading from an address that could be invalid for cachability or protection reasons. A prefetch would give up with no ill effect.