AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

They could make a 5700 class SI gpu on 40nm. Something like 8 or 12 CUs, 2 primitive pipes and 2 pixel pipes. And maybe 8 ROPs per pixel pipe for 16 ROPs.
With TSMC-s track record they probably made some plan B. The 28nm could end up being well into 2012. (i mean mass production)
 
Last edited by a moderator:
By the way, New Zealand appears to be the dual GPU card with two Tahiti chips, as obviously known since april.

That means the SI series consists of 3 GPUs, Tahiti, Thames (small town in the north of New Zealand? May it), and Lombok (Indonesian island). And three GCN IDs appeared also in the driver.
 
Last edited by a moderator:
By the way, New Zealand appears to be the dual GPU card with two Tahiti chips, as obviously known since april.

That means the SI series consists of 3 GPUs, Tahiti, Thames (small town in the north of New Zealand? May it), and Lombok (Indonesian island). And three GCN IDs appeared also in the driver.

Thames = Laptop chip, not desktop, the other reported laptop chips are Heathrow, Wimbledon and Chelsea
 
Thames = Laptop chip, not desktop, the other reported laptop chips are Heathrow, Wimbledon and Chelsea
What would mean there are only two desktop GPUs (Tahiti and Lombok) but 4 mobile ones. I would say that doesn't add up as long as the other parts are not coming significantly later. They didn't got added in the last 3 months or so since Tahiti appeared.

The device ID databse in Cat11.7 knows only 4 basic Southern Islands SKUs:
New Zealand (destktop enthusiast part, dual Tahiti)
Tahiti (desktop highend, as XT and Pro)
Thames (mobile? as XT/GL, Pro, and LE)
Lombok (entry/mainstream, as XT/XT GL, Pro, and AIO = all-in-one)

That would leave the performance part missing and most former mobile GPUs had also LP parts. Probably it's just still far from complete and we may see an introduction stretching well into 2012 with the other parts coming later.
 
Let's go back to the question of scheduling and tracking dependencies. I argued that GCN looks like as if AMD wanted to keep the overhead for advanced scheduling techniques down and that could even be an incentive to reduce the (arithmetic) latencies to just 4 cycles, which would remove any dependencies between them so they wouldn't need to be tracked.

But we have of course still memory accesses, which needs to be taken care of. GCN uses a different approach for that as one could have assumed. It does not compare the source operands of the instructions to memory accesses in flight. They use a far simpler scheme. All what they track is just the number of outstanding memory accesses for each wavefront (splitted into 3 classes: vector memory reads, memory writes/exports and datashare/scalar memory reads, judging from the code I looked at memory requests retire in-order, so a later memory request from the same class isn't counted to be completed if an older one is still in flight). To wait for the completion of memory requests is basically the responsibility of the running program itself (normally the compiler will insert the required instuction).

The internal instruction s_waitcnt is used ("executed"/consumed in the instruction buffer itself, not issued to the units) for this purpose. It allows to specify the acceptable number of outstanding memory accesses to proceed. Is the actual number higher, the wavefront is is not considered for instruction issue until the number reaches the specified values. In the simplest case it looks like this:

s_waitcnt 0 // all memory requests have to be completed

But any combination of the three classes can be specified:

s_waitcnt vmcnt(4) & expcnt(3) & lgkmcnt(1)
// the last 4 vector memory reads, the last 3 writes and the last local/scalar memory access don't have to be completed, only older ones to continue

That way the scheduler doesn't have to track the individual registers for dependencies on memory accesses. And it doesn't have to be done on each instruction, only when this kind of barrier instruction appears. No score boarding in GCN! :D
 
Interesting, so there are two types of instructions: those that take one quad cycle, and those that don't.

I had thought the scheme was to issue 1 instruction and not come back until it is done.
This statically defined wait length is potentially even simpler for the scheduler, which now doesn't need to care about what happens to the instructions it spits out. On the plus side, it allows for bunches of loads to increase MLP and coalescing opportunities while reducing the burden of tracking them.

This expends an instruction slot to define a memory clause of variable length.
AMD provided a separate sequencer for the TMU in the previous GPUs. It's as if they took the sequencers, merged, and exposed them to software.

Couldn't an aggressive compiler could emit a load for register X, and issue ALU ops using that same location, assuming that they will use the old value prior to the load's completion? That might not be reliable, if the vagaries of the scheduler or preemption delay some of them to near the time the load completes.
 
VERY sneaky. :)

It seems to have all the advantages of clauses, without many of the associated overheads.

Besides, you can now do prefetching for free.

It's surprising how much you can save by just getting rid of ILP and relying on TLP.
 
The loads would be sort of prefetches, though a load used in that way would still involve traffic from the cache to a register, which a normal prefetch would not do.
Going forward, the use of the x86 memory model also could lead to faults that prefetches typically do not need to cover, but a load would require handling, unless there is a no-register no-fault load option.
 
By the way, the highest number of tolerable outstanding fetches one can specify for continuation of execution are 15 vector memory reads (4bit field), 7 memory writes (3bit field) and 31 LDS/GDS/scalar memory accesses (5bit field).
 
This might allow for latency hiding even if code is not as arithmetically dense as the earlier SIMD design would have liked. If the compiler can collect enough loads, 15 outstanding loads and 10 wavefronts would give 150 cycles of latency hiding before a wavefront would stall on the first load.
It falls apart if a dependence in data or control pops up, though this scheme does expose risk if there is divergence.

I wonder if the hardware can meet the implied number of in-flight loads. 4 SIMDs with 150 issued loads per L1 cache? Does the 4-cycle wavefront issue count as 4 separate cache loads to be tracked?
The numbers involved seem to be huge.

The number LDS/GDS accesses is even higher. This must hint at the expectation that the data share will be heavily contended and that there is a signficant amount of latency involved in serializing around bank conflicts with the hardware used to manage the data share scheduling.
 
This might allow for latency hiding even if code is not as arithmetically dense as the earlier SIMD design would have liked. If the compiler can collect enough loads, 15 outstanding loads and 10 wavefronts would give 150 cycles of latency hiding before a wavefront would stall on the first load.
Guess you would have to multiply that number by the 4 cycles each SIMD gets scheduled.
It falls apart if a dependence in data or control pops up, though this scheme does expose risk if there is divergence.
At some points the dependency on the loads has to be resolved either way. The solution can do nothing what nvidias scoreboarding couldn't do either. It just requires less hardware.
I wonder if the hardware can meet the implied number of in-flight loads. 4 SIMDs with 150 issued loads per L1 cache?
As a first point it is the maximum the instruction format can handle. It doesn't tell us directly what the first incarnation of the hardware can do (but Cayman already supported 8 or even 16 [would have to look it up] outstanding memory accesses per wavefront). And secondly it is of course the maximums allowed per wavefront. There are probably other limits per CU (the memory pipeline is shared by all SIMDs) which -depending on your code and the conditions- you may hit earlier or not. And as the last point, in principle it defines only the maximum you can specify in that barrier-like instruction, there could be even more in flight in sections where it is not used (but unlikely in my opinion).
Does the 4-cycle wavefront issue count as 4 separate cache loads to be tracked?
That is highly unlikely as only complete wavefronts get issued, so the complete loads have to be ready either way.
The number LDS/GDS accesses is even higher. This must hint at the expectation that the data share will be heavily contended and that there is a signficant amount of latency involved in serializing around bank conflicts with the hardware used to manage the data share scheduling.
You have to take into account that it is shared with GDS and scalar memory accesses (the access to the scalar L1 is shared by 4 CUs). I would expect the LDS to be pretty fast as you can even broadcast a value from (a single location in the) LDS to all lanes of the vector ALU directly by choosing the LDS value as source operand. This obviously takes precedence over pending LDS load/store instructions, where bank conflicts may need to be handled (increasing the latency), but shows that the latency itself can't be very high as it appears to be hidden by the pipeline without any additional waiting for the single value (where of course no bank conflict can occur).

Edit:
And from what I can tell (just my impression, would need to look more into it) the compiler appears to try to allocate scalar registers as storage for small constant arrays and such things. So you may have quite a few scalar memory instructions at the beginning of your kernel just loading that stuff from memory. You probably don't what that stalling the beginning of execution of your kernel.
 
Last edited by a moderator:
At some points the dependency on the loads has to be resolved either way. The solution can do nothing what nvidias scoreboarding couldn't do either. It just requires less hardware.
If the scheduler is as dependent on the barrier instruction as it appears, it can do more. It may not be correct or defined behavior if a wavefront is set to run past N loads and in that window there is divergence or a set of ALU instructions dependent on one of the loads, which may or may not be ready in time.
 
If the scheduler is as dependent on the barrier instruction as it appears, it can do more. It may not be correct or defined behavior if a wavefront is set to run past N loads and in that window there is divergence or a set of ALU instructions dependent on one of the loads, which may or may not be ready in time.
In some sense, that would just be invalid code generated by the compiler. I'm pretty sure the compiler will always insert the appropriate instruction so the correct and reproducible behavior is guaranteed. If you bypass the compiler and write the code directly in assembler by yourself, you can get to this point of course. But I wouldn't promote that as a feature :LOL:
 
Going forward, the use of the x86 memory model also could lead to faults that prefetches typically do not need to cover, but a load would require handling, unless there is a no-register no-fault load option.
Like what? Loads can complete out of order. Just stall when the destination register of the load is being used, else keep the load in flight.
 
This might allow for latency hiding even if code is not as arithmetically dense as the earlier SIMD design would have liked. If the compiler can collect enough loads, 15 outstanding loads and 10 wavefronts would give 150 cycles of latency hiding before a wavefront would stall on the first load.
A load can be issued only every 4 cycles, so it should be 600 cycles. Quite enough.
 
If the scheduler is as dependent on the barrier instruction as it appears, it can do more. It may not be correct or defined behavior if a wavefront is set to run past N loads and in that window there is divergence or a set of ALU instructions dependent on one of the loads, which may or may not be ready in time.

Why wouldn't the compiler simply insert another waitcnt instr after the branch if that code path is dependent on other in-flight loads?
 
Like what? Loads can complete out of order. Just stall when the destination register of the load is being used, else keep the load in flight.
A prefetch could fault by reading from an address that could be invalid for cachability or protection reasons. A prefetch would give up with no ill effect.

A load can be issued only every 4 cycles, so it should be 600 cycles. Quite enough.
If the hardware is able to support that number of loads in flight, which I questioned earlier. That's a lot of in-flight accesses. The memory pipeline might max out at a lower number of outstanding loads.


Why wouldn't the compiler simply insert another waitcnt instr after the branch if that code path is dependent on other in-flight loads?
In the case that the compiler is correct and it does insert another waitcnt, it shows a theoretical limit of the functionality and one area where it may not be practical to use to the fullest.
If the compiler is incorrect, the code was created manually, or this was done on purpose, it may lead to interesting software behavior.
 
A prefetch could fault by reading from an address that could be invalid for cachability or protection reasons. A prefetch would give up with no ill effect.
This type of prefetching is strictly in programmer's hands, so it's safe to shift this burden onto him.
 
Back
Top