AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Can anyone confirm unified virtual address space for SI ?
Or any of the features related to this slide?
URL]
 
I wasn't saying they aren't there. Just that the SIMDs can be scaled independently of TMUs now. See my earlier posts in this thread regarding AMD's ALU:TEX ratio.

Are you sure about this? And is it fully independent, meaning any ratio is possible, or is there still some constraint?

Full independence would sure be helpful for CU scaling.
 
I wasn't saying they aren't there. Just that the SIMDs can be scaled independently of TMUs now. See my earlier posts in this thread regarding AMD's ALU:TEX ratio.
If there is some exploition of a relation between the number of SIMDs and instruction latency(*) to keep the scheduling simple, it can't be scaled without taking into account some serious constraints. Anyway, the significant raise of the effective shader power of a CU will raise the effective ALU:TEX ratio anyway (edit: will be the same as GF100/110 type SMs and 25% higher than GF104/114 style SMs).

(*):
With the round robin approach of scheduling wavefronts belonging to the individual SIMDs, its' basically required that the instruction latency is an integer multiple of the number of SIMDs. Easiest would be a 1:1 ratio. ;) But 2:1 looks still good (4 vector ALUs, latency 8 cycles; next step to higher ALU:TEX ratio would be 8 SIMDs, anything inbetween would require more sophisticated scheduling).
 
Last edited by a moderator:
If there is some exploition of a relation between the number of SIMDs and instruction latency(*) to keep the scheduling simple, it can't be scaled without taking into account some serious constraints. Anyway, the significant raise of the effective shader power of a CU will raise the effective ALU:TEX ratio anyway (edit: will be the same as GF100/110 type SMs and 25% higher than GF104/114 style SMs).

(*):
With the round robin approach of scheduling wavefronts belonging to the individual SIMDs, its' basically required that the instruction latency is an integer multiple of the number of SIMDs. Easiest would be a 1:1 ratio. ;) But 2:1 looks still good (4 vector ALUs, latency 8 cycles; next step to higher ALU:TEX ratio would be 8 SIMDs, anything inbetween would require more sophisticated scheduling).

Hmmm, I don't follow. How does scheduling benefit from the relationship between SIMD count and instruction latency? It should just follow the normal mantra of having enough wavefronts available to hide instruction latency. There shouldn't be any hardcoded assumptions at play - that's why they got rid of clauses.

The number of SIMDs is determined by the number of instruction schedulers/dispatchers, the width of the SIMD and the wavefront size.

# SIMDS (x) = instruction dispatch rate * wavefront size / SIMD width.

Since the SIMDs are pipelined you should ALWAYS be able to issue a new wavefront to a given SIMD every x clocks regardless of instruction latency.
 
AMD pulled a fermi, but better.
Well, that remains to be seen. It looks ok so far on paper, but we don't really know yet how good it'll work in practice...
Though a 32 CU chip would be quite a monster indeed, basically twice the shader capacity as GF110. Now the raw flop number wouldn't be all that much higher than Cypress/Cayman but if efficiency (not per area but per theoretical throughput) improves to GF110 levels that would be quite a beast. Granted maybe nvidia will be able to fit twice a GF110 on 28nm too...
Though wouldn't that also need quite a bit more memory bandwidth to really shine? I wonder what we'll gonna see there.


Anyway, the significant raise of the effective shader power of a CU will raise the effective ALU:TEX ratio anyway (edit: will be the same as GF100/110 type SMs and 25% higher than GF104/114 style SMs).
64 FMA : 4 bilinear samples looks like a 33% increase from 96 FMA : 8 bilinear samples to me...
 
From Demers' presentation, he addressed that they did a lot of work in general on the caches and memory utilization/efficiency for the architecture. So this is aimed not just at improving APU performance where main memory bandwidth is shared and constrained, but also in the larger discrete units which will have their own dedicated memory.

I tried my best to cover his presentation and the materials passed to me, but what I was given just had too many holes in it to get a complete overview and understanding: http://www.pcper.com/reviews/Graphi...ecture-Overview-Southern-Isle-GPUs-and-Beyond
 
64 FMA : 4 bilinear samples looks like a 33% increase from 96 FMA : 8 bilinear samples to me...
My bad, used the wrong base. So a GF104 style SM has 25% lower ALU/Tex ratio than AMDs CU (or current SIMD engine, at least theoretically but harder to compare) or AMD's ratio is 33% higher.
 
Thanks PcPer and hardware.fr! Realworldtech is going to do an article but really disappointed there hasn't been more coverage.
 
AMD pulled a fermi, but better.
It's certainly an improvement over previous AMD architectures, but Fermi likely still has more efficient caching (not just because it has more L1, but also since it can likely cover L1 latency with less workgroups in flight).
 
Hmmm, I don't follow. How does scheduling benefit from the relationship between SIMD count and instruction latency?
You can use a simple round robin scheme for the schedulers (as implied in the presentation) and you are guaranteed to be able to issue a dependent instruction without tracking dependencies between arithmetic instructions. It's the same reasoning as for the two interleaved wavefronts used from R600 through Cayman.
It should just follow the normal mantra of having enough wavefronts available to hide instruction latency. There shouldn't be any hardcoded assumptions at play - that's why they got rid of clauses.
AMD currently hides the instruction latencies with the execution of two interleaved wavefronts. They don't need to track dependecies between individual instructions at all. They only need to track dependencies between clauses (involving memory clauses, arithmetic ones don't have dependencies on each other, they can be executed simply in order).
They get rid of the clauses, because control flow (which opens up a new clause) was inefficient (generally all code with a lot of short clauses). Clauses simplify checking for depencies (at least one kind of them). The arithmetic dependies were completely hidden by the interleaved wavefront issue.
The number of SIMDs is determined by the number of instruction schedulers/dispatchers, the width of the SIMD and the wavefront size.

# SIMDS (x) = instruction dispatch rate * wavefront size / SIMD width.

Since the SIMDs are pipelined you should ALWAYS be able to issue a new wavefront to a given SIMD every x clocks regardless of instruction latency.
But if you increase the numbers of SIMDs to let's say 6 and you use the same scheduling scheme as suggested (round robin), it's not going to be any faster than putting in just 4 SIMDs ;)

Wavefront size stays at 64 as stated and the width of the vector ALUs is 16, so every 4 clocks a new instruction can be accepted. If you have more than 4, you can't serve them round robin anymore because you will come by only every 6 cycles for instance (when you have 6 SIMDs). There is basically no easy way to increase that number arbitrarily, you have to double up your scheduling width (twice the number of instructions per cycle) to accommodate 8 SIMDs. Something in between doesn't make much sense.

Let's go back to 4 SIMDs seen on the slides. The instruction latency comes into play when you think about following the lines of keeping the things simple. If it is 8 cycles for instance, you need to issue the next instruction for that SIMD from another wavefront. Welcome back to the interleaved wavefront scheme (now just for more than just two). Only for a 4 cycle latency you get a "vector back-to-back wavefront issue" as mentioned on the slides and get rid of this interleaving as promised on one of the slides (but one may argue about it, I agree). Everything between 5 and 8 cycles require at least 2 wavefronts/SIMD, 9-12 will requuire 3 and so on. While just marking a wavefront as "busy", i.e. suppressing the issue of an instruction for the next (one or two) times sounds fairly easy, it will nevertheless complicate things a bit (it will have to be checked when considering instructions for issue) and will also raise the minimum number of wavefronts necessary to hide the arithmetic latencies. => Bad

An old SIMD engine could attain peak rates with only 2 wavefronts (128 data elements). With a 4 cycle latencies a CU could get to peak rates with 4 wavefronts (256 data elements). With an 8 cycle latency (everything in between does not make a difference and is only a possible reason for limiting the clockspeed) you need at least 8 wavefronts (512 data elements) und for 12 cycles you are at 12 wavefronts (768 data elements). For comparison, a GF100 SM needs 576 data elements (18 warps, the instruction latency) in the worst case (absolutely no ILP in the instruction stream present) down to 192 data elements (6 warps) with ILP=4 (does not scale beyond that, obviously the size of the instruction window the Fermi schedulers look at). So especially for lowly threaded problems, there is a strong incentive of keeping the latencies down so one doesn't fall behind Fermi (AMD's CUs probably don't care about ILP at all, as this would again mean one would need to track dependencies between arithmetic instructions). And as the effective size of the register file available for one thread is quite a bit lower for the new architecture (because each SIMD has a separate one, before the instructions in VLIW slot z for instance could access elements from all other banks), that's another incentive to keep the number of threads somewhat in check, i.e. the latencies down.

The alternative is Fermi: in comparison much more effort for the scheduling to hide the long latencies not only for memory accesses, but also just for plain arithmetic instructions. I would guess AMD tries to save something on the latter part, especially as they claimed that the die real estate of a CU will rise not excessively compared to Caymans SIMD engines (it would have to rise either way with the scalar units and double the LDS and double the L1 which is now R/W).
 
Last edited by a moderator:
Sure, but designing your hw assuming devs will use litle LDS when spec exposes 32K is a poor design choice. Although something that will work.
Why should LDS be different from any other resource? If you oversubscribe a resource, performance goes down; that's not complicated nor different from anything else.
 
I think devs would expect to get good perf even if they use a lot of LDS if the spec says they can do so.
Naa, the spec allows for textures with dimensions of 8192x8192 (or 16kx16k now?). I think no developer would expect a good performance if he really uses that excessively. :rolleyes:
 
You can use a simple round robin scheme for the schedulers (as implied in the presentation) and you are guaranteed to be able to issue a dependent instruction without tracking dependencies between arithmetic instructions. It's the same reasoning as for the two interleaved wavefronts used from R600 through Cayman.

Ah, so you're taking the position that they're not doing any fancy scoreboarding and just relying on the same sort of assumptions they made for the interleaving of clauses. How well is that going to work when you toss variable latency memory transactions in there? If you size your CU based on instruction latencies only, memory latencies could make things go south real fast.

Wavefront size stays at 64 as stated and the width of the vector ALUs is 16, so every 4 clocks a new instruction can be accepted. If you have more than 4, you can't serve them round robin anymore because you will come by only every 6 cycles for instance (when you have 6 SIMDs). There is basically no easy way to increase that number arbitrarily, you have to double up your scheduling width (twice the number of instructions per cycle) to accommodate 8 SIMDs. Something in between doesn't make much sense.

I agree, think I summed it up in my equation above :)

The alternative is Fermi: in comparison much more effort for the scheduling to hide the long latencies not only for memory accesses, but also just for plain arithmetic instructions. I would guess AMD tries to save something on the latter part, especially as they claimed that the die real estate of a CU will rise not excessively compared to Caymans SIMD engines (it would have to rise either way with the scalar units and double the LDS and double the L1 which is now R/W).
Right and that's where I'm missing something. 10 wavefronts per SIMD limits latency hiding to 40 cycles. Isn't that a bit on the low side, especially for an architecture that can be running different apps and kernels in parallel? What happens when a few wavefronts hit long latency memory instructions (> 40 cycles)? Do you get a 4 cycle bubble in the pipeline or does the entire thing stall?

In contrast, if we look at Fermi:

Pipeline depth: Fermi: 36 clocks, FSA: 4 clocks (if consecutive issue of dependent instructions is really possible)
Wavefronts per SIMD: Fermi: 24, FSA: 10
Latency hiding: Fermi: 48 clocks, FSA: 40 clocks

I have to think that FSA will be doing some sort of scoreboarding / readiness tracking of its wavefronts and not just blindly doing round-robin issue.
 
Right and that's where I'm missing something. 10 wavefronts per SIMD limits latency hiding to 40 cycles. Isn't that a bit on the low side, especially for an architecture that can be running different apps and kernels in parallel? What happens when a few wavefronts hit long latency memory instructions (> 40 cycles)? Do you get a 4 cycle bubble in the pipeline or does the entire thing stall?
It's round-robin amongst the SIMDs, and then it picks amongst the ~10 wavefronts on a SIMD in that cycle.
Its 40*4 or 160 cycles before you potentially need to worry about the next instruction, and that's only if all other wavefronts are stalled. (edit: to clarify, if all the other wavefronts issue once and then stall after the instruction issue in question)
 
Last edited by a moderator:
Ah, so you're taking the position that they're not doing any fancy scoreboarding and just relying on the same sort of assumptions they made for the interleaving of clauses. How well is that going to work when you toss variable latency memory transactions in there? If you size your CU based on instruction latencies only, memory latencies could make things go south real fast.
As I said already, you still have to track memory dependencies of course. You just don't have to care about dependencies between arithmetic instructions anymore. That basically reduces the dimensionality of a scoreboard to a score line. Appears simpler to me ;)
Right and that's where I'm missing something. 10 wavefronts per SIMD limits latency hiding to 40 cycles. Isn't that a bit on the low side, especially for an architecture that can be running different apps and kernels in parallel? What happens when a few wavefronts hit long latency memory instructions (> 40 cycles)? Do you get a 4 cycle bubble in the pipeline or does the entire thing stall?
Are you assuming all 10 Wavefronts per SIMD are issuing memory operations and can't continue until they complete (because the next arithmetic instruction accesses the destination register of a pending memory instruction) at the same time and no wavefront is left doing some calculations? How probable is that? According to the presentation, all wavefronts on one SIMD are evaluated for the possibility to schedule the next instruction from them (which means the scheduler has to check 10 possible instructions for conflicts with pending memory operations for that respective wavefront, but not with all instructions in flight, and it has basically 4 cycles for that check). There is no round robin scheme there. So if any wavefront for the SIMD has a vector instruction left, the SIMD will not be idle.
In contrast, if we look at Fermi:

Pipeline depth: Fermi: 36 clocks, FSA: 4 clocks (if consecutive issue of dependent instructions is really possible)
Afaik, Fermi is at 18 cycles for single precision instructions done on the vector ALUs. It isn't that bad.
Wavefronts per SIMD: Fermi: 24, FSA: 10
Latency hiding: Fermi: 48 clocks, FSA: 40 clocks
That would mean that in absolute terms FSA can hide more latency time (because clockrate is significantly lower)? So where is the problem then?
I have to think that FSA will be doing some sort of scoreboarding / readiness tracking of its wavefronts and not just blindly doing round-robin issue.
As explained above, it has to do something for memory instructions of course, but in my opinion not for dependencies between arithmetic instructions.

What FSA misses compared to Fermi is a load balancing between the SIMDs. With FSA, the register files are pinned to a certain vector ALU, unlike with the VLIW approach (only write port was fixed, one could read each bank of the register file) before and also unlike with Fermis register file (both vector ALUs have access to the whole register file). That needs some complexity and time for collecting the operands which can be shaved off now with FSA (the disadvantage is a smaller effective register file).
As mentioned already, the Cayman VLIWs apparently support dependent instructions within an VLIW (also multiplications, which is basically the maximum effort). That tells me that the ALUs itself have no problems completing the operations within 4 clock cycles at <= 1GHz. But reading the operands for the operations from the 4 register file banks are distributed over 3 cycles and swizzled to the right VLIW slot (that's where some of the weird rules came from) according to the documentation. That will not be needed anymore, which makes things much more straight forward and also saves some time. A similar problem as sketched for VLIW applies to Fermi, too. Both vector ALUs and also the SFUs need to access the full register file. And opposed to VLIW (where the slot determines the bank to write to) the results have to be routed back. So you need more ports on your register files and a more sophisticated operand collector and result network. And all that costs time (latency).

Edit:
Btw., I really think one has to give AMD some credit for naming things what they are. No stupid "scalar" ALUs (besides the one which is really scalar ;)) or such stuff. No, they named SIMD SIMD and a vector ALU is a vector ALU.
 
Last edited by a moderator:
We were told directly by Eric Demers that the first products using this technology will be released in Q4 of this year. But it looks like on the APU side that Trinity is based on the VLIW4 of the Cayman chips. So, it could be another 1.5 to 2 years before we see this technology on the APUs. Now, having said that, I am curious if the Brazos update will not in fact utilize the FSA?
 
Back
Top