Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
I wasn't saying they aren't there. Just that the SIMDs can be scaled independently of TMUs now. See my earlier posts in this thread regarding AMD's ALU:TEX ratio.
So can someone explain the meaning of all this for someone whom is humbly ignorant of the meaning of what has transpired recently?![]()
If there is some exploition of a relation between the number of SIMDs and instruction latency(*) to keep the scheduling simple, it can't be scaled without taking into account some serious constraints. Anyway, the significant raise of the effective shader power of a CU will raise the effective ALU:TEX ratio anyway (edit: will be the same as GF100/110 type SMs and 25% higher than GF104/114 style SMs).I wasn't saying they aren't there. Just that the SIMDs can be scaled independently of TMUs now. See my earlier posts in this thread regarding AMD's ALU:TEX ratio.
If there is some exploition of a relation between the number of SIMDs and instruction latency(*) to keep the scheduling simple, it can't be scaled without taking into account some serious constraints. Anyway, the significant raise of the effective shader power of a CU will raise the effective ALU:TEX ratio anyway (edit: will be the same as GF100/110 type SMs and 25% higher than GF104/114 style SMs).
(*):
With the round robin approach of scheduling wavefronts belonging to the individual SIMDs, its' basically required that the instruction latency is an integer multiple of the number of SIMDs. Easiest would be a 1:1 ratio.But 2:1 looks still good (4 vector ALUs, latency 8 cycles; next step to higher ALU:TEX ratio would be 8 SIMDs, anything inbetween would require more sophisticated scheduling).
Well, that remains to be seen. It looks ok so far on paper, but we don't really know yet how good it'll work in practice...AMD pulled a fermi, but better.
64 FMA : 4 bilinear samples looks like a 33% increase from 96 FMA : 8 bilinear samples to me...Anyway, the significant raise of the effective shader power of a CU will raise the effective ALU:TEX ratio anyway (edit: will be the same as GF100/110 type SMs and 25% higher than GF104/114 style SMs).
My bad, used the wrong base. So a GF104 style SM has 25% lower ALU/Tex ratio than AMDs CU (or current SIMD engine, at least theoretically but harder to compare) or AMD's ratio is 33% higher.64 FMA : 4 bilinear samples looks like a 33% increase from 96 FMA : 8 bilinear samples to me...
It's certainly an improvement over previous AMD architectures, but Fermi likely still has more efficient caching (not just because it has more L1, but also since it can likely cover L1 latency with less workgroups in flight).AMD pulled a fermi, but better.
You can use a simple round robin scheme for the schedulers (as implied in the presentation) and you are guaranteed to be able to issue a dependent instruction without tracking dependencies between arithmetic instructions. It's the same reasoning as for the two interleaved wavefronts used from R600 through Cayman.Hmmm, I don't follow. How does scheduling benefit from the relationship between SIMD count and instruction latency?
AMD currently hides the instruction latencies with the execution of two interleaved wavefronts. They don't need to track dependecies between individual instructions at all. They only need to track dependencies between clauses (involving memory clauses, arithmetic ones don't have dependencies on each other, they can be executed simply in order).It should just follow the normal mantra of having enough wavefronts available to hide instruction latency. There shouldn't be any hardcoded assumptions at play - that's why they got rid of clauses.
But if you increase the numbers of SIMDs to let's say 6 and you use the same scheduling scheme as suggested (round robin), it's not going to be any faster than putting in just 4 SIMDsThe number of SIMDs is determined by the number of instruction schedulers/dispatchers, the width of the SIMD and the wavefront size.
# SIMDS (x) = instruction dispatch rate * wavefront size / SIMD width.
Since the SIMDs are pipelined you should ALWAYS be able to issue a new wavefront to a given SIMD every x clocks regardless of instruction latency.
Why should LDS be different from any other resource? If you oversubscribe a resource, performance goes down; that's not complicated nor different from anything else.Sure, but designing your hw assuming devs will use litle LDS when spec exposes 32K is a poor design choice. Although something that will work.
Naa, the spec allows for textures with dimensions of 8192x8192 (or 16kx16k now?). I think no developer would expect a good performance if he really uses that excessively.I think devs would expect to get good perf even if they use a lot of LDS if the spec says they can do so.
You can use a simple round robin scheme for the schedulers (as implied in the presentation) and you are guaranteed to be able to issue a dependent instruction without tracking dependencies between arithmetic instructions. It's the same reasoning as for the two interleaved wavefronts used from R600 through Cayman.
Wavefront size stays at 64 as stated and the width of the vector ALUs is 16, so every 4 clocks a new instruction can be accepted. If you have more than 4, you can't serve them round robin anymore because you will come by only every 6 cycles for instance (when you have 6 SIMDs). There is basically no easy way to increase that number arbitrarily, you have to double up your scheduling width (twice the number of instructions per cycle) to accommodate 8 SIMDs. Something in between doesn't make much sense.
Right and that's where I'm missing something. 10 wavefronts per SIMD limits latency hiding to 40 cycles. Isn't that a bit on the low side, especially for an architecture that can be running different apps and kernels in parallel? What happens when a few wavefronts hit long latency memory instructions (> 40 cycles)? Do you get a 4 cycle bubble in the pipeline or does the entire thing stall?The alternative is Fermi: in comparison much more effort for the scheduling to hide the long latencies not only for memory accesses, but also just for plain arithmetic instructions. I would guess AMD tries to save something on the latter part, especially as they claimed that the die real estate of a CU will rise not excessively compared to Caymans SIMD engines (it would have to rise either way with the scalar units and double the LDS and double the L1 which is now R/W).
It's round-robin amongst the SIMDs, and then it picks amongst the ~10 wavefronts on a SIMD in that cycle.Right and that's where I'm missing something. 10 wavefronts per SIMD limits latency hiding to 40 cycles. Isn't that a bit on the low side, especially for an architecture that can be running different apps and kernels in parallel? What happens when a few wavefronts hit long latency memory instructions (> 40 cycles)? Do you get a 4 cycle bubble in the pipeline or does the entire thing stall?
As I said already, you still have to track memory dependencies of course. You just don't have to care about dependencies between arithmetic instructions anymore. That basically reduces the dimensionality of a scoreboard to a score line. Appears simpler to meAh, so you're taking the position that they're not doing any fancy scoreboarding and just relying on the same sort of assumptions they made for the interleaving of clauses. How well is that going to work when you toss variable latency memory transactions in there? If you size your CU based on instruction latencies only, memory latencies could make things go south real fast.
Are you assuming all 10 Wavefronts per SIMD are issuing memory operations and can't continue until they complete (because the next arithmetic instruction accesses the destination register of a pending memory instruction) at the same time and no wavefront is left doing some calculations? How probable is that? According to the presentation, all wavefronts on one SIMD are evaluated for the possibility to schedule the next instruction from them (which means the scheduler has to check 10 possible instructions for conflicts with pending memory operations for that respective wavefront, but not with all instructions in flight, and it has basically 4 cycles for that check). There is no round robin scheme there. So if any wavefront for the SIMD has a vector instruction left, the SIMD will not be idle.Right and that's where I'm missing something. 10 wavefronts per SIMD limits latency hiding to 40 cycles. Isn't that a bit on the low side, especially for an architecture that can be running different apps and kernels in parallel? What happens when a few wavefronts hit long latency memory instructions (> 40 cycles)? Do you get a 4 cycle bubble in the pipeline or does the entire thing stall?
Afaik, Fermi is at 18 cycles for single precision instructions done on the vector ALUs. It isn't that bad.In contrast, if we look at Fermi:
Pipeline depth: Fermi: 36 clocks, FSA: 4 clocks (if consecutive issue of dependent instructions is really possible)
That would mean that in absolute terms FSA can hide more latency time (because clockrate is significantly lower)? So where is the problem then?Wavefronts per SIMD: Fermi: 24, FSA: 10
Latency hiding: Fermi: 48 clocks, FSA: 40 clocks
As explained above, it has to do something for memory instructions of course, but in my opinion not for dependencies between arithmetic instructions.I have to think that FSA will be doing some sort of scoreboarding / readiness tracking of its wavefronts and not just blindly doing round-robin issue.
Thank you!I tried my best to cover his presentation and the materials passed to me, but what I was given just had too many holes in it to get a complete overview and understanding: http://www.pcper.com/reviews/Graphi...ecture-Overview-Southern-Isle-GPUs-and-Beyond