AMD: R7xx Speculation

Status
Not open for further replies.
I don't agree with most of your assertions here. You don't have to be able to branch every scalar instruction. Every fourth would match branching performance with the old design. As for transcendentals, I said let's ignore it for simplicity, but if you want to go there then I will.
Maybe I misinterpreted what design you are proposing.

Okay, so let's compare the old design (A) with the new one (B). A has 16x(4x1D + 1D) SIMD units, and B has 64x1D + 16x1D SIMD units (MAD + transcendental).
You are saying that design B has 64x1D +16x1D in a SIMD.
How many individual elements are processed per clock?
Design A has 16 elements being processed in a given clock cycle, hence why the 80 units per SIMD are divided up into 5 ALU processor groups.
What has design B changed, exactly, other than distributing the 16 over the terms in the parenthesis?
By your doing so, I interpret it as meaning that all 64 elements have one component evaluated per clock.
If not, why did you change the 16-unit division of ALUs?
I don't see how it's related to the thread-switching scheme that follows.

A has a "macrobatch" of two 64-thread batches, and B's consists of eight 64-thread batches. A's macrobatches can be switched every 8 cycles, B's every 32 cycles. Both have instruction packets of (4x1D + 1D), but B has the additional flexibility of dependency in the MAD parts.
Design A has to have enough registers to handle two 64-thread batches.
Design B needs enough to handle eight.
Either that, or each clause is 1/4 the size of those found in A, and clause setup overhead is quadruple that of A. The absolute amount of overhead is not something I'm aware of.

You can see that instruction packet throughput and branch throughput is the same in both systems, so you don't really need more resources there for decoding/fetching/whatever.
Each thread gets a sequencer in the SIMD's control logic.
Design A has two.
Why wouldn't B have eight?
The live set of registers is also 8 times as large, over most of the 32 clocks of execution.
The most design A will have to worry about is 2 clauses' worth.
 
It's a PhysX test alright. And PhysX was originally meant to be done on the CPU or - if you had one installed - on a PPU. I'm sure they would've removed the test had they known that NV was going to take over Ageia and turn their GPU into a PPU.

Futuremark needs to take a stand. They might approve of it if you use a seperate GF card (not in SLI) to process physics instead of doing partial physics on the GPU like it's being done now. The way it's done now just goes against their own Driver Approval Policy... it artificially boosts the 3DM score in a way that it's not meant to be benched...

I thought this was the intent of the CPU2 test.
 
That's not under debate. The question is whether it's Nvidia's duty to disable their PhysX driver when 3dmark is running. IMO it's not. 3dmark makes calls to the PhysX API. If GPU accelerated PhysX wasn't factored into the assumptions that FM made in designing the test that way then they are at fault. They also vastly underestimated how much faster a GPU would be than a PPU (assuming Nvidia really isn't cheating to get to those performance numbers).
Its not Nvidia's duty to disable it but wouldnt the scores be termed useless if they dont meet the testing policy?
 
That means a given element's instruction on a SIMD will not hit the N+1 instruction for 8 cycles.
If branches can be resolved and the next instruction picked in that time frame, branching won't insert a bubble.
Branches are never resolved without switching hardware thread (context) - that's because the Sequencer evaluates the branch (i.e. decides where to jump next).

So ALU bubbles can only occur when the ALU SIMD runs out of threads, e.g. when all threads are waiting for TEX results.

Also, result forwarding isn't needed if the results can make the round trip from ALU to register file, and back again, which saves hardware.
This is never an issue because the pipeline has a "previous" register that holds a copy of the last register result for each of the five ALU lanes. This register (seems to be 2 in fact, vec4 + scalar) can be sampled in any successive instruction (its lifetime is until it's overwritten).

There are register file bandwidth constraints but it's easier to point you at the R600 ISA document which spends pages on the subject.

Jawed
 
Its not Nvidia's duty to disable it but wouldnt the scores be termed useless if they dont meet the testing policy?

Well technically they'd be even more useless...but yeah :)

Looking at some of the numbers these cards are putting up nowadays maybe it would be a good thing to grab some of those cycles for physics acceleration. As always we don't have the software to evaluate that approach so we're just jerking off into the wind as usual.
 
One thing I wondered is why R600 (and later) didn't have a 64x1D single-clock-switch SIMD instead of 16x5x1D 4-clock-switch SIMD, if you know what I mean. Let's ignore the 5th channel for now. This would give ATI the same dependency-free scalar performance that NVidia has.
It seems ATI wanted to build a vec4-structured register file. Though with the irony that any 32-bit aligned word can be fetched.

It seems to me that after a certain point register file layout/porting/bandwidth/read-ordering trumps a lot of other things when you're building ALUs.

Also, you can't ignore the requirement to build ALUs that do more than just MAD/MUL/ADD. Transcendentals need to be proportionate (i.e. 1/4 MAD rate) and then the myriad of SM4's new instructions need to be distributed across the widths of the ALU types, without adding too much specialisation.

Once you add these instruction types you then get into the "co-issue" problem and serial dependency issues. ATI decided to go with an entirely static, compiled, solution. Doing that they presumably then decided that for now 4 MADs + 1 T was the right way to go, instead of 1 MAD + 1 T that runs at 1/4 speed, or whatever...

Jawed
 
Okay, here's some more proof:
http://www.anandtech.com/video/showdoc.aspx?i=3275&p=4

The 9800 GTX has 10% higher core clock, 13% higher shader clock, and over 120% higher bilinear texturing rate than the 8800 Ultra. However, in Crysis they perform the same, because the Ultra has 47% more BW/ROPs.


http://www.anandtech.com/video/showdoc.aspx?i=3338&p=6

more recent benchmarks, the 9800 gtx has conciderably less bandwidth as the 8800 gtx, but it out performs it by 10%+

concidering the gtx 260 is also in those charts its got a huge bandwidth advantage but we don't see any of that.
 
There are register file bandwidth constraints but it's easier to point you at the R600 ISA document which spends pages on the subject.

Somehow I've never run across the pdf. My google-fu is weak. :cry:
I think I've found it now, so I'll be going through it.
 
You are saying that design B has 64x1D +16x1D in a SIMD.
How many individual elements are processed per clock?
Design A has 16 elements being processed in a given clock cycle, hence why the 80 units per SIMD are divided up into 5 ALU processor groups.

...

The live set of registers is also 8 times as large, over most of the 32 clocks of execution.
This is where the SOA and AOS mentality comes in. The register file doesn't need to access all 4 channels of all 64 elements in design B. It's organized in groups of 64 by channel, instead of 16 pixels.

Assuming two register operands allowed to be read per instruction, A's SIMD reads 10 groups of 16 FP32/s per clock. In B, the SIMD reads 2 blocks of 64 FP32's every clock and 2 more every 4 clocks for the transcendental units. "Live registers" is the same.

This actually makes B's register file design simpler, because you don't need as much granularity.

By your doing so, I interpret it as meaning that all 64 elements have one component evaluated per clock.
Yes.
If not, why did you change the 16-unit division of ALUs?
I don't see how it's related to the thread-switching scheme that follows.
Otherwise I can't make the same claims as before. I guess it could be 4 groups of 16, but isn't that the same thing?

Design A has to have enough registers to handle two 64-thread batches.
Design B needs enough to handle eight.
Not sure what you're talking about. Register files are the same size. FP32's in flight in the ALUs is the same as well.

Each thread gets a sequencer in the SIMD's control logic.
Design A has two.
Why wouldn't B have eight?

The most design A will have to worry about is 2 clauses' worth.
A's sequencer can be used in B for the most part. It can handle 2 per 8 cycles in A, so 8 per 32 cycles in B isn't a problem. It's still updating instruction pointers and loading instruction clauses and dealing with branches at the same rate. Yeah, insanely long instruction sequences that don't fit in any cache could thrash worse in this design, but it's a corner case.

For clarity, how big is a clause in the way you use the term? Are you talking about a 5x1D instruction packet, or something variable and longer? (Well, not always 5x1D, as tex or branch instruction are possible too, but they don't go into the SIMD's, obviously)
 
Last edited by a moderator:
http://www.anandtech.com/video/showdoc.aspx?i=3338&p=6

more recent benchmarks, the 9800 gtx has conciderably less bandwidth as the 8800 gtx, but it out performs it by 10%+
That's just because the 8800 GTX's other deficits are too big now. The 9800 GTX has 25% higher shader clock, 17% higher core, 135% higher bilinear, etc.

My point is that is that all else being equal, BW helps in Crysis even without AA. You said it doesn't.

Take a 4850 and bump clocks (mem and core) by 20%, and you should get a 20% boost if CPU/PCI-e isn't a factor at the testing resolution. Now increase just the mem clock another 50% to match the 4870, and you'll get a futher boost. Not another 50%, of course, but a boost nonetheless.
 
Last edited by a moderator:
That's just because the 8800 GTX's other deficits are too big now. The 9800 GTX has 25% higher shader clock, 17% higher core, 135% higher bilinear, etc.

My point is that is that all else being equal, BW helps in Crysis even without AA. You said it doesn't.

Take a 4850 and bump clocks (mem and core) by 20%, and you should get a 20% boost if CPU/PCI-e isn't a factor at the testing resolution. Now increase just the mem clock another 50% to match the 4870, and you'll get a futher boost. Not another 50%, of course, but a boost nonetheless.


how much of deficits, the 8800 gt is also in that graph and it has ~35% less bandwidth compared to the 8800 gtx and its performance is only 5% less at the highest resolution tested. To expect maricles with the 4870 just because of bandwidth in Crysis (edit: without AA active), I just can't seem to see where that is coming from.
 
Last edited by a moderator:
It seems to me that after a certain point register file layout/porting/bandwidth/read-ordering trumps a lot of other things when you're building ALUs.
I totally agree, but as I mentioned above, this modification simplifies register layout a bit. Porting remains the same.

Also, you can't ignore the requirement to build ALUs that do more than just MAD/MUL/ADD. Transcendentals need to be proportionate (i.e. 1/4 MAD rate) and then the myriad of SM4's new instructions need to be distributed across the widths of the ALU types, without adding too much specialisation.

Once you add these instruction types you then get into the "co-issue" problem and serial dependency issues.
I included this in more detail after the post you replied to. Co-issue is a problem that exists with NVidia's architecture, too, but if look again, my solution simply modifies an instruction packet (i.e. what the SIMD can do in a clock) to allow dependent MADs. Transcendentals are scheduled in the same way as it is now.

ATI decided to go with an entirely static, compiled, solution. Doing that they presumably then decided that for now 4 MADs + 1 T was the right way to go, instead of 1 MAD + 1 T that runs at 1/4 speed, or whatever...
My solution is the same in that sense. Same number of MADs, same number of T's. It just operates on elements in a different order, thus allowing dependencies and easier register file access, but has a few costs that IMHO are small.
 
I totally agree, but as I mentioned above, this modification simplifies register layout a bit. Porting remains the same.
I would caution you to get your head round sections 4.6 and 4.7 of R600 ISA :devilish:

My eyes glazed over half way through 4.7.4 :LOL:

Jawed
 
Status
Not open for further replies.
Back
Top