AMD: R7xx Speculation

3dilettante · Jun 20, 2008

Mintmaster said:
I'll give you the branching advantage, but the point of my post was that even for serially dependent scalar code, RV770's ALUs are as effective per mm2 as GT200's. We're looking at 5 times the ALUs in half the space.

Is that after correcting for process node?
I know that per product it shouldn't matter, but unless a process shrink has no effect on the unit area, G200b would do better. The design concept underlying G200 would win out on a per mm2 basis.

I'm not sure about power as of yet.
The picture might be clearer once the power numbers for 4870 and whatever the top clocked G92b chip are final.

edit: I missed your edit. I had left my post reply screen on for a while.

Not that I think NVidia has the ability to make ALUs as dense as RV770's (heck, even ATI didn't up to now), but if they did, would they really choose their current design instead when the only time it wins is with branchy code?

Possibly. The minimum program granularity is coarse for RV770.
The future might require that the GPUs start pulling shaders from different contexts, or the trend for running non-graphics code concurrently with the graphics code may take off.

Recall · Jun 20, 2008

Guys I am in the market for a new gpu. I currently have a 8800GTS 640mb, looking at benchmarks the 4850 does not represent a big enough boost for me.

With the superior core speed and gigantic memory clocks, can we expect 20-30% increase in performance with the 4870 over the 4850?

Also are ATI now using hardware resolve for processing AA, or is it still done through shaders.

Thanks!

pjbliverpool · Jun 20, 2008

kyetech said:
R300 -> R420 -> R520 -> R580
Amazing -> Good -> dissapointing -> OK

Xenos -> R600 -> RV670 -> RV770
V.good -> good? -> ok? -> Amazing

R300 = RV770 = R700 = Amazing (relative to their price points)

Awesome

But how do you know Xenos is v.good in the context of those PC GPU's? We don't actually know how it performs compared to those others. At best we can say it probably performs a bit better than a 500Mhz 128bit G71 with 8 ROPs. Hardly earth shattering compared to R600 especially when you consider that R600 also supports DX10 which takes a lot of transistors.

This has always been the problem i've had with the general consensus with regards to Xenos. On paper, it looks incredible. But on paper, R600 looks earth shatteringly incredible! If we didn't actually get to see R600 benchmarks in the cold light of day - especially in comparison to G80 then we would probably still be happily toddling along thinking its the best thing since sliced bread.

We've never seen any Xenos benchmarks though.

Arty · Jun 20, 2008

Lightsmark 2007 v1.3 12x10

3870X2: 122.7
4850: 264.3

What specifically caused this leap?

Recall · Jun 20, 2008

New video, wow!

http://www.youtube.com/watch?v=x2fS9covXBs

Unknown Soldier · Jun 20, 2008

If that's real, then it is impressive indeed.

US

3dilettante · Jun 20, 2008

CarstenS said:
No you didn't. But until quite recently, one was supposed to think that every kind of crossbar was old-fashioned compared to a modern concept like ring-bus.

It wouldn't be the IBM Cell designers. If it weren't for time pressure, some indicated they'd rather have gone for a higher-performance interconnect.

Arun said:
Yeah, and 20 double pumped TMUs (or more precisely, part of the TMUs at the very least would maybe be 4-wide and double-pumped instead of 8-wide).
Yes, the way this would be implemented is that each of the four 'squares' per row is actually 2x5 ALUs double-pumped, rather than 4x5.

If that's the case, RV770's serial performance would be double that of an 800 ALU design.
Wouldn't a shader that was only a string of dependent incrementing adds pull up a higher sum per pixel?

cho · Jun 20, 2008

I think the interconnect of 800 ALUs maybe a 2D-Mesh.

MfA · Jun 20, 2008

3dilettante said:
The minimum program granularity is coarse for RV770.

Minimum granularity for a G80 styled multiprocessor with 8 double pumped ALUs is 16. Minimum granularity for a RV670 style SIMD engine with 16x5 ALUs is 16.

Is 16 coarse?

PS. just like last generation I suspect the minimum is very far removed from the actual, it's just that scalar-SPMD does not inherently have better branching granularity than VLIW-SPMD.

3dilettante · Jun 20, 2008

I meant program granularity, as in how many different programs can be actively executing in a given cycle.

Right now, none of the GPUs pull from separate kernels or separate programs, but if the designers wanted to change that, they'd be limited by the number of SIMDs.

Right now, it seems that for physics and whatnot, there's just a big context switch between modes, or the demos run a separate GPU card to run physics.

Mintmaster · Jun 20, 2008

MfA said:
Minimum granularity for a G80 styled multiprocessor with 8 double pumped ALUs is 16.

I thought it's 32 if you want everything going at full speed, as instruction changes happen every 4 clocks from what I remember.

Minimum granularity for a RV670 style SIMD engine with 16x5 ALUs is 16.

Isn't it 64 for similar reasons as above?

I'm waiting for the detailed hardware.fr tests to let us know about RV770. I figure it's also 64.

Still, I don't think 64 and 32 are that different.

MfA · Jun 20, 2008

3dilettante said:
I meant program granularity, as in how many different programs can be actively executing in a given cycle.

Right now, none of the GPUs pull from separate kernels or separate programs, but if the designers wanted to change that, they'd be limited by the number of SIMDs.

In the same cycle NVIDIA can run as many different programs as it has multiprocessors and ATI can run as many different programs as it has SIMD engines ... both of them have 10, still comes out the same.

MfA · Jun 20, 2008

Mintmaster said:
I thought it's 32 if you want everything going at full speed, as instruction changes happen every 4 clocks from what I remember.

Yeah warp size is 32, but that's an artefact of implementation ... just like the 64 thread wavefront is for ATI.

I'm assuming the numbers hold for this generation but I can't be sure.

Mintmaster · Jun 20, 2008

3dilettante said:
Right now, none of the GPUs pull from separate kernels or separate programs, but if the designers wanted to change that, they'd be limited by the number of SIMDs.

Is that really an issue? You can just merge multiple programs with branching if you wanted to do that on the same core. The only disadvantage is that you can't divide registers between them.

Anyway, if you wanted to do something like this in the future, doesn't RV770 have 40 SIMDs and GT200 only 30?

3dilettante · Jun 20, 2008

MfA said:
In the same cycle NVIDIA can run as many different programs as it has multiprocessors and ATI can run as many different programs as it has SIMD engines ... both of them have 10, still comes out the same.

Well they don't actually run from different kernels at present, if CUDA and CTM are any guide.
The way I've seen the threading described for G80, it seems flexible enough to assign a different thread of execution to each SIMD array--though this appears to still be from the same overall program.

Mintmaster · Jun 20, 2008

MfA said:
In the same cycle NVIDIA can run as many different programs as it has multiprocessors and ATI can run as many different programs as it has SIMD engines ... both of them have 10, still comes out the same.

Since you're pretty knowledgeable on this subject, can you answer something for me?

Aside from instruction bandwidth, does it cost much more to have a SIMD processor that can change instructions every clock (like just about every CPU) instead of every 4?

3dilettante · Jun 20, 2008

Mintmaster said:
Is that really an issue? You can just merge multiple programs with branching if you wanted to do that on the same core. The only disadvantage is that you can't divide registers between them.

When would this merging be determined? The contexts can be completely separate and completely independent. Currently, there doesn't appear to be any such merging going on.

Anyway, if you wanted to do something like this in the future, doesn't RV770 have 40 SIMDs and GT200 only 30?

RV770 has 10 SIMD arrays, G200 has 3 SIMD arrays per multiprocessor (edit: multiprocessor cluster), so 30.

psurge · Jun 20, 2008

MfA said:
In the same cycle NVIDIA can run as many different programs as it has multiprocessors and ATI can run as many different programs as it has SIMD engines ... both of them have 10, still comes out the same.

I guess I don't understand exactly what "different programs" are then. It seems like with your definition, a VS/GS/PS are all "one program". So would it be correct to say that a single program is some set of routines that share a context? (By this I mean the state of the texture samplers, constant cache and so on - I'm not a gfx programmer so not sure what to call this).

If so, is the same sort of limitation present in CUDA (a lot of the context I'm referring to above seems to be gfx specific)?

[Edit] Maybe a clearer way to phrase the question: I was under the impression that in a given cycle, the 2/3 SIMD arrays in an NVIDIA multiprocessor could be running instructions from say a VS and a GS. Is this not the case?

Arun · Jun 20, 2008

GT200 has 30 multiprocessors, and each of which has a scheduler. R6xx/R7xx very clearly have a lower amount of scheduler overhead per unit of ALU performance.

Jawed · Jun 20, 2008

3dilettante said:
Perhaps each TU can have short-cycle access to the nearest TU, but it would seem sensible to assume that there are frequent cases where one L1's data set would be in the same locality as another's.
To keep replication from gutting the effectiveness of the caches, maybe each TU can have delayed access to other L1s, or the global data share picks up on shared lines and saves a copy.

In prior GPUs (going back to R300) L1s are per filter pipe - that is, per pixel. So a texel actually appears in multiple L1s. I think it was in uncompressed form - certainly in Xenos the texels are decompressed before being put in L1.

In RV770 we can see a decompression unit within the TU. Indeed R6xx is the same. I did think for a while that texels were stored in uncompressed form within R6xx's texture caches :???:

So historically it's been normal to have duplication across the L1s in ATI's GPUs.

But now we're talking about duplication across quads, not within a quad. Again, in the past, I think that was par for the course.

What I'm wary of is that making the TUs interconnect many:many with the L1s is just way more complexity than necessary. With a crossbar between L1 and L2 there's already a many:many. The L1s should be 10s of KB. In R600 the L1s are 32KB for texels and 32KB for vertex data. 16KB for texels in RV630/5 and 0KB for vertex data in RV610/20.

There could be some kind of relationship with the local data share per SIMD, and I wish I had some clarity as to its use.
Synchronization within and between SIMDs would be facilitated by the data shares, but they could also be used to house temp copies of L1 lines, or even contexts for pending clauses from other batches to keep more in flight.

I suppose LDS is for inter-element sharing and perhaps is blind to context.

I suspect LDS is for only one context at a time:

fetch from register file into LDS (e.g. R1, R4, R12)
then any number of inter-element reads
since LDS is not a register file, inter-element writes are not allowed, so there is no "clean-up", it's just released

In R6xx there's a constant cache that's under programmer control. This cache is to support D3D10 constant buffers. CBs can be huge (4096 fp32s) and there can be 16 of them bound to a shader at any one time. So one of the Sequencer instructions is to fetch specific cache lines for the ALU instructions to then read from.

Because R6xx issues clauses of ALU instructions (from 1 to 128 instructions) which are "atomic", when the sequencer performs a constant cache fetch the cache is set up for the entire duration of the clause.

So I'm thinking that LDS inter-element sharing would work the same way. Instead of a Sequencer instruction to fetch from global constant cache into the ALU's constant cache lines (I think there are 2 lines), there's an LDS fetch instruction that reads from register file into LDS. Naturally these fetch instructions are invisible to the ALUs, there's no latency as such.

Presumably GDS is similar, but implies that Sequencers talk to each other. Presumably the bandwidth is low in this case, e.g. 1 register per element, as opposed to, say, 4 registers per element in LDS.

Fun.

If that were the case, vertex work and synchronization operations would be bottlenecked at one access per cycle for the entire chip, assuming the data shares enable synchronization primitives.
The GDS would become a global serialization structure, something that could have been handled with a few "mass halt" signal lines instead.
I'm thinking those caches might be banked or multiported. Maybe not 10-way, but definitely more than single-ported.

I presume synchronisation of elements across a wodge of contexts (hardware threads) is something that the Sequencers would signal to each other. This would be just another status bit for each context.

NVidia has the concept of blocks, within which elements can share data, each block consisting of multiple contexts. The developer specifies the block size based on the amount of per-element data that needs to be shared through PDC.

In theory, since ATI virtualises the register file, it's possible to share all extent contexts'. Looking at the newer RV770 diagrams there's no "memory read/write cache" (which I presume was the mechanism for moving register file data to/from video memory) as there was in R6xx, so presumably that's what GDS is doing. But GDS looks rather isolated. Bit confusing.

Jawed

AMD: R7xx Speculation

3dilettante

Recall

pjbliverpool

B3D Scallywag

Arty

KEPLER

Recall

Unknown Soldier

3dilettante

cho

MfA

3dilettante

Mintmaster

MfA

MfA

Mintmaster

3dilettante

Mintmaster

3dilettante

psurge

Arun

Unknown.

Jawed

Similar threads