AMD: R7xx Speculation

Status
Not open for further replies.
Looks like a typo to me (should maybe be 2065?), in all other tests the 8800U is very close to the 9800GTX.

Nope, the Ultra is quite a bit faster than the GTX at Vantage extreme. The GTX score looks too low though should be around 2500.
 
While R580 got a fair boost from adding ALUs they were also math bound at the time. I'm not sure I'd go as far as calling RV670 limited by shader power.

I could see all AA/AF being performed by the actual shaders but they're still left with actually feeding data to those shaders quickly enough. I wouldn't be surprised if the TMUs got turned into a bunch of really large point samplers. That might even trim enough transistors where the 800 SPs become more reasonable. RV670 seemed to scale reasonably well with with higher levels of AA/AF so if they could feed in data faster things should balance out. Especially since the higher levels rely more on cached results than fetching new unique values.
 
The 256mm² die size are just shown by gpuz, so they could be wrong/off.

But someone also made some calculation on the leaked chip footage... according on this calculation the chip won't exceded a 275 mm^2 die-size.

The stand-out there is 8800U scoring 3065 in Vantage Extreme. That's quite a bit higher than the seemingly rumoured 2800 for HD4870, which would indicate that ATI is still far off being able to fully use ~124GB/s.

The indication is also that GTX 280 will end-up ~60%+ faster than 8800U. In this case that's a scaling that's better than the increase in bandwidth.

Jawed

According some source, HD4850 scores about X26xx, so HD4870 could have a score higher than X3200 (+30% on HD4850).
Today some chinese source claimed that HD4870 could out-perform 8800U by a 35% ...
 
My problem with this ALU number is this. I'm assuming it's still 16 shaders / 80 ALUs per processor. So we're looking at 10 processors - a 150% increase in shading and control logic.
Yeah, it's been bugging me too. So 5 SIMDs of 160 ALUs with the horrible batch size is still a contender.

Now my question is....if there's a 150% increase in control + shading logic accompanied by a 25% increase in die size that means control+ALU in RV670 is taking up very little space. So what's taking up the rest?

As I understand it, there's 3 levels to the control logic:
  1. Command Processor - this interfaces with the CPU (and other GPUs?). It configures pipeline state, e.g. the loading of shader programs into the 3 shader-types: VS, GS, PS; per shader-type thread counts; Z blend mode
  2. Sequencer(s) - this manages extant threads and how they are issued to the ALU and TU SIMDs. It tracks the status of each thread (can the thread be issued?, what's it waiting for?) and makes control-flow (dynamic branching) decisions, e.g. if all the elements in a hardware thread have completed a dynamic loop then exit the loop. It also handles the DMA of all data in and out of the register file and constant cache and it also initiates on-die data movements (e.g. moving completed fragments out of the register file into the RBEs)
  3. SIMDs - these take a clause of code (up to 128 ALU instructions on ALUs or 8 TU instructions on TUs) and run it to completion. In terms of control there's: instruction fetch/decode; operand fetching (which has complex rules due to the limited port bandwidth); constant fetching; constant buffer fetching; predication update; program counter
Then there's the memory system, where every memory client has virtualised access to memory. I presume this is reasonably costly.

I've failed so far to produce any kind of model of the costs. I just can't squeeze the 800-ALU-quart into the 250mm2 pint pot :cry:

In terms of R580 they only increased ALUs and control was untouched so batch size tripled as well. Of course batch size on RV770 might have gone up too or batches could run for less than 4 clocks which would offset any increases in width.
In my opinion reducing the clocking of batches is as unlikely as a different clock for the ALUs. I think it would be a huge architectural change.

But hey, I thought 800 was impossible for space reasons - without doing a "custom" re-implementation. I suppose it's possible they have utilised custom logic...

Jawed
 
That's another thing...wouldn't RV770's global register file be much larger than RV670's to support that kind of increase in arithmetic? Like Jawed, I would be gobsmacked if this is confirmed but I'm still a bit hesitant.
Register file memory should be very dense. Though I'm guessing that for porting reasons there are multiple copies of the register file (writes go to all copies - in reading, separate instances of the register file fetch distinct operands for the same clock).

RV670 has 1MB of register file, before taking account of porting. If there's 4 ports created using separate instances, then that's 4MB. That's huge.

RV770 would need 2.5MB (10MB) which is starting to look insane...

Jawed
 
You don't really know this. You could be correct that control logic could be quite large, but I've got some doubts about the TMU size being that big, and am pretty sure you're dead wrong about cache size using a large amount of die space (unless you also consider register file as cache, but I count that as part of the ALU). (Compared to cpus, gpu caches are still small - think about it intel fits 4MB on about 70 mm^2 on a 65nm node and rv670 only has 256KB L2 and 4x32KB L1 texture cache - sure it has other caches too (for ROPs for instance) but probably nowhere near 4MB in total.)

I meant that TMU+Logic (including register file, etc.)+caches have a big impact on total size, more than the ALU themselves, I did not speak for the TMU or caches alone (even if comparing different procesess is difficult, it's known that Intel's caches are more dense even at the same "nominal" process node). Finally, look at the RV635->RV670 comparison: size goes up only 70 mm^2 and you have almost 2.5 times the shaders, double the TMU and bus width.
So it's easy to believe that the shaders themselves are not very big.
 
Finally, look at the RV635->RV670 comparison: size goes up only 70 mm^2 and you have almost 2.5 times the shaders, double the TMU and bus width.
Now that you mention it - that seems about the biggest point against 800/40/16 in RV770 - the 256mm² as a given.


Because going from RV635 to RV670 you'd cramp 200 ALUs, 8 TMUs and 128 Bit memory interface into the 70mm² (which i take for granted, don't know the die size of RV635 myself).

Now, RV770 is supposed to be about 64mm² larger than RV670 and the only thing you'd get off there was 128 Bit memory interface, which allegedly stays the same in RV770 (except the mysterious non-crossfire Crossfire-Card R700, which would according to some need some kind of chip to chip interconnect through the ring buses).

So, with even less additional die space, you not only want to add 200 shaders and 8 TMUs but at least 480 shaders and 16 TMUs. Hm.

Besides: Does anyone here believe, AMD's not going to stick to their FP16-single-cycle-TMUs?
 
Does the use of different 55nm node compared to RV670 change anything at all diesize wise? (more tightly/loosely packed transistors?)
 
Because going from RV635 to RV670 you'd cramp 200 ALUs, 8 TMUs and 128 Bit memory interface into the 70mm² (which i take for granted, don't know the die size of RV635 myself)
What about additional 12 ROPs?

Kaotik: Exactly. There were rumours about new process libraries many months ago. There was also a mention of new transistor desing. Maybe this is truth and the reason for this is higher transistor density. Which could be the cause of conservative clock speeds (like NV40 vs. R420 - same size, different density, 62 milion of transistor difference, lower clock-speed for NV40)
 
Psycho is right, the RV670 is not square... my bad. :oops:

RV670-RV770-R580-Dies.jpg


GPU-Dies.png


RV670:
Size: ----> ~14.44mm x ~13.30mm = 192mm²
Pixels ---> 89 x 82

RV770:
Pixels ---> 103 x 102
Length -> 103 / 89 = 1.1573 * 14.44mm -> ~16.7115mm
Width ---> 102 / 82 = 1.2439 * 13.30mm -> ~16.5439mm
Size -----> 16.7115mm * 16.5439mm = ~276.47mm²

RV770 ~= 276 mm² (±5%) :yes:
 
What about additional 12 ROPs?
Somehow i counted them under memory interface, but you're right they should be mentioned separately. But nonetheless - the ROPs seem to need a little rework from RV670 to RV770 too, so that does not change my overall view.
 
I thought R6xx are still Vec4+scalar but in some cases more flexible.

Maybe in other words:

If you go to Octa-TMUs, you would also go to Octa-ALUs?
Dave has explained multiple times that the ALU configuration of R6xx is super scalar.

scalar | scalar | scalar | scalar | scalar+
 
Now, RV770 is supposed to be about 64mm² larger than RV670 and the only thing you'd get off there was 128 Bit memory interface, which allegedly stays the same in RV770 (except the mysterious non-crossfire Crossfire-Card R700, which would according to some need some kind of chip to chip interconnect through the ring buses).

So, with even less additional die space, you not only want to add 200 shaders and 8 TMUs but at least 480 shaders and 16 TMUs. Hm.

While I don't know exactly how much the 128 BIt memory interface takes, do keep in mind exactly how much die size shrank from 80 nm to 55nm from R600 to RV670 with the removal of the 256-bit extrenal/512-bit internal ringbus. Remember there's an internal component too, so I'm sure it's not a small part of the die at all.

Plus, add in the fact that supposedly they are on TSMC's premium 55nm node process and densities might be increased as well
 
I have no idea how cache/register files/control logic works, but is it possible ATi build in surplus amounts in there architecture for easy scaling (like rv635 to rv670 for example) and thus didn't really need change it all that much.

They put in excess bandwidth into R600, so is it possible, or doesn't it work like that.
 
I think 800 SP make sense

480Sp and 750MHz clock will not be enough to beat the 9800Gtx with a 30% margin, as some rumors pointed
Also, 800Sp/750Mhz clock/256b Gddr5 will fit the 'terrascale' claim, on some Ati marketing papers leaked...
More than that...we know by experience that Ati like to refresh products in the end of the year, should be easy to do this by raising clocks, more easy anyway than changing the marchitecture..
 
I have no idea how cache/register files/control logic works, but is it possible ATi build in surplus amounts in there architecture for easy scaling (like rv635 to rv670 for example) and thus didn't really need change it all that much.

They put in excess bandwidth into R600, so is it possible, or doesn't it work like that.
I guess that control logic doesn't grow very much. e.g. there's only 1 more SIMD in RV670, although the SIMDs do get twice as wide. The latter would indicate the SIMD control logic cost is halved per ALU.


Compare the real RV635 and an imaginary one:
  • real: 3 SIMDs (each of 2 quads) + 2 quads TU (128KB of L2 cache) + 1 RBE
  • imaginary: 2 SIMDs (each of 3 quads) + 3 quads TU (192KB of L2 cache) + 1 RBE
The imaginary one has less SIMDs, which implies less control logic (since each SIMD is independent). But the imaginary one has more TUs and more cache and perhaps more control logic for the TUs.

The imaginary RV630 should perform better on the face of it. Perhaps it would be bigger though, implying that the TUs cost more than control logic.

Or, maybe when there's an L2 a 2:1 ALU:TEX ratio is not high enough - i.e. the performance gain isn't there?

Jawed
 
Status
Not open for further replies.
Back
Top