AMD: R7xx Speculation

Status
Not open for further replies.
Looking at the physical size of the arrays gives me pause.
Which is harder to upclock, the superscalar arrays that span most of the chip's width or pretty much every other portion of it?

Nvidia likely kept their SIMD widths small and the units scalar just for this reason.
Of course, but what's the point in upclocking the arrays if you need so much more space to do it? The G71 had very good ALU density, although it used a monsterous batch size so that doesn't really count, I guess.

(FYI, the arrays only span 26.5% of the chip, so I wouldn't say "most". Also, to see the importance of improving performance in the rest of the chip, look at how close G94 is to G92 in games, despite having half the clusters.)
 
Of course, but what's the point in upclocking the arrays if you need so much more space to do it?
It's upclocked arrays with significantly better branch granularity and better scalar and dependent code efficiency.

(FYI, the arrays only span 26.5% of the chip, so I wouldn't say "most". Also, to see the importance of improving performance in the rest of the chip, look at how close G94 is to G92 in games, despite having half the clusters.)

Looking at the die shot, the array length is a good chunk of the length of the chip. I guess I included too much of the per-SIMD hardware on the side of each SIMD.
 
Looking at the physical size of the arrays gives me pause.
Which is harder to upclock, the superscalar arrays that span most of the chip's width or pretty much every other portion of it?

Nvidia likely kept their SIMD widths small and the units scalar just for this reason.
The "SuperScalar" nature is nothing to do with the size; supoer scalar refer to the 5D scalars. SIMD sizes are fully scalable down to a single quad (1/4 the maximum seen so far).
 
Yes, cause AMD said so. But theres a reason why the changed it. And i didnt say that it will be old-fashioned :cool:
No you didn't. But until quite recently, one was supposed to think that every kind of crossbar was old-fashioned compared to a modern concept like ring-bus.
 
SIMD sizes are fully scalable down to a single quad (1/4 the maximum seen so far).
Hmm yes, but now that the tmus seem tied to the arrays, would it make sense? You'd have 4 tmus for an array with length 4 (tell me about configurations of future rv7xx chips! :) ).
 
The "SuperScalar" nature is nothing to do with the size; supoer scalar refer to the 5D scalars. SIMD sizes are fully scalable down to a single quad (1/4 the maximum seen so far).
It has more to do with the number long signal lines going across the SIMD.
It's more hardware that has to remain synchronous.
How much easier would it have been to clock higher if it were split up 5 ways?
It would be easier to clock the 1 quad SIMD much higher.

Actually, if Larrabee's supposed clocks turn out to be true, we already know the answer.
 
If we assume that's actually a working proto-type or reference R700...

Look closely at the power connectors there's a 1x6 and 1x8. However, there are only 2x6 pin power connectors connected to the board. Meaning it's possible that R700 will consume less than 225 watts. :oops:

Regards,
SB
I think its more likely thats all they had to power it up. If its anything like the 3870X2 it will run on 2x6 pins, just no overclocking tab in the control panel.
 
It's upclocked arrays with significantly better branch granularity and better scalar and dependent code efficiency.
I'll give you the branching advantage, but the point of my post was that even for serially dependent scalar code, RV770's ALUs are almost as effective per mm2 as GT200's. We're looking at 5 times the ALUs in half the space.

This is assuming, of course, that RV770's ALUs aren't waiting for texture data due to TMU saturation or insufficient latency hiding. In some shader tests GT200 is showing huge gains over G92.

Not that I think NVidia has the ability to make ALUs as dense as RV770's (heck, even ATI didn't up to now), but if they did, would they really choose their current design instead when the only time it wins is with branchy code?
 
Last edited by a moderator:
Heh, you guys are probably going to ridicule me now (as well as later if I'm wrong), but my theory is that RV770 has (mostly) double pumped ALUs and TMUs. I also think the eight-wide TMUs are what you see on the left and the top left, while the continuation of the ALUs on the left are the schedulers. Also, you know what is going to be fun for AMD now? Figure out how to get a coherent roadmap on 40nm that isn't pad limited for every single chip... :)
 
Heh, you guys are probably going to ridicule me now (as well as later if I'm wrong), but my theory is that RV770 has (mostly) double pumped ALUs and TMUs. I also think the eight-wide TMUs are what you see on the left and the top left, while the continuation of the ALUs on the left are the schedulers. Also, you know what is going to be fun for AMD now? Figure out how to get a coherent roadmap on 40nm that isn't pad limited for every single chip... :)
I beg your pardon - you think RV770 has 400 double pumped ALUs or what?
 
I'll give you the branching advantage, but the point of my post was that even for serially dependent scalar code, RV770's ALUs are as effective per mm2 as GT200's. We're looking at 5 times the ALUs in half the space.
Now you've lost me. What chips exactly are you talking about? GT200's supposed to have 240 ALUs + 30 FP64-Units (you may count'em in or not), while RV770 is more than rumored to sport 800 single ALUs.

I can follow you on the die-size side of things, but not on the ALU-count.


@Arun:
What about the picture of the die w0mbat posted here? Not 10 identically looking longish "things" there right in the middle?
 
Heh, you guys are probably going to ridicule me now (as well as later if I'm wrong), but my theory is that RV770 has (mostly) double pumped ALUs and TMUs. I also think the eight-wide TMUs are what you see on the left and the top left, while the continuation of the ALUs on the left are the schedulers. Also, you know what is going to be fun for AMD now? Figure out how to get a coherent roadmap on 40nm that isn't pad limited for every single chip... :)

Weren´t you the one that started the missing mul stuff also ? ;)

By the way, a non biased comparative review between HD 4850 and 9800 GTX+ shows that ATI wins hands down with 4X AA to the Nvidia card:
http://www.hardware.fr/article...force-9800-gtx-v2.html
 
No more ringbus + fixed performance (by way of lots more units while keeping transistor count fairly low) seems to indicate that the ringbus naysayers were right & the ringbus was a waste of transistors?

Do we know which bit of ATI/AMD designed the RV770?
Team A: R300 -> Xenos -> RV770?
Team B: R420 -> R520 -> R600

The familys are more like

R300 -> R420 -> R520 -> R580

Xenos -> R600 -> RV670 -> RV770
 
Now you've lost me. What chips exactly are you talking about? GT200's supposed to have 240 ALUs + 30 FP64-Units (you may count'em in or not), while RV770 is more than rumored to sport 800 single ALUs.
Yeah, I came into this thread hoping I could correct my error before anyone replied. :oops:

It's 5 times the ALUs in a little less space, not half. I'm comparing with a theoretical GT200b.

Nonetheless, let's just look at the current silicon, and, as mczak suggests, say I'm wrong about that sliver being part of the ALUs. 800 SPs are 25.2% of 260mm2. 240 SPs are 26.5% of 576mm2. That's a factor of 7.8 in areal density.

So even in the worst case of serially dependent scalar code where RV770 is running at 20% efficiency, GT200 is only 11% faster per mm2.
 
chavvdarrr said:
I beg your pardon - you think RV770 has 400 double pumped ALUs or what?
Yeah, and 20 double pumped TMUs (or more precisely, part of the TMUs at the very least would maybe be 4-wide and double-pumped instead of 8-wide).
CarstenS said:
What about the picture of the die w0mbat posted here? Not 10 identically looking longish "things" there right in the middle?
Yes, the way this would be implemented is that each of the four 'squares' per row is actually 2x5 ALUs double-pumped, rather than 4x5.

Anyway, I could be completely wrong here, just that I've got some massive difficulty believing ATI could achieve this perf/mm² out of thin air. That would imply that they were previously complete retards when it comes to minimising area and suddenly turned into absolute geniuses. And honestly, it's much harder to half the area than to double the clock, and I'm pretty damn convinced (based on public data from Icera mostly) that the 'sweetspot' clock on 65nm (if you can achieve it) in terms of perf/mm² and perf/watt is nearer 1500MHz than 750MHz...

EDIT: And FWIW, if I'm right and that the left part of the central block is also part of the ALUs (i.e. it's the local scheduler/fetch/... unit), then they're really occupying 40% of the die while they occupy 26% of the die on GT200... 0.4*260=104 & 0.26*595=155 - now, if you shrink that by 19%, it becomes 125. At least that's my estimate, if anyone has a better one please let me know.
 
R300 -> R420 -> R520 -> R580
Amazing -> Good -> dissapointing -> OK

Xenos -> R600 -> RV670 -> RV770
V.good -> good? -> ok? -> Amazing

R300 = RV770 = R700 = Amazing (relative to their price points)

Awesome
 
Well yes but designing the ALUs for twice the clock would surely make them bigger - maybe not quite twice, but by a significant amount.


Hope I'm not beating a dead horse, but the whole idea behind Fast14 is to make it faster without increasing the pipeline stages. They talk about it here with a processor that actually uses Fast14 design.
In the case of the ALUs, it's a good guess they'll be using this design for a while, so applying it to one small unit (since it's repeated 160 times on the chip) seems like a killer way to negate Nvidia's shader clock speed advantage.
 
Status
Not open for further replies.
Back
Top