Anand has the details about r520,rv530,rv515

_xxx_ · Sep 28, 2005

Jawed said:
Those "relative performance" numbers are very interesting:

RV515 - 1
RV530 - 3
R520 - 5

Jawed

Dave Baumann said:
RV530 Shader processors = 12.

Has the penny dropped on 4-1-3-2 yet? (or at least, part of it?).

4+1+3+2 = 10
10/2 = 5!!!

EDIT: typo

wireframe · Sep 28, 2005

Could someone remind me why the 4 and 3 in 4-1-3-2 don't stand for number of ROPs and Shader Units per ROP? It makes an easy 12.

Edit: was this thrown out simply based on the logic that the R580 16-1-3-1 would equal 48 SUs and that this was impossible?

Geo · Sep 28, 2005

wireframe said:
Could someone remind me why the 4 and 3 in 4-1-3-2 don't stand for number of ROPs and Shader Units per ROP? It makes an easy 12.

Because then you have to sign up for R580 having 48 shaders. . .and the CW is still quailing at the transistor/die size implications of putting our collective John Hancock at the bottom of that one, particularly for a "refresh". . .

Edit: With honorable exceptions (see below.

)

Jawed · Sep 28, 2005

wireframe said:
Could someone remind me why the 4 and 3 in 4-1-3-2 don't stand for number of ROPs and Shader Units per ROP? It makes an easy 12.

Why not just state that's your theory. We only have theories round here.

Edit: was this thrown out simply based on the logic that the R580 16-1-3-1 would equal 48 SUs and that this was impossible?

I don't see why 48 shader pipelines is impossible.

Jawed

wireframe · Sep 28, 2005

geo said:
Because then you have to sign up for R580 having 48 shaders. . .and the CW is still quailing at the transistor/die size implications of putting our collective John Hancock at the bottom of that one. . .

I'll sign up.

I guess the other problem then becomes "if R580 can fit three times as many of what R520 has, the R520's SUs must be really tiny. Hmmm." But that assumes that they use the same shader unit structure, which I suppose make a lot of sense, but who knows... Maybe the R580 has slightly less complex shader units that are decoupled to remove all the qualifications that we see with the dual issuing shader units on NV40, for example.

RoOoBo · Sep 28, 2005

Jawed said:
Why not just state that's your theory. We only have theories round here.

I don't see why 48 shader pipelines is impossible.

Jawed

The G70 already has 48 'full' SIMD ALUs for shading.

Jawed · Sep 28, 2005

wireframe said:
I'll sign up.

I guess the other problem then becomes "if R580 can fit three times as many of what R520 has, the R520's SUs must be really tiny. Hmmm." But that assumes that they use the same shader unit structure, which I suppose make a lot of sense, but who knows... Maybe the R580 has slightly less complex shader units that are decoupled to remove all the qualifications that we see with the dual issuing shader units on NV40, for example.

I'm going to do a Dave on you now, and ask you to think about the TMU pipelines in R5xx.

How many? Assume they're decoupled from the shader pipelines.

Jawed

wireframe · Sep 28, 2005

Jawed said:
Why not just state that's your theory. We only have theories round here.

Well, I am pretty sure that this is the first theory (not mine) I read on here, but then a whole slew of them followed that seemed to assume that this one is wrong. I never got a grip on why it was deemed "obviously wrong". That's all.

I don't see why 48 shader pipelines is impossible.

I don't either. I think you could squeeze at least a 128 in there if you keep them simple

That, again, is part of my questionign why this line of thinking was "collectively" dropped. Is it an assumption that the shader units of R520 and R580 have to be "the same" so that having 3 times the number of them would scale performance and transistors that way as well? Or, like I said in my previous post, is it possible that ATI did what Nvidia did going from the NV40 and G70 and tore up the shader units a bit, but to a greater extent, going for "more but simpler...but collectively more powerful"?

Geo · Sep 28, 2005

RoOoBo said:
The G70 already has 48 'full' SIMD ALUs for shading.

I'm open minded on this one, certainly, tho a little more cautious than brethren Jawed and wireframe.

Going for

, have you considered that if R520 is roughly the same number of transistors as G70 (maybe even more?), and then you triple ALUs from there? On a refresh?

The other thing tho, and I've been a proponent of this theory in the past, is we could be seeing R520-->R580 as a particularly odd use-case of ATI's "mid-range first" new process strategy. . .

Jawed · Sep 28, 2005

wireframe said:
Well, I am pretty sure that this is the first theory (not mine) I read on here, but then a whole slew of them followed that seemed to assume that this one is wrong. I never got a grip on why it was deemed "obviously wrong". That's all.

It's just one of several theories.

It doesn't account for decoupled texture pipes, a feature of Xenos that's seen (by me, at least) as extremely likely to be in R5xx. Hence my signature!!!!!!!!!!!!!!!

That, again, is part of my questionign why this line of thinking was "collectively" dropped. Is it an assumption that the shader units of R520 and R580 have to be "the same" so that having 3 times the number of them would scale performance and transistors that way as well? Or, like I said in my previous post, is it possible that ATI did what Nvidia did going from the NV40 and G70 and tore up the shader units a bit, but to a greater extent, going for "more but simpler...but collectively more powerful"?

http://www.beyond3d.com/forum/showpost.php?p=556870&postcount=917

The reality is that the #-#-#-# discussions have fallen out of favour, cos we've been at it for most of a year now.

Jawed

wireframe · Sep 28, 2005

geo said:
have you considered that if R520 is roughly the same number of transistors as G70 (maybe even more?), and then you triple ALUs from there? On a refresh?

The G70 is the NV40 refresh part. The G70 has roughly 36% more transistors than NV40 (220M v 300M), but 50% more the number of "shader pipelines" (16 v 24). Now, if you also consider the fact that the G70 features restructured shader units that remove the conditionals that the NV40 suffered from (where the NV40 could count as 32 shader units under certain conditions) you can now see it as more of a 48 shader unit part. It's really not that dramatic.

I repeat myself, but I really think ATI may be doing a similar thing here between R520 and R580. R520 has "heavy shaders" a la NV40, and R580 decouples them so you get three out of one, where those three may have even more features, but, more importantly, can work more independently.

What really has my brain doing loops though is when Dave asked if the coin has dropped. This makes me think I am all wrong about this because otherwise the coin dropped the first time he mentioned 4-1-3-2 and was then discarded for other theories.

Jawed · Sep 28, 2005

I maintain that Xenos fits 4 full shader pipeline arrays (64 pipes) into 232m transistors. I'm guessing that those four arrays only consume about 33% of the die area.

Which makes them around 1.2m transistors per pipe. Say 20m transistors per array of 16 pipes.

An array of 16 pipes in R520/R580 would be slightly bigger, overall, because of prior-DX support (fixed function hardware).

At an absolute minimum it seems R580 is 40m more transistors than R520. I'm guessing it's more like 70m, just to be safe, though.

Jawed

Jawed · Sep 28, 2005

wireframe said:
I repeat myself, but I really think ATI may be doing a similar thing here between R520 and R580. R520 has "heavy shaders" a la NV40, and R580 decouples them so you get three out of one, where those three may have even more features, but, more importantly, can work more independently.

You're really going wrong here - Xenos points the way to "light" shader pipelines.

R3xx...R4xx both have "light-ish" pipelines, certainly lighter than any NVidia pipeline from the past 3 or 4 years.

Common wisdom round here is that heavy pipelines are just not ATI's bag. For a start it goes against the "maximum utilisation" ethic of Xenos - you can't utilise two dual-issue MADD-capable ALUs if the instructions in your code make the second instruction dependent on the first instruction's result.

(Well, you can, by staggering the dual-issue - but NVidia's pipelines don't do that - maybe they will in G80?)

Jawed

RoOoBo · Sep 28, 2005

ALUs don't take as much logic or area as some people think. In most processors there is a lot of other stuff that makes for most of the area and transistors: buses, registers, memory, queues, and even more memory. If we go to the CPU world, for example in a P4, the problem isn't in adding more integer of FP ALUs but how to feed them. In a GPU you can feed ALUs quite more easily that in a CPU. Current GPU designs seem to work feeding a group of four inputs into an array of 4 to 8 (in cascade?) SIMD ALUs but there may be alternative implementations.

wireframe · Sep 28, 2005

Jawed said:
You're really going wrong here - Xenos points the way to "light" shader pipelines.

Are we not on the same rollercoaster and only at different hills and valleys?

IIRC the R520 was denoted as 16-1-1-1, leaving very little room for interpreting this as a VPU with more shader pipelines than ROPs. I therefore think that R520 has 16 shader units that are "heavy" and that R580 makes the transition to divvy up those heavy shader units into 3 independent ones.

BTW, I am tempted to say that R520 will be ~250M-270M transistors and the R580 will be ~300-320M. Does this sound ballpark probable to you?

EDIT: I am saying that I think ATI will go heavy shaders with R520 before going ultra-light with R580. In other words, they are going to NV40 on steroids with R520 and a G70 on steroids with R580, if that makes sense. (Not necessarily with the same limitations as NV40 or G70, but the general idea.)

Jawed · Sep 28, 2005

Take a look at Xenos.

Notice how it is 3 arrays of 16 shader pipelines, 1 array of 16 texture pipelines and "half an array" of ROP pipelines (8).

I think Xenos paints a pretty clear picture of the fragment shader architecture of the entire R5xx series. But that's just my theory.

I've argued it at length for months now...
Jawed

Jawed · Sep 28, 2005

wireframe said:
EDIT: I am saying that I think ATI will go heavy shaders with R520 before going ultra-light with R580. In other words, they are going to NV40 on steroids with R520 and a G70 on steroids with R580, if that makes sense. (Not necessarily with the same limitations as NV40 or G70, but the general idea.)

Seriously, what you're suggesting doesn't make sense at all to me. If you can justify why instead of just saying it, then...

Also, it's my opinion that in terms of shader and texture pipes, RV530 is a mini-R580. In R580 each array is 16 pipes. In RV530 each of those two array types is 4 pipes.

RV530 appears to be the "prototype" R580, using multiple shader arrays in a fairly small form (12 shader pipes total) before going big guns with R580's 48 pipes.

And also, as I've said before, ATI stresses that a 3:1 ratio between shader ops and texture ops is the target that devs should aim for. That fits nicely with both R580 and RV530 according to my theories.

As well as Xenos, in reality, of course.

Jawed

wireframe · Sep 28, 2005

Jawed said:
Take a look at Xenos.

Notice how it is 3 arrays of 16 shader pipelines, 1 array of 16 texture pipelines and "half an array" of ROP pipelines (8).

I think Xenos paints a pretty clear picture of the fragment shader architecture of the entire R5xx series. But that's just my theory.

Ok, so I think you are both right and wrong, if that is possible.

First of all, I wouldn't look too closely at Xenos because that is a USA chip with eDRAM for a closed system with all the convenience that brings. However, I think the general idea of multiple independently shader units (or ALUs), or pools, rather than a strict pipeline, solves a problem that is more general. We only need to look at the NV40 to see that. It has a lot of potential that is going unused because "it's complicated."

So, if you think R520 is going to use this many-but-light paradigm, how do you get power out of it? It only seems to suggest 16 <something> (16-1-1-1) and even if you multiply you end up with 16 of whatever the rest is hanging off of that. This would mean that R520 is an absolute wimp. This is why I think R520 will feature some truly "extreme pipes" because if it only has 16 of them they better be.

I think Xenos may paint a more accurate picture for R600 though. The idea is sound - "use 1,000,0000 ants to move a mountain instead of one ogre" - but I don't think we are there yet and I don't see the numbers in the mysterious notation to suggest that this is the case with R5xx. My guess is that R580 will be the first move in that direction and R600 will feature a "pool" of logic instead of the "pipelines" as we see them now. This would mean you need a new scheduler (as your sig points out) because you have to cleverly reschedule ops to take advantage of the available "ant" instead of "having the whole ogre to themselves"...as they say... :???:

Jawed · Sep 28, 2005

Hint: with fully decoupled shader and texture pipelines, a Xenos-like scheduler becomes extremely useful.

Even if you only have 4 shader pipes and 4 texture pipes.

The key to ATI's new architectures is, I believe, out of order batch scheduling - which essentially means that a batch that needs texturing can be texturing, while a batch that doesn't need texturing can be calculating.

Batches do a little dance through the GPU, each taking their turn on the dancefloors, according to whether they want to waltz (texture) or breakdance (calculate), and whether there's space on those dancefloors for them.

Jawed

wireframe · Sep 28, 2005

Jawed said:
Hint: with fully decoupled shader and texture pipelines, a Xenos-like scheduler becomes extremely useful.

Even if you only have 4 shader pipes and 4 texture pipes.

The key to ATI's new architectures is, I believe, out of order batch scheduling - which essentially means that a batch that needs texturing can be texturing, while a batch that doesn't need texturing can be calculating.

Batches do a little dance through the GPU, each taking their turn on the dancefloors, according to whether they want to waltz (texture) or breakdance (calculate), and whether there's space on those dancefloors for them.

I agree with that completely. I thought I expressed my acceptance of that in my previous post. What I am having problems believing is that this will be the case with R5xx. Again, I base this on the notations:
16-1-1-1 for R520
16-1-3-1 for R580

This, to me, seems to imply that R520 does not feature the "magic dance of the working ants" whereas the R580 hints at it and R600 will probably go flat out with it like Xenos. I just don't see where you link Xenos to what we know of R520. RV530 is said to be of the R580 mould so that is a different matter. It should also feature the ALU indenpendence of the R580 if this notation is indeed ROPs-1-1-SU's per ROP-2.

Am I missing something? We seem to be both agreeing and arguing different things.

Anand has the details about r520,rv530,rv515

_xxx_

wireframe

Geo

Mostly Harmless

Jawed

wireframe

RoOoBo

Jawed

wireframe

Geo

Mostly Harmless

Jawed

wireframe

Jawed

Jawed

RoOoBo

wireframe

Jawed

Jawed

wireframe

Jawed

wireframe

Similar threads