RXXX Series Roadmap from AnandTech

Given Wavey's piece, and now this one, I have to assume there is a .pdf or .ppt floating about somewhere. Somebody hunt it down. . .
 
tEd said:
that's where the hybrid vertex textures might fit in
I suppose if the TMU pipeline pool just has a queue on the front (which seems pretty likely anyway) then either the vertex or fragment shader pipelines could push a texture address and filter-type command into the queue.

Jawed
 
My guess based on some facts and other guesses: :)

RV530: 4-1-3-2 = ROPs / ? / pipelines or ALUs per ROP / Z, Stencil

RV530 is 12-pipelines chip and it is too small for 256bit memory bus. So 12 or 8 ROPs would be useless with 128bit MB (more tranzistors, higher price, higher possibility of some defect...). But 4 ROPs would be limitation for Z/Stencill operations, so 4 "extreme" ROPs were used (2x Z/Stencil) like a compensation. It is possible, that R5xx ROPs are able to do 4x MSAA, so 4 ROPs combined with +-600MHz clock speed wouldn't be sacrificing even when using AA.

R520...

version a) 16 ROPs, 16 pipelines with ? ALUs, (chip would be similar to other RV5x0 architecture)
version b) 16 ROPs, 16 "extreme" ALUs, (chip would be similar to other RV5x0 architecture)

R580...

version a) 16ROPs, 48 simple pipelines, 1 ALU per pipeline
version b) 16ROPs, 48 ALUs, a bit similar to Xenos architecture, but w/o US
 
Dave Baumann said:
WRT to the texture address processing - their past does one thng and their future does (more or less the same); its likely their present would do the same as well.
k, thanks for that :) I don't have access to all the R600 info though, so I didn't really consider that as a certainty for it. Now that I think about it, I don't see how it would work better otherwise.

Uttar, you are reading too much into the "designed at NV30 time", it could mean a multitude of things, for instance it could mean they have paid particular attention to FP32 register performance....
Heh, I see - tbh, that's not really interesting though. The R3xx architecture is quite different from NV3x's, and has no register performance penalty that I'm aware of. By going from FP24 to FP32, you increase the cost of each register by 50%, and you potentially increase the pipeline length, and thus the number of stages required. I've got no data on how long typical single-cycle FP24/FP32 ALU pipelines are, but let's assume FP32 is twice the length to take a worst-case scenario, so the register "costs" would be 3x higher. That's hardly unmanagable.
As far as I understand, the NV3x pipeline length was significantly greater than R3xx's to cope with texture latency; the R3xx, on the other hand, had decoupled texture operations (per-quad; in the R520, it would be a single pool). That's just what I remember from NV/ATI developer documents though, so don't trust me too much on this.

Hmm, well you're not taking account of fragment shader pipeline count, which is, frankly, pointless.
I don't think you really understood what I was trying to measure - we don't know exactly what the 2 and 3 of the RV530 stand for, and as such, we don't know how bandwidth-intensive those would be. I'll agree that based on that, however, I shouldn't have posted R520's numbers.

Uttar
 
The R3xx architecture is quite different from NV3x's, and has no register performance penalty that I'm aware of.
No, its just it performance characteristics are somewhat more predicatable.

By going from FP24 to FP32, you increase the cost of each register by 50%
33%.

and you potentially increase the pipeline length, and thus the number of stages required
There no requirement to do anything of the sort.

As far as I understand, the NV3x pipeline length was significantly greater than R3xx's to cope with texture latency; the R3xx, on the other hand, had decoupled texture operations (per-quad; in the R520, it would be a single pool).

NV30's pipeline consided of one FP32 ALU and Tex address unti, two FX untis and a register combiner - if you required FP processing then the single then you were using the single FP ALU and competing for its resources against the texture processing.
 
Uttar said:
and their texture caches can store filtered texels.
Why would anyone want to store filtered values in a cache?

Dave Baumann said:
NV30's pipeline consided of one FP32 ALU and Tex address unti, two FX untis and a register combiner
"Register combiner" is just the name NVidia used for those FX units (FX9 in NV10 to 28, FX12 in NV3x).
 
Dave Baumann said:
No, its just it performance characteristics are somewhat more predicatable.
So you're saying there IS a reigster limit on the R3xx/R4xx that halves performance before the API limit? If so, I'd love to have some tests of it - I do distincitively remember some ATI guys talking about the opposite on this very forum, but my memory could be at fault here.
Good point; that's why I rarely post after 10PM :)
NV30's pipeline consided of [...]
Of course, but that hardly explains the register problems of the NV3x. Actually, now that I look back at my info, I cannot seem to explain it anymore... The NV4x has higher absolute quad engine length/latency (but lower than a traditional increase would be) and four times the number of quad engines.
So, assuming the NV4x quad engine's length isn't higher than the NV35's quad engine's length plus redundancy (for power of 2 reasons), you've still got 8x more registers (16x more than the NV30, unless the NV35 was just a length reduction). Bleh, I'll have to check my info on this.
Why would anyone want to store filtered values in a cache?
It's a system in the NV4x, afaik, that allows NVIDIA to save registers and reduce the "TMU" latency. I'll be blunt and say I don't know all the details, but it does exist; in that case, it isn't used as cache, but rather as temporary storage.
Insane bastards might think that's why there is shimmering on the G70 (and NV4x), but I can't see how in the world such a thing would work. So don't overspeculate (well, I probably did in my above posts, but that's no reason to do worse than I did ;)).

Uttar
 
Last edited by a moderator:
Uttar said:
So you're saying there IS a reigster limit on the R3xx/R4xx that halves performance before the API limit? If so, I'd love to have some tests of it - I do distincitively remember some ATI guys talking about the opposite on this very forum, but my memory could be at fault here.

Well, that makes two of us. :) But thinking of those cagey engineers, I doubt that they actually said it was unlimited. . .probably what they said is that you'd be bottlenecked somewhere else before you hit their register limit, so that as a practical matter there was no register limitation. . .not that there was unlimited registers. Genteel engineer-speak for "we're more balanced than you, nyah nyah!".
 
Uttar, yes, you're right, I didn't really understand what you were trying to say.

You could re-work your argument to take in likely scenarios for the new GPUs, e.g. 8 or 12 TMUs for RV530.

I think it could be interesting to look at the ratio of memory bandwidth to single-texture rate, bytes/texel as it were:

R520 - 44.8/10.4 = 4.31B/T
X850XTPE - 37.6/8640 = 4.35B/T
X800XL - 32/6.4 = 5B/T
RV530 - 22.4/4.8 = 4.67B/T (8 TMUs)
RV530 - 22.4/7.2 = 3.11B/T (12 TMUs)
X700XT - 16.8/3.8 = 4.42B/T
RV515 - 12.8/1.8 = 7.11B/T (450/400)
RV515 - 8/1.8 = 4.44B/T (450/250)
X550 - 8/1.6 = 5B/T
X300 - 6.4/1.3 = 4.92B/T
Xenos - 22.4/8 = 2.8B/T

I suppose it would be instructive to compare these theoretical values with the measured values. Of course framebuffer writes have a huge impact on the achievable B/T rate. Xenos is interesting because the framebuffer bandwidth is effectively subtracted out of the overall bandwidth consumption.

But anyway, with the theoretical data above, what do you conclude?

Jawed
 
But anyway, with the theoretical data above, what do you conclude?

That it's sound more like 8 TMUs to me and the mysterious "3" might be rather for OPs than physical units.

If I haven't shot in my foot with some weird math, the RV530 might exceed the 100 GFLOP/s with ease in it's highest variant.
 
16-1-1-1 R520 -> 16 x 1 = 16 pipelines
16-1-3-1 R580 -> 16 x 3 = 48 pipelines :)
4-1-3-2 RV530 -> 4 x 3 = 12 pipelines
4-1-1-1 RV515 -> 4 x 1 = 4 pipelines
 
dzulkeply said:
16-1-1-1 R520 -> 16 x 1 = 16 pipelines
16-1-3-1 R580 -> 16 x 3 = 48 pipelines :)
4-1-3-2 RV530 -> 4 x 3 = 12 pipelines
4-1-1-1 RV515 -> 4 x 1 = 4 pipelines

That's an awefully nice estimate, with the only other difference that it doesn't make sense.

Apart from each chips capabilities and improvements, in the strict sense both R520 and R580 are still 4 quad designs.
 
Sunday said:
I find somehow nonsense to have special CrossFire card in R(V)5xx series! I mean, why would you buy nonCF card? Maybe right now you don’t want CF (‘cos of the lack of mobos, or the lack of extra $ that goes for CF capable model), but some day you’ll wish to have CF setup, and it would be very convenient to be able to use your existing card. Second DVI output shouldn’t’ be a problem with adequate adapter… In mine opinion each R(V)5xx card should be CF card, that is the only way to popularize CrossFire idea…
I think you're missing the point. You only need one Crossfire card to use two cards in a crossfire setup. Buy the normal card first and then buy the Crossfire card later.
 
Uttar said:
The R3xx architecture is quite different from NV3x's, and has no register performance penalty that I'm aware of.
All architectures have a register performance penalty. The number of registers that must be used to hit a penalty will be different of course. Using too many registers in a shader program will make it more difficult for the hardware to hide memory latencies.
 
Ailuros said:
That it's sound more like 8 TMUs to me and the mysterious "3" might be rather for OPs than physical units.
Confused? Do you mean 3x ALUs (vec3+scalar) per fragment pipeline? Making R580 a triple-issue architecture?

Jawed
 
Jawed said:
Confused? Do you mean 3x ALUs (vec3+scalar) per fragment pipeline? Making R580 a triple-issue architecture?

Jawed

Whatever that 3 stands for, it indicates 3x times that of R520. 3x times physical units (ALUs) sounds highly unlikely to me; 3x times the throughput sounds more like it and as peak value. I think under specific conditionals that ballpark can also be reached (or come damn close to it) between NV40 and G70.

Could be entirely wrong though.
 
I would like to point out, IIRC, that Anandtech had R420 down as an eight extreme pipe part until very close to the time that part launched. They are an awfully long way from infallible - they tend to either be under NDA and therefore not say anything or spouting fairly crackpot third hand info. I also find the 1400MHz memory spec on a mid range card somewhat unlikely, given that NVIDIA have not been able to specify that on the 7800 GTX.

Have we yet reached a consensus on the X-X-X-X syntax, or are we still scratching around in the dark?
 
Last edited by a moderator:
Back
Top