RXXX Series Roadmap from AnandTech

Damn good thinking XMas, so that means:

ROPS - ? - fragment pipelines per ROP - TMU pipelines per ROP

Jawed
 
I find somehow nonsense to have special CrossFire card in R(V)5xx series! I mean, why would you buy nonCF card? Maybe right now you don’t want CF (‘cos of the lack of mobos, or the lack of extra $ that goes for CF capable model), but some day you’ll wish to have CF setup, and it would be very convenient to be able to use your existing card. Second DVI output shouldn’t’ be a problem with adequate adapter… In mine opinion each R(V)5xx card should be CF card, that is the only way to popularize CrossFire idea…
 
If 128bit bus is dictated by the size of the mainstream chips are we to assume that all the future mainstream chips even RV930 will have 128 bit wide bus? Is it possible to increase the amount of data transfered per pin in future memory technologies eg. GDDR4, GDDR5 etc. to increase the effective width of the bus? BTW where the heck is QDR?
 
Sunday said:
I find somehow nonsense to have special CrossFire card in R(V)5xx series! I mean, why would you buy nonCF card? Maybe right now you don’t want CF (‘cos of the lack of mobos, or the lack of extra $ that goes for CF capable model), but some day you’ll wish to have CF setup, and it would be very convenient to be able to use your existing card. Second DVI output shouldn’t’ be a problem with adequate adapter… In mine opinion each R(V)5xx card should be CF card, that is the only way to popularize CrossFire idea…

You forget that there are OEMs out there that buy the majority of these chips, and most of their systems don't ship with dual-GPU configurations.
 
back to 128-bit bus
R300 was 256-bit @0.13
How big is 6600 compared to R300? And Rv530 ? I really doubt "chip is too small" explanation
 
Yas

There was a talk of clock speed based increase in "pipelines" with the R420 launch, hints at "double pumped", comparisons to Intel's Netburst, and agreements concerning design tools to achieve significantly higher speed operation

Viewing the numbers listed, the base number seems to be the first, and the mysterious number seems to be the 3rd one. Focusing on that, and assuming accuracy in the numbers, I see two sets of numbers that seem especially important: The "12 pipe" and "4-1-3-2" Wavey is making some sort of hint about, and the 16-1-1-1 and 16-1-3-1 between the R520 and R580.

One thing that makes sense is "ROPs" and then ALUs per ROP, but I don't think that fits...that seemis too drastic a change in transistor count it would seem to me for the R520 to R580 change . That doesn't rule it out, but it doesn't seem to fit a sane refresh path.

What does seem to fit is having a design intended to achieve that type of throughput without adding silicon, which fits with some of the indicators listed at the beginning (if they're not simply fiction). That is, by having ALU processing multi-"pumped" per clock.

  • This would maintain the "locked" pixel/ROP/ALU pipeline relationship (silicon-wise) that would seem to explain the "R3xx legacy"
  • This would correspond to some speculation I've had concerning how some of their mobile-technology solutions could be of benefit to desktop parts in terms of performance (there seems to be varying clock usage in mobile parts already, geared toward minimum power instead of maximum performance)
  • It might offer an alternative to performance scaling, depending on the profile for leakage, power, heat, etc., to execute units capable of this type of scaling on a given process

I can't evaluate how feasible this is to be done right now, but this seems to be the type of thing that makes sense and is planned by both IHVs, with ATI already having made announcements last year that seem to directly relate to it, and there being evidence of it for nVidia for separating clocking by increasing degrees going forward.

Using this guess does seem to indicate that the R520 would seem to be an" underperformer" without high base clock speeds, but might indicate a similarity in R580 and R520 that might directly relate to the issues reported in relation to "R520" delay depending on how this might be implemented.

...

There are problems with this guess, and some remaining mysteries. What does the "2" at the end mean, and why does only the RV350 have it? DDR2? Is it the latest generation of high-clocked DDR1 for everything else? Also, why the apparently huge jump from R520 to R580? This guess does perhaps explain how it might be achievable in a refresh, but not why such a large jump in performance would be attempted. Along with this, it is significant that there is no "2" in this column between "1" and "3"...both together seem to strongly indicate that this guess is wrong, unless there is some implementation detail to explain it..

Also, there doesn't seem to be a listing for vertex processing in the numbers. The 2nd could be "TMUs per pipe", which would fit as well for the idea of R3xx lineage, but the last remains a mystery...why would the middle range have a larger number than any other?

Finally, why would the R420 have 16 pipes and the next generation have the same count? The R580 would certainly(!) address this if the 3rd number relates to ALU throughput somehow, but the R520 would mainly seem a fairly "dissatisfactory" stepping stone in relation. I could guess that the R420 might have already implemented something like this (making it a jump from 8 double-pumped to 16), but this wouldn't seem to fit the ROP/pixel processing relationship guesses.

...

Hmm, well, the numbers could just be wrong or incomplete, but this guess doesn't seem to hold together accurately with what is known. I hope it might touch on some relevant things, though.
 
chavvdarrr said:
back to 128-bit bus
R300 was 256-bit @0.13
How big is 6600 compared to R300? And Rv530 ? I really doubt "chip is too small" explanation


From B3D 3D tables:

R300: 218 mm2 (107M at 150nm process)
NV43(Geforce6600): 150mm2 (143M at 110nm process)


R300 is probably the smallest chip with 256bit bus. There is no die size info about the NV35 I dont know if it was a bit smaller than R300.
 
Last edited by a moderator:
demalion said:
Finally, why would the R420 have 16 pipes and the next generation have the same count? The R580 would certainly(!) address this if the 3rd number relates to ALU throughput somehow, but the R520 would mainly seem a fairly "dissatisfactory" stepping stone in relation.

Don't forget that ATI is also revamping the shader core to some extent: at a minimum a jump from SM 2.0 and FP24 to SM 3.0 and FP32. There may pretty significant efficiency gains in general vs. the R3xx/R4xx core. In other words, given the same number of "pipelines"...clock of clock, R5xx may be significantly faster wrt shading than R3xx/R4xx.

We'll just have to wait and see.
 
John Reynolds said:
Would the next person who is in the same room as Dave punch him in the arm or chest for me? TIA!

Should be me I think, and gladly. A slap with the thick end of a 5800 Ultra should do it :LOL:
 
Joe DeFuria said:
Don't forget that ATI is also revamping the shader core to some extent: at a minimum a jump from SM 2.0 and FP24 to SM 3.0 and FP32. There may pretty significant efficiency gains in general vs. the R3xx/R4xx core.

It might well be wrong to assume so. For starters it takes almost twice the silicon budget to move up from FP24 to FP32 and then they still have to use silicon for non-trivial SM 3.0 features like dynamic branching. Which, I might add, ATI in the past promised to be more useful than nVidias first attemps at it. On top of this I'm pretty certain the R4xx has very high efficiency after the tweaks to the already awesome R300.
 
Here's a slightly modified table including some other numbers:
16-1-1-1 R520(XL) @ 500/500 256-bit; 100% per-clock relative bandwidth
04-1-3-2 RV530(XT) @ 600/700 128-bt; 58% per-clock relative bandwidth
04-1-1-1 RV515(Pro) @ 450/400 128-bit; 45% per-clock relative bandwidth
Now, obviously, some parts could be more bandwidth limited than others. But you don't triple the number of pipelines yet only increase relative bandwidth by 29%.
One could argue that with 29% more relative bandwidth, you could double ROP or texturing performance, but the benefits would be relatively weak in each case. You could argue that the RV515 has too much bandwidth, but that seems unlikely for a low-end part.

Another important point is that apparently, R520 was designed in a timeframe where ATI had the ALU superiority, so they wouldn't have focused as much on it; with the R580 however, they might have realized they weren't going to have that advantage with the R520, and decided they had to fix it. This would also have affected the RV530.

That would imply that: the first number is four times the number of "pipelines" (4xQuads); the second number is the number of dedicated texture addressing units per "pipeline"; the third number is the Vec4 ALU throughput per "pipeline"; the last number is the texture filtering throughput per "pipeline".

Why am I differencing the addressing and the filtering? Simple question here: how many texture filtering operations are run in a single cycle nowadays, with trilinear and AF, if the IHV doesn't "bypass" the cost? Well, simply put, not that many. If correct, this would give an interesting market position for the RV530: Great image quality for the low/mid-end. Of course, it also could be the ROP throughtput, but that seems ot be a bit too bandwidth limited to me...

Uttar
 
In Xenos the texture address calculation ALU is in the TMU pipeline - so there's a one-to-one correspondence between filtering TMUs and texture address calculation ALUs.

I don't see how you're calculating per-clock relative bandwidth. All your numbers there seem completely screwy to me.

Jawed
 
Jawed said:
In Xenos the texture address calculation ALU is in the TMU pipeline - so there's a one-to-one correspondence between filtering TMUs and texture address calculation ALUs.
This isn't Xenos. The GF6/GF7 architectures handle addressing very differently too, and their texture caches can store filtered texels. I'm not saying my speculation is correct, but disregarding it for such reasons is a bit ridiculous.

I don't see how you're calculating per-clock relative bandwidth. All your numbers there seem completely screwy to me.
Memory Frequency*(Bus Width/256)/Core Frequency.


Uttar
 
Last edited by a moderator:
LeStoffer said:
It might well be wrong to assume so.

Which is why I'm not assuming anything. ;)

I'm just re-emphasizing (like you have) that there is quite a bit of difference between R5xx and R3/4xx in terms of shader capability and precision...there's going to be lots of new transistors in there to cover that. So "only" being marginally faster than the current high-end R4xx would not be all that surprising considering the new capabilities. There is a chance, though, that since the shader design had to be revamped more than "trivially", that additional efficiency gains could have been incorporated as well.
 
WRT to the texture address processing - their past does one thng and their future does (more or less the same); its likely their present would do the same as well.

Uttar, you are reading too much into the "designed at NV30 time", it could mean a multitude of things, for instance it could mean they have paid particular attention to FP32 register performance....
 
Uttar said:
Memory Frequency*(Bus Width/256)/Core Frequency.
Hmm, well you're not taking account of fragment shader pipeline count, which is, frankly, pointless.

If you take Bandwidth/Single-texture rate (or Bandwidth/fragment rate) as the basis of your argument, I think it'll be more convincing.

Jawed
 
vb said:
lets start this cronologicaly

1 ATI decoupled texture units from fragment quads to make them available vertex units ->R520
Is this an established fact or just conjecture and speculation?

(I've missed quite a few topics on this in the past few months, especially the really long threads, so a link for me to read up on this would be great. Thanks!)
 
incurable said:
Is this an established fact or just conjecture and speculation?

(I've missed quite a few topics on this in the past few months, especially the really long threads, so a link for me to read up on this would be great. Thanks!)

that's where the hybrid vertex textures might fit in
 
Back
Top