Anand talk R580

Well, if we make the following two assumptions:
1. nVidia needs a core clock of the G70 of about 500MHz to do equal to or better than the Radeon X1800 XT across the board (or nearly so).
2. A R580 will obtain approximately double the performance of the X1800 XT in the best-case at the same clock speeds, and will be clocked similarly.

Then we can imagine what a 90nm G70, with no changes to the core, would have to be to beat this new product. First of all, let's imagine that it is a 32-pipeline architecture. To double the performance over the current G70 at 500MHz, it would therefore need 50% higher clocks, or 750MHz clock speeds.

Personally, I think it's unlikely that the a 90nm G7x will be able to be that fast. As such, nVidia will really need to be working on some significant core changes if they want their 90nm product to beat ATI's R580 across the board.
 
Of course, I was only stating what would be required to beat the R580 across the board. It is completely possible for a 90nm G7x with zero core changes to beat an R580 on average, since I expect the improvements from adding more ALU units will all over the place, depending upon the game. But we can always hope for more, right? :)
 
Unknown Soldier said:
Mintmaster I see what you saying .. but people are interested in which is better. All they'll see is Nvidia has a faster card than ATI. I'm still waiting for SC:CT because the X1800XT walked the old GTX. All I know is it's gonna hurt ATI in the end.
You're right. We need reviewers to be more forward looking. I think they did an okay job in the R3xx vs. FX days. NVidia tweaked the crap out of certain shaders to save face in major games, but people knew it just plain wasn't a real PS 2.0 card.

For the last year and a half, NVidia has been tooting the PS3.0 horn. In all fairness, they did have FP blending when ATI didn't, but the PS3.0 advantage was all 90% marketing and 10% substance.

(For the more technical folk: The only really useful pixel shader I know that needs NV40's PS3.0 capabilities and can't be done as well on ATI's PS2.0 hardware is parallax mapping with distance functions. Even then, only the unlimited dependent texture read ability is important, and this technique hasn't been used in any game yet.)

Dynamic branching is far and away the most important feature of PS3.0. No shader without branching can honestly be called PS3.0. And here are the results for such shaders:
Tech-Report: Soft shadow mapping w/ dynamic branching: R520 is 60% faster
ATI SDK samples: R520 is 2x as fast
Parallax Occlusion Mapping: R520 is 2-3 times as fast
Parallax mapping with distance functions: R520 >100% faster

And here's the HOLY CRAP! benchmark:
GPUMANIA: PVR Voxel demo (bottom of page): R520 is 5-9 times as fast as G70

Now, having said all this, NV40 and G70 are fantastic architectures. They do all the really important stuff very well. Use of dynamic branching and real PS3.0 shaders won't impact image quality nearly as much as the introduction of FP calculations in PS2.0 did, or even HDR. Even if RSX has the same branching cost, it will carry PS3 a long way. But there a few big techniques, like raytracing-esque shaders (POM, Voxels), and soft shadow maps.

But I don't think NV40/G70 is any more PS3.0 than NV3x is PS2.0. The former is a PS2.0+ architecture with PS3.0 tacked on, just like the latter is PS1.1 with PS2.0+ tacked on. If reviewers emphasize this even a bit and run some tests to prove it, then R520 should look more appealing.
 
^^^^OUCH!

Chalnoth said:
Of course, I was only stating what would be required to beat the R580 across the board. It is completely possible for a 90nm G7x with zero core changes to beat an R580 on average, since I expect the improvements from adding more ALU units will all over the place, depending upon the game. But we can always hope for more, right? :)

Most certainly :)
 
Last edited by a moderator:
Mint, I think you're argument goes too far to the extreme. First defining dynamic branching as the only feature that makes something PS3.0, and then declaring that the G70 is really a PS2.0+. Well, if that's the case, than the R3xx/PS2.0 was really a PS1.4+ when we consider some of the strange limitations it and PS2.0 had (e.g. texture read limits)

Gradients, a better input register model, vastly more temporary registers, longer shader lengths, indexable register files, these all also contribute to making PS3.0 a better model to program. R520's better performing DB is nice (although I discount the pathological voxel cases) and it would be pointless to argue it, but the situation is not the same as NV3x and PS2.0.

Much of supporting 3.0 goes beyond DB support, such as being able to deal with much longer shaders, unlimited texture reads, FP32, etc. These require substnatial architectural changes and the R520 would not be PS3.0 compliant if it was simply an R420 + dynamic branching.
 
Mintmaster said:
Were you referring to the voxels demo or was I bashing NV40/G70 too much? I really wasn't trying to come off that way.

No no...I was referring to R520's superior handling of dynamic branching...referring to a what appears to be a spanking of G70. That's it. As for G70 being a SM2.0 card with SM3.0 tacked on...I really disagree with that but hey everyone is entitled to their own opinions right?

I just wonder if dynamic branching will be important before the next architectures are introduced...but then R520 should still be a somewhat viable option at that point in time still. (speaking as if I had money any case of :))
 
Last edited by a moderator:
scificube said:
There is already an ATI email circulating around that states Xenos only has slightly more pixel shading power than the X800XT.
Umm, I'm going to assume you meant X1800XT judging by the rest of your post, because otherwise it doesn't make any sense.

Even if you're referring to the X1800XT, though, that really shocks me. I've heard that a few times on these boards too. Given how closely they launched, I can't imagine Xenos being any less efficient than RV530 on a per-pipe, per-clock basis.

Maybe this is in reference to current typical shaders. RV530 tracks very closely to the 6800 in Shadermark, even though it is clocked 80% higher. The 6800, in turn, is a bit more efficient than R580 in the same benchmark, per pipe per clock. So I could imagine that a design with 48 shader pipes and 16 texture units is only around 40-50% faster on average than R520 at the same clock. If you take into account that Dave B said a Xenos shader < R520 shader (maybe mini-ALU is less capable), then it makes sense for the 625MHz R520 to be only a bit behind the 500MHz Xenos.

But surely you can have substantially better speed with Xenos if the shader has enough math, or if it uses the 16 non-filtered texture samplers (say for shadow mapping) to reduce the burden on the filtering units.

I don't buy the theory that 8-12 units on average will be occupied with vertex transformation, because few games on the market today have anywhere near that type of workload. R520 has 10 vertex shaders because they're cheap, and it's costly to hold up your pixel pipeline when you have clumps of tiny or zero-pixel triangles. Average time actually vertex shading can't be more than 20-30%. I've worked at ATI analyzing this stuff, and while it was several years ago, I still know how to spot a heavy vertex load via the resolution scaling.
 
Bear in mind that RV530's dynamic branching efficiency is reduced in comparison to RV515; or, at least, the branching performance is the same but it can negate the increased ALU capacity.
 
That's a typo. I did mean the X1800XT.

As for where I got that it was in an email from an ATI fellow seen here in this thread:

http://www.beyond3d.com/forum/showthread.php?t=24771&highlight=ati+rsx+x1800

As for the 8-12 ALUs for vertex processing I just picked some random number that made sense. If I did bad with what I selected I blame the devil...the devil made me do it! I was only trying make sense for myself those afore mentioned comments.

In any case, it still appears to me Xenos isn't going to hang with R580 unless I misinterpret 16-1-3-1 to mean the wrong thing.

Oh and thanks for the info. I was unware Dave commented on the capability of Xenos's ALUs as compared to those in R520.
 
Last edited by a moderator:
Each "ALU" on Xenos consists of a single Vector[4 component] + Scalar ALU (so each ALU is is similar to a vurrent vertex shader ALU), whereas each "ALU" from R420+ consists of a single vector ALU with full capabilities and another vector ALU with ADD capabilities (both of which can co-issue up to a Vec3 with a scalar op).
 
^^^Thanks. I'll figure out what that means in a little while :)

...one the face of it...it would appear ATI's R420 and above's...um pixel shader processors?...would be more capable than those in Xenos.

...and if I got it right...both ALUs in a pixel shader processor? (I need...containers...or I get lost!) for Nvidia both have full capabilities so that's why some would classify them as more capable.

(I understand the difference between a vector and scalar ALU...that comprise what I call in this case a pixel shader processor. Rather...I should say I know the difference between scalars and vectors as I'm just a curious programmer...)

--------------------------------------------------------------------------------------------------------------------------

I know this isn't a classroom or anything but would it be correct to say an ALU in one of Xenos's shader arrays can perform a 5D op in a clock cycle while R420/R520 can do two? (ok...1.5...or something) Or are R420/R520's vector ALUs not capable of Vec4's?
 
Last edited by a moderator:
Bah, I went through all the trouble of registering (I went through a yahoo account before I realized it was one of the (duh) banned emails and then two more e-mail accounts to do so. The first wouldn't receive any messages ;) ), and I end up beaten repeatedly. In any case, since Xenos does only have 16 filtered texture units, it's somewhat more limited than the X1800XT. But, shading power should range somewhere from said X1800XT ((20+20ALUs) for the PS, and 8 ALUs for vertex work. considering the 25% higher clocks, this would put it just below the X1800XT. But such a case would be the worst shading case possible, since 50% of the instructions executed would have to fall in line with that simpler ALU (Add or Imput modify, no?). And on the other extreme, if none were executed (equally improbable/impossible), then it could have the shading capabilities of an X1800XT with 32 PS +16VS "pipes" running at 500Mhz instead of 625. And so, the average should realistically fall somewhere in between... and I would think, would lean more towards more shading capability (of course, I don't actually know the averge composition of instruction execution. But for the compiler to find so many places where strictly Add could be done, as opposed to MADD or any of the other instructions, doesn't seem too realistic)

Or is that far off?

On the subject of the filtered and unfiltered texture fetches, do normal maps require any sort of filtering given what they're used for? It would be useful to dump those in with the unfiltered texture fetches, thus freeing the filtering units to apply heavier filtering to the textures that actually need it (until the desired filtering is done, only one texel loops through the unit, correct? And does this block the shading units in G70 if higher than bilinear filtering is used?)
 
with ATI finished with the development of R520, Xenos and R580, working on finishing R600, I would imagine that ATI has a roadmap with Microsoft for the next-next gen console GPU, which I will call 'C2' or 'Xenos2' for now. the Xbox3 GPU. of course it is only a glimmer in their eyes at this time. but they have to be doing the early R&D since Sony and Nvidia have a roadmap for the successor to RSX/PS3.
 
http://www.penstarsys.com/previews/graphics/nvidia/512_7800gtx/512gtx_3.htm

credit to JoshMST for the find!

Here it's said Nvidia's G70 90nm low-k part will enable or have 32 pixel pipes, 10 vertex pipes, and could be clocked up to 700Mhz. This info is towards the bottom of the page but and I don't want to copy it here because that may be against the rules.

Being as R580 is already a 90nm part (unless the 80nm rumor is true) and using low-k it doesn't seem wrong to think ATI won't be getting R580 too much faster than 700Mhz unless they've be really conservative with R520.

That's said...it's become apparent there are subtleties that I have yet to understand in play here so I ask how would these two parts stack up to each other? (at least theoretically)
 
Last edited by a moderator:
Josh had quite a laundry list of goodies in there to go with 32/10 and 700mhz. I can't see it all happening, and at 380m trans. I can imagine 380m trans at 700mhz, as this would seem to be the neighborhood that ATI is aiming at (see the Hexus bean from last week). But you also have to take into account that NV has been loudly insistent that they are not going to rely on clocks to the degree that they perceive ATI doing so, and why. And a goodly bit of positive reinforcement the last eighteen months for that strategy, and even goodlier negative reinforcement the eighteen months previous to that. So you have to factor that in as well.
 
DemoCoder, I did say PS2.0+, didn't I? That was on purpose.

I thought the ps_2_x profile included unlimited dependent texture reads, gradients, and long shaders. More temporary registers are nice if you're coding in ASM, but with HLSL or GLSlang I don't see it being anything more than a minor convenience. Maybe useful for some seriously hardcore GPGPU stuff, but not much more. Indexable constant registers don't seem to do anything more than a lookup table does, except maybe a performance boost in some bizarre, pathological cases.

I stand by my statement. Add DB to ps_2_x, and there's very, very little more that you can do with PS3.0. I'd like to hear some concrete examples if you disagree.

And R420 didn't support that profile, so I'm not suggesting R520=R420+DB. But even so, the architectural changes required to add all of the things you mentioned absolutely pale in comparison to what's needed for fast dynamic branching. You can mention that "60M transistor for SM3.0" that NVidia said, but a lot of that is in FP blending, FP filtering, VS3.0 (vertex texturing and MIMD branching), along with a heavy dose of BS, i.e. not comparing equally performing architectures.

As for saying R300 really is a PS 1.4+, I think that's pretty accurate, except for instruction count limits/flexibility (i.e no limit between "phases"), and the FP pipeline. However, PS1.4 was a very heavy departure from PS1.1-PS1.3, so your statement is not really fair even though I agreed with it.

The main reason R300 was so radically more capable (in terms of graphics output, not performance) than R200 was you could get high precision data in and out of the shader, and all calculations were done in FP. Then add in the architectural differences that allowed huge throughput, very good dependent texture access, and fast MSAA, and you have R300.

And I'm not reducing NV40/G70 to NV30. If you read my reply again, I said fast dynamic branching will NOT have the same impact on graphics that PS2.0 did. That alone separates the R520/G70 comparison from R300/NV30. I'll say it again: Beyond DB, NV40/G70 is a very, very good architecture.

Nonetheless, just like NV3x will get raped by R3xx in any modern pixel shaded game, there will be a few examples in the future where G70 will get owned by R520. However, the latter will occur much less frequently than the former owing to what I said in the previous paragraph. I think most readers are getting the impression of "never" instead of "much less frequently", since both cards are PS3.0 and NVidia is winning all the synthetic pixel shader benchmarks. That was the main point of my post.
 
Last edited by a moderator:
Back
Top