3D Technology & Architecture

santyhammer · May 19, 2007

I think the R600 should be faster than the 8800GTX... It can use GDDR4, doubles the stream units and bandwith is a lot more... I suspect the problem are some immature drivers ( and also immature DX10 ). All the new things need to mature a few before to start serious benchmark test... I remember when Metal Gear Solid came to PC.. the Psychomantis invisibility cubemap made my PC almost to hang due to the bad speed... one year after went like silk. Vista and the Catalyst need some time to be well optimized.

I'm gonna buy a R600 definitely because:

1) Is cheaper than the NVIDIA ones.

2) The preliminary benchmarks don't make it justice ). Let's wait a few so the drivers and test mature a bit ( personally I'll wait for 3dmark 2007 before to judge the R600 performance ). Speed is not important for me, I just want alll the DX10 features at low cost.

3) Comes in AGP ( or almost is what I saw here http://xtreview.com/addcomment-id-2...-HD-2900-XTX-and-AGP-Radeon-HD-2600-2400.html ) becase I lack a PCI-E motherboard. The 2600HD Pro looks very nice for me... no need for external power connection, 128bits GDDR3 1Ghz memory, passive silent dissipator, small and nice.

4) I saw that wonderful terrain tessellation and Humus said wasn't made with the GS so I bet the R600 has a tessellator which I want to play with! If the R600 GS if is really better than the 8800 will help too.

5) The HDMI and integrated "sound card" looks good.

6) Its new AA modes are very interesting.

So i'm trying to decide if I must get that 2600HD Pro or to wait a few and get a Barcelona CPU with a DX10 integrated VGA in the motherboard! ( or pray to NVIDIA so they release the 8600 in AGP! )

Unknown Soldier · May 19, 2007

By the time there is a use for GS at the limits Humus explained, Nvidia might have a proper solution at hand. I doubt they are sleeping at the moment, and also doubt there'll be a 8900 whatever. I expect them to be working on the Next Gen Card.

US

Nick · May 19, 2007

I'm surprised nobody has mentioned the scalar architecture yet. G80 at 1.35 GHz can deliver 346 GFLOPS all the time, R600 at 742 MHz can drop to 95 GFLOPS for shaders with all dependent scalars. To reach its peak performance of 475 GFLOPS it needs 5 independent scalars every clock cycle (multiply-add dependencies are ok).

While I'm confident that driver improvements can offer some nice speedups, that worst case which is less than a third of G80's performance still exists. Frankly, I don't know of any reason to prefer a VLIW architecture over a scalar one.

Anyway, am I exxagerating the importance of this or is it indeed one of the primary reasons R600 doesn't compete with GTX and Ultra?

ERP · May 19, 2007

For the most part shaders work on vectors anyway. I very much doubt you will see anything close to the worst case in any useful shader.

What surprises me is the lack of benchmarks that eyplain the differences in performance between the expected results and the measured results. It surprises me that sites don't do things like run a relatively standard shader, then progressively modify the shader to ascertain what stops it from hitting it's peak.

As a dev I do things like reduce texture resolution to all the way down to the minimum possible on the hardware to ensure the texture is in the cache. I run in progressively smaller windows (down to 1x1) to isolate the performance of the Vertex shader. I remove texture reads and ALU instructions to try and understand what the bottle necks are.

What I see online is running few standard benchmarks at different resolutions and rampant speculation.

AlexV · May 19, 2007

A major reason for VLIW is that you`ve got somewhere along a shitpile of money invested in such a thing, you`ve had major shuffling of your roadmap, and you thought that your primary competitor would do a sucky hack job, and he turned out to have some balls attached and try a serious departure from what it did before.

I`m partially kidding there, as there are certainly advantages to the way ATi chose...as long as your compiler monkeys work proper magic and you have a good enough relationship with developers.

Nick · May 19, 2007

ERP said:
For the most part shaders work on vectors anyway.

Vectors, yes, but of what dimension? Texture coordinates typically have two components, colors three, normals three, camera and ligth position three, etc. Furthermore, fog is a scalar, point size is a scalar, depth is a scalar...

I very much doubt you will see anything close to the worst case in any useful shader.

Worst case obviously not, but best case neither. The amount of instruction-level parallellism they need to extract is considerable. Also, at branches and the end of the shader it's highly unlikely you can fill all 5 ALU's.

So maybe with a great compiler they can reach an average of 4 operations per shader unit. That's enough to beat the GTX, but still not enough to beat the Ultra. On paper R600 should have gone for gold, but clearly something is preventing it to reach best case performance.

It surprises me that sites don't do things like run a relatively standard shader, then progressively modify the shader to ascertain what stops it from hitting it's peak.

Why is that surprising? Reviewers rarely know how to write shaders, let alone evaluate the architecture with it. And even if they could do it, it takes considerable effort, delaying their review and costing them page hits. Furthermore, readers rarely want to be bothered with math and stuff. They want to see pretty pictures of benchmarks and performance graphs they can understand.

Anyway, nobody's stopping you from creating a site/blog where you post your results and invite other professional developers to post theirs...

swaaye · May 20, 2007

Nick said:
...readers rarely want to be bothered with math and stuff. They want to see pretty pictures of benchmarks and performance graphs they can understand.

I don't think that's true of the average B3D reader.

(tho I do like pretty pictures and graphs interspersed to replenish my mental energies)

Jawed · May 20, 2007

Nick said:
Anyway, am I exxagerating the importance of this or is it indeed one of the primary reasons R600 doesn't compete with GTX and Ultra?

GTX is dual-issue, not scalar. It has plenty of hazards arising out of the SF units: think about contiguous scalar-dependent instructions that feed into or depend upon SF.

Also think of texturing instructions, whose coordinates need interpolating (one ordinate at a time) before the texturing instruction can be fired off to the TMUs.

Think about the apparently non-existent branch-evaluation unit in G80. Branching in G80 is a fair old mystery - you'll get hints of it in the CUDA guide.

The result is compiler complexity, and less ALU-instruction throughput than the shader outwardly indicates.

If you want to see a trivial example of G80's ALU throughput falling over for no apparent reason:

http://forum.beyond3d.com/showthread.php?p=1005099#post1005099

There's still not been an explanation for this behaviour. If you read on you'll see the shader code. R600 also has unexpected behaviour in these ALU tests, but G80's rather more eye-catching.

I presume you're playing catch-up on this ALU-throughput storyline in the R600 soap opera

That's the 3rd appearance of that storyline, at least...

Jawed

Nick · May 20, 2007

Jawed said:
GTX is dual-issue, not scalar. It has plenty of hazards arising out of the SF units: think about contiguous scalar-dependent instructions that feed into or depend upon SF.

GTX can do 43 GFLOPS worth of special-function operations, while R600 does 47 GFLOPS. I don't think that's a significant difference. The SFU's are also used for interpolation, but that only costs one cycle, as you've explained to me.

Anyway, I don't see how this 'dual-issue' architecture could be made 'fully' scalar. So I don't think we can consider this a weakness of G80 unless there's a better way. Or am I missing something?

Also think of texturing instructions, whose coordinates need interpolating (one ordinate at a time) before the texturing instruction can be fired off to the TMUs.

Do they first compute 'u' for a whole batch, then 'v' for a whole batch? Or can they compute 'u' of the first 8 pixels, then 'v' of the first 8 pixels, then the rest of the pixels in the batch?

If you want to see a trivial example of G80's ALU throughput falling over for no apparent reason:

http://forum.beyond3d.com/showthread.php?p=1005099#post1005099

Interesting. Any chance it's just limited by the CPU? Since this is an AMD test, have these numbers been confirmed with recent drivers for G80?

I presume you're playing catch-up on this ALU-throughput storyline in the R600 soap opera That's the 3rd appearance of that storyline, at least...

Yeah, sorry, long threads scare me away. I don't have the time now to read everything, and the things I'm interested in only recently started to surface in shorter threads.

aeryon · May 20, 2007

santyhammer said:
5) The HDMI and integrated "sound card" looks good.

No time to reply to other points (and everybody can easily give good arguments against them) but one shocks me since a lot of people is confused about R600 specs:

5) HD2900XT has no UVD engine. this new UVD is only for RV610/630 and HDMI is only 1.2 so not compatible with HD-DVD and Blue Ray (for that you need HDMI 1.3). Without proper HDMI, this feature is IMHO useless...

3dcgi · May 20, 2007

pjbliverpool said:
Does that mean Xenos may also be a fair bit slower than its paper specs suggest?

There are too many differences to come to any reliable conclusions. R600 might be based on Xenos, but it took so long to develop because there are a lot of differences.

aeryon · May 20, 2007

Jawed said:
Think about the apparently non-existent branch-evaluation unit in G80. Branching in G80 is a fair old mystery - you'll get hints of it in the CUDA guide.

The result is compiler complexity, and less ALU-instruction throughput than the shader outwardly indicates.

hmmmm

source: dynamic branching test on hardware.fr

not really what you say, in fact quite the opposite...

trinibwoy · May 20, 2007

Jawed you keep hanging onto those AMD benchmark numbers as gospel yet you ignore Rys' own findings:

Rys said:
To save myself another new post, I had a quick poke at the float4 parallel MAD shader this morning on G80 and it runs at full speed for me. So it's probably just a detail in how Guennadi runs the shader in his app (a quick look doesn't show anything out of the ordinary, though, since it's just a 1Kx1K RT he's rendering into with most render state turned off, and a null GS), rather than a driver or hardware issue.

Also, you keep referring to this SF dual-issue hazard as G80's achilles heel yet we have no evidence of such a thing occurring outside of contrived cases. Do you really believe it will be an issue in real shaders or are you just trying to dispel the G80 scalar myth?

Reverend · May 20, 2007

santyhammer said:
2) The preliminary benchmarks don't make it justice ). Let's wait a few so the drivers and test mature a bit ( personally I'll wait for 3dmark 2007 before to judge the R600 performance ). Speed is not important for me, I just want alll the DX10 features at low cost.

Why do you want all the DX10 features?

Geeforcer · May 20, 2007

trinibwoy said:
Jawed you keep hanging onto those AMD benchmark numbers as gospel yet you ignore Rys' own findings:

Also, you keep referring to this SF dual-issue hazard as G80's achilles heel yet we have no evidence of such a thing occurring outside of contrived cases. Do you really believe it will be an issue in real shaders or are you just trying to dispel the G80 scalar myth?

Call me Ishmael.

Arun · May 20, 2007

Jawed said:
Think about the apparently non-existent branch-evaluation unit in G80. Branching in G80 is a fair old mystery - you'll get hints of it in the CUDA guide.

There definitely is a branch evaluation unit in both G80 and R600, not sure where you got the idea there isn't? I'm not sure exactly how it works though, because I'm not sure you get a free branch every cycle (which would be ridiculous overkill, anyway); perhaps it's clocked at 675MHz, which is the scheduler's clock? That'd still be more branches/clock (no matter how useless that metric is

) than R600, although I'd presume neither architecture is really starved there.

Also, one thing to keep in mind for the CUDA guide: they always, always explain things in terms of "number of clocks taken", so you cannot really conclude much about throughput there. For example, based on that documentation, you might conclude the main ALU and the SFU cannot execute instructions at the same time, but this is obviously incorrect. They explain things that way so that the doc is fairly abstract and not too architecture-dependent.

As for the SFU/FMUL dual-issue, I want to finish my triangle setup & ROP testers, and then I'll try fiddling a bit with G80 again based on, let us say, new information...

pjbliverpool · May 20, 2007

3dcgi said:
There are too many differences to come to any reliable conclusions. R600 might be based on Xenos, but it took so long to develop because there are a lot of differences.

True there are differences but its fairly safe to assume they didn't make anything slower than Xenos, everything should have been as fast or faster and so if you simply decrease R600's benchmark scores by 50% to account for the clock speed difference, that should pretty much be a best case scenario for Xenos (under texture addressing constraints).

Rys · May 20, 2007

Branching isn't free on G80 or R600, so there's overhead there, but both have dedicated logic for it and the overhead is minimal (compared to the truly free branching on R5xx).

EDIT: And Jawed's right, G80 isn't the paragon of simplicity in terms of scheduling that it can be made out to be, but that needs to be thought about in broader context.

Nick · May 20, 2007

Rys said:
And Jawed's right, G80 isn't the paragon of simplicity in terms of scheduling that it can be made out to be, but that needs to be thought about in broader context.

Is that compared to requiring no scheduling effort at all or compared to actual architectures you have in mind that would have better scheduling properties (without increasing complexity to the point that you have to lower the number of ALU's and/or clock frequency)?

I mean, they'll always have to rely on some compiler work. But no matter how hard you try, a VLIW architecture can only use a fraction of its ALU's with dependent scalars.

So I'm going to ask again: Is there ever a reason to prefer VLIW over scalar? Or in other words: Will any new architecture ever use VLIW again?

Arun · May 20, 2007

Nick said:
So I'm going to ask again: Is there ever a reason to prefer VLIW over scalar? Or in other words: Will any new architecture ever use VLIW again?

In theory, you could have some lower control overhead in terms of transistor counts, I'd imagine. Whether that's worth it or not depends on how efficient it is in practice. Arguably, this is already much less important for GPUs than for CPU, because the control overhead is much lower since the ALUs are SIMD anyway, even they if they were "scalar" from a programming point of view.

And thanks for the correction regarding the cost of branching Rys - oopsie!

3D Technology & Architecture

santyhammer

Unknown Soldier

Nick

ERP

AlexV

Heteroscedasticitate

Nick

swaaye

Entirely Suboptimal

Jawed

Nick

aeryon

3dcgi

aeryon

trinibwoy

Meh

Reverend

Geeforcer

Harmlessly Evil

Arun

Unknown.

pjbliverpool

B3D Scallywag

Rys

Graphics @ AMD

Nick

Arun

Unknown.

Similar threads