G8x vs R6xx in Rightmark3D 2.0

Davros · Jul 24, 2007

ShaidarHaran said:
I understand Vista complicates matters but surely they can do some sort of resource utilization analysis to differentiate between aero resource usage and that of a "real" 3d app...

No that would be the easy way, the Ati way would be for the driver to contain a list of every 3D app ever made

Bob · Jul 24, 2007

3dcgi said:
There should be no need to guide execution with parenthesis in this case.

That's not true for IEEE-754 floating point.

For example:
a = 1000000
b = 0
c = .00001
d = .00001

a + b + c + d = 1000000 + 0 + .00001 + .00001 = 1000000
(a+b) + (c+d) = (1000000 + 0) + (.00001 + .00001) = 1000000.00002

Or so.

Incidentally, the optimization mentioned above also benefits G80 in some small way.

itaru · Jul 24, 2007

3dcgi said:
There should be no need to guide execution with parenthesis in this case. Compilers writers are smart, I'm sure they'll figure it out.

Thanks for the analysis.

Yes.
That only if it executes it.

However, the optimization technique to R600 recommends it.
For making of the instruction parallel.
It is because the compiler of Shader is perhaps immature.

http://ati.amd.com/developer/SDK/AM...ions/ATI_Radeon_HD_2000_programming_guide.pdf

Rys · Jul 24, 2007

It's not because the R600 compiler is immature. It should find all common opportunities for slot filling in common code, including the case you mention (in fact it does find that case, I just checked).

The point is that if you can explicitly hint it in any way to extract parallelism, do so.

Humus · Jul 25, 2007

3dcgi said:
There should be no need to guide execution with parenthesis in this case. Compilers writers are smart, I'm sure they'll figure it out.

The compiler is capable of optimizing it, however, to be compliant with DX10 you can't reorganize stuff that may affect the result. Like in Bob's example. It's up to the developer to write good code. This is very important in SM4 shaders, a lot more of the original semantics is preserved by HLSL.

3dcgi · Jul 25, 2007

Humus said:
The compiler is capable of optimizing it, however, to be compliant with DX10 you can't reorganize stuff that may affect the result. Like in Bob's example. It's up to the developer to write good code. This is very important in SM4 shaders, a lot more of the original semantics is preserved by HLSL.

This leads me slightly off topic but it seems to me that the compiler should be free to optimize unless the programmer guides optimization with parenthesis. No parenthesis being a don't care. This puts the burden on programmers that have order of operation requirements and yields the most performance for the rest.

I'm sure there are a lot of applications that won't care about a bit of mixed precision here and there and any bit of work that can be offloaded from developers that are not GPU experts is a good thing. Especially if GPU programming is to ever become mainstream.

I consider myself a average programmer and most of the time I don't have time to optimize every little thing. In most C++ programs I've written I use floats because I need decimals and precision isn't terribly important. I can envision GPU programmers taking this approach at times.

And yes I realize this means a program might yield different results with a different compilers and Microsoft has a bunch of rules to ensure this doesn't happen. I'm just not sure it's the best approach. Microsoft's Internet Explorer team must agree with me since they've often worked outside of standards.

Tim Murray · Jul 25, 2007

3dcgi said:
This leads me slightly off topic but it seems to me that the compiler should be free to optimize unless the programmer guides optimization with parenthesis. No parenthesis being a don't care. This puts the burden on programmers that have order of operation requirements and yields the most performance for the rest.

Well, it matters a lot for floating point. The canonical example is something like (5+1e10) - 1e10 = 0, but 5 + (1e10 - 1e10) = 5. Optimizations that affect output are a terrible, terrible, terrible idea. Go back to any of the UT2003 or 3DMark 03 discussions if you want more insight there.

Humus · Jul 25, 2007

3dcgi said:
No parenthesis being a don't care.

In most programming languages (including HLSL) no parentheses means left-to-right evaluation.

3dcgi · Jul 25, 2007

Tim Murray said:
Well, it matters a lot for floating point. The canonical example is something like (5+1e10) - 1e10 = 0, but 5 + (1e10 - 1e10) = 5. Optimizations that affect output are a terrible, terrible, terrible idea. Go back to any of the UT2003 or 3DMark 03 discussions if you want more insight there.

When I made my first post I wasn't thinking about floating point, but I thought about whether I should even make the second post before doing so. I was curious at what kind response I'd get. To clarify, I'm not talking about having the compiler make unwanted optimizations I'm just theorizing that it might be useful to flip the programming model around and have programmers define when they don't want optimizations. If a programmer is operating with as little precision as the example then they'll definitely want to define the order of operations.

Anyway, I'm going to let this drop since it's off topic and all I'm doing is presenting a half baked theory on a programming model for ease of use. As I haven't done much with floating point in some time precision issues with floats may hamper ease of use more than the automatic parallelization will help it. Obviously bugs caused by the order of operations could be much more important than some easily gained performance.

CarstenS · Jul 25, 2007

Here you are with pretty recent drivers:
http://www.pcgameshardware.de/?article_id=607300

Some Demirug-Birdie told me, that the Instructions for the firy bunny are full of dependent scalars - true?

chavvdarrr · Jul 25, 2007

Quasar said:
Here you are with pretty recent drivers:
http://www.pcgameshardware.de/?article_id=607300

Some Demirug-Birdie told me, that the Instructions for the firy bunny are full of dependent scalars - true?

well, question is:
Is this correct example how this effect is created, ie if these dependancies are artificially added, and are these dependancies easily removed ?

If this is the easy way to create such effect, then its architecture problem, not shader-writer one. Who is going to spend hours of writing and tuning?
in the NV30 days that was exactly the same situation afaik - NV30 had potential, but time needed to extract this potential was 5x the time needed to simply write R300-friendly shader, no?

Rys · Jul 25, 2007

There's a weighting function (8 clocks on R600 I think) which does a bunch of dependent scalar MUL/ADD/MADD, and a gradient calculation (16 clocks on R600, MULs + transcendentals), run 32 times each to generate the shader's noise, which then feeds a small loop over scalar floats to accumulate.

That's the bulk of the math-only stuff in that fire shader (at a quick look).

There's a decent mix of integer instructions too (3:1 float:int or thereabouts in total), incidentally. So all-in, not a shader that R600 likes too much. I'd profile on NVIDIA but they don't have a profiler for ps_4_0 that works on x64

Jawed · Jul 25, 2007

So weighting + gradient is 24 clocks. Repeated 32 times = 768 clocks. This is 62M fragments/s. At 2560x1600 that's 15fps, for a screen filling quad of this PS. Since it isn't a screen filling quad and since there's more to this shader anyway, it's not a hugely useful metric.

Jawed

CarstenS · Jul 25, 2007

Jawed said:
So weighting + gradient is 24 clocks. Repeated 32 times = 768 clocks. This is 62M fragments/s. At 2560x1600 that's 15fps, for a screen filling quad of this PS. Since it isn't a screen filling quad and since there's more to this shader anyway, it's not a hugely useful metric.

Jawed

You did not calculate the dependencies of scalar stuff but instead load up R600 to the fullest possible extent. Additionally, texturing is not one of the primary focuses of R600, this can add to.

Jawed · Jul 25, 2007

Quasar said:
You did not calculate the dependencies of scalar stuff [...]

The calculation was based upon Rys's reporting of clock cycles - which is independent of superscalar co-issue or scalar dependency.

Jawed

Rys · Jul 25, 2007

Quasar said:
You did not calculate the dependencies of scalar stuff but instead load up R600 to the fullest possible extent. Additionally, texturing is not one of the primary focuses of R600, this can add to.

The clock calculation I did does take in to account dependency. Jawed then just extrapolates out for the ALU count of the chip and its fullscreen clock. I think it's correct. There's very limited texture sampling happening in the shader, although it is filtered.

CarstenS · Jul 25, 2007

Ah, ok. My mistake then.

trinibwoy · Jul 29, 2007

chavvdarrr said:
well, question is:
Is this correct example how this effect is created, ie if these dependancies are artificially added, and are these dependancies easily removed ?

That's a good question. I wonder how many upcoming titles have explicitly taken advantage of G80's narrow units. AMD might want to get a little more serious about their dev rel if that's the case.

G8x vs R6xx in Rightmark3D 2.0

Davros

Bob

itaru

Rys

Graphics @ AMD

Humus

Crazy coder

3dcgi

Tim Murray

the Windom Earle of mobile SOCs

Humus

Crazy coder

3dcgi

CarstenS

Moderator

chavvdarrr

Rys

Graphics @ AMD

Jawed

CarstenS

Moderator

Jawed

Rys

Graphics @ AMD

CarstenS

Moderator

trinibwoy

Meh

Similar threads