3D Technology & Architecture

Arun · May 21, 2007

Nick said:
Wouldn't it even be possible to avoid pipeline flushes for these situations? It's probably not worth the complexity, but I'm just thinking theoretically.

Yeah, I was thinking in the case of using the previous rendertarget as a texture in the next pass. There's only so much you can do to prevent a pipeline flush in that case, as far as I can see...

Jawed said:
There's no pipeline flush there. Read more carefully.

A pipeline flush (explicit or not) is the only explanation for what you are describing. If there is no pipeline flush, then the rest of your post is wrong, and that's that.

Here's an extreme example that should be easy to understand. Consider a full-screen quad covering millions of pixels. Every multiprocessor has up to 768 pixels in flight at a given time. As soon as all the multiprocessors are filled with threads, every time a warp exits, a new one will enter. As long as the scheduler is smart enough not to run all the warps perfectly in sync, you won't see bubbles then.

To make sure we have the same definition of a pipeline flush, let's also take an example with CUDA. After every call to the API, you are guaranteed that all stages on the GPU are empty, because CUDA is fully synchronous. When the pipeline is full and begins emptying, the pipeline flush has started. When new threads begin entering again, the pipeline flush has ended.

Jawed · May 21, 2007

Arun Demeure said:
A pipeline flush (explicit or not) is the only explanation for what you are describing. If there is no pipeline flush, then the rest of your post is wrong, and that's that.

Well, maybe someone else will explain it to you. I've shown you precisely how the performance of your shader can be explained. You have no explanation whatsover, you merely present the number.

Why is the shader running at 89% pixel-throughput?

To make sure we have the same definition of a pipeline flush, let's also take an example with CUDA. After every call to the API, you are guaranteed that all stages on the GPU are empty, because CUDA is fully synchronous. When the pipeline is full and begins emptying, the pipeline flush has started. When new threads begin entering again, the pipeline flush has ended.

There is no pipeline flush in what I described. The 2304-clock batches I've described will butt against each other very neatly, thank you.

Jawed

Arun · May 21, 2007

Hmm, I'll admit to have rapidly dismissed that explanation because it didn't explain my results for this:

So, to see if that was because of a "bubble", I retried with 32x(4xMADD+1xSF), with all instructions being dependent on the previous one. I got ~1150MPixels/s with LOG2 (lower!) and with SIN, I got ~1000MPixels/s again. Doing this 16x with a Vec2 instead (->independent instructions...) gave practically identical scores.

Your explanation gives perfectly valid results indeed for the case with 1200MPixels/s though, which implies that if you are correct, this is what I was assuming wrong:

As long as the scheduler is smart enough not to run all the warps perfectly in sync, you won't see bubbles then.

I think if I added a variable-latency instruction (texture fetch at pseudo-random location), you'd see higher efficiency, but I'm not entirely sure.

I'll admit not to be 100% convinced by your explanation, though, because it doesn't explain why the case without bubbles performs slightly worse, rather than better. Am I wrong to assume your explanation is basically "whatever fits the numbers", or did you come up with the right numbers on the first try?

Jawed · May 21, 2007

Arun Demeure said:
Am I wrong to assume your explanation is basically "whatever fits the numbers", or did you come up with the right numbers on the first try?

I modelled different sized batches in Excel:

1-warp, 32-pixel batch runs at 900MPixels/s
2-warp, 64-pixel batch runs at 1080 MP/s
6-warp, 192-pixel batch runs at 1246 MP/s

I picked a 4-warp batch from the list, because it fits. 4 warps is also the second batch size (after 3 warps) that "fully occupies" G80, according to CUDA Occupancy Calculator for a 1-register shader.

One of the things that's not clear, based on the COC, is whether the joint limitation of 24 warps and 8 blocks (batches, effectively) also applies to graphics shaders. 8 batches, on their own, just doesn't seem enough to hide texturing latency - it seems that the combination of serial scalar issue and intra-batch instruction re-ordering is required to fill in this time.

3-warps (at 1156MP/s) may actually be how your shader is being issued. We're victims of error margin. That would be 8 batches then.

I hadn't tackled the other scenarios you listed because I wanted to solidify the idea that bubbles can explain the performance shortfall seen in the first shader.

If we naively schedule:

So, to see if that was because of a "bubble", I retried with 32x(4xMADD+1xSF), with all instructions being dependent on the previous one. I got ~1150MPixels/s with LOG2 (lower!) and with SIN, I got ~1000MPixels/s again. Doing this 16x with a Vec2 instead (->independent instructions...) gave practically identical scores.

then 160 instructions (no co-issue) should produce 1080MP/s. If you do the overlapping I described earlier, there should only be 8 clocks of bubbles (4 at the start, 4 at the end), resulting in ~1344 MP/s.

The measured 1150MP/s implies 2x160 clocks of single-issue bubbles topping and tailing each batch. With a base of 128 instructions and a 4-warp batch, this would be ~1168MP/s theoretically.

Clearly this shader is running too slow, much slower than a co-issued ALU pipeline supports. Also, much slower than your suggested independent SF pipeline. I'd like not to conclude fuzzy compilation, but I can't see what else to say. Until I can find some plausible execution plan for this shader, I'm bemused... Try a different driver?

Jawed

Mintmaster · May 21, 2007

Nick said:
I'm surprised nobody has mentioned the scalar architecture yet. G80 at 1.35 GHz can deliver 346 GFLOPS all the time, R600 at 742 MHz can drop to 95 GFLOPS for shaders with all dependent scalars. To reach its peak performance of 475 GFLOPS it needs 5 independent scalars every clock cycle (multiply-add dependencies are ok).

One thing to note here is that G80 will look better in theoretical shader benchmarks than in-game shaders because games often have parallel instruction streams. Shadow mapping, lighting, base/detail textures, etc. are several techniques we see in games, unlike demos or benchmarks which usually only showcase just one of them.

While the scalar architecture helps a bit, my guess is that the texture filtering prowess is where G80 gets its real strength from. For 3D textures and anisotropic filtering, the GTX is 4x as fast as R600 per clock! I'm of the belief that graphics based on data can be far more realistic than those based on math, especially if we're talking about math that doesn't need lots of data.

Until someone builds a really fast GPU raytracer (and even then I'm not sure it's the best way forward), offline computation in some form or the other is the key to realistic graphics.

trinibwoy · May 21, 2007

Mintmaster said:
One thing to note here is that G80 will look better in theoretical shader benchmarks than in-game shaders because games often have parallel instruction streams. Shadow mapping, lighting, base/detail textures, etc. are several techniques we see in games, unlike demos or benchmarks which usually only showcase just one of them.

Isn't that the case for all architectures? I thought G80's strength was that it could come closer to its best case numbers in real world scenarios.

Frank · May 21, 2007

trinibwoy said:
Isn't that the case for all architectures? I thought G80's strength was that it could come closer to its best case numbers in real world scenarios.

Yes, that's the overall idea. Less bottlenecks, and less compiler / game designer wizardry needed.

Then again, if you go those extra miles, R600 can beat the G80 by a good margin. Although you could optimize for the G80 as well, although the difference between average and peak will be a lot smaller for that chip.

And improved drivers/compilers can make a difference for both, but again mostly for R600.

Edit: and they both like math much better than texture lookups, R600 especially.

OpenGL guy · May 21, 2007

Jawed said:
Code:

struct vertex { float4 colorA : color0; float4 colorB : color1; }; float4 main(vertex IN) : COLOR { float4 A = IN.colorA.xyzw; float4 B = IN.colorB.xyzw; return normalize((A+B)/(A*B)); }

that runs in 9 cycles on R580. How many on G80 and R600? What's the resulting utilisation in both? I'll prolly get a slap on the wrist from Bob for over-egging the SF/interpolations...

9 cycles on R580? Is that measured performance or theoretical? My calculations show 9 cycles on R520 and ~4 cycles on R580.

Jawed · May 21, 2007

OpenGL guy said:
9 cycles on R580? Is that measured performance or theoretical? My calculations show 9 cycles on R520 and ~4 cycles on R580.

I wrote that inside GPU Shader Analyzer, so yes it shows as 9 and 4 cycles respectively. I didn't want to confuse by posting the latter. To be fair, I was being sloppy, I should have said "9 instruction slots" not "cycles". Sorry about that.

Hey, while you're here, is that Cg or HLSL I wrote?

Ooh, and, how many instruction slots in R600? Pretty please? Just curious...

Jawed

Rys · May 21, 2007

Jawed said:
I wrote that inside GPU Shader Analyzer, so yes it shows as 9 and 4 cycles respectively. I didn't want to confuse by posting the latter. To be fair, I was being sloppy, I should have said "9 instruction slots" not "cycles". Sorry about that.

Hey, while you're here, is that Cg or HLSL I wrote?

Ooh, and, how many instruction slots in R600? Pretty please? Just curious...

Jawed

HLSL, and 9 or 10 clocks for R600......

OpenGL guy · May 21, 2007

Rys said:
HLSL, and 9 or 10 clocks for R600......

You mean "slots", right?

Kaotik · May 22, 2007

chavvdarrr said:
How many games (or any apps) using GS will be launched in next 12 months?
Last I heard AMD won't name even a single one after being directly asked.
Until such app is released, GS hardware in R600 is just bunch of useless transistors.

Maybe this was already answered but one such title should be Flight Simulator X's with the DX10 patch, as someone part of the team working on it talked about how much faster R600 was than G80 with geometry shaders on the game.
For "useless transistors", the G80 and R600 both have unified shaders, DX10 specifies unified featureset (or what should it be called?) for pixel, vertex and geometry shaders, I fail to see where the "useless transistors" sit in this scheme.

Nick · May 22, 2007

OpenGL guy said:
9 cycles on R580? Is that measured performance or theoretical? My calculations show 9 cycles on R520 and ~4 cycles on R580.

Don't they have the exact same shader units?

aeryon · May 22, 2007

Dave Baumann said:
Sorry, but this is complete nonsense.

UVD is not related to HDMI and HDMI audio - the HDMI and HDMI audio capabilities are a function of HD 2900 XT. HDMI 1.2 is fully compatible with HD DVD and Blu-ray, HDMI 1.3 contains only optional specs and at present the majority of HD users will miss very little from not having the functionality.

hmmm I don't want to be rude but sorry too, I don't think it's nonsense. I think You mixed up my mind and/or you did not understood what I said. I will try to be more clear here:

1/ I agree that UVD has nothing to do with HDMI and I never said that (Look, it has a point mark separating my my sentences about HDMI and UVD. but if You prefer, next time, I will go to next line to make it as really different points like I'm doing now).

2/ When I said that R600 has no UVD engine is related to the full VC1 / H264 HW engine present in RV610/630 but not in R600. Like PureVideo2 in G84/86 but not in G80. This picture from AMD shows it:

3/ HDMI 1.2 is NOT fully compatible with BR and HD-DVD. As You said, it lacks features so it cannot be 100% compatible. from official HDMI website :
http://www.hdmi.org/about/faq.asp#hdmi_1.3
BR and HD-DVD offer the option to use Dolby TrueHD and DTS-HD Master Audio lossless audio formats, only available if you have HDMI1.3. However, with HDMI 1.2 you can play these 2 lossless audio streams in PCM format if your player and receiver are compatible. well with all this HDMI mess, nothing will guarantee you that... and finally the last revision of HDMI offers the ability to use deeper color space (10, 12 or 16 bit per component instead of only 8bit for HDMI 1.2) and broader color space (IEC 61966-2-4 xvYCC color standard with no color space limitation). These limitations are not what I call "very little" as you said... And finally they are other minor things added to HDMI 1.3 like lip sync or increase interface speed but they are not so important like the ones above.

nicolasb · May 22, 2007

From my perspective the only remaining reason why I might want to buy an R600 card is its potential for use in an HTPC system (as I am specifically trying to put together a rig that can function both as a high(ish)-end gaming setup and also as a hi-def video player). So I am extremely keen to get some clarification on this "R600 doesn't contain UVD" question.

Could Rys or Dave Baumann (or anybody!) spell out exactly what R600 does or does not do on this front, and how that compares with what the advertising says it is supposed to do, or with what RV610 and RV630 actually will do?

_xxx_ · May 22, 2007

R600 in the HTPC? Wish you lots of fun looking for the case which will swallow it and the PSU which will feed it enough power, let alone the cooling issues.

vertex_shader · May 22, 2007

nicolasb said:
From my perspective the only remaining reason why I might want to buy an R600 card is its potential for use in an HTPC system (as I am specifically trying to put together a rig that can function both as a high(ish)-end gaming setup and also as a hi-def video player). So I am extremely keen to get some clarification on this "R600 doesn't contain UVD" question.

Could Rys or Dave Baumann (or anybody!) spell out exactly what R600 does or does not do on this front, and how that compares with what the advertising says it is supposed to do, or with what RV610 and RV630 actually will do?

Maybe because of the drivers, we cannot open the hardware acceleration of 2900XT including H.264 and VC-1.
It is a pity the HD2900XT didnt add UVD engine but there is no consumer will choose Sempron processor to match 2900XT.

Link
I'm sure not too many user want to use there 400$ vga for hardware vc-1/h.264 acceleration when the cpu more than enough :smile:

nicolasb · May 22, 2007

_xxx_ said:
R600 in the HTPC? Wish you lots of fun looking for the case which will swallow it and the PSU which will feed it enough power, let alone the cooling issues.

With a single 2900XT card the entire system under load uses barely 400W. Even an R600 Crossfire rig only hits about 550W for the whole system. Are you suggesting I will have problems finding a PSU capable of producing 400W?

As for a case, I think an Antec P182 will probably fit the bill.

vertex_shader said:
I'm sure not too many user want to use there 400$ vga for hardware vc-1/h.264 acceleration when the cpu more than enough

Well, I dunno. One benchmark I saw over at Anandtech with a Core 2 Duo E6300 and an 8800GTX had the CPU hitting 90% utilisation during BluRay playback, and there's no way to be sure that the disc they were using is the most difficult BluRay disc that will ever be published. I'd like a bit more headroom than G80 offers.

Granted, I wouldn't be using an E6300, 6600 is more likely, but, given the amount of noise the video card cooler would be generating (even at 2D clockspeeds), it'd be nice to have the option of slightly underclocking the CPU during playback (so the CPU fan can spin down slow) without risking dropped frames. One could then push the CPU clock up high for playing Oblivion.

Zvekan · May 22, 2007

nicolasb said:
Well, I dunno. One benchmark I saw over at Anandtech with a Core 2 Duo E6300 and an 8800GTX had the CPU hitting 90% utilisation during BluRay playback, and there's no way to be sure that the disc they were using is the most difficult BluRay disc that will ever be published. I'd like a bit more headroom than G80 offers.

Granted, I wouldn't be using an E6300, 6600 is more likely, but, given the amount of noise the video card cooler would be generating (even at 2D clockspeeds), it'd be nice to have the option of slightly underclocking the CPU during playback (so the CPU fan can spin down slow) without risking dropped frames. One could then push the CPU clock up high for playing Oblivion.

Newer drivers could improve G80 results.

As far as I can see, both G80 and R600, offer quite similar offload abilities. And R600, if I'm not mistaken, uses it's stream processors for CPU assist - which is hardly efficient from energy point of view.

Zvekan

_xxx_ · May 22, 2007

nicolasb said:
With a single 2900XT card the entire system under load uses barely 400W. Even an R600 Crossfire rig only hits about 550W for the whole system. Are you suggesting I will have problems finding a PSU capable of producing 400W?

When you mention a HTPC, I think of SFF and not a full-blown tower. And an SFF will hardly have enough room for the card, or a >400W power supply or proper cooling for anything of the likes.

right back at ya for your PSU "wisdom", too.

3D Technology & Architecture

Arun

Unknown.

Jawed

Arun

Unknown.

Jawed

Mintmaster

trinibwoy

Meh

Frank

Certified not a majority

OpenGL guy

Jawed

Rys

Graphics @ AMD

OpenGL guy

Kaotik

Drunk Member

Nick

aeryon

nicolasb

_xxx_

vertex_shader

nicolasb

Zvekan

_xxx_

Similar threads