3D Technology & Architecture

"Fully scalar" would be a single execution pipeline that handles every instruction, with repeats for vector instructions.

Not sure I follow you there...why can't you have a dual-issuable scalar ALU?

As far as I can tell G80 can serialise its instruction issues in a disjoint fashion:

vec2 for 16 pixels
scalar for 32 pixels
vec2 for 16 pixels

where lines 1 and 3 are the same instruction in the program. Each of those 3 lines results in 4 clocks being consumed (the ALU is 8 pixels wide), i.e., each instruction is issued four times. That example doesn't show it, but I believe this is the mechanism by which it can optimise SF/interpolation, which would otherwise cause bubbles in the MAD ALU.

Don't follow this either (sensing a trend here :oops: ). How would this help with SF scheduling?
 
So, clearly, there is no "bubble" or anything similar happening for LOG2, with these instructions coming nearly for "free" since the SFU pipeline (which is decoupled, just like the TMUs) was empty in the MADD-only case.

Why is the log2 free? In your code snippet there it looks like all 128 MADDs have to be executed before the first log2 is scheduled. Or was that confirmation that the main ALU and SFU pipelines are scheduled separately and can work on different batches in parallel?

Just getting all my dumb questions out in one go :)
 
I expect many of the DX10 titles to use GS at least to some extent. Especially since it's pretty much required for some stuff, like point sprites. If the game has particle systems, chances are they'll implement that with GS as that's the most logical solution.

So I hope AMD will improve DX10 particles performance in R650 :

080.gif

092.gif
 
Last edited by a moderator:
Or was that confirmation that the main ALU and SFU pipelines are scheduled separately and can work on different batches in parallel?
Yeah, that's what it confirms. So G7x/R5xx/R6xx definitely would have a 'bubble' here (some ALUs would be left unused), but G8x does not.
 
Why would people still think shaders are R600's problem?

Xbit labs ran two tests that use little texturing, perlin noise and vertex shading, R600 obliterated G80 in both.

xbitmark_vshader.gif


06_perlin.gif


Just according to expectations, Radeon HD 2900 XT managed to find itself on the first place in this test: with mathematic instructions exceeding the amount of texture lookups by the factor of 10, the R600 architecture feels itself very good here.
 
Let's take your first shader, 128 MADs and 128 LOG2s, serially dependent, so that the first LOG2 has to wait until the last MAD has finished.

Let's issue a 4-warp batch (block). Looking inside a single ALU array we see, in 1 second:
  • 1350 million instruction clocks
  • no LOG2 can be issued until 8 pixels (that's the width of the array) have executed their 128 MADs
  • once those 128 clocks have passed, each pixel's LOG2 is co-issued against pixel+8's MAD (roughly speaking)
  • this continues for a long time, until...
  • the final 8 pixels execute their final 32 LOG2s. At this point there are no more MADs to issue. These LOG2s take 128 clocks, because they issue at 1/4 rate
  • The 128 pixels in the batch therefore consume: 4 warps * 32 pixels per warp * 128 instructions / 8 pixels in array = 2048 clocks PLUS 128 clocks of MAD that run without co-issue at the start PLUS 128 clocks of LOG2 that run without co-issue at the end = 2304 clocks
  • In 1350 million clocks, a 2304-clock batch of 128 pixels can be run 585937.5 times, which is 75 million pixels
  • 16 arrays all doing the same thing is 1200 million pixels per second.
There's your bubbles. If you test this under CUDA you can vary your block size and you should see a corresponding variation in performance, effectively caused by the granularity of the single-issue instructions. You should see a similar effect when you vary the ratio of MAD and LOG2 instructions.

Jawed
 
Why would people still think shaders are R600's problem?

Xbit labs ran two tests that use little texturing, perlin noise and vertex shading, R600 obliterated G80 in both.

Problem is that shaders access textures, vertex shading test has almost no texture access, Perlin Noise has very low texture access, shaders accessing more often the textures have lower efficiency in R600 (see shadermark and 3DMark pixel shader test, for example), I think there's a slow access to texture in the shaders.
 
If you're having trouble visualising the instruction reordering then this might help. The code snippet, as ever:

b3d83.jpg


And this is how I think it issues:

b3d98.gif

Sorry for the length, but it's the only way to show a 32-pixel warp executing these instructions. The bunching-up of the MAD for all 32 pixels at the start seems to be the key to issuing the RCPs soon-enough so that they don't inflict a bubble onto the main part of the ALU.

Jawed
 
shaders accessing more often the textures have lower efficiency in R600 (see shadermark and 3DMark pixel shader test, for example), I think there's a slow access to texture in the shaders.
For the tests referenced here it more to do with the number ot texture commands in relation to the number of ALU commands.
 
5) HD2900XT has no UVD engine. this new UVD is only for RV610/630 and HDMI is only 1.2 so not compatible with HD-DVD and Blue Ray (for that you need HDMI 1.3). Without proper HDMI, this feature is IMHO useless...
Sorry, but this is complete nonsense.

UVD is not related to HDMI and HDMI audio - the HDMI and HDMI audio capabilities are a function of HD 2900 XT. HDMI 1.2 is fully compatible with HD DVD and Blu-ray, HDMI 1.3 contains only optional specs and at present the majority of HD users will miss very little from not having the functionality.
 
For the tests referenced here it more to do with the number ot texture commands in relation to the number of ALU commands.

Could you please clarify? This would be very nice for me (I'm only a profane with a very big passion for VGAs :) ): seeing the block diagram I see that for Texture addressing and texture point sampling R600 should be theoretically faster than G80 (i.e. both have 32 TA units, but R600 has higher frquencies) , but much slower on filtering. But speaking about "commands" put me in trouble, I thought these shaders to be slower due to a slow texture filtering rate, so what's the real chase there?
Thank you in advance :)

Edit: OK, I undestood: the TEX:ALU command ratio is higher than the shader unit: texture filtering unit ratio in many shaders, causing them to be texture fillrate limited (or a sort of).
 
[*]The 128 pixels in the batch therefore consume: 4 warps * 32 pixels per warp * 128 instructions / 8 pixels in array = 2048 clocks PLUS 128 clocks of MAD that run without co-issue at the start PLUS 128 clocks of LOG2 that run without co-issue at the end = 2304 clocks
You are assuming there is a pipeline flush, i.e. no other threads/batches/blocks running after/before/at the same time as these 128 pixels. That's a bit ridiculous, and if you had such a case, you'd run into other bottlenecks anyway. If you tried to do this in CUDA, you'd be overhead-limited btw, I'd presume.

Consider a real-world game instead: complete pipeline flushes only happen at the start of the frame, whenever you switch rendertarget, and whenever you end the frame. In most cases, between these various events, you might excuse hundreds of thousands or millions of threads. If you want to consider pipeline flushes, then also consider how many stages there are in a modern GPU and how fast you can fill them. You should quickly realize that the inefficiency of what you are describing is fairly negligible compared to that!

P.S.: Also consider that attribute interpolation is probably more common than you think. The utilisation of the SFU pipeline in your diagram would obviously be higher if some of the operations used interpolated attributes as one of the operands, rather than registers.
 
You are assuming there is a pipeline flush, i.e. no other threads/batches/blocks running after/before/at the same time as these 128 pixels. That's a bit ridiculous, and if you had such a case, you'd run into other bottlenecks anyway.

Consider a real-world game: complete pipeline flushes only happen at the start of the frame, whenever you switch rendertarget, and whenever you end the frame. In most cases, between these various events, you might excuse hundreds of thousands or millions of threads.

If you want to consider pipeline flushes, then also consider how many stages there are in a modern GPU and how fast you can fill them. You should quickly realize that the inefficiency of what you are describing is fairly negligible compared to that!
There's no pipeline flush there. Read more carefully.

Jawed
 
...complete pipeline flushes only happen at the start of the frame, whenever you switch rendertarget, and whenever you end the frame.
Wouldn't it even be possible to avoid pipeline flushes for these situations? It's probably not worth the complexity, but I'm just thinking theoretically.
 
In the current AMD 2900 specifications page :
http://ati.amd.com/products/radeonhd2900/specs.html

The integrated Xilleon HDTV encoder and HD decode acceleration for H264, VC1 DivX and other media formats are still specified.
So, is AMD misleading users in their specification and this is missing?
Or does it simply mean that the video processor in the HD 2900XT is different in respect to the rest of the series?

I'm really not following your point. I think perhaps you're making unwarranted assumptions about what UVD brings to the table that's new and above/beyond what Avivo was already bringing.
 
I'm really not following your point. I think perhaps you're making unwarranted assumptions about what UVD brings to the table that's new and above/beyond what Avivo was already bringing.

There was another user saying there's no "UVD" in R600. Several reviews pointed out this to be fully present in R600, as in this example: http://www.anandtech.com/video/showdoc.aspx?i=2988&p=13

So I only tried to understand what's behind both claims, because R600 official specifications pointed out indeed to H264 and VC1 acceleration, but Dave Baumann answer seems to point out that the current R600 implementation does not fully offload the decode, as probably RV630 and R610 boards will do.
 
Back
Top