3D Technology & Architecture

Rys · May 20, 2007

Well I was thinking about G8x in comparison to everything around it, from NV3x and R3x onwards (I'm too new to the game to know the innards of anything older), as the broader context. In terms of the effort required to write an optimised instruction assembler, G8x has to be the friendliest so far, and by some distance in some cases.

Yes, the SFU scheduling in G8x requires (probably considerable) effort, but NVIDIA's software engineers must be thankful they're not writing the assembler for R6x (or another NV3x for that matter), and I think that's the point of the whole argument surrounding R6x and G8x.

That argument maps to every bit of the software-to-hardware chain, in my opinion, from programmers writing engines and shaders to what the hardware spits out at the end.

Nick · May 20, 2007

Arun Demeure said:
In theory, you could have some lower control overhead in terms of transistor counts, I'd imagine.

Like you say yourself, scalar stream processors are SIMD, Single Instruction Multiple Data. Why would control have more overhead?

G80 feeds 8 ALU's the same instruction, while R600 feeds 16 ALU's the same instruction. But it could have just as well been the other way around - it's no property of VLIW versus scalar. In case of G80 these ALU's are clearly right next to each other, simplifying wiring. I wonder whether that's also the case for R600...

Either way, the biggest difference is really how the data is ordered: SoA or AoS. On the CPU I'm using both, with the same SSE units.

Nick · May 20, 2007

Rys said:
Yes, the SFU scheduling in G8x requires (probably considerable) effort...

Considerable, yes, but not more than R600 as far as I can see.

nicolasb · May 20, 2007

aeryon said:
No time to reply to other points (and everybody can easily give good arguments against them) but one shocks me since a lot of people is confused about R600 specs:

5) HD2900XT has no UVD engine. this new UVD is only for RV610/630 and HDMI is only 1.2 so not compatible with HD-DVD and Blue Ray (for that you need HDMI 1.3). Without proper HDMI, this feature is IMHO useless...

HDMI 1.2 certainly is "compatible with HD-DVD and Blu-Ray"; the only thing it won't do is pass lossless HD audio formats. It'll work fine with standard DD and DTS streams. (It is, admittedly, dissapointing that they didn't make it v1.3-compliant).

As for the 2900XT not having the UVD, do you have a source for that claim? There were rumours to that effect a few months ago but they were very firmly squashed some time ago.

Rys · May 20, 2007

nicolasb said:
As for the 2900XT not having the UVD, do you have a source for that claim? There were rumours to that effect a few months ago but they were very firmly squashed some time ago.

I have a source: AMD. It's an update to the article we need to make, although in our defence we were told different a little while before launch.

As for the HDMI support not being 1.3, the timeframe for design means that 1.3 wasn't ratified when they designed for that, so 1.2 support was what they built.

Jawed · May 20, 2007

Nick said:
GTX can do 43 GFLOPS worth of special-function operations, while R600 does 47 GFLOPS. I don't think that's a significant difference. The SFU's are also used for interpolation, but that only costs one cycle, as you've explained to me.

I wasn't referring to SF GFLOPs, I was referring to the compilation hazard that co-issued SF presents. As Bob said a long while back, it's pretty much always possible to avoid bubbles in the MAD ALU - but it requires work. As far as I can tell this workload is split between the compiler and the scheduler - but I haven't investigated the low-level code produced for G80 (e.g. in CUDA) to be sure how the split occurs.

Suffice to say that across a warp of 32 pixels, some to-ing and fro-ing of "program counter" is required, effectively issuing instructions out of order, while maintaining per-pixel dependency. Basically, some pixels will execute upto instruction slot 9, say, while others have to halt at 5, then that second set of pixels gets a later chance to catch up.

Anyway, I don't see how this 'dual-issue' architecture could be made 'fully' scalar. So I don't think we can consider this a weakness of G80 unless there's a better way. Or am I missing something?

No, I just get bored explaining G80 isn't an entirely scalar ALU, because people keep calling it that. All it is is that common vector instructions are issued as serial-scalar. It has big benefits in terms of register consumption and ease of compilation, within chains of purely vector arithmetic. As soon as SFs or interpolations enter the fray, the edges get more ragged and register consumption increases.

Do they first compute 'u' for a whole batch, then 'v' for a whole batch? Or can they compute 'u' of the first 8 pixels, then 'v' of the first 8 pixels, then the rest of the pixels in the batch?

I don't know. I can't think of a way to find out, either. Clearly the TMUs have a much lower throughput than the ALUs, so when a warp of 32-pixels presents a TEX operation, it's going to have to be queued.

This might have another impact on compilation/instruction-scheduling. Put simply, if a queued TEX is assumed, then there's no need to produce all 32 pixels' (u,v)s as a contiguous stream. It might be best to produce 4 lots of 8. I imagine this will make compilation a little kinder.

Interesting. Any chance it's just limited by the CPU? Since this is an AMD test, have these numbers been confirmed with recent drivers for G80?

Tech Reports results were produced with a 158.xx driver, which makes it <1 month old I believe.

The only reason I can think of for this behaviour is that an erroneous apportionment of VS

S batches was made. I imagine this could be caused by something mis-measuring the theoretical throughput of the shaders and therefore not allocating sufficient batches to keep the ALUs busy. These shaders are SM4. The same may not occur if SM3 "equivalents" were submitted.

Jawed

Jawed · May 20, 2007

trinibwoy said:
Jawed you keep hanging onto those AMD benchmark numbers as gospel yet you ignore Rys' own findings:

As far as I can tell, Rys didn't test using AMD's shaders. He had a look at AMD's and surmised that his test code is similar. For all we know, AMD broke G80 by issuing a 128-long loop, instead of a 127-long loop. It could be as trivial as that...

Also, you keep referring to this SF dual-issue hazard as G80's achilles heel yet we have no evidence of such a thing occurring outside of contrived cases. Do you really believe it will be an issue in real shaders or are you just trying to dispel the G80 scalar myth?

I think you'll find there's a concensus amongst those of us who are paying attention that there's nothing contrived here.

Jawed

Jawed · May 20, 2007

Arun Demeure said:
There definitely is a branch evaluation unit in both G80 and R600, not sure where you got the idea there isn't? I'm not sure exactly how it works though, because I'm not sure you get a free branch every cycle (which would be ridiculous overkill, anyway);

If you start nesting control flow, then I imagine free evaluation becomes more useful. Also it becomes very useful on really short dynamically indexed loops.

Of course, finding code of such complexity isn't easy.

I don't know where G80's branch evaluation is performed. Is it in the scheduler or is it a hidden unit that sits alongside the MAD and SF units? Branching in G80 really is a mysterious subject (I'm referring to the selective use of predication).

As for the SFU/FMUL dual-issue, I want to finish my triangle setup & ROP testers, and then I'll try fiddling a bit with G80 again based on, let us say, new information...

New information and results are always greedily welcomed

Jawed

Jawed · May 20, 2007

Rys said:
Branching isn't free on G80 or R600, so there's overhead there, but both have dedicated logic for it and the overhead is minimal (compared to the truly free branching on R5xx).

In R600, is this an artefact of the Xenos-style of batch arbitration/sequencing or is it something else?

Jawed

leoneazzurro · May 20, 2007

Rys said:
I have a source: AMD. It's an update to the article we need to make, although in our defence we were told different a little while before launch.

As for the HDMI support not being 1.3, the timeframe for design means that 1.3 wasn't ratified when they designed for that, so 1.2 support was what they built.

In the current AMD 2900 specifications page :
http://ati.amd.com/products/radeonhd2900/specs.html

The integrated Xilleon HDTV encoder and HD decode acceleration for H264, VC1 DivX and other media formats are still specified.
So, is AMD misleading users in their specification and this is missing?
Or does it simply mean that the video processor in the HD 2900XT is different in respect to the rest of the series?

Jawed · May 20, 2007

Nick said:
Like you say yourself, scalar stream processors are SIMD, Single Instruction Multiple Data. Why would control have more overhead?

Think of the number of instruction schedulers. G80 has 16 SIMDs, each of which is independent. R600 has 4.

It's worth pointing out that R600 seemingly manages thousands of batches concurrently across the entire GPU, while G80 manages hundreds. There might be a 10:1 ratio there. Clearly that increases complexity in R600.

Operand fetching is another area of complexity. R600 seems even more over-the-top than G80 in this respect.

Sadly, these broad observations are as far as we can go.

Jawed

Rys · May 20, 2007

Jawed said:
As far as I can tell, Rys didn't test using AMD's shaders. He had a look at AMD's and surmised that his test code is similar.

I used the same code, first as HLSL and then as the output ASM fed in under D3D9.

Related, Arun's currently working on the MUL with some new codes and he's got a bit more data now, with streams that do things a bit differently to mine.

As for branch overhead not being completely free, I'd only be guessing. I think it's down to area though and we'll find out soon.

Jawed · May 20, 2007

Rys said:
I used the same code, first as HLSL and then as the output ASM fed in under D3D9.

So, it's running the code as SM4 that causes the problem that TR sees?

Jawed

Rys · May 20, 2007

Jawed said:
So, it's running the code as SM4 that causes the problem that TR sees?

Yep. At its core the shader just runs two streams of 64 instructions each, one after the other. It's a pretty trivial case to execute.

Jawed · May 20, 2007

Rys said:
Yep. At its core the shader just runs two streams of 64 instructions each, one after the other. It's a pretty trivial case to execute.

Is there a way to compile G80 code and see the hardware instructions, similar to ATI's GPU Shader Analyzer?

Jawed

Rys · May 20, 2007

Jawed said:
Is there a way to compile G80 code and see the hardware instructions, similar to ATI's GPU Shader Analyzer

Not that I know of, at least not publicly (or that we have access to either).

trinibwoy · May 20, 2007

Jawed said:
I think you'll find there's a concensus amongst those of us who are paying attention that there's nothing contrived here.

Yeah but I'm stilll waiting for the punch line....what's the big conclusion? That G80's compiler can't be written by a six year old? I think the point has always been that it's much, much easier to extract high utilization out of the architecture no matter how hard you try to find corner cases where that doesn't hold.

Also, what's the distinctiion between "fully scalar" and "serial scalar"? G80 can't re-order serialized instructions belonging to the same original vector instruction?

Humus · May 20, 2007

Morgoth the Dark Enemy said:
There are more uses to the GS. And replacing textures with triangles to increase realism...maybe, somewhere around a helluva lot of time in the future when we get to be able to handle sub-pixel pollies nicely and thus create everything with geometry instead of texturing. Not quite there yet.

For that the dedicated tessellator will likely be a better solution, but you can certainly use the GS to do moderate tesellation.

Jawed said:
Hmm, so NVidia asked for the cap of 1024 fp32s on the output of GS What about the 4 streams limit in streamout (instead of 8 streams, to make it symmetrical with vertex input), I'm wondering if they asked for that too

I don't know who asked for that or if it was just Microsoft that came up with those limits. There are certainly cases where you'd want to output more than 1024 scalars though. In some of the work I've done I've gone to great lengths to squeeze stuff into 1024 floats.

chavvdarrr said:
How many games (or any apps) using GS will be launched in next 12 months?

I expect many of the DX10 titles to use GS at least to some extent. Especially since it's pretty much required for some stuff, like point sprites. If the game has particle systems, chances are they'll implement that with GS as that's the most logical solution.

chavvdarrr said:
Until such app is released, GS hardware in R600 is just bunch of useless transistors.

That's always the case. New features are never utilized on launch by existing games. But if AMD didn't push the envelope you'd never see new features utilized.

Nick said:
Vectors, yes, but of what dimension? Texture coordinates typically have two components, colors three, normals three, camera and ligth position three, etc. Furthermore, fog is a scalar, point size is a scalar, depth is a scalar...

That's all fine. There's nothing preventing the normal and texture coordinate to be computed in parallel. Unless you for some weird reason compute the texture coordinate from the normal or vice versa. But that might be fine too if you have enough other work to schedule in parallel with that. The compiler is doing a really great job here. In real world cases you'll typically see most units utilized, except for artificial worst case scenarios.

Jawed · May 20, 2007

trinibwoy said:
Yeah but I'm stilll waiting for the punch line....what's the big conclusion? That G80's compiler can't be written by a six year old? I think the point has always been that it's much, much easier to extract high utilization out of the architecture no matter how hard you try to find corner cases where that doesn't hold.

Generally, SFs, interpolations and potential register pressure are not corner cases. If these ALUs are not running any pixel shaders, then there's no interpolation. Now, that's what I call a corner case (hey, GPGPU).

As it happens, interpolations increase utilisation. Another way of putting this is that interpolations can lower apparent instruction throughput.

I'm not arguing G80 isn't easier, I'm just saying that people keep pointing at what are effectively corner cases for R600 and forgetting that there are corner cases in G80.

Here's a gritty little shader (this is coding with an axe) :

Code:

struct vertex 
{ 
    float4 colorA     : color0; 
    float4 colorB     : color1; 
}; 
 
float4 main(vertex IN) : COLOR 
{ 
    float4 A = IN.colorA.xyzw; 
    float4 B = IN.colorB.xyzw; 
    return normalize((A+B)/(A*B));                 
}

that runs in 9 cycles on R580. How many on G80 and R600? What's the resulting utilisation in both? I'll prolly get a slap on the wrist from Bob for over-egging the SF/interpolations...

Also, what's the distinctiion between "fully scalar" and "serial scalar"? G80 can't re-order serialized instructions belonging to the same original vector instruction?

"Fully scalar" would be a single execution pipeline that handles every instruction, with repeats for vector instructions.

As far as I can tell G80 can serialise its instruction issues in a disjoint fashion:

vec2 for 16 pixels
scalar for 32 pixels
vec2 for 16 pixels

where lines 1 and 3 are the same instruction in the program. Each of those 3 lines results in 4 clocks being consumed (the ALU is 8 pixels wide), i.e., each instruction is issued four times. That example doesn't show it, but I believe this is the mechanism by which it can optimise SF/interpolation, which would otherwise cause bubbles in the MAD ALU.

Jawed

Arun · May 20, 2007

Jawed: Here's a little shader I made for you that runs at ~1200MPixels/s on my G80...

Please note that this is with 101.41, because those are the only drivers exposing the MUL (at least under OpenGL, I didn't check some of the 158.xx under D3D10, but I'm pretty sure Rys did) and I'm not going to go through 3 zillion reboots just now, hehe.

struct PS_Output {
float4 color : COLOR;
};

PS_Output ps(uniform sampler2D mytexture : TEX0)
{
PS_Output OUT;

float4 texCoord = 0.1f;
float scalar = tex2D(mytexture, texCoord.xy).x;

// 128 x MADD
scalar = scalar*scalar+scalar;
scalar = scalar*scalar+scalar;
scalar = scalar*scalar+scalar;
...

// 32 x LOG2
scalar = log2(scalar);
scalar = log2(scalar);
scalar = log2(scalar);
...

OUT.color = scalar.xxxx;
return OUT;
}

Now, interestingly, with sin() instead of log2(), I get ~1000MPixels/s. And removing the SFs completely, I get ~1300MPixels/s (96% of peak). Without the MADDs, I get ~1340MPixels/s (99% of peak) with either sin() or log2().

So, clearly, there is no "bubble" or anything similar happening for LOG2, with these instructions coming nearly for "free" since the SFU pipeline (which is decoupled, just like the TMUs) was empty in the MADD-only case. In the sin() case, however, performance goes down. So, to see if that was because of a "bubble", I retried with 32x(4xMADD+1xSF), with all instructions being dependent on the previous one. I got ~1150MPixels/s with LOG2 (lower!) and with SIN, I got ~1000MPixels/s again. Doing this 16x with a Vec2 instead (->independent instructions...) gave practically identical scores.

Part, if not all, of the difference between LOG2 and SIN can most likely be explained by the fact that the latter abuses the MADD unit to put the value in range, as explained in the following diagram from Stuart Oberman's FPU patent: http://www.notforidiots.com/G80/ALUOps.png. Also, as per Bob's suggestion (we had already tested something quite similar back in November, I think, or got the point in another way at least):

struct PS_Output {
float4 color : COLOR;
};

PS_Output ps(uniform sampler2D mytexture : TEX0,
float4 texCoord1 : TEXCOORD0,
...
float4 texCoord8 : TEXCOORD11)
{
PS_Output OUT;

float4 texCoord = 0.1f;
float scalar = tex2D(mytexture, texCoord.xy).x;

// 32 x MADD with Attribute Interpolation
scalar = scalar*scalar+texCoord1.x;
scalar = scalar*scalar+texCoord1.y;
scalar = scalar*scalar+texCoord1.z;
scalar = scalar*scalar+texCoord1.w;

...

scalar = scalar*scalar+texCoord8.x;
scalar = scalar*scalar+texCoord8.y;
scalar = scalar*scalar+texCoord8.z;
scalar = scalar*scalar+texCoord8.w;

OUT.color = scalar.xxxx;
return OUT;
}

This runs at ~4000MPixels/s, which is 74% of peak. I'd have expected higher, but perhaps the fact there *only* are dependent instructions (needing results from BOTH the SFU and ALU every time!) is showing a latency-related bottleneck, because there aren't enough warps/threads to hide the ALU/SFU's own latencies. That's just a theory, though. Still quite a bit higher than the theoretical 20% peak of R600 anyway, not that such a worst-case scenario matters...

And finally, the RCP for perspective correction is indeed done once at the beginning of the shader. This can easily be verified by writing a SFU-limited program that uses one interpolated attribute and another similar program that does not. The former will take 5 extra clocks (4 for the RCP, 1 for the interpolation).

As for the MUL, that information will have to be published at another date, probably in an article, so stay tuned... There still are some things I'd like to understand better first though, sigh... :???:

EDIT: Added info & link to the patent diagram showcasing the FPU abuse for SIN.

3D Technology & Architecture

Rys

Graphics @ AMD

Nick

Nick

nicolasb

Rys

Graphics @ AMD

Jawed

Jawed

Jawed

Jawed

leoneazzurro

Jawed

Rys

Graphics @ AMD

Jawed

Rys

Graphics @ AMD

Jawed

Rys

Graphics @ AMD

trinibwoy

Meh

Humus

Crazy coder

Jawed

Arun

Unknown.

Similar threads