Typical GPU Efficiency

nAo · Jun 16, 2005

Jawed said:
So the reason that pipeline efficiency falls off in NV40 is due to dependency in code.
Jawed

New mantra, old assertion. You have still to give some proof to this statement as the previous one was obviously broken.
I'm not saying there are not dependency problems but the hw just need (most of the time) one or two extra non dependant intructions to insert per clock in order to use the second ALU

KimB · Jun 16, 2005

Yeah, it's something I've been talking about for a long while, actually. The first time I found that I actually described why a unified architecture could be more efficient was in July of last year:
http://www.beyond3d.com/forum/viewtopic.php?t=14250&start=12

But unified pipelines have been talked about for a good year or so before that.

Jawed · Jun 16, 2005

nAo said:
I'm not saying there are not dependency problems but the hw just need (most of the time) one or two extra non dependant intructions to insert per clock in order to use the second ALU

And we still come back to the problem that the real-world code snippet, the longest shader from FarCry, that I quoted earlier in the thread, is running at 2.2 instructions per clock, not 6, nor 5, nor even 4.

Jawed

nAo · Jun 16, 2005

Jawed said:
And we still come back to the problem that the real-world code snippet, the longest shader from FarCry, that I quoted earlier in the thread, is running at 2.2 instructions per clock, not 6, nor 5, nor even 4.
Jawed

Jawed, you're making the same mistake another time. You agreed with me that doesn't make any sense to count total instructions and instrucionts per clock as average.
You should count total flops and average flops per clock cycle

Jawed · Jun 16, 2005

bloodbob, why not post the quote in full:

Rather than separate pixel and vertex pipelines, weâ€™ve created a single unified pipeline that can do both. Providing developers throw instructions at our architecture in the right way, Xenos can run at 100% efficiency all the time, rather than having some pipeline instructions waiting for others. For comparison, most high-end PC chips run at 50-60% typical efficiency. The super cool point is that â€˜in the right wayâ€™ just means â€˜give us plenty of work to doâ€™. The hardware manages itself.

Jawed

Xmas · Jun 16, 2005

Jawed said:
So the reason that pipeline efficiency falls off in NV40 is due to dependency in code. e.g. the MUL and MAD can only operate in one clock if there's no dependency between them.

No, actually it's the other way round. NV40 is sometimes register read limited, therefore it's better to feed the MAD with results from the MUL.

bloodbob · Jun 16, 2005

Jawed said:
bloodbob, why not post the quote in full
Jawed

Because the rest of it was irrelivant for making my point and would just take up space on the forum and I would have hoped you and the other viewers would have bothered to click the link in the first post?

Jawed · Jun 16, 2005

Xmas said:
Jawed said:

So the reason that pipeline efficiency falls off in NV40 is due to dependency in code. e.g. the MUL and MAD can only operate in one clock if there's no dependency between them.

Click to expand...

No, actually it's the other way round. NV40 is sometimes register read limited, therefore it's better to feed the MAD with results from the MUL.

So I think what you're saying is that a MUL r0, r1, r2 and a non-dependent MAD r3, r4, r5, r6 chokes NV40's ability to read registers, so if you can serialise these two instructions by making the MAD dependent on the MUL you avoid the register-choke.

Jawed

Geo · Jun 16, 2005

Rather than separate pixel and vertex pipelines, weâ€™ve created a single unified pipeline that can do both. Providing developers throw instructions at our architecture in the right way, Xenos can run at 100% efficiency all the time, rather than having some pipeline instructions waiting for others. For comparison, most high-end PC chips run at 50-60% typical efficiency. The super cool point is that â€˜in the right wayâ€™ just means â€˜give us plenty of work to doâ€™. The hardware manages itself.

Okay, so then what are the scenarios when "the wrong way" is in play that reduce efficency? This seems to be "when there isn't plenty of work to do", but what does that mean really, and how much does it happen? Is that a legacy game non-shader enabled point and nothing more, or is there more complexity hiding in there?

It *sounds* like it ought to be irrelevant --even if technically you are less efficient in those scenarios for those units, those units by definition aren't going to be the performance bottleneck in those scenarios (which is the point in the first place, right?)

Xmas · Jun 16, 2005

Yes, Jawed, that's what I meant

GPU Gems chapter 30 linked by nAo said:
Similarly, the register file has enough read and write bandwidth to keep all the units
busy if reading fp16Ã—4 values, but it may run out of bandwidth to feed all units if
using fp32Ã—4 values exclusively.

Jawed · Jun 16, 2005

That 3DCenter article on NV40 is excellent.

One thing is for sure, the complexity of NV40 makes it very hard to identify where the bottlenecks are. You're stuck with benchmarks and hoping that the driver compiler wants to cooperate.

So, ahem, we're still in the dark about current GPU efficiency

Jawed

ERP · Jun 16, 2005

You're kind of missing the point, the efficiency that ATi is really talking about is how often either the Pixelshader is completly idle (i.e) Vertex bound or the Vertex Shader is idle (Pixel bound).

It will vary by application and you won't be able to derive it only measure it.

NV2A has a number of performance counters, but it's not enough to measure both ends of the pipe effectively. You can measure when you are vertex bound, but not trivially when you are pixel bound.

If I was guessing from the bench marks I've run, I don't think ATI are miles off on the 40%-50% idle estimate. It's one of the things that intrigues me about Xenos, how will it perform relative to a none unified architecture.

nAo · Jun 16, 2005

ERP said:
If I was guessing from the bench marks I've run, I don't think ATI are miles off on the 40%-50% idle estimate. It's one of the things that intrigues me about Xenos, how will it perform relative to a none unified architecture.

I'm also very intrigued to know how Xenos will perform against a non unified architecture.
RSX would probably have more ALUs (I believe pixel pipelines alone on RSX will have the same raw power of all Xenos ALUs and RSX has a clock advantage too) than Xenos but if Xenos is much more efficient than a 'standard' GPU (and I've no doubt current GPU are very unefficient) it could be way faster than RSX at shading..

Jawed · Jun 16, 2005

From:

Windowing on a 3D Pipeline

Page 11:

Peak FP Performance
- Vertex Engine (FP32)
--- 6*5*2 * 400 MHz = 24 GFlops

- Pixel Engine (FP32)
--- 16*4*3 * 400 MHz = 76 GFlops

- Texture Math Engine (FP16)
--- 16*4*6 * 400 MHz = 154 GFlops

- FP Blend (FP16)
--- 16*4*3 * 550 MHz = 106 GFlops

Total = 260 FP16 & 92 FP32 GFlops

So, even with the vertex engine "doing nothing" while the GPU waits for the current pixel batch to complete, and only counting pixel ops, NV40 would be capable of running at 76% efficiency, peak.

Jawed

psurge · Jun 16, 2005

Regarding Xenos: if a group contains 64 pixels/verts, 4 per ALU, then the AOS style registers exposed to the programmer can be transformed into SOA style HW registers. If you are doing lots of ops on parts of registers, this approach should give you pretty much perfect execution unit useage without having to support co-issue (at the cost of operating on more data elements in parallel, e.g. register space).

_xxx_ · Jun 16, 2005

Jawed said:
That 3DCenter article on NV40 is excellent.

Pretty much all of their articles are. "AthlonXP vs. Athlon64" is a definite must-read.

nAo · Jun 17, 2005

psurge said:
(at the cost of operating on more data elements in parallel, e.g. register space).

Unfurtunately I think this is a huge cost

KimB · Jun 17, 2005

Up to a point, supporting more instructions within a single pipeline makes sense. It's clearly going to be less expensive in terms of transistors to do this than to add additional pipelines. But it's going to be less efficient, too. So there's got to be an optimal amount of operations within a single pipeline (which will depend upon a large number of factors, not least of which is the actual makeup of the pipelines), to get the highest performance out of the same transistor count, and I doubt it's just one instruction allowed per clock per pipeline.

psurge · Jun 17, 2005

nAo I think you're probably right. Plus branching performance would be worse.

On the other hand, wouldn't it simplify the exection units and instruction set? You could get rid of the crossbars and instruction bits for swizzling, and, assuming your basic unit is a madd, you wouldn't need to organize adders in a tree after the multipliers (for dot products). Anyway, just thinking out loud, I have no idea if those things are significant or not...

bloodbob · Jun 17, 2005

Jawed said:
So, even with the vertex engine "doing nothing" while the GPU waits for the current pixel batch to complete, and only counting pixel ops, NV40 would be capable of running at 76% efficiency, peak.
Jawed

Who says ATI is using flops to measure the preformance?
ATI give no detailed analysis frankly because this a FUD campagin and we should all treat it as such.

Typical GPU Efficiency

nAo

Nutella Nutellae

KimB

Jawed

nAo

Nutella Nutellae

Jawed

Xmas

Porous

bloodbob

Trollipop

Jawed

Geo

Mostly Harmless

Xmas

Porous

Jawed

ERP

nAo

Nutella Nutellae

Jawed

psurge

_xxx_

nAo

Nutella Nutellae

KimB

psurge

bloodbob

Trollipop

Similar threads