Typical GPU Efficiency

Jawed said:
So the reason that pipeline efficiency falls off in NV40 is due to dependency in code.
Jawed
New mantra, old assertion. You have still to give some proof to this statement as the previous one was obviously broken.
I'm not saying there are not dependency problems but the hw just need (most of the time) one or two extra non dependant intructions to insert per clock in order to use the second ALU ;)
 
nAo said:
I'm not saying there are not dependency problems but the hw just need (most of the time) one or two extra non dependant intructions to insert per clock in order to use the second ALU ;)

And we still come back to the problem that the real-world code snippet, the longest shader from FarCry, that I quoted earlier in the thread, is running at 2.2 instructions per clock, not 6, nor 5, nor even 4.

Jawed
 
Jawed said:
And we still come back to the problem that the real-world code snippet, the longest shader from FarCry, that I quoted earlier in the thread, is running at 2.2 instructions per clock, not 6, nor 5, nor even 4.
Jawed
Jawed, you're making the same mistake another time. You agreed with me that doesn't make any sense to count total instructions and instrucionts per clock as average.
You should count total flops and average flops per clock cycle
 
bloodbob, why not post the quote in full:

Rather than separate pixel and vertex pipelines, we’ve created a single unified pipeline that can do both. Providing developers throw instructions at our architecture in the right way, Xenos can run at 100% efficiency all the time, rather than having some pipeline instructions waiting for others. For comparison, most high-end PC chips run at 50-60% typical efficiency. The super cool point is that ‘in the right way’ just means ‘give us plenty of work to do’. The hardware manages itself.

Jawed
 
Jawed said:
So the reason that pipeline efficiency falls off in NV40 is due to dependency in code. e.g. the MUL and MAD can only operate in one clock if there's no dependency between them.
No, actually it's the other way round. NV40 is sometimes register read limited, therefore it's better to feed the MAD with results from the MUL.
 
Jawed said:
bloodbob, why not post the quote in full
Jawed
Because the rest of it was irrelivant for making my point and would just take up space on the forum and I would have hoped you and the other viewers would have bothered to click the link in the first post?
 
Xmas said:
Jawed said:
So the reason that pipeline efficiency falls off in NV40 is due to dependency in code. e.g. the MUL and MAD can only operate in one clock if there's no dependency between them.
No, actually it's the other way round. NV40 is sometimes register read limited, therefore it's better to feed the MAD with results from the MUL.

So I think what you're saying is that a MUL r0, r1, r2 and a non-dependent MAD r3, r4, r5, r6 chokes NV40's ability to read registers, so if you can serialise these two instructions by making the MAD dependent on the MUL you avoid the register-choke.

Jawed
 
Rather than separate pixel and vertex pipelines, we’ve created a single unified pipeline that can do both. Providing developers throw instructions at our architecture in the right way, Xenos can run at 100% efficiency all the time, rather than having some pipeline instructions waiting for others. For comparison, most high-end PC chips run at 50-60% typical efficiency. The super cool point is that ‘in the right way’ just means ‘give us plenty of work to do’. The hardware manages itself.

Okay, so then what are the scenarios when "the wrong way" is in play that reduce efficency? This seems to be "when there isn't plenty of work to do", but what does that mean really, and how much does it happen? Is that a legacy game non-shader enabled point and nothing more, or is there more complexity hiding in there?

It *sounds* like it ought to be irrelevant --even if technically you are less efficient in those scenarios for those units, those units by definition aren't going to be the performance bottleneck in those scenarios (which is the point in the first place, right?)
 
Yes, Jawed, that's what I meant

GPU Gems chapter 30 linked by nAo said:
Similarly, the register file has enough read and write bandwidth to keep all the units
busy if reading fp16×4 values, but it may run out of bandwidth to feed all units if
using fp32×4 values exclusively.
 
That 3DCenter article on NV40 is excellent.

One thing is for sure, the complexity of NV40 makes it very hard to identify where the bottlenecks are. You're stuck with benchmarks and hoping that the driver compiler wants to cooperate.

So, ahem, we're still in the dark about current GPU efficiency :LOL:

Jawed
 
You're kind of missing the point, the efficiency that ATi is really talking about is how often either the Pixelshader is completly idle (i.e) Vertex bound or the Vertex Shader is idle (Pixel bound).

It will vary by application and you won't be able to derive it only measure it.

NV2A has a number of performance counters, but it's not enough to measure both ends of the pipe effectively. You can measure when you are vertex bound, but not trivially when you are pixel bound.

If I was guessing from the bench marks I've run, I don't think ATI are miles off on the 40%-50% idle estimate. It's one of the things that intrigues me about Xenos, how will it perform relative to a none unified architecture.
 
ERP said:
If I was guessing from the bench marks I've run, I don't think ATI are miles off on the 40%-50% idle estimate. It's one of the things that intrigues me about Xenos, how will it perform relative to a none unified architecture.
I'm also very intrigued to know how Xenos will perform against a non unified architecture.
RSX would probably have more ALUs (I believe pixel pipelines alone on RSX will have the same raw power of all Xenos ALUs and RSX has a clock advantage too) than Xenos but if Xenos is much more efficient than a 'standard' GPU (and I've no doubt current GPU are very unefficient) it could be way faster than RSX at shading..
 
From:

Windowing on a 3D Pipeline

Page 11:

Peak FP Performance
- Vertex Engine (FP32)
--- 6*5*2 * 400 MHz = 24 GFlops

- Pixel Engine (FP32)
--- 16*4*3 * 400 MHz = 76 GFlops

- Texture Math Engine (FP16)
--- 16*4*6 * 400 MHz = 154 GFlops

- FP Blend (FP16)
--- 16*4*3 * 550 MHz = 106 GFlops

Total = 260 FP16 & 92 FP32 GFlops

So, even with the vertex engine "doing nothing" while the GPU waits for the current pixel batch to complete, and only counting pixel ops, NV40 would be capable of running at 76% efficiency, peak.

Jawed
 
Regarding Xenos: if a group contains 64 pixels/verts, 4 per ALU, then the AOS style registers exposed to the programmer can be transformed into SOA style HW registers. If you are doing lots of ops on parts of registers, this approach should give you pretty much perfect execution unit useage without having to support co-issue (at the cost of operating on more data elements in parallel, e.g. register space).
 
Up to a point, supporting more instructions within a single pipeline makes sense. It's clearly going to be less expensive in terms of transistors to do this than to add additional pipelines. But it's going to be less efficient, too. So there's got to be an optimal amount of operations within a single pipeline (which will depend upon a large number of factors, not least of which is the actual makeup of the pipelines), to get the highest performance out of the same transistor count, and I doubt it's just one instruction allowed per clock per pipeline.
 
nAo I think you're probably right. Plus branching performance would be worse.

On the other hand, wouldn't it simplify the exection units and instruction set? You could get rid of the crossbars and instruction bits for swizzling, and, assuming your basic unit is a madd, you wouldn't need to organize adders in a tree after the multipliers (for dot products). Anyway, just thinking out loud, I have no idea if those things are significant or not...
 
Jawed said:
So, even with the vertex engine "doing nothing" while the GPU waits for the current pixel batch to complete, and only counting pixel ops, NV40 would be capable of running at 76% efficiency, peak.
Jawed
Who says ATI is using flops to measure the preformance?
ATI give no detailed analysis frankly because this a FUD campagin and we should all treat it as such.
 
Back
Top