Xbox 360 Direct3D Performance Update *changed

Skrying · Oct 18, 2006

zeckensack said:
I know this is going to hurt some people's feelings, but one of these days it needs to be said at least once.

If there needs to be a software update to tweak the load-balancing characteristics of Xenos, it can only mean that the thing is fundamentally broken.

Unified shading's key benefit is automatic load balancing. Shuffling around of compute resources on the fly, at any time, which specifically has to include mid-batch. Software can't do this. The hardware must manage itself. If it doesn't do that, the effort is wasted. If Xenos doesn't do that, well, that's ... bad.

Ever install a driver for your home PC? From ATi in fact they have a programmable memory controller in sorts, a driver update can be released and what do you know an extra 15% performance increase. Same for Nvidia cards and so on.

Jawed · Oct 18, 2006

Guden Oden said:
Yeah, so what?

That doesn't mean there is anything that hasn't been 'yet revealed', as the topic subject line says. We've known about these features for a long time now, before the 360 was launched in fact, they're not new, and programmers have had access to them before now as well. Just because they're getting (better) integrated into microsoft's software libraries doesn't mean it suddenly is new stuff. Quite the opposite.

Actually there are new features, such as support for constant buffers (which I've already pointed out) which was requested by devs. This is a feature in common with D3D10 - though I don't know the degree of commonality. The Xenos shading language is definitely in continual evolution - if nothing else then because it was far from complete at launch.

Jawed

Fafalada · Oct 18, 2006

Arctic Permafrost said:
If they already did, we must've heard of it, and we have not.

Right - because the primary/only objective of anything we implement in a game, is to advertise it in PR.

Bigus Dickus · Oct 18, 2006

zeckensack said:
I know this is going to hurt some people's feelings, but one of these days it needs to be said at least once.

If there needs to be a software update to tweak the load-balancing characteristics of Xenos, it can only mean that the thing is fundamentally broken.

Unified shading's key benefit is automatic load balancing. Shuffling around of compute resources on the fly, at any time, which specifically has to include mid-batch. Software can't do this. The hardware must manage itself. If it doesn't do that, the effort is wasted. If Xenos doesn't do that, well, that's ... bad.

I disagree. I think it was Jawed that had a pretty good explanation of how manual control by devs could avoid some context switching that the software load balancing would produce.

I mean, OOe CPU's are supposed to "just work" as well, but tweaking code has always given performance benefits.

Rangers · Oct 18, 2006

I think what this topic is getting at is basically two points, Jawed's post and the watch.impress article.

The watch.impress article basically stated shading performance on GPU was subpar compared to RSX (whilst it has advantages in some other areas), but, that developers felt they could solve this with better programming for Xenos in the future. However this was not really elaborated on. Other than the problem was called "thread stalls in the GPU and a lack of threading resources" or something like that.

Seemed to have something to do with threads in the GPU..and going by Jaweds post, it's not something a "normal" GPU would be able to do neccesarily?

Basically I hoped it was something like: Xenos skimped on the registers and other fine grained control resources (natural, for a console part where small size and price are paramount), however, Xenos also allows a "deeper", finer, level of programming control than "normal" GPU's that eventually, could see much better utilization of the already existing shader power, and therefore a great increase in the latter.

Also, wasn't automated tiling in the Sdk for a while back?
________
Herbal aire vaporizer review

MixedGuy · Oct 18, 2006

I found some interesting info... Not sure if this been posted before... What do they mean the "True Power" of the 360 GPU??? Have they been holding back before or they just now learned how to make tools to take advantage of 360???

Xbox 360 Direct3D and GPU Performance Update

Speakers: Andrew Goossen and Michael Dougherty

Are you craving a deep understanding of Direct3D and GPU performance features for Xbox 360? Learn about Xbox 360 enhancements to Direct3D, including accelerated predicated tiling and z-prepass, precompiled command buffers, and significant runtime performance improvements. We will also discuss the usage and under-the-hood architecture of a new PIX (Performance Investigator for Xbox) feature: the GPU analysis tab. This analyzer inspects over a hundred hardware GPU counters in conjunction with an accurate GPU simulation to generate a detailed performance analysis of GPU events.

HLSL Shader Compiler Update for Xbox 360 and Windows

Speakers: Jack Palevich and John Rapp

Several recent enhancements have been made to HLSL to support the new graphics features available in Xbox 360 and on Windows. This talk covers the HLSL language additions that expose the full power of the Xbox 360 GPU, as well as new annotations, flags, and runtime changes that can make your shaders run faster on Xbox 360. For Windows, the presentation covers the HLSL language enhancements that expose the new features of Direct3D 10, how the new features map onto down-level hardware, and basic strategies for writing high-performance, cross-platform HLSL code.

http://www.microsoftgamefest.com/session_abstracts.htm#GRAPHICS
link^^^

Hardknock · Oct 18, 2006

Now we're getting somewhere. We finally have official comfirmation from MS themselves of unlocking the full power of the Xbox 360 GPU instead of some random forum post. Interesting....

Asher · Oct 18, 2006

Very, very interested in several of these...but I'm out of harddrive space and living on flakey hotel internet in NYC for the week.

Any info posted on updates would be greatly appreciated.

Rolf N · Oct 18, 2006

Bigus Dickus said:
I disagree. I think it was Jawed that had a pretty good explanation of how manual control by devs could avoid some context switching that the software load balancing would produce.

I mean, OOe CPU's are supposed to "just work" as well, but tweaking code has always given performance benefits.

Would be nice if you could dig up what Jawed posted. I might start a thread about what Xenos can do in terms of load balancing.

For what I'm getting at, here's my hypothetical worst case:
The chip operates with a certain split between vertex and pixel ALUs that can be configured, but only between batches, by the (driver) software. The software would count instructions every time new vertex and/or pixel shader programs are made current and flush/reconfigure the pipes.

If that's what unified shading works like in Xenos, it totally sucks. That's what I'm saying. I'm also saying that I really don't know, hence the surprising amount of ifs. It just sounds fishy. And there's no public info that I've seen.

Hardknock · Oct 18, 2006

I'm pretty sure this is the post people are refering to:

Jawed said:
Xenos ALUs can stall when you have very short loops (say 2 instructions), or when you don't have enough ALU code to hide TEX latency.

One solution is to re-sequence your code. e.g. unroll loops or pre-fetch textures.

The sequencer in Xenos is programmable, allowing devs to fine-tune the coherency of code execution. That is, to define thread-switching points in their code, rather than allowing Xenos to apply its default behaviour. This means Xenos will run with less thread-switches (in total), where every thread-switch costs time through latency (if the latency can't otherwise be hidden).

Xenos's default behaviour is to thread-switch whenever it sees a latency-inducing instruction (e.g. TEX or BR). So, by lumping TEX operations together and then saying "now you can switch threads", Xenos can reduce the 2 or 3 separate thread-switches to a single thread-switch. That reduces the total number of ALU instructions that are required to hide these TEX instructions.

The first task for devs is to tweak data formats (stored as textures or vertex streams) so that access patterns are efficient. i.e. the minimum number of fetches are performed. Additionally, since Xenos offers a choice of access techniques, the dev has to evaluate them.

In a unified architecture, you can't evaluate the performance of a shader in isolation. You can't write a shader and say "TEX ops take this long, ALU ops this long and branches this long, so the total time is xxx, so we can get X fps". You can only say that's the minimum time they'll take. When the pipelines are doing a mix of work (for other pixels, say, or for vertices as well as pixels) then bandwidth limits or full buffers will cause blips. Ultimately the programmer is exposed to concurrency issues.

Another way of putting it is that the programmer has better control of concurrency issues in Xenos - in traditional GPUs, when resource limits are reached, the dev has to tackle the problem indirectly, rewriting code in the hope that the new usage pattern will eke-out better performance. In theory Xenos provides direct control and more options to control execution and resource usage.

Since the SDK for Xenos is still not complete, devs are currently in see-but-can't-touch hell...

Naturally, as someone who has never written shader code, nor coded for Xenos, I can only summarise the general concepts.

Jawed

Rangers · Oct 18, 2006

zeckensack said:
Would be nice if you could dig up what Jawed posted. I might start a thread about what Xenos can do in terms of load balancing.

For what I'm getting at, here's my hypothetical worst case:
The chip has a fixed split between vertex and pixel ALUs and must be configured before starting a batch. The software could count instructions every time new vertex and/or pixel shader programs are made current and flush/reconfigure the pipes.

If that's what unified shading works like in Xenos, it totally sucks. That's what I'm saying. I'm also saying that I really don't know, hence the surprising amount of ifs. It just sounds fishy. And there's no public info that I've seen.

Toying with the balancing algorithim is supposedly useless, and tweaking it gains you under 5% according to ERP.

This is also why I think TurnDragon says it should just work.

Everybody tends to want to focus on the load balancing, when my theory is what's important about this stuff has nothing to do with it, but rather everything to do with threading.
________
Web Shows

3dcgi · Oct 18, 2006

zeckensack said:
Would be nice if you could dig up what Jawed posted. I might start a thread about what Xenos can do in terms of load balancing.

For what I'm getting at, here's my hypothetical worst case:
The chip has a fixed split between vertex and pixel ALUs and must be configured before starting a batch. The software could count instructions every time new vertex and/or pixel shader programs are made current and flush/reconfigure the pipes.

If that's what unified shading works like in Xenos, it totally sucks. That's what I'm saying. I'm also saying that I really don't know, hence the surprising amount of ifs. It just sounds fishy. And there's no public info that I've seen.

No worries as there isn't a fixed split between ALUs.

Hardknock · Oct 18, 2006

3dcgi said:
No worries as there isn't a fixed split between ALUs.

Wait a second I thought Xenos had a fixed 3 group split of 16 ALUs each? Or maybe I'm not understanding :???:

ERP · Oct 18, 2006

Hardknock said:
Wait a second I thought Xenos had a fixed 3 group split of 16 ALUs each? Or maybe I'm not understanding

Yes that's the batch granularity, and on any one clock it's split that way, but unless your dealing with really small numbers of verts then it's irrelavant to the load balancing.

3dcgi · Oct 18, 2006

Hardknock said:
Wait a second I thought Xenos had a fixed 3 group split of 16 ALUs each? Or maybe I'm not understanding

I meant there's no fixed split between VS and PS. Each SIMD can do both. Not on the same clock cycle of course.

Dave Baumann · Oct 18, 2006

zeckensack said:
If there needs to be a software update to tweak the load-balancing characteristics of Xenos, it can only mean that the thing is fundamentally broken.

Not necessarily related solely to the load balancing, per-se, but that statement isn't really true. The instruction scheduler is microcode progammable so software updates could easily be applied that would alter the characteristics of the hardware - an example of why this may happen is that perhaps the instruction scheduler would be tuned for a shaders from prior history, but further profiling on apps that are being coded specifically to this hardware may yeild better results with some tweaking. (Whether you'd want to do that on a console is an entirely different matter, of course)

Dave Baumann · Oct 18, 2006

zeckensack said:
For what I'm getting at, here's my hypothetical worst case:
The chip operates with a certain split between vertex and pixel ALUs that can be configured, but only between batches, by the (driver) software. The software would count instructions every time new vertex and/or pixel shader programs are made current and flush/reconfigure the pipes.

At a basic level the chip monitors the activity of the vertex and pixels buffers and tries to schedule workloads according to that. There's no fix in terms of what work can be applied, merely a queue of Vertex threads and a queue of fragment threads. The only dependancy on what work goes where is when a thread begins processing on a SIMD but is then slept again (as it has some dependancy) it has to be assigned to the same SIMD when its woken as the registers are local to a SIMD and can't be swapped.

Guden Oden · Oct 18, 2006

Jawed said:
Actually there are new features, such as support for constant buffers (which I've already pointed out) which was requested by devs.

Yet it isn't anything that the hardware hasn't had the ability to do all along, just because the SDK gains more built-in capabilities isn't reason to go all trumpeting around and making threads left and right. I'm sure the PS and N64 SDKs gained new capabilities too every now and then back in the day, and later on, PS2, GC, xbox etc. I don't remember anyone making any big deal out of it, but I guess the internet's evolved somewhat since then. Developers are more outspoken, as are the companies themselves (and especially MS). I guess there's a certain amount of PR value in revealing stuff like this and playing it up perhaps a bit more than strictly neccessary...

The Xenos shading language is definitely in continual evolution - if nothing else then because it was far from complete at launch.

This is also merely software updates. If MS really was adding hardware functionality and new shading processor opcodes it'd be quite sensational and I'd be as excited as the next guy, but unfortunately, neither xenos nor xenon are giant FPGAs running at hundreds or even thousands of MHzs, so it's not possible.

Moderation note: Part removed: do not respond to personal attacks, it's a waste of everyone's time

Rolf N · Oct 18, 2006

Dave Baumann said:
At a basic level the chip monitors the activity of the vertex and pixels buffers and tries to schedule workloads according to that. There's no fix in terms of what work can be applied, merely a queue of Vertex threads and a queue of fragment threads. The only dependancy on what work goes where is when a thread begins processing on a SIMD but is then slept again (as it has some dependancy) it has to be assigned to the same SIMD when its woken as the registers are local to a SIMD and can't be swapped.

Thanks, Dave.

Shifty Geezer · Oct 18, 2006

Guden Oden said:
Yet it isn't anything that the hardware hasn't had the ability to do all along, just because the SDK gains more built-in capabilities isn't reason to go all trumpeting around and making threads left and right. I'm sure the PS and N64 SDKs gained new capabilities too every now and then back in the day, and later on, PS2, GC, xbox etc.

That seems the crux of it. Regarding the OP's title, the 'true power' isn't revealed. It never is in a console until devs are years into it's life and have learnt the optimizations and tricks to make it shine. But the hardware is the same throughout its life and isn't going to be unlocked, except perhaps in the case of forced reducations like PSPs downclocking, for which I can't see any reason in a big-box console.

If it's really wanted, a 'console x, true power of the system not yet revealed' thread can be created for every console on it's anniversary for at least 3 years, and be a true enough observation, but I can't see that it's a great starting point for discussion. More appropriate would be a thread 'SDK improvements for XB360' or somesuch.

Xbox 360 Direct3D Performance Update *changed

Skrying

S K R Y I N G

Jawed

Fafalada

Bigus Dickus

Rangers

MixedGuy

Hardknock

Asher

Rolf N

Recurring Membmare

Hardknock

Rangers

3dcgi

Hardknock

ERP

3dcgi

Dave Baumann

Gamerscore Wh...

Dave Baumann

Gamerscore Wh...

Guden Oden

Senior Member

Rolf N

Recurring Membmare

Shifty Geezer

uber-Troll!

Similar threads