Xbox 360 Direct3D Performance Update *changed

I know this is going to hurt some people's feelings, but one of these days it needs to be said at least once.

If there needs to be a software update to tweak the load-balancing characteristics of Xenos, it can only mean that the thing is fundamentally broken.

Unified shading's key benefit is automatic load balancing. Shuffling around of compute resources on the fly, at any time, which specifically has to include mid-batch. Software can't do this. The hardware must manage itself. If it doesn't do that, the effort is wasted. If Xenos doesn't do that, well, that's ... bad.

Ever install a driver for your home PC? From ATi in fact they have a programmable memory controller in sorts, a driver update can be released and what do you know an extra 15% performance increase. Same for Nvidia cards and so on.
 
Yeah, so what?

That doesn't mean there is anything that hasn't been 'yet revealed', as the topic subject line says. We've known about these features for a long time now, before the 360 was launched in fact, they're not new, and programmers have had access to them before now as well. Just because they're getting (better) integrated into microsoft's software libraries doesn't mean it suddenly is new stuff. Quite the opposite.
Actually there are new features, such as support for constant buffers (which I've already pointed out) which was requested by devs. This is a feature in common with D3D10 - though I don't know the degree of commonality. The Xenos shading language is definitely in continual evolution - if nothing else then because it was far from complete at launch.

Jawed
 
Arctic Permafrost said:
If they already did, we must've heard of it, and we have not.
Right - because the primary/only objective of anything we implement in a game, is to advertise it in PR.
:rolleyes:
 
I know this is going to hurt some people's feelings, but one of these days it needs to be said at least once.

If there needs to be a software update to tweak the load-balancing characteristics of Xenos, it can only mean that the thing is fundamentally broken.

Unified shading's key benefit is automatic load balancing. Shuffling around of compute resources on the fly, at any time, which specifically has to include mid-batch. Software can't do this. The hardware must manage itself. If it doesn't do that, the effort is wasted. If Xenos doesn't do that, well, that's ... bad.

I disagree. I think it was Jawed that had a pretty good explanation of how manual control by devs could avoid some context switching that the software load balancing would produce.

I mean, OOe CPU's are supposed to "just work" as well, but tweaking code has always given performance benefits.
 
I think what this topic is getting at is basically two points, Jawed's post and the watch.impress article.

The watch.impress article basically stated shading performance on GPU was subpar compared to RSX (whilst it has advantages in some other areas), but, that developers felt they could solve this with better programming for Xenos in the future. However this was not really elaborated on. Other than the problem was called "thread stalls in the GPU and a lack of threading resources" or something like that.

Seemed to have something to do with threads in the GPU..and going by Jaweds post, it's not something a "normal" GPU would be able to do neccesarily?

Basically I hoped it was something like: Xenos skimped on the registers and other fine grained control resources (natural, for a console part where small size and price are paramount), however, Xenos also allows a "deeper", finer, level of programming control than "normal" GPU's that eventually, could see much better utilization of the already existing shader power, and therefore a great increase in the latter.

Also, wasn't automated tiling in the Sdk for a while back?
________
Herbal aire vaporizer review
 
Last edited by a moderator:
I found some interesting info... Not sure if this been posted before... What do they mean the "True Power" of the 360 GPU??? Have they been holding back before or they just now learned how to make tools to take advantage of 360???




Xbox 360 Direct3D and GPU Performance Update

Speakers: Andrew Goossen and Michael Dougherty



Are you craving a deep understanding of Direct3D and GPU performance features for Xbox 360? Learn about Xbox 360 enhancements to Direct3D, including accelerated predicated tiling and z-prepass, precompiled command buffers, and significant runtime performance improvements. We will also discuss the usage and under-the-hood architecture of a new PIX (Performance Investigator for Xbox) feature: the GPU analysis tab. This analyzer inspects over a hundred hardware GPU counters in conjunction with an accurate GPU simulation to generate a detailed performance analysis of GPU events.





HLSL Shader Compiler Update for Xbox 360 and Windows

Speakers: Jack Palevich and John Rapp



Several recent enhancements have been made to HLSL to support the new graphics features available in Xbox 360 and on Windows. This talk covers the HLSL language additions that expose the full power of the Xbox 360 GPU, as well as new annotations, flags, and runtime changes that can make your shaders run faster on Xbox 360. For Windows, the presentation covers the HLSL language enhancements that expose the new features of Direct3D 10, how the new features map onto down-level hardware, and basic strategies for writing high-performance, cross-platform HLSL code.


http://www.microsoftgamefest.com/session_abstracts.htm#GRAPHICS
link^^^
 
Now we're getting somewhere. We finally have official comfirmation from MS themselves of unlocking the full power of the Xbox 360 GPU instead of some random forum post. Interesting....
 
Very, very interested in several of these...but I'm out of harddrive space and living on flakey hotel internet in NYC for the week.

Any info posted on updates would be greatly appreciated.
 
I disagree. I think it was Jawed that had a pretty good explanation of how manual control by devs could avoid some context switching that the software load balancing would produce.

I mean, OOe CPU's are supposed to "just work" as well, but tweaking code has always given performance benefits.
Would be nice if you could dig up what Jawed posted. I might start a thread about what Xenos can do in terms of load balancing.

For what I'm getting at, here's my hypothetical worst case:
The chip operates with a certain split between vertex and pixel ALUs that can be configured, but only between batches, by the (driver) software. The software would count instructions every time new vertex and/or pixel shader programs are made current and flush/reconfigure the pipes.

If that's what unified shading works like in Xenos, it totally sucks. That's what I'm saying. I'm also saying that I really don't know, hence the surprising amount of ifs. It just sounds fishy. And there's no public info that I've seen.
 
Last edited by a moderator:
I'm pretty sure this is the post people are refering to:

Xenos ALUs can stall when you have very short loops (say 2 instructions), or when you don't have enough ALU code to hide TEX latency.

One solution is to re-sequence your code. e.g. unroll loops or pre-fetch textures.

The sequencer in Xenos is programmable, allowing devs to fine-tune the coherency of code execution. That is, to define thread-switching points in their code, rather than allowing Xenos to apply its default behaviour. This means Xenos will run with less thread-switches (in total), where every thread-switch costs time through latency (if the latency can't otherwise be hidden).

Xenos's default behaviour is to thread-switch whenever it sees a latency-inducing instruction (e.g. TEX or BR). So, by lumping TEX operations together and then saying "now you can switch threads", Xenos can reduce the 2 or 3 separate thread-switches to a single thread-switch. That reduces the total number of ALU instructions that are required to hide these TEX instructions.

The first task for devs is to tweak data formats (stored as textures or vertex streams) so that access patterns are efficient. i.e. the minimum number of fetches are performed. Additionally, since Xenos offers a choice of access techniques, the dev has to evaluate them.

In a unified architecture, you can't evaluate the performance of a shader in isolation. You can't write a shader and say "TEX ops take this long, ALU ops this long and branches this long, so the total time is xxx, so we can get X fps". You can only say that's the minimum time they'll take. When the pipelines are doing a mix of work (for other pixels, say, or for vertices as well as pixels) then bandwidth limits or full buffers will cause blips. Ultimately the programmer is exposed to concurrency issues.

Another way of putting it is that the programmer has better control of concurrency issues in Xenos - in traditional GPUs, when resource limits are reached, the dev has to tackle the problem indirectly, rewriting code in the hope that the new usage pattern will eke-out better performance. In theory Xenos provides direct control and more options to control execution and resource usage.

Since the SDK for Xenos is still not complete, devs are currently in see-but-can't-touch hell...

Naturally, as someone who has never written shader code, nor coded for Xenos, I can only summarise the general concepts.

Jawed
 
Would be nice if you could dig up what Jawed posted. I might start a thread about what Xenos can do in terms of load balancing.

For what I'm getting at, here's my hypothetical worst case:
The chip has a fixed split between vertex and pixel ALUs and must be configured before starting a batch. The software could count instructions every time new vertex and/or pixel shader programs are made current and flush/reconfigure the pipes.

If that's what unified shading works like in Xenos, it totally sucks. That's what I'm saying. I'm also saying that I really don't know, hence the surprising amount of ifs. It just sounds fishy. And there's no public info that I've seen.

Toying with the balancing algorithim is supposedly useless, and tweaking it gains you under 5% according to ERP.

This is also why I think TurnDragon says it should just work.

Everybody tends to want to focus on the load balancing, when my theory is what's important about this stuff has nothing to do with it, but rather everything to do with threading.
________
Web Shows
 
Last edited by a moderator:
Would be nice if you could dig up what Jawed posted. I might start a thread about what Xenos can do in terms of load balancing.

For what I'm getting at, here's my hypothetical worst case:
The chip has a fixed split between vertex and pixel ALUs and must be configured before starting a batch. The software could count instructions every time new vertex and/or pixel shader programs are made current and flush/reconfigure the pipes.

If that's what unified shading works like in Xenos, it totally sucks. That's what I'm saying. I'm also saying that I really don't know, hence the surprising amount of ifs. It just sounds fishy. And there's no public info that I've seen.
No worries as there isn't a fixed split between ALUs.
 
Wait a second I thought Xenos had a fixed 3 group split of 16 ALUs each? Or maybe I'm not understanding :???:

Yes that's the batch granularity, and on any one clock it's split that way, but unless your dealing with really small numbers of verts then it's irrelavant to the load balancing.
 
If there needs to be a software update to tweak the load-balancing characteristics of Xenos, it can only mean that the thing is fundamentally broken.
Not necessarily related solely to the load balancing, per-se, but that statement isn't really true. The instruction scheduler is microcode progammable so software updates could easily be applied that would alter the characteristics of the hardware - an example of why this may happen is that perhaps the instruction scheduler would be tuned for a shaders from prior history, but further profiling on apps that are being coded specifically to this hardware may yeild better results with some tweaking. (Whether you'd want to do that on a console is an entirely different matter, of course)
 
For what I'm getting at, here's my hypothetical worst case:
The chip operates with a certain split between vertex and pixel ALUs that can be configured, but only between batches, by the (driver) software. The software would count instructions every time new vertex and/or pixel shader programs are made current and flush/reconfigure the pipes.
At a basic level the chip monitors the activity of the vertex and pixels buffers and tries to schedule workloads according to that. There's no fix in terms of what work can be applied, merely a queue of Vertex threads and a queue of fragment threads. The only dependancy on what work goes where is when a thread begins processing on a SIMD but is then slept again (as it has some dependancy) it has to be assigned to the same SIMD when its woken as the registers are local to a SIMD and can't be swapped.
 
Actually there are new features, such as support for constant buffers (which I've already pointed out) which was requested by devs.
Yet it isn't anything that the hardware hasn't had the ability to do all along, just because the SDK gains more built-in capabilities isn't reason to go all trumpeting around and making threads left and right. I'm sure the PS and N64 SDKs gained new capabilities too every now and then back in the day, and later on, PS2, GC, xbox etc. I don't remember anyone making any big deal out of it, but I guess the internet's evolved somewhat since then. Developers are more outspoken, as are the companies themselves (and especially MS). I guess there's a certain amount of PR value in revealing stuff like this and playing it up perhaps a bit more than strictly neccessary... :)

The Xenos shading language is definitely in continual evolution - if nothing else then because it was far from complete at launch.
This is also merely software updates. If MS really was adding hardware functionality and new shading processor opcodes it'd be quite sensational and I'd be as excited as the next guy, but unfortunately, neither xenos nor xenon are giant FPGAs running at hundreds or even thousands of MHzs, so it's not possible.

Moderation note: Part removed: do not respond to personal attacks, it's a waste of everyone's time
 
At a basic level the chip monitors the activity of the vertex and pixels buffers and tries to schedule workloads according to that. There's no fix in terms of what work can be applied, merely a queue of Vertex threads and a queue of fragment threads. The only dependancy on what work goes where is when a thread begins processing on a SIMD but is then slept again (as it has some dependancy) it has to be assigned to the same SIMD when its woken as the registers are local to a SIMD and can't be swapped.
Thanks, Dave.
 
Yet it isn't anything that the hardware hasn't had the ability to do all along, just because the SDK gains more built-in capabilities isn't reason to go all trumpeting around and making threads left and right. I'm sure the PS and N64 SDKs gained new capabilities too every now and then back in the day, and later on, PS2, GC, xbox etc.
That seems the crux of it. Regarding the OP's title, the 'true power' isn't revealed. It never is in a console until devs are years into it's life and have learnt the optimizations and tricks to make it shine. But the hardware is the same throughout its life and isn't going to be unlocked, except perhaps in the case of forced reducations like PSPs downclocking, for which I can't see any reason in a big-box console.

If it's really wanted, a 'console x, true power of the system not yet revealed' thread can be created for every console on it's anniversary for at least 3 years, and be a true enough observation, but I can't see that it's a great starting point for discussion. More appropriate would be a thread 'SDK improvements for XB360' or somesuch.
 
Back
Top