New Technical Information About RSX???...But In Japanese

I wonder if nAo (et. al.) might comment on RSX's efficiency relative to one might find to a similar part running in the PC space. I remember that ATI made a big deal about the efficiency of Xenos compared to something like RSX and put it at 95% vs something like 50% respectively. Thanks for any info you can share on this.
 
Last edited by a moderator:
If you don't have enough bandwidth for HDR+AA you can be more clever and use less bandwidth
If you don't have enough fillrate for your uber particles effects you can be more clever and use less fillrate.
If you don't have enough computational power for your shaders you can be more...no sorry, you're f&%£"d :)

so basicly do you want to say that with ps3 you are able to bypass somehow those problems while with X360 you r f&%£"d ??
 
your ability to read the thread and understand nAo's post is quite impressive .

Is that computational shading power the result of cell+RSX or just RSX... is RSX that far ahead of Xenos in shading power? I always had the impression that is was the opposite...
 
Also, if your low on pixel shading or vertex shading the CELL could help out according to some of Sony's own papers and abstracts on the net.
 
According to the following the CELL should at least be able to provide some ammount of extra pixel shading to the RSX. Now, I know when you have physics, animation, AI, sound, and other stuff happening on the CELL you don't want to over burden it. But if you have some power to spare perhaps something like the following could help:


Mapping Deferred Pixel Shaders onto the Cell Architecture
Alan Heirich - Sony Computer Entertainment America

Abstract
This paper studies a deferred pixel shading algorithm implemented on a Cell-based computer entertainment system. The pixel shader runs on the Synergistic Processing Units (SPUs) of the Cell and works concurrently with the GPU to render images. The system's unified memory architecture allows the Cell and GPU to exchange data through shared textures. The SPUs use the Cell DMA list capability to gather irregular fine-grained fragments of texture data generated by the GPU. They return resultant shadow textures the same way. The shading computation ran at up to 85 Hz at HDTV 720p resolution on 5 SPUs and generated 30.72 gigaops of performance. This is comparable to the performance of the algorithm running on a state of the art high end GPU. These results indicate that a hybrid solution in which the Cell and GPU work together can produce higher performance than either device working alone.
 
Is that computational shading power the result of cell+RSX or just RSX... is RSX that far ahead of Xenos in shading power? I always had the impression that is was the opposite...

Don't know why you asking me…but yes going by that translated article it seems only RSX and theoretically RSX beats out xenos in shading but that’s theoretically.
 
"For the 360, they have little memory bandwidth issues with the 10MB eDRAM, and less FSAA issues. But some developers are having issues with a lack of shader ALU performance and threading resources, however performance will increase as developers get more familiar with unified shader architecture."

I am really curious what this means. On GAF another translation I believe was "(360) developers are having trouble with thread stalls" implying in the GPU, then a mini debate broke out whether you can have thread stalls in a GPU, or they were reffering instead to the CPU with thread stalls.

Basically is it possible to gain Xenos GPU performance by different..programming techniques as the article implies? Somebody in this thread has already stated that unified shading should "just work" sort of contradicting what the article is implying?

Any 360 devs wish to speak on what this means? Fran?
________
VAPORIZERINFO.COM
 
Last edited by a moderator:
"For the 360, they have little memory bandwidth issues with the 10MB eDRAM, and less FSAA issues. But some developers are having issues with a lack of shader ALU performance and threading resources, however performance will increase as developers get more familiar with unified shader architecture."

I am really curious what this means. On GAF another translation I believe was "(360) developers are having trouble with thread stalls" implying in the GPU, then a mini debate broke out whether you can have thread stalls in a GPU, or they were reffering instead to the CPU with thread stalls.

Basically is it possible to gain Xenos GPU performance by different..programming techniques as the article implies? Somebody in this thread has already stated that unified shading should "just work" sort of contradicting what the article is implying?

Any 360 devs wish to speak on what this means? Fran?

Alhtough the united architecture should just work, I guess it is very reasonable to expect that maybe that is not the best way to get the most out of xenos. I fully expct developers not find it as easy to develop for the 360 GPU as for a PC GPU and that maybe a bit different approaches/programing is required to get the full power...
 
Alhtough the united architecture should just work, I guess it is very reasonable to expect that maybe that is not the best way to get the most out of xenos.
Can you explain why?

If you're feeding Xenos pixel and vertex data, it should be automatically load balancing. You also have the choice of feeding just vertex data say, and it'll load balance to do just vertex work. ATi said there was the option for devs to provide their own thread management, but the standard one was supposed to be generally good. This was nVidia's complaint with US, managing the load balancing, and I always felt ATi went ahead with it because they had worked out the best methods that nVidia hadn't.

If the devs need to do the load balancing themselves, that to me suggests the default systems aren't very effective. I guess that is a possibility with US being untried before this part. If Xenos works as well as it was described on paper, there shouldn't be any trouble with accessing it's shader performance, and in should be benefiting from improved shader use efficiency to boot.
 
One other option I can think of is that within the Console space, manual unified shading would be more beneficial than in the PC space, because console developers are expected to make the most out of a system by hand.
 
Xenos ALUs can stall when you have very short loops (say 2 instructions), or when you don't have enough ALU code to hide TEX latency.

One solution is to re-sequence your code. e.g. unroll loops or pre-fetch textures.

The sequencer in Xenos is programmable, allowing devs to fine-tune the coherency of code execution. That is, to define thread-switching points in their code, rather than allowing Xenos to apply its default behaviour. This means Xenos will run with less thread-switches (in total), where every thread-switch costs time through latency (if the latency can't otherwise be hidden).

Xenos's default behaviour is to thread-switch whenever it sees a latency-inducing instruction (e.g. TEX or BR). So, by lumping TEX operations together and then saying "now you can switch threads", Xenos can reduce the 2 or 3 separate thread-switches to a single thread-switch. That reduces the total number of ALU instructions that are required to hide these TEX instructions.

The first task for devs is to tweak data formats (stored as textures or vertex streams) so that access patterns are efficient. i.e. the minimum number of fetches are performed. Additionally, since Xenos offers a choice of access techniques, the dev has to evaluate them.

In a unified architecture, you can't evaluate the performance of a shader in isolation. You can't write a shader and say "TEX ops take this long, ALU ops this long and branches this long, so the total time is xxx, so we can get X fps". You can only say that's the minimum time they'll take. When the pipelines are doing a mix of work (for other pixels, say, or for vertices as well as pixels) then bandwidth limits or full buffers will cause blips. Ultimately the programmer is exposed to concurrency issues.

Another way of putting it is that the programmer has better control of concurrency issues in Xenos - in traditional GPUs, when resource limits are reached, the dev has to tackle the problem indirectly, rewriting code in the hope that the new usage pattern will eke-out better performance. In theory Xenos provides direct control and more options to control execution and resource usage.

Since the SDK for Xenos is still not complete, devs are currently in see-but-can't-touch hell...

Naturally, as someone who has never written shader code, nor coded for Xenos, I can only summarise the general concepts.

Jawed
 
Another way of putting it is that the programmer has better control of concurrency issues in Xenos - in traditional GPUs, when resource limits are reached, the dev has to tackle the problem indirectly, rewriting code in the hope that the new usage pattern will eke-out better performance. In theory Xenos provides direct control and more options to control execution and resource usage.
Thanks for the explanation. Do you see this thread management as a possible reason for the idea that XB360 are having trouble getting shader performance? Is unoptimized data packaging causing excess thread changes and incurring overhead, reducing time spent on actual shader work?
 
Thanks for the explanation. Do you see this thread management as a possible reason for the idea that XB360 are having trouble getting shader performance? Is unoptimized data packaging causing excess thread changes and incurring overhead, reducing time spent on actual shader work?
It's bound to. Thread management is merely a tool for tackling the problem, since Xenos uses thread-switching to hide all latency, whether incurred by texturing, branching or thread-prioritisation (e.g. vertices versus pixels).

Unrolling loops or pre-fetching (textures) are common techniques on GPU and CPU to minimise latency. The programmer can fine tune these techniques by specifying when a thread-switch is performed.

Sub-optimal data formats are bad no matter what GPU you're using. They'll cause stalls (or bubbles) in shader pipelines regardless.

---

Another dimension in the "latency" problem is the number of threads in flight. Xenos is dependent on having ready-to-execute threads available when thread-switching. A stall happens if all threads are waiting for a texture, or branching (i.e. not ready-to-execute). The chances of this happening increase as the number of threads in flight falls. The number of threads in flight falls when shaders make excessive use of registers. But ... the amount of texture-induced latency tends to fall as shaders increase in complexity (use more registers), which means the chances of thread-switching fall ... so it doesn't necessarily matter so much that you have less threads in flight :!:

So, there's no simple way to model this, you have to suck it and see.

The list of recommendations for performance goes on and on and on...

I don't know what proportion of devs have never programmed with shaders before (or how arguably similar or different graphics programming is on PS2, say) but I think it's fair to say that hardcore SM3 programming is still pretty much virgin territory.

When graphics-engine devs get to build their second graphics engine on a next-gen console, they should be well over the learning-curve and the SDK-pitfalls (whatever remain).

Jawed
 
Mapping Deferred Pixel Shaders onto the Cell Architecture
Alan Heirich - Sony Computer Entertainment America

Abstract
This paper studies a deferred pixel shading algorithm implemented on a Cell-based computer entertainment system. The pixel shader runs on the Synergistic Processing Units (SPUs) of the Cell and works concurrently with the GPU to render images. The system's unified memory architecture allows the Cell and GPU to exchange data through shared textures. The SPUs use the Cell DMA list capability to gather irregular fine-grained fragments of texture data generated by the GPU. They return resultant shadow textures the same way. The shading computation ran at up to 85 Hz at HDTV 720p resolution on 5 SPUs and generated 30.72 gigaops of performance. This is comparable to the performance of the algorithm running on a state of the art high end GPU. These results indicate that a hybrid solution in which the Cell and GPU work together can produce higher performance than either device working alone.

Hunh? These statements are strange at the very least...5 spus?
 
What's so strange about it? It just says that CELL is fast enough to seriously "help" out the GPU if the DEV wants that...

Even with 1 SPU disabled and 1 SPU reserved for the OS, it should be 6 SPUs. At least I think that the blakjedi found that strange. But I think that the authors of that paper just chose to use 5 SPUs, and leave 1 SPU idle to handle other hypothetical code.
 
Even with 1 SPU disabled and 1 SPU reserved for the OS, it should be 6 SPUs. At least I think that the blakjedi found that strange. But I think that the authors of that paper just chose to use 5 SPUs, and leave 1 SPU idle to handle other hypothetical code.
Yeah, that's how I read the quote and blakjedi's comments. Also the Unified Memory Architecture could be referring to uniform memory addressing? Even though you have two pools of RAM, you can access them as one just with the memory address, no? Or putting that another way, RSX and Cell can work with both pools of RAM directly.
 
I dont think they say Xenos lacks shader power at all...

It says, *some* developers are having poor alu perfomance... Going by that could be a million factors causing this, including the concurrency in shaders, because if i remember correctly xenos is also a heavy threadly design...

I also remember a slide of an Epic presentation where they claimed they resolved concurrency issues in their shaders allowing to use much more alu ops.

Since its a japanese web site, i'm guessing thats japanese developers talking, and judging by the avarage 360's games coming from there I would say they are not using any shaders at all...

Thats of course a problem Ms needs to adress, it should be easy to develop for X360, and achieve at least a mininum graphical bar, at least its what she claims...

Reading comprehension FTW. That article said NOTHING about RSX having "more" shader power. It just said that devs are struggling to get shader performance out of Xenos but that should change as they get more familiar with the architecture. Revolutionary new shader technology says a big "Well durr!!!" to that idea.
 
Back
Top