Xenos & XCPU core locking

Agisthos · Oct 30, 2005

The close association between RSX and Cell gets brought up so often here but we have no substantial information released by Sony due to the tight lipped nda's. So while we wait...

The Xenos has the ability to lock onto any of the cpu cores caches directly (bypassing the memory bus?), this is a feature mentioned in a few Xbox pr interviews.

I would have though it would be a very interesting topic for discussion, and yet almost never gets mentioned.

ihamoitc2005 · Oct 30, 2005

No RAM latency

Agisthos said:
The close association between RSX and Cell gets brought up so often here but we have no substantial information released by Sony due to the tight lipped nda's. So while we wait...

The Xenos has the ability to lock onto any of the cpu cores caches directly (bypassing the memory bus?), this is a feature mentioned in a few Xbox pr interviews.

I would have though it would be a very interesting topic for discussion, and yet almost never gets mentioned.

Reading of CPU decompressed data (using 1 CPU thread) straight from blocked cache by GPU is to prevent data read/write on high-latency main RAM, therefore preventing a big limiting aspect of software performance.

Titanio · Oct 30, 2005

ihamoitc2005 said:
Reading of CPU decompressed data (using 1 CPU thread) straight from blocked cache by GPU is to prevent data read/write on high-latency main RAM

Is the latency going across the CPU-GPU bus much less than that from GPU to RAM? I wouldn't have thought so. Where it can be used, I'd have thought it were more about bandwidth-saving rather than anything else.

Anyway, this has actually been discussed quite a lot

ihamoitc2005 · Oct 30, 2005

Latency and bandwidth

Titanio said:
Is the latency going across the CPU-GPU bus much less than that from GPU to RAM? I wouldn't have thought so. Where it can be used, I'd have thought it were more about bandwidth-saving rather than anything else.

Anyway, this has actually been discussed quite a lot

Latency in software performance is not due only to transmission of data on bus but also due to time for read/write requests to be fulfilled by actual RAM, like getting medicine straight from factory instead of pharmacy. By not using RAM, such latency is completely avoided, therefore only one direction transmission latency on bus remains which is unavoidable.

There is good article on Arstechnica:

http://arstechnica.com/paedia/b/bandwidth-latency/bandwidth-latency-1.html

dukmahsik · Oct 30, 2005

is this the memexport feature or the cpu core slaving to gpu feature?

Jawed · Oct 30, 2005

Titanio said:
Is the latency going across the CPU-GPU bus much less than that from GPU to RAM?

Latency isn't much of an issue in a streaming dataflow. i.e. if the dataflow is continuous then it doesn't matter if the data takes 1 tick or 1000 ticks to get to the other end.

But there's no reason to suppose that the latency between CPU-L2-cache and GPU is high, like that associated with DDR RAM, because most of the latency when accessing DDR RAM is actually in the memory devices themselves.

Jawed

Jawed · Oct 30, 2005

dukmahsik said:
is this the memexport feature or the cpu core slaving to gpu feature?

http://arstechnica.com/articles/paedia/cpu/xbox360-1.ars/2

Jawed

Titanio · Oct 30, 2005

Jawed said:
Latency isn't much of an issue in a streaming dataflow. i.e. if the dataflow is continuous then it doesn't matter if the data takes 1 tick or 1000 ticks to get to the other end.

Agreed, I don't think it matters much anyway for these purposes, but I do wonder if it's so much about avoiding latency as saving bandwidth. That the kind of things this would be used for isn't latency sensitive only further suggests that.

Jawed said:
But there's no reason to suppose that the latency between CPU-L2-cache and GPU is high, like that associated with DDR RAM, because most of the latency when accessing DDR RAM is actually in the memory devices themselves.

Perhaps indeed. I suppose it comes down to how cache access behaves.

Powderkeg · Oct 30, 2005

Titanio said:
Perhaps indeed. I suppose it comes down to how cache access behaves.

In part, but the main factor is simply speed.

If you had a 20 clock cycle access on both RAM and L2 cache, you still have RAM clocked at 700MHZ vs a CPU at 3.2GHZ.

Jawed · Oct 30, 2005

Titanio said:
Agreed, I don't think it matters much anyway for these purposes, but I do wonder if it's so much about avoiding latency as saving bandwidth. That the kind of things this would be used for isn't latency sensitive only further suggests that.

I think it's entirely about saving bandwidth against DDR RAM. GPUs are latency-tolerant by design. The newest GPU designs take latency-tolerance to new extremes - hence the whole concept of out-of-order threading.

Why put stuff into main memory if it doesn't need to go there? You've just saved the 10.8GB/s needed to put it into memory, and the 10.8GB/s to read it back out again.

---

I'm sure we'll see something similar in PS3.

Additionally, within Cell, bandwidth is saved by having the LSs able to send/fetch data amongst themselves. None of those tasks touch memory, or even Cell cache. This is entirely dependent on the algorithm being streaming.

If your algorithm is streaming in nature (e.g. "here are the polys that make up a monster as he swings his hammer down to crush you") then there's a great opportunity to keep that data away from memory. In that sense RSX should appear like "another Cell" connected to PS3's Cell. At the very least RSX should consume procedural poly and texture data without that data having to go into either XDR RAM or GDDR3 RAM.

---

All of this is not to say that Xenos might run some shaders on the data coming direct from Xenon L2, and stuff the results of those shaders into memory. e.g. tessellation is a two-pass process with the implication that main memory is used to hold the intermediate data - but this topic is very sketchy. Tessellation (including adaptive tessellation for level of detail control) leads to some polys being deleted as well as others being created.

Then there's the memory consumed by vertex/poly data in performing predicated tiling (e.g. because the back-buffer needs to be split into 3 tiles).

So, even though the XBox Procedural Synthesis streaming pipeline starts with a portion of Xenon-L2-locked for the GPU with the data going straight into Xenos's shader pipelines - the requirement to loop through that data both to perform tessellation and tile-predication means that the (or at least some of the) vertex/poly data needs to be kept in memory.

Well, that's the way I interpret it. The detail is just not there, sadly.

Jawed

blakjedi · Oct 30, 2005

It would be interesting to hear just how many titles are using xenos as a traditional GPU versus those utilizing tiling, cpu cache locking etc. We expect increases in complexity, speed, and effects if those techniques are used but i wonder if we can actually *SEE* the impacts of those techniques in games.

Xenos & XCPU core locking

Similar threads