XCPU and Xenos: What does "cpu slaved to the GPU" mean?

dukmahsik

Banned
Can anyone expand more on this?

"1 or more of the 6 CPU threads can be "slaved" to the GPU, so the GPU would actually dictate all program flow through that chip. This would be especially useful in the procedural synthesis/dynamic geometry that MS is talking up a storm about."

What are the implications for games? Does Cell/RSX have this feature?
 
You might want to provide a source for that quote. For all we know it's something someone made up on a forum with no basis for hardware application.
 
I believe arstechnicia noted somegthing similar in their XeCPU article, if not exactly (it may come from the patents).

How I understand it is 1 (or more! You could use 2 cores for this as well) of the CPU cores can be dedicated to a task like procedural synthesis (e.g. generating all the trees in a massive forest) and use the "L2 cache lock" feature to stream that data directly to the GPU.

So if a CPU core and 128K L2 cache are dedicated to streaming geometry information to the GPU you could consider this a "slave" situation--the CPU core's purpose is to generate information for the GPU to render.

This conserves memory and bandwidth (for the 3D model and this feature has what they call "D3D" compression which is said to save 50% on bandwidth) and is a way for artists/game designers to create more content cheaper.

As DeanoC has noted this is a situation the CELL will excell at. 3 PPC cores is not enough to compete with Sony in this area--an important one in many respects due to its implications for developers. So procedural synthesis was a hurdle MS wanted to overcome. Their solution was to beef up the VMX units, allow a cache lock feature for streaming, have the CPU be able to write directly to the GPU (and vice versa in a very limited fashion), and so forth (MEMEXPORT can be used for other things but this may be an area where it helps).
 
It might refer to the XeGPU being able to interupt a XeCPU thread, so effectively causing a CPU callback from the GPU command flow...

Of course there probably lots of fine details but the idea might be to render a bounding volume for say a tree, put a conditional branch on z-pass that issues a CPU interupt to generate the actual tree geometry, GPU renders some other stuff while the CPU works then actually render the tree later on in the command buffer.

Very handy for procedural geometry as you essentially have a CPU slave to the GPU command buffer, of course the hard bit it hiding all the latencies and delays to not stall the GPU.

Of course I could be barking up the wrong tree complelely (pun intended :) )
 
THIS is what makes x360 so interesting, IMO. :D It REALLY does seem to be a very tightly knit system, that while perhaps not having the highest raw specs ('only' 8 pixel pipes in GPU, in-order CPU cores with just 1M cache, 128-bit UMA memory etc), it still through these interesting functions probably will be able to do quite amazing things once programmers find their way around it. I'm very enthusiastic.

PS3 seems more like a big-block V8 with a blower mounted to it in comparison, more raw power but perhaps not as efficient. There's a certain appeal to that approach also though... :D
 
Yeah, the differences in the design approaches are interesting.

Memory
-MS: The slides note the irregular load the backbuffer creates and how it sucks up a ton of bandwidth. Solution? Isolate it so it never bottlenecks and then use a simple to use UMA for the system.
-Sony: Large bandwidth needs require a lot of bandwidth resources. CELL needs low latency memory and RSX needs a lot of bandwidth. Solution? A NUMA with 2 256MB segments, effectively doubling the bandwidth of the 360's UMA. Yeah devs still need to deal with the irregular needs of the backbuffer and the extra work of dealing with the memory segmentation... but boy, ~50GB of bandwidth?! w00t!

Graphics
-MS: We want something to dovetail with our next gen OS to create a new graphics platform... so what features do we want? How about something more effecient and maximizes the hardware and tears down the barrier between VS and PS. Solution? Unified shaders. At ~257M transistors for logic it looks to be a pretty capable chip.
-Sony: Opps, our experimental VPU is not working out for some reason or another. Solution? nVidia, what is your fastest/feature rich chip? Ok, can you shrink that down to 90nm and up the frequency 25% or more? Oh, it has vertex shaders? Well, CELL can already do that but the more the marrier!

Same approaches come to mind in the CPUs, but a more complicated way. MS obviously wanted more FP performance from their CPU. The Cache Lock features clearly indicates they expect the CPUs to be generating 3D models. Add in the beefed up VMX units. But they also wanted to keep things simple. Symetric cores, shared cache, etc. And then there was the issue of the balance. MS has noted like 80% of game code is general purpose, but on the other hand your game engine may spend 80% of its execution time on FP type tasks. Sony obviously is going a totally different approach with a stream processor. Just a lot of power there. Obviously more to juggle with more cores and an asymetric design, but also a ton of potential. In some ways it is hard to compare the CPUs because they are so different... they are also different sizes. Without knowing die size, from a transistor perspective the DD2 CELL is 50% larger than the XeCPU (250M vs. 165M). Looking at the die it seems that the 3 cores and cache each take up about 25% of the space. Heat, power consumption, etc... all aside, to bring it up to a similar transistor count you could get 5 cores and 1MB of cache (~190GFLOPs) or 4 cores and 2MB of cache (150GFLOPs). Total BS numbers, but makes you go hmmm Same issue of more cores being more complicated AND a stream processor is an attempted solution to the problem of making a lot of cores effecient... but I do think people would look at it differently if the XeCPU had 4 cores and 2MB of cache. So in that regards I DO expect the CELL to be more powerful in general because it is larger... and as DeanoC noted, the CELL is gonna FLY with procedural tasks. It is really designed with that type of work in mind! Yet the Xenon is not chop liver either... very effecient, very streamlined.

The Xbox 360 really reminds me of a super GCN. Obviously the Xbox 360 hardware compares more favorably to this-gen PC hardware compared to the GCN, but the same philosophy of streamlining and maximizing potential all seem to be there. The fact MS did not splurge on the HDD, media ports, and so forth also reminds me of MS. Quick, some ring Redmond... did Yamauchi resign at Nintendo to secretly take control of MS's Game division?! :LOL:

The PS3 sounds a lot like the PS2... but this time instead of the GS Sony hired NV and got a NV2a :D They also decided to put a lot of RAM in with a lot of bandwidth to offset the limitations of no eDRAM. And we all heard how hard the PS2 was to program for, yet it did alright I guess ;)

Guden said:
('only' 8 pixel pipes in GPU, in-order CPU cores with just 1M cache, 128-bit UMA memory etc)

Only nit pick... it does not have 8 pixel pipelines... really no pipelines at all. I think 3 shader arrays is more accurate... really no use comparing it to older architectures since it is really different and WONT be used in the same way. If it were in the PC space we could say it could be limited like a 8 pipe part at times, but overall it is pretty clear that never applies.

Xenos has 257M transistors and has no video decoding or output (scaler chip is external). The RSX (if like the G70) is 300M, but also has ~25M for the PureVideo. That puts it at ~275M for logic. To compares

RSX 275M
Xenos 257M => 7% difference
--------------
NV40 200M (222M with video)
R420 160M => 25% difference

R420 kept up with NV40 pretty good, yet the difference between Xenos and RSX is very marginal. ~20M transistors and 50MHz. And if we are to take ATI's claims at face value, Xenos is about 95% effecient compares to 50-70% effecient compared to their current gen offerings. Xenos is more adaptable to different games, different parts in a game, and even different stages in the rendering pipeline.

If there is a V8 anywhere in the Xbox 360 it is definately the GPU!
 
Guden Oden said:
PS3 seems more like a big-block V8 with a blower mounted to it in comparison, more raw power but perhaps not as efficient. There's a certain appeal to that approach also though... :D
Achieves the same thing, but sounds SOOO much better :p

'Slaving' suggest the GPU tells the CPU what to do, but that doesn't sound too feasible to me. How would the GPU issue instructions to get the CPU to maybe process some vertices? I'd be more inclined to believe the actual concept is 'dedicating' a CPU to GPU assist, rather than 'enslaving' it. The CPU would be supplied code with perhaps 'guidelines' coming from GPU. eg. A displacement thread that gets data from Xenos with MemExport to process some wierd procedural geometry. Without setting up the CPU this way the GPU won't be able to commandeer it into processing something else.
 
I distinctly remember a picture of the XCPU showing its inernal...

2 of the cores were quite near the FSB...

It's probably how it will be used potentially...

the GPU can access it for certain needs...
 
Back
Top