arstechnica part 2 : inside the xbox 360

aaaaa00 · Jun 2, 2005

DemoCoder said:
Sony had always planned to build a GPU. They actually built one, but for various reasons, they had to go with the RSX. There was never any realistic chance IMHO that the PS3 was designed for pure software rendering. The SPEs can in no way compete with GPU pixel pipelines in terms of fetching texels, filtering, and doing ROP operations.

The original patent had a half-GPU stuck on the end of 2 CELLs that just did rasterization didn't it? Something like a 4 CELL CPU, and a 2 CELL + rasterizer back end GPU.

JasonLD · Jun 2, 2005

I think going with Nvidia has been better choice for Sony instead of going with their own solution..more powerful or not, having Nvidia's GPU made PS3 lot easier to develop for at least..

Carl B · Jun 2, 2005

Agreed - as discussed before in another thread, gaining access to NVidia's development tools is just too much of an advantage to pass up when competing against someone like Microsoft, who is very strong in that area.

I personally agree aaa that I think Sony probably had a 'shoot the moon' strategy brewing in their heads based around that patent, but they likely realized fairly quickly that it simply wouldn't be feasible for the PS3.

nAo · Jun 2, 2005

aaaaa00 said:
The original patent had a half-GPU stuck on the end of 2 CELLs that just did rasterization didn't it? Something like a 4 CELL CPU, and a 2 CELL + rasterizer back end GPU.

It was just wishfull patenting

DemoCoder · Jun 2, 2005

BTW, here is a paper that shows how force integration can be run on 1,024 processors on a 3+million triangle scene. The speedup is *superlinear* Or, for collision detection

The parallel efficiency on 1,500 processors is 65% (mostly synchronization overhead), for a speedup of 980. The dataset used for this run was almost 100 million polygons.

http://charm.cs.uiuc.edu/papers/ArrayJournal03.www/

I think this shows that all the components of physics: collision detection, constraint solving, and integration, can be parallelized. 65% may sound "low" but for many concurrent problems, communication overhead is worse, and it is not unheardof to see much lower sub-linear gains. 65% is quite nice, and the synchronization overhead on SPE is likely to be lower than ASCI Red due to the fact that you have a shared embedded communication system (FlexIO) vs ASCI Red's network-based sharing. Also, game datasets are likely to be much more special purpose datastructures.

nAo · Jun 2, 2005

It has to be noted that parallalize something to run on some SPEs is not enough to achieve good performance, it is a necessary but not sufficient conditon.
A lot of microthreads running on one or more SPEs will hardly exploit CELL power as threads switching is not a first class operation on SPEs, moreover threads code has to be written in order to leverage SPEs attitudes (no branchy code that compiler can't predict and hint).
Branches should be removed (with predication) or batched (exploiting DLP)

Acert93 · Jun 2, 2005

nAo said:
aaaaa00 said:

The original patent had a half-GPU stuck on the end of 2 CELLs that just did rasterization didn't it? Something like a 4 CELL CPU, and a 2 CELL + rasterizer back end GPU.

Click to expand...

It was just wishfull patenting

Looking back at 1999, who knew that fixed function, and then programmable, GPUs would take off like they did?

This is not directed at you nAo... while thinking about that question I did some digging on the 1999/2000 timeframe and what the industry was like. Looking at it from a 1999 perspective I would say that Sony was probably at least semi-seriously exploring the BE option. Not setting their hearts on it mind you, but looking at how things had developeed it was definately an option worth looking at.

In 1997 we have the Riva 128 and in 1998 the Riva 128 ZX. In 1998 was the Voodoo 2 also. (I actually had a Riva 128 paired with a Voodoo 2 on a PII 233MHz w/ MMX and 128MB of PC100 memory... boy that was fast!). The TNT came out in late 1998. In Spring of 1999 we saw the TNT 2 and 3Dfx shipped the Voodoo 3.

In 1999 we get the first GeForce (GeForce256) which basically moved the geometry load (T&L) to the GPU from the CPU. And it was not until the GF3 and DX8 (in early 2001?) that we got our first taste of programmable shaders.

From Sony's perspective, when they first mentioned the PS3 and "CELL" (was that in 1999?) GPUs were still in their infancy. They were only begin to migrate away from fixed function--so maybe a really fast CPU with a lot of floating point performance--we be much more flexible and would be best for allowing both developers decide how much they will use for graphics or general game code. In theory is sounds like a great idea at least. And the fact CELL is a streaming architecture, very similar to how a modern GPU works, would be the best way to get a CPU to do GPU like tasks.

We know that Intel overestimated where chip frequencies would be at. I believe we were projected to surpass 8GHz a while back. Chip makers have also hit issues of leakage that have resulted in increased heat and power consumption. While these were to be expected, I am sure that not even Intel planned on a 100W+ CPU.

And back to CELL... the SPEs are called "synergistic processing elements". Right now I believe the SPEs are SIMD, but you could put other stuff there like a scalar unit... or a rasterizer?

So if we had reached 65nm, and clockspeeds were up in the 8GHz-10GHz range, we would be looking at ~2 CELLs in the same area as 1 with 2-3x the frequency... that gets us in the TFLOPs range. Add in another pair (instead of the RSX GPU) and that is 2TFLOPs. Interestingly that is the same claimed total programmable performance of the PS3. Specialize some of the SPEs for raster work... viola! Broadband Engine!

Now that never happened, and maybe Sony never planned on it happening. It could have been a twinkle in Kutaragi's eye. It could have been a "Plan B if everything lines up... and if it does not lineup our Plan A is a traditional GPU while having CELL as the CPU" or vice versa. Maybe Toshiba had a CELL based GPU in the works and also a more traditional GPU design, but Sony opted for NV because of better IPs, NV was in a crunch and would give them a great deal, and most importantly the tools.

Maybe NV/Sony approached eachother way back in 2002 when MS/NV were having a pricing tift. I surely do not know

But as SLOW as this industry seems to move at times (Come on DAVE! We want that Xenos article!!), looking back at the last 5 years gives you a lot of perspective. A LOT has changed, and very quickly. We went from less than 1GHz to over 3GHz quite fast... and then hit a wall. We went from very basic GPUs to including T&L, then to offering programmable GPUs. All of that in the initial planning stages of the PS3.

To move forward to today... think about what the next 5-6 years will bring. I bet Sony and MS will put out quite a few patents of ideas, especially over the next 3 years. Then they will select the best couple scenarios and begin moving forward with designs (each design probably having more than one varient... e.g.. more CPU cores and less cache and less fast memory and a less CPU cores with more cache and more slow memory).

Who knows, we may eventually see the PS3 vision in PS4

Me on the other hand... I have always been a big fan of GPUs. The first computer I paid for out of my own pocket had a Riva 128 and eventually a Voodoo2. Seeing how the GPU market has blossomed and how their is a strong current toward more flexibility and programmability I do not see them going anywhere--especially in how they have pushed memory development and effeciency. So I have always been on the GPU bandwagon I guess... but only because what my eyes were showing me. What Sony showed of CELL at E3 seemed to indicate it is not weakling... so who knows. If not PS4, maybe PS5!

bbot · Jun 2, 2005

bbot said:
Fafalada said:

aaaa0 said:

Major Nelson vindicated?

Click to expand...

In fairy land maybe.
Compared to cramming the most processing intensive Physics components into VU0(and I know I'm not the only one to have done that on PS2), using SPEs for the same job(s) is a walk in the park.

bbot said:

Go back and read it now.

Click to expand...

I did, still sounds retarded

Click to expand...

Exactly what part sounds retarded? Care to enlighten me?

I'm waiting.

Gubbi · Jun 2, 2005

nAo said:
Branches should be removed (with predication) or batched (exploiting DLP)

Right. For example searching down an oct-tree, at each node do a SIMD (parallel) compare on the x, y, z compents, a mask and a vertical or, and you instantly have the index for the pointer pointing to the next node, - without a single branch.

Batching is going to add some overhead, but there are 7 SPUs so I'm guessing it's a wash between CELL and XeCPU.

Cheers
Gubbi

Squeak · Jun 2, 2005

At any rate, Playstation 3 fanboys shouldn't get all flush over the idea that the Xenon will struggle on non-graphics code. However bad off Xenon will be in that department, the PS3's Cell will probably be worse. The Cell has only one PPE to the Xenon's three, which means that developers will have to cram all their game control, AI, and physics code into at most two threads that are sharing a very narrow execution core with no instruction window. (Don't bother suggesting that the PS3 can use its SPEs for branch-intensive code, because the SPEs lack branch prediction entirely.) Furthermore, the PS3's L2 is only 512K, which is half the size of the Xenon's L2. So the PS3 doesn't get much help with branches in the cache department. In short, the PS3 may fare a bit worse than the Xenon on non-graphics code, but on the upside it will probably fare a bit better on graphics code because of the seven SPEs.

Isn't one the Cell PPEs intended primary roles, to orchestrate the SPEs and replace the majority of the control logic cluttering a conventional CPU, with programmable power?

Gubbi · Jun 2, 2005

Squeak said:
....the majority of the control logic cluttering a conventional CPU....

Another way of saying it is that they took out most of the smarts, that make modern CPUs easy to program.

Moving complexity to software in order to fit as many core on a die as possible.

Cheers
Gubbi

The GameMaster · Jun 2, 2005

There seems to be a particular interest in the VMX128 units on XENON... does anyone know exactly what the differences are between a normal VMX and the VMX128? Another question... as the name implies does the VMX128 support 128-bit precision or 64-bit precision? I have my theories... but...

One other question... does anyone know how many DOT products per cycle each SPE can process? I don't think it is 1 DOT product per cycle as it takes 4 VMX operations to process a DOT product if I understand that correctly.

Finally... here is an interesting theory... what if that 115 GFLOPs that was claimed on XENON was actually referring to DOUBLE PRECISION GFLOPs instead of single precision. I do know that if they want to do procedural synthesis that they will need more than 32-bit precision, and from what I understand these VMX128 units are the heart of the procedural synthesis technology. One more thing... Microsoft's current specification list says there is only 1 VMX128 unit per core and this article is claiming 2 VMX128 units per core... I wonder which is correct? Or was this a last minute change that Microsoft was hiding until recently...

I am going to need to look further into this mysterious VMX128 unit...

The GameMaster...

nAo · Jun 2, 2005

The GameMaster said:
One other question... does anyone know how many DOT products per cycle each SPE can process? I don't think it is 1 DOT product per cycle as it takes 4 VMX operations to process a DOT product if I understand that correctly.

it takes 4 instructions to process 4 DOT4, or 3 instructions to process 4 DOT3, or 2 instructions to process 4 DOT2, so a SPE can do 1 DOT4 per cycle, 1.33 DOT3 per cycle and 2 DOT2 per cycle.

Finally... here is an interesting theory... what if that 115 GFLOPs that was claimed on XENON was actually referring to DOUBLE PRECISION GFLOPs instead of single precision.

It's a wrong theory, 115 Gflop/s are about single precision math.

I do know that if they want to do procedural synthesis that they will need more than 32-bit precision

why?

London Geezer · Jun 2, 2005

The GameMaster said:
Finally... here is an interesting theory... what if that 115 GFLOPs that was claimed on XENON was actually referring to DOUBLE PRECISION GFLOPs instead of single precision.

Interesting theory. But... No. Just... No.

1) 115 Double precision GLOPS are closing into the supercomputer realm, for one CPU.

2) If that was the case, don't you think MS would have shoved that piece of information down our throats over and over again till we believed our middle name is Double Precision?

The GameMaster · Jun 2, 2005

Thanks for clearing that up for me nAo... but yea it was just an interesting theory that crossed my mind as I was reading that article. But again though does anyone know what the differences is between regular VMX such as the one used in the Cell and this VMX128 unit that is in XENON? Anyway it *IS* important that XENON is capable of higher than 32 bit precision as it will be dealing with graphics related functions, but yea I will assume that the 115 GFLOPs is using single precision unless I find evidence to the contrary... because if that was double precision that would be roughly 1.15 TFLOPs performance of single precision (...well actually Microsoft did state their hardware has over a TFLOP of performance, so it *MAY* be possible they was referring to only the CPU) I have no proof of this at the moment...

Additionally it seems that the VMX128... according to that article...

The exact number of VMX-128 execution units is not yet known, but it's probably at least two, and possibly three. There is probably one VMX ALU that handles most vector arithmetic operations (add, multiply, multiply-add, etc.). There is also probably a separate unit for handling vector permutes and pack/unpack operations. This unit would also handle VMX-128's new, 3D-oriented, single-cycle dot product instruction(s), which allows programmers to perform dot products on arrays in a much more efficient and programmer-friendly manner. This new vector unit functionality is described quite well in the following portion of the Microsoft patent application referenced in Part I of this series:

Each VMX128 unit seems to be able to process 1 DOT product per cycle according to the article as well as one of the patents that Microsoft has...

Ok so if it takes 4 cycles to process a DOT4 product per SPE (it sure as heck can't do 1 DOT4 product in 1 cycle)... and with 8 VMX units that should give the Cell CPU roughly 6.4 Billion DOT products per second and another 35.2 Billion DOT products per second with the RSX (assuming 24 pixel pipelines and 8 vertex pipelines and the fact that each pipeline should be capable of 2 DOT products per cycle). Is that an accurate statement to make?

Now Microsoft listed 1 VMX128 as part of their specification list... if there is indeed TWO VMX128 units now instead of one, would that effectively double the FLOPs performance of XENON? More specifically we need to verify that article's claim that there is 2 VMX128 units instead of one...

But considering the information that we have found out *AFTER* the unvieling of the system and it's rough specifications we have found out that Microsoft has been systematically understating their machine specifications. I am sure we all can agree on these three points though...

*XENON (The TriCore CPU) is more powerful than we first thought and at the same time is more Cell-like than we first thought WITH the exception being the VMX128 which will be important for procedural synthesis.
*XENOS (The GPU) is more powerful than we first thought.
*The software technology employed is more impressive than we first thought (especially the procedural synthesis technology).

The GameMaster...

Carl B · Jun 2, 2005

The GameMaster said:
...but yea I will assume that the 115 GFLOPs is using single precision unless I find evidence to the contrary... because if that was double precision that would be roughly 1.15 TFLOPs performance of single precision (...well actually Microsoft did state their hardware has over a TFLOP of performance, so it *MAY* be possible they was referring to only the CPU) I have no proof of this at the moment...

GameMaster have you lost your mind? I can't believe you are even entertaining the possibility of 115 GFLOPs DP on this chip - let alone 'upconverting' to 1.15 TFLOPs of SP performance. It just doesn't scale that way. Just because the SPE's on Cell go from full SP performance to roughly 10% of SP in DP, doesn't mean that *every* chip does.

Thowllly · Jun 2, 2005

The GameMaster said:
Ok so if it takes 4 cycles to process a DOT4 product per SPE (it sure as heck can't do 1 DOT4 product in 1 cycle)... and with 8 VMX units that should give the Cell CPU roughly 6.4 Billion DOT products per second

Read again. It takes 4 cycles for four dot4 products per SPE. For a total of 25.6 billion dot4 products ...or 34.1 billion dot3...or 51.2 billion dot2...

The GameMaster · Jun 2, 2005

I did state it was a theory with no information that can back up that theory at the moment. Don't get so riled up over nothing...

Thowllly said:
The GameMaster said:

Ok so if it takes 4 cycles to process a DOT4 product per SPE (it sure as heck can't do 1 DOT4 product in 1 cycle)... and with 8 VMX units that should give the Cell CPU roughly 6.4 Billion DOT products per second

Click to expand...

Read again. It takes 4 cycles for four dot4 products per SPE. For a total of 25.6 billion dot4 products ...or 34.1 billion dot3...or 51.2 billion dot2...

I must be getting my math messed up somewhere or there is something else that I am missing details wise... looks like I have to do more researching. If it takes 1 cycle to process 1 DOT product per SPE with a single normal VMX... then how many DOT products can the VMX128 process per cycle? There seems to be more information we do not know about in regards to XENON. Back to the grind!

The GameMaster...

nAo · Jun 2, 2005

The GameMaster said:
But again though does anyone know what the differences is between regular VMX such as the one used in the Cell and this VMX128 unit that is in XENON?

VMX on Xenon should be someway extended with extra instructions hadling dot products, data packing and unpacking,etc..

Anyway it *IS* important that XENON is capable of higher than 32 bit precision as it will be dealing with graphics related functions

Why is it important?
You keep repeating that without telling us why it is.

Ok so if it takes 4 cycles to process a DOT4 product per SPE (it sure as heck can't do 1 DOT4 product in 1 cycle)...

You should carefully read what I wrote

I said a SPE can process FOUR DOT4 in 4 instructions.
Can SPE have a troughput of ONE DOT4 per clock cycle? yes it can!
After all single cycle dot-product on CPU is "NOT" something new or special - we've had this capability in mainstream CPUs at least since 1997/98(DC)

and with 8 VMX units that should give the Cell CPU roughly 6.4 Billion DOT products per second and another 35.2 Billion DOT products per second with the RSX (assuming 24 pixel pipelines and 8 vertex pipelines and the fact that each pipeline should be capable of 2 DOT products per cycle). Is that an accurate statement to make?

No, it isn't
CELL doesn't have 8 VMX units, it has 1 VMX unit and 7 SPEs. SPEs are not VMX units.
Neverthless CELL can do 8 DOT4 per clock cycle -> 25.6 GDot4/s
This almost 3 times of what Xenon CPU can do about dot products.
It tells us something about relative power of these 2 processors? not much

Titanio · Jun 2, 2005

The GameMaster said:
But again though does anyone know what the differences is between regular VMX such as the one used in the Cell and this VMX128 unit that is in XENON?

We don't really know what VMX in Cell looks like, actually the second revision seems to have seen VMX tweaked a lot. It may be very similar to he VMX in X360, minus some of the instructions (like a dot product instruction).

The GameMaster said:
Anyway it *IS* important that XENON is capable of higher than 32 bit precision as it will be dealing with graphics related functions, but yea I will assume that the 115 GFLOPs is using single precision unless I find evidence to the contrary... because if that was double precision that would be roughly 1.15 TFLOPs performance of single precision (...well actually Microsoft did state their hardware has over a TFLOP of performance, so it *MAY* be possible they was referring to only the CPU) I have no proof of this at the moment...

That 1Tflop figure is CPU + GPU + everything else. That 115Gflops figure is single precision ops, no doubt about it. In fact, the WatchImpress Allard interview seems to suggest they took a lot if not all of the DP ALUs out of X360's CPU. X360 doesn't need DP for "graphics related work" - if you're talking about procedural vertex generation, for that 32-bit is fine. Maybe if you were dealing with pixel ops you might like higher precision, but I don't think that's the suggestion.

The GameMaster said:
Each VMX128 unit seems to be able to process 1 DOT product per cycle according to the article as well as one of the patents that Microsoft has..

I believe every VMX unit can do a dot product per cycle? A dot product anyway is 7 floating point ops for a 4-component vector, every VMX unit can do 8 per cycle. You may have to jig them around, but I believe it's possible to effectively get 1 per cycle (?) X360's VMX unit does one per cycle per unit too - 3 VMX units give 3 per cycle * 3200 = ~9bn per sec, as per the spec - it's just all packaged up in one instruction versus multiple instructions, but that's an issue of convenience rather than performance.

The GameMaster said:
Ok so if it takes 4 cycles to process a DOT4 product per SPE (it sure as heck can't do 1 DOT4 product in 1 cycle)... and with 8 VMX units that should give the Cell CPU roughly 6.4 Billion DOT products per second and another 35.2 Billion DOT products per second with the RSX (assuming 24 pixel pipelines and 8 vertex pipelines and the fact that each pipeline should be capable of 2 DOT products per cycle). Is that an accurate statement to make?

No, I don't think so at all. Again, you should effectively be able to get 1 per cycle per SPE. Previously, with a 4Ghz Cell chip, I believe people were estimating 36bn dot products per second on paper (PPE + SPEs). Apply the same math to the PS3 configuration, you'd get 25.6bn dot products per sec.

The GameMaster said:
Now Microsoft listed 1 VMX128 as part of their specification list... if there is indeed TWO VMX128 units now instead of one, would that effectively double the FLOPs performance of XENON? More specifically we need to verify that article's claim that there is 2 VMX128 units instead of one...

I think you're confusing execution units within one VMX unit with VMX units. Like people were confusing ALUs within ALUs in Xenos.

The GameMaster said:
*XENON (The TriCore CPU) is more powerful than we first thought and at the same time is more Cell-like than we first thought WITH the exception being the VMX128 which will be important for procedural synthesis.

I'd actually disagree. The flops tweaking with the FPUs was kind of to be expected, but I'm more surprised by the suggestion the X360's CPU seems to have made many of the same tradeoffs as Cell - it apparently really isn't much better for "general purpose processing" (whatever that is) - but without anywhere near the same kind of gain in other areas as we get for those tradeoffs on Cell.

edit - nAo and others beat me to some of my points..

arstechnica part 2 : inside the xbox 360

aaaaa00

JasonLD

Carl B

Friends call me xbd

nAo

Nutella Nutellae

DemoCoder

nAo

Nutella Nutellae

Acert93

Artist formerly known as Acert93

bbot

Gubbi

Squeak

Gubbi

The GameMaster

nAo

Nutella Nutellae

London Geezer

The GameMaster

Carl B

Friends call me xbd

Thowllly

The GameMaster

nAo

Nutella Nutellae

Titanio

Similar threads