GDC Europe

Ok, I have gone long enough without asking this.
Version, just who exactly are you?
I've been reading on here and PsiNext and you've been throwing out hints here and there and all over the place.(Especially at PsiNext)
I just wanted to know who you were. :D
 
mistan said:
Ok, I have gone long enough without asking this.
Version, just who exactly are you?
I've been reading on here and PsiNext and you've been throwing out hints here and there and all over the place.(Especially at PsiNext)
I just wanted to know who you were. :D

only a PS fan

xxx.JPG
 
mistan said:
Ok, I have gone long enough without asking this.
Version, just who exactly are you?
I've been reading on here and PsiNext and you've been throwing out hints here and there and all over the place.(Especially at PsiNext)
I just wanted to know who you were. :D

He's just a boy who likes to throw numbers and schematics around. Mostly false. Though lots of people have fallen for it and think he's in the industry. Just to set the record straight, he is not a game developer/programmer/coder/whatever.
 
version said:
compute again after TGS with 4 ghz :)

Um, the critical path in both the XeCPU and the Sony CELL is most likely going to be in the PPE, what makes you think that the same design in the same process is going to yield high enough to get an additional 25% in clock frequency?

In other words, keep dreaming. Both sony and MS are using the same design for all intents and purposes and the likely hood is that the XeCPU has more frequency headroom than CELL.

Aaron Spink
speaking for myself inc.
 
aaronspink said:
Um, the critical path in both the XeCPU and the Sony CELL is most likely going to be in the PPE, what makes you think that the same design in the same process is going to yield high enough to get an additional 25% in clock frequency?

In other words, keep dreaming. Both sony and MS are using the same design for all intents and purposes and the likely hood is that the XeCPU has more frequency headroom than CELL.

Aaron Spink
speaking for myself inc.

Riiiight, except that the Cell has reached in excess of 5 GHz in tests thus far and to our knowledge the XeCPU never has. Not to mention that the XeCPU throws out a lot more heat than the Cell, even when both are at 3.2 GHz. I don't think that Cell *will* see a clockspeed increase in PS3, but I certainly feel that it would more likely see one than the XeCPU. I see the PPE's as the greater clockspeed bottlenecks - the SPE's should clock quite readily.
 
Last edited by a moderator:
The CELL GFLOPS number of PS3 should be:

PPE: 3.2 GHz*1 VMX*4D VMX FMAC + 3.2 GHz *1 FPU*2D Paired-single FMAC
=3.2 GHz *1*8 FLOPs + 3.2 GHz*1*3 FLOPs
=25.6 GFLOPS + 12.8 GFLOPS

SPE: 3.2 GHz*7 SPEs*4D SPE FMAC
=3.2 GHz*7*8 FLOPs
=179.2 GFLOPS

PPE + SPEs = 25.6 + 12.8 + 179.2 = 217.6 GFLOPS
 
Last edited by a moderator:
xbdestroya said:
Riiiight, except that the Cell has reached in excess of 5 GHz in tests thus far and to our knowledge the XeCPU never has. Not to mention that the XeCPU throws out a lot more heat than the Cell, even when both are at 3.2 GHz. I don't think that Cell *will* see a clockspeed increase in PS3, but I certainly feel that it would more likely see one than the XeCPU. I see the PPE's as the greater clockspeed bottlenecks - the SPE's should clock quite readily.

Looking at shmoo plots from ISSCC and predicting yieldable frequencies is pretty pointless. Pentium 4s have reached well beyond 5 Ghz, yet you don't see Intel selling them. Likewise, K8s have reached well in excess of 3 Ghz, yet you don't see AMD selling them.

A shmoo plot is interesting in a technical sense but you have to be carefull to examine the practical aspects of the ranges used to generate the shmoo plot.

The heat of both XeCPU and CELL are fairly indeterminate but if you want to believe that XeCPU throws out a lot more heat than CELL then be my guest, but you'll be doing it on jack all for data.

Look, the PPE design, which you assume has the greater clockspeed bottlenecks is pretty much the exact same design between X360 and PS3 for the most part down to the polygon level. On a given process they are both going to be within spitting distance of each other in frequency.

Aaron Spink
speaking for myself inc.
 
cho said:
The CELL GFLOPS number of PS3 should be:

PPE: 3.2 GHz*1 VMX*4D VMX FMAC + 3.2 GHz *1 FPU*2D Paired-single FMAC
=3.2 GHz *1*8 FLOPs + 3.2 GHz*1*3 FLOPs
=25.6 GFLOPS + 12.8 GFLOPS

SPE: 3.2 GHz*7 SPEs*4D SPE FMAC
=3.2 GHz*7*8 FLOPs
=179.2 GFLOPS

PPE + SPEs = 25.6 + 12.8 + 179.2 = 217.6 GFLOPS

All nice and good, but this doesn't allow the PPE to load any data. Either you get the other VMX unit/FP unit or you load data. Pretty much the most trivial loop is a DAXPY loop which requires a load per FMAC. Unless the PPE in cell has become a 3 issue design, the real peak is for the PPE is pretty much 25.6 GFLOPS.

Aaron Spink
speaking for myself inc.
 
aaron said:
So sustained is actually: 204.8 GFLOPs
Oh come on - for one calling anything that assumes code composed of 100% MADDs "sustained" is more then a little silly.
And second, if we DO write an idealized benchmark that will get that kind of utilization, I can do a LOT better then 1:1 ratio with non-arithmetic ops. So it would definately be higher then 25.6GFlop if we could truly dual-issue VMX MADDs.

But on that note - I'd like to know a bit more about this idea with double FP execution units on PPE VMX, I know version has been dreaming about this since DD2 was first announced but I still find it hard to swallow.
Although, given how similar die-graphs for PPE and PPX look like, and PPX using 4x larger register files, it's a good question what the extra budget is spent on in PPE.
But since both MS and Sony claim >8 flop/cycle on their PPC core, I always assumed whatever creative PR is at work, it's the same for both CPUs.

In other words, keep dreaming. Both sony and MS are using the same design for all intents and purposes and the likely hood is that the XeCPU has more frequency headroom than CELL.
Well people have been advertising PPX cores as being more complex since day one, and with 3 of them wouldn't that make XeCPU less likely to scale - I mean, if like you say, it's the PPC cores that limit the clock speed most.
 
Last edited by a moderator:
Deano, you say no special stuff for branches, but what do you think about this:

ISA support to eliminate branches
The SPU ISA defines compare instructions to set masks that can be used in three
operand select instructions to create efficient conditional assignments. Such conditional
assignments can be used to avoid difficult-to-predict branches.

and this:

1.7. Programmer Directed Branch Prediction
Branch prediction can be significantly improved by using feedback-directed optimization. However, feedbackdirected
optimization is not always practical in situations where typical data sets do not exist. Instead, programmerdirected
branch prediction is provided using an enhanced version of GCC’s __builtin_expect function.
int __builtin_expect(int exp, int value)
Programmers can use _builtin_expect to provide the compiler with branch prediction information. The return
value of __builtin_expect is the value of the exp argument, which must be an integral expression. For dynamic
prediction, the value argument can be either a compile-time constant or a variable. The __builtin_expect
function assumes that exp equals value
.
 
Peter Hofstee:

So far we haven't disclosed exactly what the power of cell (at any frequency) is. Usually, once you are at the point where voltage has to be increased to achieve higher frequencies, the power goes with the cube of the operating frequency (and worse if leakage and tunneling play a significant role). Thus, for example, an increase in frequency from 3.2 to 4GHz, should result in nearly 2x the power.

The design philosophy of Cell was to design for as high an operating frequency as we could without making the processor inefficient ( if 1% more performance costs more than 3% power you would know for sure you've gone too far ) and then achieve maximum operating efficiency by running the processor at its minimum operating voltage.

Some of the graphs we've shown indicate an operating frequency somewhat over 3GHz at the minimum operating voltage. Personally I think that it is better to go to SMP configurations ( like the 2-way blade prototype IBM has shown ) if you have a higher power budget.
 
Some reporting on the Harrison presentation:

http://www.next-gen.biz/index.php?option=com_content&task=view&id=928&Itemid=2

Not a whole lot of detail on it there, perhaps there wasn't much detail to report. Hopefully transcripts or videos might get up at some stage, but in the meantime, I hope next-gen or the like report on the other presentations too..

edit - Another article from Gamasutra. Seems Harrison really held back on detail..there's little really new there:

http://www.gamasutra.com/php-bin/news_index.php?story=6379
 
Last edited by a moderator:
aaronspink said:
Looking at shmoo plots from ISSCC and predicting yieldable frequencies is pretty pointless. Pentium 4s have reached well beyond 5 Ghz, yet you don't see Intel selling them. Likewise, K8s have reached well in excess of 3 Ghz, yet you don't see AMD selling them.

A shmoo plot is interesting in a technical sense but you have to be carefull to examine the practical aspects of the ranges used to generate the shmoo plot.

The heat of both XeCPU and CELL are fairly indeterminate but if you want to believe that XeCPU throws out a lot more heat than CELL then be my guest, but you'll be doing it on jack all for data.

Look, the PPE design, which you assume has the greater clockspeed bottlenecks is pretty much the exact same design between X360 and PS3 for the most part down to the polygon level. On a given process they are both going to be within spitting distance of each other in frequency.

Aaron Spink
speaking for myself inc.

Well the heat thing isn't something I'm just taking a stab at however, it comes from this Q&A from an interview with IBM's Paul McKenney.

I have read that the Cell processor was designed in part to run an RTOS -- I guess that's obvious, given its gaming focus. What other embedded processors are interesting right now?

Paul: ARM is heavily used for embedded and Linux runs on both 64-bit and 32-bit ARM. There are even SMP ARM parts out there.

It really blew my mind when I first saw that -- a four-core, single-chip ARM, running at 350 and 550MHz, providing 1,440 Dhrystone MIPS, all on 600 milliwatts of power.

Of course, compare that to our PowerPC processor for Xbox which does 700 times as many floating-point operations per second as the four-core ARM does integer operations per second. It has only three cores instead of four, but it does use quite a bit more power. Still, 85 watts is well within range for a consumer device and not that long ago you couldn't buy a supercomputer that could do what PowerPC can now do, regardless of how much power you had available.

But again, the important question is "what does your application need?" If you're running off a battery, the ARM processor we just talked about is high power. You get only a few minutes of that kind of power from a D cell. On the other hand, if you have a wall outlet, 85 watts is trivial -- less than an amp.

So the implication here is that the XeCPU is an 85-watt TDP part; well above Cell at the same frequency I would imagine, even though we only have schmoo for the SPE's.

http://www-128.ibm.com/developerworks/power/library/pa-nl14-directions.html
 
Last edited by a moderator:
Fafalada said:
Oh come on - for one calling anything that assumes code composed of 100% MADDs "sustained" is more then a little silly.

hey, don't tell me, I'm not the one who came up with the convention of counting MADs as 2 ops.

And second, if we DO write an idealized benchmark that will get that kind of utilization, I can do a LOT better then 1:1 ratio with non-arithmetic ops. So it would definately be higher then 25.6GFlop if we could truly dual-issue VMX MADDs.

It is pretty much impossible to do better than a 1:1 ratio with non-arithmetic ops. Basically you get to the input problem. You could say setup a test case where you load the value for 1 register and then use say the rest of the 32 registers for constants, but in that case you pretty much have a bad optimization than high performance.

In order to do a mathmatical operation, you require a minimum of 1 input per operation. This is pretty much the daxpy case. While within a short burst it is often possible to see a linear progression of FP ops, this will only be the case because you've pre-loaded the operands.

Aaron Spink
speaking for myself in.c
 
Back
Top