Activision Developers take on Cell

Status
Not open for further replies.

gosh

Newcomer
"1. The Reason MS didn't "correct" their GFLOP evaluation is because a SYSTEM cannot be described as a mere Additive resolution of its constituent parts. Sony is basically lying.

2. The SPE's do NOT have the described memory bandwidth - they CANNOT really access memory in that way as they have to DMA from main memory into the local store BEFORE it can be used.

3. X360 is IN ORDER, this removed a HUGE amount of the silicon AND thus removed a LARGE proportion of the G5's heat isses; the bulk of code written for Macs is simply Mac Centric - the code written for X360 is X360 SPECIFIC and thus can be far better optimised by the compilers and the programmers.

4. Total system bandwidth does not provide a valid comparison - stop doing it, its like saying look... my house has more ducting than yours... in the end its the effective use of the bandwidth that is the REAL measure and we can't quantify that measure in any easy manner. In a while - when the systems are better known we will know their bottlenecks and thus know where MORE bandwidth was required... but thats about as close to quantification as is possible"

"

The **bleep** for the SPE is VERY deep - a jump statement may or may NOT be taken but regardless the SPE WILL read ahead those instructions that would be executed should the jump NOT take place. Each instruction will be at a different point in the execution **bleep**... so lets say there 21 instructions... 0 being the jump on condition, the next 10 being the NONE jump code the last 10 being the code ran when a jump is executed.

it might take say 5 cycles (arbitrary for examples sake) to actually execute the jump on condition command... during which time the dual issue SPE has begun processing on the next 10 instructions (assuming no dependancy issues). The result of the jump on condition comes in and eh voila - all those instructions (1-10) are suddenly WRONG... we don't need them to execute any more.... here comes a NICE flush of the **bleep**... removing ALL the new code and returning to the state BEFORE the read ahead was done. It then starts reading from instruction 11 - 21 and execution continues.

THIS is the cost of branching, THIS is the reason we WILL have to write more serial code that uses instructions to make decisions on vectors THIS is ONE of the reasons coding for the ps3 WILL take longer."

"
heres a question - on ps2 the vu units are used largely to process vertex/lighting/skinning data from a known stream source specifically designed around the VU itself... we use VU0 for sporadic vector math thus keeping the actual values in the VU registers during complex math. These things HAD to be designed around before they worked....Ps3 has similar requirements for alignment and data read/write HOWEVER we are expected to run FAR more complex and random code accessing FAR more random (historically speaking) data... in order to get the speed out of the SPE's we HAVE to redesign core systems around the SPE requirements... economies of scale WILL play a factor in whether this ACTUALLY gets done... if we can produce an ok game that sells 2M units by NOT redesigning but putting out mediocre code whilst making the same game on X360 looking VERY cool and doing it easily and selling 3M units... I think the business speaks for itself.... off the cuff example it may be... wrong it is not.

developing for ps3 WILL cost more than X360 - doesn't mean I like one better than the other.. its just a fact. In maybe 3 years time they be approaching similar at which point we could see a change - but for now it is what it is."

"hate to mention this but... someone quoted Tim Sweeny as saying the cell is easy to program for....general option is that his role as "Chief Architect" is largely moot he is the CEO and nothing more.... the crux of it is... he IS NO LONGER a programmer.

and regarding those articles - read them both and by and large its guesswork.

deftones - read between the lines, I said I can't give you my source, not that I'm guessing. My University career saw my final year project getting sent back twice due to being TOO complex at which point I had to employ those metaphors you love so much in order to explain the intricacies of scheduling in the U/V pipes of the FIRST pentium CPU and how it could be programmed with this in mind. My CPU work doesn't stop there as I worked at the lowest level on ps1 for several years writing renders at the assembly level and optimising the game for instruction cache (I-cache) and Data cache (D-cache) thrashing. That year due to the entire teams efforts Sony gave us a pat on the back saying we were the most optimal engine out that year. I then switched to ps2 for a while and then gamecube where most of my powerpc experience comes from. Again I wrote at the lowest level for gamecube AND optimised at a higher level knowing the hardwares weak points and where we could rely upon its strong points.

I'm not a novice, I'm not guessing and I've staying up with the CPU's I use or WILL over the years. I know almost nothing about the current batch of PC CPU's, mainly because I don't program for them... but I do know a awful lot about powerpc architecture, the architecture of the x360 and I'm currently learning the architecture of the cell's SPE's which may take a few more weeks before i'm up to making global design decisions.

so please - answer me this, what do you do? and what exactly do you disagree with regarding my previous LARGE post?"

wow - I go away for ONE day at E3 and ALL **** breaks loose

to answer a few questions

1. I'm not a fanboy, nor do I lean towards any particular hardware. I do currently work on Xenon BUT I have been working on Ps1,Ps2,DC and Gamecube for a LONG time.

2. regarding the cells SPE's and WHY they are not good at general purpose code OR random data access.

When a decision is made in a processor (an if statement) the PC (Program counter) often has to jump to a separate section of the code returning later to the commmon code. This interupts the caching of the instructions as the processor does NOT know where the code is going to jump as this is more often than not data dependant. Modern CPU's have a method they use the predict which branch will be taken. Branch prediction is extremely complex and I won't even try to explain it here BUT I can say that it often hides the caching issues I mentioned. A branch miss results in ALL the instruction cache being flushed and the cpu then starts reading from its new location... X360 HAS branch prediction on ALL its cores thus ofimes it avoids this flush. Ps3 has it only on the PPE - the spe's have ZERO branch prediction but CAN be given hints by the programmer as to which branch might be taken... this is NOT as fast as full branch prediction AND it takes longer to program. Code can be designed around this problem and no doubt it WILL be .... but it takes longer.

Futher - the spe's rely upon a LOCAL area of memory (256Kb) to store both the data they use during calculations AND the resuts of said calculations. This data is streamed in by DMA as and when needed and thus needs to be controlled either by itself or the main CPU the DMA is HEAVILY relied upon but AFAIK can only process one request at a time albeit at a VERY fast rate. Now... the DMA reads from a memory address a set number of bytes, each addres+numbytes is a single request.... consider accessing 1,000 vertices... they would normally be stored linearly in memory thus a SINGLE dma request would be used. If you're following me then you know where I'm going... consider the SPE accessing data from 100 different memory addresses in a NON linear fashion; random. Each access would represent a NEW dma, more DMA requests == further delays for ALL spe's.

There is an area of memory that is shared between the SPE's so the above example IS a worst case scenario... but like many things on the cell.. this shared memory is non-trivial and is small thus requiring manangement;more programming.

3. Cell is VERY good with large datasets requiring linear access and having a relatively simple code pathway - lighting, vertex transform, A* pathing (if designed properly), Character skinning, texture processing, animation. However to get the most out of it each will require its own specific handling and design around the SPE limitations and DMA request optimisation.

4. Information on RSX is sketchy at best... they have said 100B shader ops but haven't told us much else. It will be fast, it will be able to run most if not all the things that the ATI part in X360 will run. It DOES NOT use the same language - more work required.

5 X360's CPU == 240 Gflop with 1 thread per core, Ps3 CPU == 218 Gflop - TOTAL when you ADD the 2 threads running on the PPE... if we take the same standpoint as sony.... X360's CPU is 480Gflops... thats >2XPs3.... we just need the numbers on the GFX Cards."

"
note - "IF we take the same standpoint as Sony" in that - Sony in their comments regarding the STATS on the cell HAVE ADDED THE NUMBERS across the board in order to reach the 2.18 TFLop, they have added 2 threads on the ppe and dual issue on teh spe's.... it was MEANT to show people what the x360 numbers come out to WHEN we use the same "technique" as sony.

and - you say that it shouldn't be much more difficult than ps2 - hmm.. that puts MOST things at around 5x programming time over x360 - thanks for backing me up.

regarding branch prediction NOT being an issue - go to the cell docs IF you have them (anyone else who has them can back me up here)

/cell/0_3_0/documents/hardware/BE-Overview_e.pdf
in that pdf go to SPE->SPU Performance Characteristics and READ the section on BRANCH PREDICITION AND WHY ITS IMPORTANT.


THEN come back here and try again - sorry if I'm harsh but you came on here DIDN'T read my post properly and THEN got offensive."

"

we were looking at these stats a few days ago and no mater WHAT equation we use we CANNOT get the system to 2.18TFlops

the CPU is 218 GFlops - not that great
the GPU is a black box in the released stats claiming 1.8TFlops... now... THAT is absurd... not impossible but certainly outside the realms of single die - Our general thinking is that they are using an SLI setup however that would KILL profitability.

My own opinion says they are more likely to hit 1TFlop and be similar to X360...

The main issue is... that 218GFlops on the CPU is gonna be VERY hard to utilize and or maximise"

----these are his words not mine, this is not really supposed to be publicized, but he is a developer for activision who is working on the next generation Tony Hawk. for both system, thus he has an intimate knowledge of both, in addition there are ather devs but this should be enough to chew on for now.


Someone Dissect this
 
As a programmer, most of what he said make sense to me. But this:

5 X360's CPU == 240 Gflop with 1 thread per core, Ps3 CPU == 218 Gflop - TOTAL when you ADD the 2 threads running on the PPE... if we take the same standpoint as sony.... X360's CPU is 480Gflops... thats >2XPs3.... we just need the numbers on the GFX Cards."

Come'on! The number of hardware "threads" doesn't affect the maximum possible flops/ips a processor core can do. What matter is the number of execution units in those cores. If there's one VMX unit per core, then having 1 or a zillions hardware threads won't allow the units to do more. Threads are only useful to maximize the use of the execution units on dead/lost cycles.

And what's that ? 240Gflops for a 3 core cpu at 3.2Ghz ? That's 25 flops per cycle, anyone can explain this to me ?
 
gosh said:
"1. The Reason MS didn't "correct" their GFLOP evaluation is because a SYSTEM cannot be described as a mere Additive resolution of its constituent parts. Sony is basically lying.

we sure were not hearing that when the xbox stats were released.
 
smarth, is that what he's asying in this his post? That you "can't"add the hardware threads up to increase the Gflop rating, and that's what he thought sony did?
 
5 X360's CPU == 240 Gflop with 1 thread per core, Ps3 CPU == 218 Gflop - TOTAL when you ADD the 2 threads running on the PPE... if we take the same standpoint as sony.... X360's CPU is 480Gflops... thats >2XPs3.... we just need the numbers on the GFX Cards."

:LOL: Who is this guy? His credibility just kinda flew right out the window with this one...
 
The point of the camment was to show what the X360's numbers would be IF you used Sony's method for calculatoing FLOPS.

he says that explicitly.

"5 X360's CPU == 240 Gflop with 1 thread per core, Ps3 CPU == 218 Gflop - TOTAL when you ADD the 2 threads running on the PPE... if we take the same standpoint as sony.... X360's CPU is 480Gflops... thats >2XPs3.... we just need the numbers on the GFX Cards."

"
note - "IF we take the same standpoint as Sony" in that - Sony in their comments regarding the STATS on the cell HAVE ADDED THE NUMBERS across the board in order to reach the 2.18 TFLop, they have added 2 threads on the ppe and dual issue on teh spe's.... it was MEANT to show people what the x360 numbers come out to WHEN we use the same "technique" as sony. "
 
gosh ,where does that come from ?
BTW ,the tony haw series were made on renderware middle ware.Not a technical showcase.

He sounds like a very frutrated guy.
 
"5 X360's CPU == 240 Gflop with 1 thread per core, Ps3 CPU == 218 Gflop - TOTAL when you ADD the 2 threads running on the PPE... if we take the same standpoint as sony.... X360's CPU is 480Gflops... thats >2XPs3.... we just need the numbers on the GFX Cards."

ohh ,i see , the little 5 at the start of the quote counts .He means ,5 xcpus = 240 gflops.



Tim sweeney is no longer a programmer ? this guy is sick ?
 
IIRC, Epic said they have not started to tap the SPEs yet. So I would take anything they have to say about ease of programming Cell with a bit of salt. Wait till they make a game that has to utilize the SPEs and then talk to them.

Anyway, it was a somewhat interesting read, but it contained little I could not deduct on my own.
 
I can't wait to see the link of where this article was found....

*looks at his watch as he waits for gosh...
 
The lack of branch prediction on the SPE's is well known, and CELL's power comes from having ***EIGHT*** SPE's all running at 3.2 GHz, with 256 KB SRAM each, dual instruction pipelines (not dual threaded), 128 x 128-bit registers, and a DMA engine each, with massive SPE to SPE's|PPE|Main Memory|GPU bandwidth.

Terrible! :eek:

A possible answer to the 2 TFLOP spec:

"I also scratched my head which kind of wild marketing optimizm lead to these numbers. Things I could think of:

If texture interpolation now can be done in FP32:
Bilinear interpolation per one component takes 4 multiplies, 3 adds and 2 subs: 9 ops, x4 channels = 36 ops.
If result blend operation (when writing the results into FP32 buffer) can be done in FP32, it would add another 12 ops.
Add the original 8 ops of the other shareder unit (4xMADD)

We have 56 ops/cycle/pipe.
At (suppose) PS 32 pipes, it would be 1792 FP ops/cycle.
At 550Mhz, we have 985 GFlops.

Well, we do have a (theoretic) teraflop, we are almost there I am not too sure about number of PS pipes (24 or 32?). Anayway, I did not count the VS pipes (8, likely?). Also, if anamorphic filtering can be done in FP32, that would hike the texture unit FP to twice, and we would be at 1.8 Tflops

Now, before we claim a record, let see how it compares to a Cray supercomputer, model X1E, one liquid cooled cabinet configuration:

2.3 TFlops, memory bandwidth 3200Gb/s

Teraflops could be comparable, but it seems we have a memory bandwidth problem: Cray has 100x more to do any useful job with all these FP units


Roman" from here:http://www.nvnews.net/vbulletin/showthread.php?t=52171
 
Even if he hasn't used it, and has the docs in front of him, I'd say you're more likely to have false ideas about Cell.
 
Sounds about right. But we've known from the beginning developing for the cell would require a paradigm shift in programming.

If you had some yardwork that had to be done. Hiring the XeCPU would be like hiring 3 teenagers who each brought 3 slacker friends and they had to share some tools. Hiring the Cell would be like hiring 1 adult who brought his 8 children.
 
Status
Not open for further replies.
Back
Top