Official CELL Speculation Thread

...

Even a [highly sexy - heh] ATI engineer agreed that an APU could run a shader program damn well.
It could? Then what's the purpose of a GPU? CELL doesn't need one!!! PSX3 GS3 is nothing but just another CELL with eDRAM frame buffer!!!

Unless you can explain how an APU differs from, say, a R3x0 Vertex Shader.
Vertex Shader carries transform and lighting specific instructions, APU doesn't.

as you seem to think an APU can't which is so very obtuse
I said does not/should not, never can't.
 
Re: ...

DeadmeatGA said:
It could? Then what's the purpose of a GPU? CELL doesn't need one!!! PSX3 GS3 is nothing but just another CELL with eDRAM frame buffer!!!

Perhaps that's all it needs to be. That's basically what we'll see with the DX Next move to a fully programmable Unified Shader and Topology support that's almost unbounded to many extents traditionally seen.

In a perfect world you'd want fully programmable acceleration, but this isn't feasible so concessions must be made. When I look at 3D graphics, I see two distinct areas of acceleration: That which is static and scales linearly (such as per fragment ops) and those which are dynamic and scales in [almost] indeterminate manners (per vert, surface/object, and higher abstraction) at the whim of the developer.

EDIT: I'm talking about the increase in computation per task as anticipated by a future hardware designer.

Things like Filtering and AA are highly iterative (as I talked of earlier.. like 28ops per fragment for bilinear) and lend themselves well to dedicated logic constructs as you don't need true programmability. Dedicated (concurrent) logic is the way to go, ergo the need for a "GPU".

Yet, everything else - I lump it together as the "front-end" - should be programmable. Topology, Shaders, Physics, et al should all be open to developers exploitation as they see fit.

So, there is a nice duality between what's a "GPU" and has some dedicated logic and output constsructs and the "MPU" which is programmable.

Vertex Shader carries transform and lighting specific instructions, APU doesn't.

I'm going to laugh.
 
Yet, everything else - I lump it together as the "front-end" - should be programmable. Topology, Shaders, Physics, et al should all be open to developers exploitation as they see fit.

Vince this actually fits under the DXNEXT model rather well. in additoin it shifts the bottle neck in how CPU to GPU sync is handled (efectively rendering the CPU as AI and process housekepper for the most part).

isn't the problem with CPU centric rendering (and lets limit our dicussion to just this for now), is that flexible tho it is (your effectively got the superset of functionality available) has major hurdles in managing both GFX and non-GFX task in an efficient manner?
 
It has been a problem with typical x86 implementations which are considerably bottlenecked between CPU and GPU, as well as the typical x86 CPU being far overwhelmed to handle both duties compared to what the GPU could do with graphics functions.
 
Re: ...

DeadmeatGA said:
Vertex Shader carries transform and lighting specific instructions, APU doesn't.

Many of these are nothing more than macros which decompose themselves at compile/assemble time into sequences of more generic instructions. Example m4x3 (a 4x3 matrix multiplied by a 3-vector) on the VS takes 4 instruction slots. Why? Because its implemented in hardware as a sequence of muls and mads.
 
DMGA said:
CELL is not a GPU. It does not/should not run shader programs.
:oops:
Let's trace back - YOU've been claiming for last year (at least) that APU equals VU.
--->
DMGA said:
each CELL core is built around single PPC core serving as the I/O engine, while 8 VUs handles the computational tasks dispatched from the Linux kernel.

So what, you're saying that every PS2 game to date should not run, because it runs shaders on VUs?
And further according to you VU has no instructions to run T&L?? And is that's 'the reason' why a VU IS the T&L processor in PS2?
:oops:
 
MfA said:
You are not going to run VS-3.0 or pixel shaders very well without hardware multithreading.

I think SSNC ( Semiconductor Solution Network Company ), Toshiba have been aware of the challenges in pushing CELL for 3D Graphics application for quite a while:

[0137] Other dedicated structures can be established among a group of APUs and their associated sandboxes for processing other types of data. For example, as shown in FIG. 27, a dedicated group of APUs, e.g., APUs 2702, 2708 and 2714, can be established for performing geometric transformations upon three dimensional objects to generate two dimensional display lists. These two dimensional display lists can be further processed (rendered) by other APUs to generate pixel data. To perform this processing, sandboxes are dedicated to APUs 2702, 2708 and 2414 for storing the three dimensional objects and the display lists resulting from the processing of these objects. For example, source sandboxes 2704, 2710 and 2716 are dedicated to storing the three dimensional objects processed by, respectively, APU 2702, APU 2708 and APU 2714. In a similar manner, destination sandboxes 2706, 2712 and 2718 are dedicated to storing the display lists resulting from the processing of these three dimensional objects by, respectively, APU 2702, APU 2708 and APU 2714.

[0138] Coordinating APU 2720 is dedicated to receiving in its local storage the display lists from destination sandboxes 2706, 2712 and 2718. APU 2720 arbitrates among these display lists and sends them to other APUs for the rendering of pixel data.

APUs can work on the Vertex side and the Pixel side of things.

A CELL based GPU could work well: of course things like texture filtering, etc... would find dedicated silicon to accelerate them: the patent mention a different configuration of a PE or group of PEs called Visualizer which substitute in the space it takes for 4 APUs, a Pixel Engine and Image Cache.

You could produce a Visualizer version of the Broadband Engine ( maybe clocked at half the speed to reduce total heat [2 GHz or so for the Broadband Engine and 1 GHz for the Visualizer] ) in the same fab as the Broadband Engine is produced basically allowing Nagasaki #2 and Oita #2 to each build both CPU and GPU for the system.

In the Pixel Engine logic we should likely find the logic for texture sampling and filtering so those kind of ultra basic tasks will be taken care fo there.
 
...

Faf

So what, you're saying that every PS2 game to date should not run, because it runs shaders on VUs?
Let us consider the rendering stages of a frame.

1. World construction(New objects added/deleted).
2. Read controller/network input.
3. Physics calculation.
4. Modify the world.
5. Generate the display list
6. T&L
7. Rasterization/Pixel Shading

Stages 1 through 5 are traditionally handled by the CPU, and this is what I expect EE3 to handle. Stages 6&7 are the domain of GPUs. Should the PSX3 implemented as two chips, the stage 1 through 5 will be allocated to EE3, while 6&7 will be allocated to GS3. While any stage can be allocated to any processor technically, stages has to be grouped in this way to minimize inter-processor bandwidth, because EE3 and GS3 don't have a lot of bandwidth between them.

This is why I don't expect the APUs in EE3 to handle shading functionality, because it is not a logical allocation of limited bandwidth.
 
MfA said:
That's dandy, but it didnt adress hardware multithreading.

Well, each APU has its PC, its Register File and its LS ( which is its real System RAM ).

SMT implementation on the Pentium 4 does some of that ( there is some dupliocation of resources ) and some more tricks ( in the trace cache u-ops are tagged by thread_id ) so we can at least say that CELL has some hooks to help MT.

The rest is doen by the PU which would do the thread scheduling work for its APUs.

The APUs depends on the PU for that, for direction on what to do... they can do the rest themselves.

They can do some I/O with external devices on their own, they can access the shared DRAM alone using the DMAC.

The PU can help them with I/O and memory access ( the PU controls the memory sandboxes ), but its main tasks are message passing ( even that could be done by APUs... have the PU set a shared area of DRAM so that all APUs doing a certain job can access to it [through APU ID mask modification] and pass messages using that area or other tricks can be devised to sort of automate that ) and job scheduling for the APUs.

The PU performs that by DMAing the program to the APU's LS ( ARPC or APU RPC ) and in that program it includes the content of the PC and the stack the APU should have.

The APU are like musicians in an orchestra with the PU deciding if they use the same instruments ( share e-DRAM portions ), if they should experiment the same tune, or if doing some solo job.

I hope I got the "example" right :)

What would you want to see as HW assisted MT ?

The thing with Pentium 4 is that it is handling 2 Threads in parallel at the same time... in a PE you would have to support from 1-to-N threads as there is no specific limit ( except PU performance ) to the number of APUs per PE.
 
Re: ...

DeadmeatGA said:
Faf

So what, you're saying that every PS2 game to date should not run, because it runs shaders on VUs?
Let us consider the rendering stages of a frame.

1. World construction(New objects added/deleted).
2. Read controller/network input.
3. Physics calculation.
4. Modify the world.
5. Generate the display list
6. T&L
7. Rasterization/Pixel Shading

Stages 1 through 5 are traditionally handled by the CPU, and this is what I expect EE3 to handle. Stages 6&7 are the domain of GPUs. Should the PSX3 implemented as two chips, the stage 1 through 5 will be allocated to EE3, while 6&7 will be allocated to GS3. While any stage can be allocated to any processor technically, stages has to be grouped in this way to minimize inter-processor bandwidth, because EE3 and GS3 don't have a lot of bandwidth between them.

This is why I don't expect the APUs in EE3 to handle shading functionality, because it is not a logical allocation of limited bandwidth.

How do you know that the CPU and GPU of PlayStation 3 will not have a lot of bandwidth between them ?

Redwood is not advertised as a slow bus architecture.

I would see just fine the Visualizer ( CELL based GS3 ) handle step 7 alone.

Since it would handle only that part of the whole procedure we could get by with using only 2 PEs and more e-DRAM.

Clocking the VS at 1/2 the clock-speed of the BE ( the e-DRAM would still not pass 1 GHz... either 1 GHz SDR or 500 MHz DDR ) in the case of a 2.5 GHz BE or at the same speed as the BE in the case of a 2 GHz BE would do the trick IMHO.

Fill-rate would then be between 2.5-4 GPixels/s which is good enough even for HDTV resolution ( the important things next-gen will be shaders execution speed ).

The VS would have 4 APUs per PE and 2 PEs meaning 8 APUs in total ( the space of 4 APUs in each PE is taken by the Pixel Engine and its Image Cache ).

The clock-speed of the APUs would then be ( according to the calculations above ) 1.25-2 GHz which would mean 10-16 GFLOPS or GOPS per APU.

This would yeld 80-128 GFLOPS or GOPS for Fragment Shading alone.

Texture filtering would be handled by the Pixel Engines.
 
MfA said:
That's dandy, but it didnt adress hardware multithreading.
Well..one can just process more vertices (or pixels) at the same time.
Most of the PS2 coders already do this, it's quite common to code T&L innerloops that process 2 or 4 vertices at the same time to make full use of any free instructions slot. It's a simple and quite effettive technique.

ciao,
Marco
 
T&L doesnt have to deal with external memory access.

Without hardware multithreading you need to pipeline to insert prefetches where necessary, for the same performance you need more temporary storage than an equivalent design with hardware multithreading (dynamic adjustment versus static).
 
...

slide13.jpg

For those you doubted the significance of this recurring diagram that has shown up in a dozen Kutaragi presentations over past two years, then here is something else that confirms that this diagram really is a representation of Kutaragi's plan...

Cell-Computing is PlayStation 3
Here, Kutaragi clearly defined Cell Computing as PlayStation3, hell that's the very title of this interview. Now can you read through Kutaragi's mind? The teraflop CELL processor is not a single chip; it is a 32-chip MCM module that can be used to build server racks. Too bad your $399 PSX3 box won't have such MCM module....
 
Deadmeat is probably right.

we're going to see 128-256 GFLOPs for the CPU and another 64-128 GFLOPs for the GPU. both are peak figures therefore sustained, realworld amounts will be much lower.


at most, 1/3 of a TFLOP for the entire PS3 architecture


one third of a TFLOP is obviously 333 GFLOPs :) and obviously that is a major improvement over PS2's 6.2 GFLOPs.

now, I highly doubt Deadmeat will disagree with what I've said here.
 
Excepthing that he keeps clutching to the slide, which would show an amazing LOSS for the machine from 99-00 tech, and everything else has been debated five ways from Sunday already, with DMGA going straight back to square one each time.

The space above is very vast and open to speculation--obviously.
 
Back
Top