IBM's Ashwini Nanda on Cell blades, raycasting, and more

ralexand said:
Was this being output in realtime. I thought it was a series of jpeg assembled and displayed. I have no idea how this compares to what they could do realtime. To be honest I don't see any significant qualitative difference with the images over what could be done with your standard shaders.

its the same thing as realtime, minus the compression for Jpeg...

what you see on your screen are Raw images from the graphic card. That cell blade did the same but the benchmark envolved saving the frames in jpeg format to a G5 station in the network (if this was done the same as in E3) instead of displaying it directly to the screen.

from what i gather, the benchmark was testing a "military or medical" purpose where the results are saved and not just displayed onto the screen.
Its a bit like FRAPS recording a movie but without having to display the game, just saving it to disk.

i dont know the scale of those benchmark results (50 means what? frames per second? it does not say) they could mean anything, but seeing the G5 scoring 1 and Cell scoring from 35 up, its pretty impressive!
 
dskneo said:
i dont know the scale of those benchmark results (50 means what? frames per second? it does not say) they could mean anything, but seeing the G5 scoring 1 and Cell scoring from 35 up, its pretty impressive!
As it says its relative. It doesn't say what speed it is running on on the G5 (could by 2FPS, could be 200), but the CELL @ 3.2Ghz is 50 times faster at it.
 
ralexand said:
Was this being output in realtime. I thought it was a series of jpeg assembled and displayed. I have no idea how this compares to what they could do realtime.

To further clarify what DSKNeo said: it was realtime. In previous articles about it, they mentioned a joystick hooked up to it that you could use to navigate with (it also gets a mention in this whitepaper, as being part of the "client"). This is also the same demo shown at E3 at Sony's conference - it didn't look to be 60fps, but it was certainly realtime.
 
one said:
Hehe, also the performance number 50 for 3.2 Ghz uniprocessor Cell is effectively taken with 7 SPEs as 1 SPE is used for the JPEG image compression kernel, so it seems 50 is the number for the PS3 Cell too.
Err in that case the ps3 cpu will be 6spe's as 1 spe is used for jpeg image compression kernel .

So the performance will be lower .


Still pretty impressive.
 
jvd said:
Err in that case the ps3 cpu will be 6spe's as 1 spe is used for jpeg image compression kernel .
"1 (No Image Encode)" for G5 is without jpeg compression. So if you assign all 7 SPEs for ray kernel, you still get 50 (No Image Encode) performance on PS3 Cell. On PS3, you don't need to compress it as you don't need to send it via gigabit ethernet, why not get 50 perf ;-) The image compression kernel wouldn't be a bottleneck, as the entire single SPE is allocated for it in all cases. PPE thread 1 does task management and PPE thread 2 does network management.
The TRE was implemented using a client server model. The client, implemented on an Apple G5 system, is connected to the server via a gigabit Ethernet. The client specifies the map, path, and rendering parameters to the server which in turn streams compressed images back to the client for display.
 
How do you figure that one . There is no (no image encode) for any cell numbers . Thus getting using the 50 number is flawed as that is obviously with encode on .
 
Quick question. Is the cell processor inside PS3 2-way SMP as well? If so at 3.2Ghz would it have performance around or over 100?
 
Andy said:
Quick question. Is the cell processor inside PS3 2-way SMP as well? If so at 3.2Ghz would it have performance around or over 100?

No, PS3 just has one cell processor with 7 SPEs, this test uses all 8 SPEs I belive.
 
jvd said:
How do you figure that one . There is no (no image encode) for any cell numbers . Thus getting using the 50 number is flawed as that is obviously with encode on .

encode is an extra overhead, not alot of overhead though.
 
V3 said:
No, PS3 just has one cell processor with 7 SPEs, this test uses all 8 SPEs I belive.

Cheers, I knew that there was only 1 cell processor inside, but I'm not exactly uptodate with technical jargon. But I remembered something about the revised cell processor, and it had 2 way something, where the the original design could only do 1 something. I thought that "something" might have been smp, but I couldn't remember exactly.
 
jvd said:
How do you figure that one . There is no (no image encode) for any cell numbers . Thus getting using the 50 number is flawed as that is obviously with encode on .
Image Compression Kernel In addition to the SPEs running the ray-kernels one SPE is reserved for image compression. The compression kernel operates on the finished accumulation buffer which is organized in a column major 2D array of single precision floating point (red, green, blue, count) data, one float per channel. The count channel contains the number of samples accumulated in the pixel. The image compression
kernel is responsible for the following tasks:
1) Normalization of the accumulation buffer
2) Compression of the normalized buffer.
3) Clearing of the accumulation buffer
The compression kernel reads sixteen by sixteen pixel tiles from the accumulation buffer into local store using DMA lists. These tiles are then normalized by dividing the color channels by the sample count for each pixel. The tile is then compressed using a multistage process involving color conversion, discrete cosine transformation (DCT), quantization, and run length encoding. The resulting compressed data is then written to a holding buffer in local store which when full is DMA written back to system memory for network delivery. As each tile is processed a tile of zeros is returned to the accumulation buffer providing the clear operation.
The server implements a three frame deep pipeline to exploit all the parallelism in the Cell processor. In stage one, the image preparation phase, PPE thread one decomposes the view screen into work blocks based on the vertical cuts dictated by the view parameters. Stage two, the sample generation phase, is where the SPEs decompose the vertical cuts into samples and store them in the accumulation buffer. In stage three, the image compression/network phase, the compression SPE encodes the finished accumulation buffer and PPE thread two delivers it to the network layer. All three stages are simultaneously active on different execution cores of the chip maximizing the image throughput.
So actuall work is done on 7 SPEs. Each stage is distributed on different processors and done in stream processing. Note that I assume image compression kernel was not a bottleneck and 7 SPEs for ray kernel could show its full performance under this condition, which is 50. If image compression kernel is a bottleneck, then the perf numbers in the chart are limited by it and may be even higher, but I don't think it's the case.

BTW, in another thread in this forum, there was a discussion about PPE cache and SPE limited by it, but according to TRE doc
The SPEs are therefore free to run at their own pace processing each vertical cut without any data synchronization on the input or output. None of the SPE’s input or output data is touched by the PPE, thereby protecting the PPE’s cache hierarchy from transient streaming data.
So it's false to assume SPEs are limited by PPE cache. Apparently this consideration on programming without touching PPE cache also matches the observation on large dataset processing in Suzuoki's keynote speech in Rambus Developers Forum 2005.
 
Last edited by a moderator:
So actuall work is done on 7 SPEs. Each stage is distributed on different processors and done in stream processing. Note that I assume image compression kernel was not a bottleneck and 7 SPEs for ray kernel could show its full performance under this condition, which is 50. If image compression kernel is a bottleneck, then the perf numbers in the chart are limited by it and may be even higher, but I don't think it's the case.

BTW, in another thread in this forum, there was a discussion about PPE cache and SPE limited by it, but according to TRE doc

and on the ps3 the cell only has 7 spes . So the actual work will be done on 6 of them with the 7th used for the kernel .

Thus the numbers aren't compariable unless you think the ps3 suddenly has an 8th spe
 
jvd said:
and on the ps3 the cell only has 7 spes . So the actual work will be done on 6 of them with the 7th used for the kernel .

Thus the numbers aren't compariable unless you think the ps3 suddenly has an 8th spe

What do you mean, "7th used for the kernel".

There are 2 kernels.

Image Compression Kernel running on 1 SPE
Ray-Casting Kernel running on 7 SPE's.
 
for the jpeg image compression kernal .

It states 1 is used for this and 7 are used for the other . In the ps3 you don't have 8 spes . Which means if 1 is sued for the jpeg kernal then there are only 6 left for the other tasks .


Which means the ps3 chip doesn't equal the benchmark numbers .
Originally Posted by one
Hehe, also the performance number 50 for 3.2 Ghz uniprocessor Cell is effectively taken with 7 SPEs as 1 SPE is used for the JPEG image compression kernel, so it seems 50 is the number for the PS3 Cell too.

This is what my reply was too. His thinking is flawed . The cell in the ps3 is 1x7 not 1x8 . These test shows the performance for 1x8 . WIth 1 less spe there will surely be diffrences in performance. Pretty large actually your taking away 1/8th the power form the spes
 
Quaz51 said:
and how a dual-cell 2.4ghz can give a +108% performence boost ??
Oh that I overlooked. Then a uniprocessor Cell has a bottleneck in somewhere.
 
Back
Top