Alan Heirich's paper on PS3 Deferred Shading (Cell pixel shading)

Titanio

Legend
It must be about a year since we first caught whiff of Alan Heirich's work on deferred shading across Cell+GPU. If you recall the abstract:

Mapping Deferred Pixel Shaders onto the Cell Architecture
Alan Heirich
Sony Computer Entertainment America
aheirich(at)Playstation.sony.com

Abstract
This paper studies a deferred pixel shading algorithm implemented on a Cell-based computer entertainment system. The pixel shader runs on the Synergistic Processing Units (SPUs) of the Cell and works concurrently with the GPU to render images. The system's unified memory architecture allows the Cell and GPU to exchange data through shared textures. The SPUs use the Cell DMA list capability to gather irregular fine-grained fragments of texture data generated by the GPU. They return resultant shadow textures the same way. The shading computation ran at up to 85 Hz at HDTV 720p resolution on 5 SPUs and generated 30.72 gigaops of performance. This is comparable to the performance of the algorithm running on a state of the art high end GPU. These results indicate that a hybrid solution in which the Cell and GPU work together can produce higher performance than either device working alone.

Discussion was kind of blunted by the fact that further details weren't forthcoming, but it looks like SCEA Research finally added the full paper to their page:


Deferred Pixel Shading on PLAYSTATION3


I've only been able to glance through it, but I'm sure it'll be of interest. This actually looks like an updated paper, credited to Alan Heirich and Louis Bavoil. I would quote the abstract and conclusion here, but my PDF viewer under Linux doesn't seem to let me copy text..
 
Here it is...

Abstract said:
This paper studies a deferred pixel shading algorithm
implemented on a Cell/B.E.-based computer entertainment
system.

The pixel shader runs on the Synergistic Processing Elements
(SPEs) of the Cell/B.E. and works concurrently with the GPU to
render images. The system's unified memory architecture allows
the Cell/B.E. and GPU to exchange data through shared textures.
The SPEs use the Cell/B.E. DMA list capability to gather
irregular fine-grained fragments of texture data generated by the
GPU. They return resultant shadow textures the same way. The
shading computation ran at up to 85 Hz at HDTV 720p
resolution on 5 SPEs and generated 30.72 gigaops of
performance. This is comparable to the performance of the
algorithm running on a state of the art high end GPU. These
results indicate that the Cell/B.E. can effectively enhance the
throughput of a GPU in this hybrid system by alleviating the
pixel shading bottleneck.

Closing Remarks said:
We have explored moving pixel shaders from the GPU to
the Cell/B.E. processor of the PLAYSTATION®3
computer entertainment system. Our initial results are
encouraging as they show it is feasible to attain scalable
speedup and high performance even for shaders with
irregular fine-grained data access patterns. Removing the
computation from the GPU effectively increases the frame
rate, or more likely, the geometric complexity of the models
that can be rendered in real time.

We can also conclude that the performance of the Cell/B.E.
is superior to a current state of the art high end GPU in that
we achieved comparable performance despite performance
limitations and despite using only part of the available
processing power. Our current implementation loses
substantial performance due to DMA waiting. This results
from the fine-grained irregular access to memory and is
specific to the type of shaders we have chosen to
implement. We have explored shaders based on shadow
mapping [15] which require evaluating GPU fragments
generated from multiple viewpoints. These multiple
viewpoints are related to each other by a linear viewing
transformation. Gathering the data from these multiple
viewpoints requires fine-grained irregular memory access.

This represents worst-case behavior for any memory
system.
 
A few things that raise questions
The system's unified memory architecture allows the Cell and GPU to exchange data through shared textures.
???

We can also conclude that the performance of the Cell/B.E.
is superior to a current state of the art high end GPU in that
we achieved comparable performance despite performance
limitations and despite using only part of the available
processing power.

???

Can someone explain what exactly he ment?
 
Where are those kind of papers published? What is the impact factor for these kind of journals? Having only exprience from publishing papers in the medicine/biomedicine field I have no clue as to what journals there are for the "electronic" sciences except from Nature and Science that cover most sciences...
 
A few things that raise questions

The system's unified memory architecture allows the Cell and GPU to exchange data through shared textures.

???

Indeed, especially as he in the concluding remarks he is talking about how you loose performance due to DMA waiting?...
 
Indeed, especially as he in the concluding remarks he is talking about how you loose performance due to DMA waiting?...

Wouldn't he just mean that the CPU and GPU can both access the XDR?
 
We can also conclude that the performance of the Cell/B.E. is superior to a current state of the art high end GPU in that we achieved comparable performance despite performance limitations and despite using only part of the available processing power.

???

Can someone explain what exactly he ment?

7800GTX is mentioned in the PDF article.
 
Cell is comparable or superior to a 7800GTX?? In what? Graphics? Doesnt that sound far fetched if thats what he ment?
 
Cell is comparable or superior to a 7800GTX?? In what? Graphics? Doesnt that sound far fetched if thats what he ment?

In the pixel shading part of the algorithm they tested (a soft shadowing algorithm). The paper breaks down the different parts of the process, and which Cell is doing, and what it's being compared on.
 
How old is that paper?

Seems to be a discrepancy between what we know and what he knows.

RSX is listed as 550mhz and 700mhz for the memory as opposed to 500/650.
 
The diagram on page 5, with the 550mhz RSX, looks like it was lifted from somewhere else, probably a pre-release reference. Can't see a reference to that anywhere else. The paper is at most a year old, I'd think.
 
It's from SCEA Research. It's the second link. It's odd that they have no date of publication or submission. The most recent reference is SIGGRAPH 2006 , so no earlier than August/September 2006.
 
Kai

A few things that raise questions

???

"The system's unified memory architecture allows the Cell and GPU to exchange data through shared textures."

It's saying the CPU and GPU can share the memory, hence "unified". Microsoft uses the word "uniform" (uniform memory architecture) to describe its singular shared memory.

???

"We can also conclude that the performance of the Cell/B.E.
is superior to a current state of the art high end GPU in that
we achieved comparable performance despite performance
limitations and despite using only part of the available
processing power."

Can someone explain what exactly he ment?

I think they are refering to the DMA waits (only part of the available processing power).


EDIT: What the ... ? How did the word "Kai" get into the post title ? I didn't type it. :|
 
Unfortunately I don't have time now to read the full article, however this caught my eye:
The shading computation ran at up to 85 Hz at HDTV 720p resolution on 5 SPEs and generated 30.72 gigaops of performance

Unless I'm mistaken (which I probably am, it's 1am), doesn't this add up to ~450 ops/pixel? And doesn't that seem a tad overcomplex?
gigaop/gigaflop?

[edit]
I guess I'm not taking texture reads into account, etc, but it still sounds high
 
They claim to be executing complex shaders. Perhaps it is 450 ops per pixel, and you'd get much better performance (in terms of less resources used) with more realistic shaders?

Although if this performance is accurate, who needs RSX in Linux?!
 
Back
Top