Alan Heirich's paper on PS3 Deferred Shading (Cell pixel shading)

Titanio · Sep 11, 2007

It must be about a year since we first caught whiff of Alan Heirich's work on deferred shading across Cell+GPU. If you recall the abstract:

Mapping Deferred Pixel Shaders onto the Cell Architecture
Alan Heirich
Sony Computer Entertainment America
aheirich(at)Playstation.sony.com

Abstract
This paper studies a deferred pixel shading algorithm implemented on a Cell-based computer entertainment system. The pixel shader runs on the Synergistic Processing Units (SPUs) of the Cell and works concurrently with the GPU to render images. The system's unified memory architecture allows the Cell and GPU to exchange data through shared textures. The SPUs use the Cell DMA list capability to gather irregular fine-grained fragments of texture data generated by the GPU. They return resultant shadow textures the same way. The shading computation ran at up to 85 Hz at HDTV 720p resolution on 5 SPUs and generated 30.72 gigaops of performance. This is comparable to the performance of the algorithm running on a state of the art high end GPU. These results indicate that a hybrid solution in which the Cell and GPU work together can produce higher performance than either device working alone.

Discussion was kind of blunted by the fact that further details weren't forthcoming, but it looks like SCEA Research finally added the full paper to their page:

Deferred Pixel Shading on PLAYSTATION3

I've only been able to glance through it, but I'm sure it'll be of interest. This actually looks like an updated paper, credited to Alan Heirich and Louis Bavoil. I would quote the abstract and conclusion here, but my PDF viewer under Linux doesn't seem to let me copy text..

patsu · Sep 11, 2007

Here it is...

Abstract said:
This paper studies a deferred pixel shading algorithm
implemented on a Cell/B.E.-based computer entertainment
system.

The pixel shader runs on the Synergistic Processing Elements
(SPEs) of the Cell/B.E. and works concurrently with the GPU to
render images. The system's unified memory architecture allows
the Cell/B.E. and GPU to exchange data through shared textures.
The SPEs use the Cell/B.E. DMA list capability to gather
irregular fine-grained fragments of texture data generated by the
GPU. They return resultant shadow textures the same way. The
shading computation ran at up to 85 Hz at HDTV 720p
resolution on 5 SPEs and generated 30.72 gigaops of
performance. This is comparable to the performance of the
algorithm running on a state of the art high end GPU. These
results indicate that the Cell/B.E. can effectively enhance the
throughput of a GPU in this hybrid system by alleviating the
pixel shading bottleneck.

Closing Remarks said:
We have explored moving pixel shaders from the GPU to
the Cell/B.E. processor of the PLAYSTATION®3
computer entertainment system. Our initial results are
encouraging as they show it is feasible to attain scalable
speedup and high performance even for shaders with
irregular fine-grained data access patterns. Removing the
computation from the GPU effectively increases the frame
rate, or more likely, the geometric complexity of the models
that can be rendered in real time.

We can also conclude that the performance of the Cell/B.E.
is superior to a current state of the art high end GPU in that
we achieved comparable performance despite performance
limitations and despite using only part of the available
processing power. Our current implementation loses
substantial performance due to DMA waiting. This results
from the fine-grained irregular access to memory and is
specific to the type of shaders we have chosen to
implement. We have explored shaders based on shadow
mapping [15] which require evaluating GPU fragments
generated from multiple viewpoints. These multiple
viewpoints are related to each other by a linear viewing
transformation. Gathering the data from these multiple
viewpoints requires fine-grained irregular memory access.

This represents worst-case behavior for any memory
system.

Nesh · Sep 11, 2007

A few things that raise questions

The system's unified memory architecture allows the Cell and GPU to exchange data through shared textures.

???

We can also conclude that the performance of the Cell/B.E.
is superior to a current state of the art high end GPU in that
we achieved comparable performance despite performance
limitations and despite using only part of the available
processing power.

???

Can someone explain what exactly he ment?

Platon · Sep 11, 2007

Where are those kind of papers published? What is the impact factor for these kind of journals? Having only exprience from publishing papers in the medicine/biomedicine field I have no clue as to what journals there are for the "electronic" sciences except from Nature and Science that cover most sciences...

Platon · Sep 11, 2007

Nesh said:
A few things that raise questions

The system's unified memory architecture allows the Cell and GPU to exchange data through shared textures.

???

Indeed, especially as he in the concluding remarks he is talking about how you loose performance due to DMA waiting?...

Arwin · Sep 11, 2007

Platon said:
Indeed, especially as he in the concluding remarks he is talking about how you loose performance due to DMA waiting?...

Wouldn't he just mean that the CPU and GPU can both access the XDR?

archangelmorph · Sep 11, 2007

Arwin said:
Wouldn't he just mean that the CPU and GPU can both access the XDR & GDDR3?

fixed..

Betanumerical · Sep 11, 2007

Arwin said:
Wouldn't he just mean that the CPU and GPU can both access the XDR?

The GPU can read(20GB/s?)/write(20GB/s?) to the XDR and the Cell can read(20GB/s?)/write(16MB/s?) to the GDDR3.

.Melchiah. · Sep 11, 2007

Nesh said:
We can also conclude that the performance of the Cell/B.E. is superior to a current state of the art high end GPU in that we achieved comparable performance despite performance limitations and despite using only part of the available processing power.

Click to expand...

???

Can someone explain what exactly he ment?

7800GTX is mentioned in the PDF article.

Nesh · Sep 11, 2007

Cell is comparable or superior to a 7800GTX?? In what? Graphics? Doesnt that sound far fetched if thats what he ment?

Titanio · Sep 11, 2007

Nesh said:
Cell is comparable or superior to a 7800GTX?? In what? Graphics? Doesnt that sound far fetched if thats what he ment?

In the pixel shading part of the algorithm they tested (a soft shadowing algorithm). The paper breaks down the different parts of the process, and which Cell is doing, and what it's being compared on.

DJ12 · Sep 11, 2007

How old is that paper?

Seems to be a discrepancy between what we know and what he knows.

RSX is listed as 550mhz and 700mhz for the memory as opposed to 500/650.

Titanio · Sep 11, 2007

The diagram on page 5, with the 550mhz RSX, looks like it was lifted from somewhere else, probably a pre-release reference. Can't see a reference to that anywhere else. The paper is at most a year old, I'd think.

AlNom · Sep 11, 2007

It's from SCEA Research. It's the second link. It's odd that they have no date of publication or submission. The most recent reference is SIGGRAPH 2006 , so no earlier than August/September 2006.

patsu · Sep 11, 2007

Kai

Nesh said:
A few things that raise questions

???

"The system's unified memory architecture allows the Cell and GPU to exchange data through shared textures."

It's saying the CPU and GPU can share the memory, hence "unified". Microsoft uses the word "uniform" (uniform memory architecture) to describe its singular shared memory.

???

"We can also conclude that the performance of the Cell/B.E.
is superior to a current state of the art high end GPU in that
we achieved comparable performance despite performance
limitations and despite using only part of the available
processing power."

Can someone explain what exactly he ment?

I think they are refering to the DMA waits (only part of the available processing power).

EDIT: What the ... ? How did the word "Kai" get into the post title ? I didn't type it. :|

betan · Sep 11, 2007

patsu said:
It's saying the CPU and GPU can share the memory, hence "unified".

More specifically, it's a unified architecture because both memories can be accessed using a single address space.

MfA · Sep 11, 2007

Platon said:
What is the impact factor for these kind of journals?

4.018 for ACM TOG ... not that it matters a great deal, the conferences are more prestigious.

wowfactor · Sep 12, 2007

MfA said:
4.018 for ACM TOG ... not that it matters a great deal, the conferences are more prestigious.

I don't know much about ACM TOG, but that impact factor is kinda too high. Where did you get it from?

For conferences, the highest impact factor is 3.31.
You can check out the list here
http://citeseer.ist.psu.edu/impact.html

Infocom is ranked at number 133 with impact factor of 1.39.

Graham · Sep 12, 2007

Unfortunately I don't have time now to read the full article, however this caught my eye:

The shading computation ran at up to 85 Hz at HDTV 720p resolution on 5 SPEs and generated 30.72 gigaops of performance

Unless I'm mistaken (which I probably am, it's 1am), doesn't this add up to ~450 ops/pixel? And doesn't that seem a tad overcomplex?
gigaop/gigaflop?

[edit]
I guess I'm not taking texture reads into account, etc, but it still sounds high

Shifty Geezer · Sep 12, 2007

They claim to be executing complex shaders. Perhaps it is 450 ops per pixel, and you'd get much better performance (in terms of less resources used) with more realistic shaders?

Although if this performance is accurate, who needs RSX in Linux?!

Alan Heirich's paper on PS3 Deferred Shading (Cell pixel shading)

Titanio

patsu

Nesh

Double Agent

Platon

Platon

Arwin

Now Officially a Top 10 Poster

archangelmorph

Betanumerical

.Melchiah.

Nesh

Double Agent

Titanio

DJ12

Titanio

AlNom

Moderator

patsu

betan

MfA

wowfactor

Graham

Hello :-)

Shifty Geezer

uber-Troll!

Similar threads