Second Gen Cell info

David Wang is awesome - I hope he does indeed end up making that third article in the series, because I have been a big fan of the first two. He makes some really interesting points with regard to the size of the level 2 cache as well and how there seems to be something more there than meets the eye.

Name: David Wang 4/27/05

In the ISSCC 2005 article about the CELL processor, the die size of the processor was reported to be 221 mm2. It was thus interesting to see that the Microprocessor Reports article on the CELL processor states that IBM plans to ship the CELL processor with a die size of 235 mm2 [1]. In the MPR article, the die size differential wasn't explained, but it has since emerged that IBM went back and re-engineered the PPE, and the PPE (and the CELL processor as a whole) grew bigger.

In the die photo of the CELL processor released at ISSCC, the width of the PPE and the "512K L2" cache has the same width, but the PPE shares that width with the self test unit[2].

In the latest die photos, the PPE and the "512K L2" still share the same width, but the test unit has been moved to shared the same width with the L2 cache block rather than the PPE. Basically, the PPE grew by 2X of the width of the test block in two different photos.

I noticed this interesting issue about a week ago, since I have high resolution photos of not only the new die photo, but also of the prototype discussed at ISSCC 2005. I've been working on it off and on, but it seems that I've been "scooped". A separate discussion has been going on at Beyond3D in looking at the growth of the PPE and speculating as to why IBM went back and re-engineered the PPE, and what IBM did[3].

A separate issue is that the "512K L2" in the CELL processor is significantly larger than the 512K L2 in the PPC970FX processor, to the tune of 2X larger. It looks like that the data and tag arrays are about the same size, but the block labelled as "512K L2" in the CELL processor has a lot of other structures in it, and I was working to figure out what they are. Collectively, the die photo analysis would have been a third article in the series here, but since the cat's out of the bag and I don't have that much time, that third article looks less likely by the minute. Regardless, the fact that IBM appears to have significantly re-engineered the PPE (sort of explains the reluctance wrt the discussions about the PPE at ISSCC) is an interesting tidbit deserving of some discussion.

[1] IBM PDF

[2] IBM Research

[3] Beyond3D discussion

By the way, much respect to Version for seemingly 'scooping' the story. ;)

(but if you're reading this David Wang - make that third article!)
 
psurge said:
See page 3-4 of this PDF for some PPE details.

The multithreading design supports fine-grained multithreading with round-robin thread scheduling. If both threads are active, the processor will fetch an instruction from each thread in turn. When one thread cannot issue a new instruction or is not active, the other active thread will be allowed to issue an instruction every cycle.

The wording is confusing (fetch != issue) - at first glance this seems to imply the PPE is single issue.

Anyway, realworldtech has thread on this very same topic here Hopefully David Wang will write up his thoughts on the die photos...

A strange way of doing SMT: if only a thread is active than that thread is not given the full control on execution and when fetching instructions it seems to be saying that it fetches only one instruction each cycle.

So, basically am I right in saying that the PPE can can fetch 1 instruction per cycle, but can issue up to 2 instructions per cycle if there are two threads active.

Alpha's EV8 was working differently and I think the Pentium 4 does the same: EV8 was a 8-way core (could execute up to 8 instructions per cycle from a single thread) and it could also divide its execution units to have 4 threads running and each thread being able to only execute two instructions per cycle IIRC.

I will do some research, but it seems an odd way to implement SMT: in a single threaded application the PPE would seem to work as a scalar processor (1-way).

PPEcompare.jpg
 
aaaaa00 said:
I built a DShow filtergraph outputting a SD MPEG2 to the Null renderer (to remove the video card from the equation), and on this old 1.4 ghz P4 400 mhz FSB (it's an engineering sample I got from Intel many years ago), it consumed about 20% of the CPU. That's with all the overhead of DirectShow/Kernel Streaming/etc.

Ok, I have confirmed that MPEG2 decoding scales roughly linearly on this PC at least.

I got ~20% CPU usage with one stream, so I built a DirectShow filtergraph with various numbers of MPEG2 streams, output to the default Video Render. No audio decoding, though I suspect that would be minimal additional CPU load.

Remember this is a far from state of the art machine: only a single 1.4 ghz P4, 256KB L2 cache, no HT, only 400 mhz FSB, no DDR memory, no fancy PureVideo acceleration.

1 stream = ~20% CPU
4 streams = ~ 80% CPU
5 streams = CPU pegged.

You could tell at 5 streams the machine was starting to struggle to keep up, you would occasionally get dropped frames when something running in the background would kick in and suck away some CPU cycles.

One other thing is that Windows XP doesn't have multimedia I/O prioritization, nor are there processor reserves or any of that realtime stuff, so there's no way for the OS to prevent background apps from causing a frame drop when the system is running flat out like this.

Assuming you remove the OS from the equation, and run the MPEG2 codecs bare metal on the CPU directly (no background thread scheduling, no random I/Os from system services), I bet you could squeeze out some more from this CPU, or at the very least make 5 streams run perfectly.

To be clear:

48 SD MPEG2 streams is impressive. I don't think you can currently build a non-exotic PC that can do that. However, that said, it doesn't seem so far out of reach for near-future, or even current PC tech to match.

To be doubly clear:

I'm not suggesting that the demo showed CELL's maximum performance.
 
The paper is confusing matters, because elsewhere it states that the PPE is a dual issue in order core. Not much point in building a dual issue core if you can only fetch one instruction per cycle.

I just think the paper is wrong where it states that only one instruction is fetched. It should be two instructions are fetched each cycle, each thread on alternate cycles.

Cheers
Gubbi
 
Dunno, maybe its similar to the SPE limited 2 issue operation on page 6 since there's a shared ISA between the PPE core and the SPE's in most aspects.

"If the instuction is not properly aligned, the instruction swap operation will force single issue operation"

Any ideas?
 
Gubbi - the only reason I can think of for single fetch/dual issue is this :

there are enough pipeline bubbles from high instruction latency and issue restrictions to make IPC above 1 extremely rare (even with 100% cache-hits), making dual fetch pointless (in the average case).
Dual issue could still bring up IPC by enough to make sense:
IPC could exceed 1 for short bursts, and independent producers could issue in parallel, providing results to consumers in less time...
 
Seems David himself is puzzled by the changes between the 2 variants. Here was his question from an email he sent in correspondance.


David Wang quote:
"Let me know if you figure out why some of the
execution units look to have been "flipped", as mirror
images of each other between DD1 and DD2. It's
interesting and puzzling at the same time."

Archetecturally there's differences, the functionality of those differences is still open to debate, we might have to wait for the next chip conference when IBM decides to let more leak out about it. There's not much to be gained by looking at G5's and the like because of the differences in a fat core Power core implementation and this one. I'm not even sure these differences can be put down to just a single or dual issue situation all by themselves.
 
psurge said:
Gubbi - the only reason I can think of for single fetch/dual issue is this :

there are enough pipeline bubbles from high instruction latency and issue restrictions to make IPC above 1 extremely rare (even with 100% cache-hits), making dual fetch pointless (in the average case).
Dual issue could still bring up IPC by enough to make sense:
IPC could exceed 1 for short bursts, and independent producers could issue in parallel, providing results to consumers in less time...

That's certainly a possibility. I still find it odd though, the fetch/decode stage is fairly cheap in terms of logic compared to the rest of the chip, and could, potentially, double throughput for well behaved code.

Cheers
Gubbi
 
MYCOM PC WEB posted its coverage of Cell at Cool Chips VIII, with many slides. (Unfortunately nothing on the IBM presentation about DD2)

http://pcweb.mycom.co.jp/articles/2005/04/28/coolchips1/
(machine translation)

The part 1 is about the presentation by Masakazu Suzuoki of SCE.
He talks about the design philosophy of Cell, such as exploiting TLP by multi-core while getting enough ILP by 2-way superscalar, affinity in multicore with OOP modules assignable to each core, simpler core design for media processing, independent address spaces to virtualize processor resources, and secure Isolated Mode (Hotel Model), with graphs that describe various trade-offs and correspondent design points found in Cell.

http://pcweb.mycom.co.jp/articles/2005/04/28/coolchips1/001.html
(machine translation)

The part 2 is about the presentation by Seiji Maeda of Toshiba, "A Cell Software Platform for Digital Media Application". For Toshiba, Cell is a platform to remove the need of custom hardwares for each SKU by implementing engines in software. The nearest goal is simultaneous recording/playing multiple movies only with Cell.

High level languages like C/C++ can be used in programming and Toshiba's platform supports a multi-layer programming model. The software engine is composed of PPE Module and SPE Module. The programming model of SPE Module consists of 3 layers:

SPE Module - the highest level of elements, such as audio/video codecs. It consists of one or more SPE Threads.
SPE Thread - real processing elements. There are 2 types, Time-Shared and Dedicated.
SPE Overlay - it exchanges data between Local Stores (and main memory).

PPE runs Realtime Resource Scheduler that assigns each SPE Thread to SPEs.

Then Toshiba gave a lecture about the scheduler by the implementation of an MPEG2 decorder developerd at Toshiba as an example. Each SPE Module has parameters such as the number of threads and processor load, and the scheduler assigns SPE Threads by consulting them. Since SPE Threads form a pipeline, it have to consider overlapping their executions. The scheduling overhead in the experimental program is under 1ms. The overhead to output the results of all SPE Threads is under 1.5% of the processing power of all SPEs and amounts to 1/30 sec (but async DMA can hide it by overlapping SPE I/O).

In the lab at Toshiba, 1920x1080 (1080i) HDTV-quality MPEG-2 can be processed with 1 SPE, and the implementation of an HDTV-quality H.264 decoder was also completed.

At the last of the presentation they showed the demo of playing 48 720x480 MPEG-2 streams with 6 SPEs without losing a frame (another SPE does scaling to 240x180 thumbnails). The clockspeed of Cell in this demo was "secret", FWIW.
 
Tacitblue said:
Seems David himself is puzzled by the changes between the 2 variants. Here was his question from an email he sent in correspondance.


David Wang quote:
"Let me know if you figure out why some of the
execution units look to have been "flipped", as mirror
images of each other between DD1 and DD2. It's
interesting and puzzling at the same time."


Archetecturally there's differences, the functionality of those differences is still open to debate, we might have to wait for the next chip conference when IBM decides to let more leak out about it. There's not much to be gained by looking at G5's and the like because of the differences in a fat core Power core implementation and this one. I'm not even sure these differences can be put down to just a single or dual issue situation all by themselves.

Well David mentions some of the execution units being flipped...I'm still confused about those doubling of units still. Pana's enlarged comparison above shows it quite clearly...hmmm... :?
 
Could someone independently verify the following for me please;

Okay, if you zoom in close to the high res DDE2 CELL die shots and I'm using the SPU registers as a reference,

SPU = 128.128 bit file

If you look closely at the SPU register block, there are 64 green strips making the entire 128.128 bit file.

So the basic building block is a 128.128/64 bit = 2.128 bit = 8.32 bit 'green strip'

i.e. each 'green strip' is a 8.32 bit register.

So for this purpose, I'll label that the most basic 'REGISTER UNIT' (RU) = 8.32 bit

And 64 RUs = 128.128 bit reg file for the SPU.



If we use the RU to describe the reg files on the PPE, without assigning them to any execution unit but going from top to bottom,

TOP reg block:

18 RU = 18 (8.32 bit) = 144.32 bit
= 72.64 bit OR 2*36.64 bit reg files

MIDDLE (two aligned blocks):

36 RU = 36 (8.32 bit) = 288.32 bit
= 72. 128 bit OR 2*36.128 bit

BOTTOM reg block:

18 RU = 18 (8.32 bit) = 144.32 bit
= 72.64 bit OR 2*36.64 bit reg files


Summary reg files for CELL DDE2:

SPU:


128.128 bit

PPE:

TOP: 72.64 bit OR 2*36.64 bit

MIDDLE: 72.128 bit OR 2*36.128 bit

BOTTOM: 72.64 bit OR 2*36.64 bit


Obviously I haven't assigned them to any execution units. However, this is what the DDE2 die is showing AFAICS? Could someone confirm this please? Thanks...
 
That's what i get if i count like you do.
Question is if we can count like that, 36 registers per thread seems somewhat odd.

That is unless Program Counter and/or status registers are stored in the regiser bank but if that is case, why aren't they stored in the register bank in the SPE?

edit: spelling and/or grammar.
 
Back
Top