HotChips 17 - More info upcoming on CELL & PS3

Status
Not open for further replies.
Excuses me if this information was boarded before and perhaps it does not have direct relevance with the topic...

Annyone could inform me if the PPE can process together 8 Flops VMX units with the 4 Flops of the FPUs(total 38.4GFlops at 3.2GHz)?

(thanx all for precius links)
 
Last edited by a moderator:
Not much reporting on this, but:

http://www.techworld.com/opsys/news/index.cfm?NewsID=4223

Interesting that they mention a seperate switch is required to connect more than one group of Cell chips. Further tallies with the "we put a switch in PS3 to let you connect Cells together, honest" line from Sony. Not that that would be of particular interest from most people's perspective, but could be cool if that sort of functionality works from Day One (along with home brew).
 
Titanio said:
Interesting that they mention a seperate switch is required to connect more than one group of Cell chips. Further tallies with the "we put a switch in PS3 to let you connect Cells together, honest" line from Sony. Not that that would be of particular interest from most people's perspective, but could be cool if that sort of functionality works from Day One (along with home brew).

With "Switch" they meant a Network-Hub.
 
About Toshiba super companion chip (SCC)
http://news.zdnet.com/2100-9584_22-5833453.html
The SCC is essentially a versatile, high-speed input-output port, according to Takayuki Mihara, an engineer with Toshiba. It receives regular and high-definition TV signals, audio and other data from set-top boxes, hard drives and similar items, and then forwards it to the processor.

"The Cell processor needs access windows to communicate with outside modules," Mihara said. "I think the Cell processor will be used in multiprocessor-centric systems such as digital TV. In reality, not many will try to record all that video at once, but it makes for a cool demo."

Cell engineers also emphasized that they designed the Cell processor--and its corresponding helper chips--to ensure smooth audio and video programming.

The Cell, for instance, partitions a single video image into five elements, which then get processed on three separate streams, said Ryuji Sakai of Toshiba. The chip also comes with a bandwidth reservation feature that can allocate bandwidth dynamically to different subcomponents of the chip. Better bandwidth scheduling leads to higher performance.

The SCC sports a wide array of input/output systems. A single chip comes with four USB ports, two serial ATA ports, four PCI slots and a PCI (peripheral component interconnect) express link, and its own memory. The SCC will communicate directly with the Cell chip over a Flex I/O link, designed by Rambus, which can pass 5GB of data per second each way.

Mihara said that the SCC would likely be included in computer systems and audio-visual equipment running Cell, but he did not specifically state which products the SCC would be used in.
The southbridge for Toshiba HDTV?
 
Last edited by a moderator:
Titanio said:
Not much reporting on this, but:

http://www.techworld.com/opsys/news/index.cfm?NewsID=4223

Interesting that they mention a seperate switch is required to connect more than one group of Cell chips. Further tallies with the "we put a switch in PS3 to let you connect Cells together, honest" line from Sony. Not that that would be of particular interest from most people's perspective, but could be cool if that sort of functionality works from Day One (along with home brew).


I'm at the conference. There was no really new information that was revealed except for a little more detail on the interconnects BIC and EIB. When I get home I can scan the relevant slides and post them (it will take me a week). Hopefully the slides get published.

Talk 1: (IBM) Nothing new
Talk 2:(IBM) BIC and EIB, details on the control logic and how multiple cells will connect to each other.
Talk 3: (Toshiba) They snuck in a talk on the *japanese accent*"Super Companion Chip" they are using to assist the cell in their TVs. Nothing new on Cell.
Talk 4: (Toshiba) Talk about how they programmed the Cell to do h.264 encoding/decoding. In sum, they used pipelining, and scheduled tasks to optimize data flow. We've heard this before.

I'm looking forward to the talk on the XBOX architecture and the keynote on multithreading and parallelisms. I'll try to take notes and then post after the talk. I'll post in this thread.
 
Npl said:
With "Switch" they meant a Network-Hub.

Actually the Switch is going to have to have a master AC0 and all the Cells connected will have their AC0 in slave mode. The AC0 is a component in the EIB which, I think, controlls data flow and addressing of the data. When two Cells are connected together without a switch, one AC0 is master, the other is slave. To connect more than 2, the special switch will be needed.

The speaker suggested that the master Cell would be able to directly address the SPEs on the slave Cell. Also one of the slides indicates that exceptions can be thrown in one Cell and handled by another. (Note: Exception handling is only possible on the PPE, the SPEs can generate them I think.)

The connection between Cells seems to really be about a means of connecting multiple EIB rings together, (speculation)perhaps to simulate having a bigger ring?(/speculation)

My guess is that the PS3 will not be seeing any of this hot CELL on CELL action. Unless the switches are going to be on a Sony proprietary network? Would it be possible to connect to a switch like this through the internet? I do not understand how that would work.
 
AlgebraicRing said:
I'm at the conference. There was no really new information that was revealed except for a little more detail on the interconnects BIC and EIB. When I get home I can scan the relevant slides and post them (it will take me a week). Hopefully the slides get published.

Talk 1: (IBM) Nothing new
Talk 2:(IBM) BIC and EIB, details on the control logic and how multiple cells will connect to each other.
Talk 3: (Toshiba) They snuck in a talk on the *japanese accent*"Super Companion Chip" they are using to assist the cell in their TVs. Nothing new on Cell.
Talk 4: (Toshiba) Talk about how they programmed the Cell to do h.264 encoding/decoding. In sum, they used pipelining, and scheduled tasks to optimize data flow. We've heard this before.

I'm looking forward to the talk on the XBOX architecture and the keynote on multithreading and parallelisms. I'll try to take notes and then post after the talk. I'll post in this thread.


Thanks! Look forward to it :)
 
AlgebraicRing said:
Actually the Switch is going to have to have a master AC0 and all the Cells connected will have their AC0 in slave mode. The AC0 is a component in the EIB which, I think, controlls data flow and addressing of the data. When two Cells are connected together without a switch, one AC0 is master, the other is slave. To connect more than 2, the special switch will be needed.

I know about that, but I was specially refering to "we put a switch in PS3 to let you connect Cells together, honest". Theres no sane way of adding 1 Cell-Switch and a Cell on the PS3 board to allow you to insert additional Cells. Clearly they spoken about networking PS3 and other Cell-driven Systems together, which should be able to throw SPU-Lets at each other. Not the add-another-CPU type of switch.

AlgebraicRing said:
The speaker suggested that the master Cell would be able to directly address the SPEs on the slave Cell. Also one of the slides indicates that exceptions can be thrown in one Cell and handled by another. (Note: Exception handling is only possible on the PPE, the SPEs can generate them I think.)
There will be RPC-Stubs on PPE and SPE, you can do anything with them if you make them complex enough( dont know if throwing Exceptions on the SPE will be a very common thing). AFAIK the Stubs can be automatically created with an IDL-Compiler, nothing stops you from writing your own, more capable ones.

AlgebraicRing said:
The connection between Cells seems to really be about a means of connecting multiple EIB rings together, (speculation)perhaps to simulate having a bigger ring?(/speculation)

My guess is that the PS3 will not be seeing any of this hot CELL on CELL action. Unless the switches are going to be on a Sony proprietary network? Would it be possible to connect to a switch like this through the internet? I do not understand how that would work.

"Switch" is used in 2 Ways, one is the 2Cells-to-2Cells Crossbar, which you`ve seen in many diagramms already. PS3 wont see anything of this.
The "we put a switch in PS3 to let you connect Cells together, honest" refers to a plain network-hub thats included in PS3 (3 Ethernet Ports). Think of SETI or Folding
 
Excellent

AlgebraicRing said:
I'm looking forward to the talk on the XBOX architecture and the keynote on multithreading and parallelisms. I'll try to take notes and then post after the talk. I'll post in this thread.
That's cool, thanks!
 
Titanio said:
Not much reporting on this, but:

http://www.techworld.com/opsys/news/index.cfm?NewsID=4223

Interesting that they mention a seperate switch is required to connect more than one group of Cell chips. Further tallies with the "we put a switch in PS3 to let you connect Cells together, honest" line from Sony. Not that that would be of particular interest from most people's perspective, but could be cool if that sort of functionality works from Day One (along with home brew).

Thanx for link Titanio.
 
Keynote II
David Kirk from NVIDIA
Multicore, Multipipes, Multithreads -- too much parallelism to handle?


I'm just going to post a small summary and whatever I thought stood out as interesting.


* Processor / System Parallelism
-- Single vs. Multi core
-- Fine vs. Coarse grained
-- Single vs. Multi pipeline
-- vector vs scalar math
-- Data vs. Thread Vs. Instruction level parallel
-- Single vs. Multithreaded processors
-- Message passing vs. Shared Memory communiction
-- SISD, SIMD, MIMD ...
-- Tightly vs. Loosely connected cores & threads

* Application / Problem Parallelism
-- No parallelism in workload means System/Processor parallelism is irrelevant
-- Large problems can more easily be parallelized
-- Good Parallel Behavior:
*-- Many inputs/results
*-- Parallel Structure -- Many similar computation paths
*-- Little Interaction between data/threads
-- Data parallelism easy to map to machine "automagically"
-- task parallelism requires programmer forethought

* Adopting Parallel software programing models will rely on university education.

* Cell Processor Approach to Parallelism (What is the programming model?)
-- Stuff we've seen so far

* Geforce 7800
-- 302 M trans, all computation oriented, no transistors being applied to cache or other non computation elements.

* translucency demo, the hand creatures with light sources behind them. This demo is much much longer, more intro than the E3 demo had. He froze the demo with girl in body suit, moving picture around and showing the geometry density of the models. (Sorry I didn't get any of the stats, but its all the Geforce 7800 stats.) upto 30 rendering passes to layer everything in. It's highly parallel computations.

*Life of a Triangle through graphics pipeline. (SLIDE 2 page 8) Simplified
Vertex Fetch
Vertex Processing
Primitive Assembly Setup
Rasterize & Zcull ( throwing away what isn't seen, adding it what is seen)
Pixel Shader
Texture <-> Frame Buffer
Pixel Shader
Pixel Engine (ROP)

Verteces are independent, so they are highly parallelizable.

Block Diagrams of Shaders and 7800 (Slide 1,2 page 9)

* Big GFlop #s
Geforce 6800 Ultra
- Clock : 425
- Vec4 MAD Ginstructions : 6.7568
- Gflops : 54.0544
Geforce 7800 GTX
- Clock : 430
- Vec4 MAD Ginstructions : 20.6331
- Gflops : 165.0648


*GPU Approach to Parallelism
-- single core
-- multipipeline
-- multithreaded
-- fine grained
-- Vector
-- Explicitly and Implicitly threaded (programmer threads code for shaders, instances spawned automatically)
-- Data Parallel (no communication between threads)

* Multi Pipeline App Improvement
-- Multithread applications: X time speedup where X is number of pipelines due to data parallelism
-- still requires a lot of software developement effort. (programming languages lack expressability of parallelism)
-- CPU is also a bottleneck

* Dual Core Procesors
-- no improvement to feeding GPUs because apps are single threaded.

* GPU Programming Languages
-- DX, OGL 1.3, Brook for GPUs (Stanford), SH for GPUs (for GPGPU)
-- various languages have parallelism baked into language

* Benefit of parallelism in GPUs is because the Data is parallel in nature, i.e. independent on a horizontal level. (vertices can be processed independently)

* Problems for widespread adoption of parallel programming.
-- Programming languages do not have parallelism baked into their semantics.
-- writing parallel code in single threaded language is INSANE
-- Language developement is critical

* Design Strategies for CPU/GPU
-- CPU: Make workload (one thread) run as fast as possible
*-- Caching
*-- Instruction/Data Prefetch
*-- "Hyperthreading"
*-- Speculative Execution
*-- limited by "perimeter" - communication bandwidth
*-- multicore will help... a little
-- GPU: Make the workload (as many threads as possible) run as fast as possible.
*-- Parallelism (1000s of threads)
*-- Pipelining
*-- limited by "area" -- compute capability

* Implementable programs on a GPU (versus the CPU)
-- Graphics (of course)
-- Image Processing and Analysis
-- Correlations - Radio telescope, SETI
-- Monte Carlo Simulation - Neutron Transport
-- Neural Networds (Speech recognition, Handwriting recognition)
-- Ray Tracing
-- Physical Modeling and Simulation
-- Video Processing
-- Black Scholes option Pricing
(Nvidia declined to speculate about the future where graphics plataues, would they expand into generic processing)


*The Good news and the Not-so-good news
-- The Good
*-- increasing Parallelism in CPU/GPU
*-- Workloads -- graphics and GP -- are highly parallel
*-- Moore's lay and the "capability curve" are still our friends
-- The Bad
* -- Parallel Programming is HARD (especially in a serial language/environment)
* -- Language and Tool support for parallelism is poor
* --Computer Science Education is not focussed on parallel programming (needs to be undergrad level and not grad level)

* Solution: More research into multithreaded developement, especially language design.

-------------------------------------------------------------------------

(Guys if i suck at taking notes, let me know and tell me how to improve!) :)

I don't think anything significant was really said in the talk, I'm a language guy so I was thrilled to hear some acknowledgement that we need better language design.
 
Is this time to go back to early computing work and dig out some historic languages? I'm sure, seeing as every conceivable computer and language model was considered at computings brith, that there exists a parallel language of some form. Maybe even C++ will give way to a new breed of language! :oops:




:LOL:
 
The present and next gen GPUs are way to rigid to efficiently use for non trivial physics, neural networks or even raytracing ... unless you back-and-forth a lot between the processor and the GPU. Cell might be good at quickly processing small batch problems (not my favourite parallel programming model BTW) and returning results, GPUs havent tended to be.
 
Last edited by a moderator:
from Xbox 360 Slides (Presentation hasn't occurred yet, figure I'd get these up so I don't have to write during the session)

Hardware Specs

Triple-Core 3.2 Ghz custom CPU
- shared 1MB L2 Cache
- customized vector floating point unit per core
- 5.4 Gps FSB: 10.8 GB/s read and 10.8 GB/s write
** GPU can read from L2 (!!! I didn't know this !!!)
500 Mghz custom GPU
- 48 parallel unified shaders
- 10 MB embedded DRAM for fram buffer: 156GB/sec
512 Meg unified memory (700MGHZ GDDR3: 22.4 GB/s)
12x Dual layer DVD
20 GB Hard drive
High Def video out.

System Block Diagram (I'm just going to list the stuff hanging off the IO chip)
DVD (SATA)
HDD (SATA)
Front Controllers (2 USB)
Wireless Controllers
MU ports (2 USB)
Rear Panel USB
Ethernet
IR
Audio Out
FLASH
System Control
Video Out (hanging off of a separate Analog chip)

CPU: PPC Core Specs
* 3 3.2 Ghz PowerPC cores
* Shared 1 MB L2 cache, 8-way associative
* Per-Core features
- 2 issue per cycle, in-order, decoupled vector/scalar issue queue
- 2 symmetric fine grain hardware threads
- L1 Caches: 32K 2-way I$ / 32K 4-way D$
- Execution Pipelines
-- Branch Unit, Integer Unit, Load/Store Unit
-- VMX 128 Units: Floating Point Unit, Permute Unit, Simple Unit
-- Scalar FPU
* VMX128 enchanced for game and graphics workload
-- all execution units 4-way SIMD
-- 128 128-bit vector registers per thread
-- custom dot-product instruction
-- native D3D compressed data formats

CPU Data Streaming Specs
* High bandwidth data streaming support with minimal cache thrashing
- 128B cache line size (all cache)
- Flexible set locking in L2
- Write streaming:
* L1s are write through, writes do not allocate in L1
* 4 uncacheable write gathering buffers per core
* 8 cacheable, non-sequential write gathering buffers per core
- Read Streaming:
* xDCBT data prefetch aroudn L2, directly into L1
* 8 outstanding load/prefetches per core
- Tight GPU data streaming integration (XPS)
* XPS -- "Xbox Procedural Synthesis"
* GPU 128B read from L2
* GPU low latency cacheable writebacks to CPU
* GPU shared D#D compressed data formats with CPU => at least 2x effective bus bandwidth for typical graphics data.

GPU Specs
* 500 MGhz graphics processor
- 48 parallel shader cores (ALUs)l dynamically schedulted 32bit IEEE FLP
- 24 billion shader instructions per second
* (super scalar design; scalar and texture ops per instruction)
- Pixel fillrate: 4 billion pixels/sec (8 per cycle); 2x for depth / stencil only
* AA: 16 billion samples/sec; 2x for depth / stencil only
- Geometry rate: 500 million triangles/sec
- Texture rate: 8 billion bilinear samples / sec
* 10 MB EDRAM -> 256 GB/s fill
* Direct3d 9.0 Compatible
- High level Shader Language (HLSL) 3.0+ support
* Custom features
- Memory export; Particle physics, subdivision surfaces
- Tiling acceleration: full resolution Hi-Z, Predicated Primitives
- XPS:
* CPU cores can be slaved to GPU processing
* GPU reads geometry data directly from L2
- Hardware scaling for display resolution matching


Architectural Choices
* FSAA, alpha and z place heavy load on memory BW
* Post-process effects require large depth complexity
* Enable flexible UMA solution
* Main Memory FB/ZB => unpredictable performance
* Solution: take FB/ZB fill-rate out of the equation

Software
* SMP/SMT
- Mainstream techniques
- Everything is simplified by being symmetric
* UMA
- No partitioning headaches
* OS
- All 3 cores available for game developers
* Standard APIs
- Win32, OpenMP
- Direct3d, HLSL
- Assembly (CPU & Shader) supported - direct hardware access
* Standard tools
- XNA; PIX, XACT
- Visual C++, works with multiple threads

-------------------------------------

I don't know if there is anything really new in there, I just posted everything just in case. I'm not following the XBOX360 as close as the CELL.
 
Ooh, good stuff, thanks for posting - I found the following to be new (to me) things:

* GPU low latency cacheable writebacks to CPU
* GPU shared D#D compressed data formats with CPU => at least 2x effective bus bandwidth for typical graphics data.


* CPU cores can be slaved to GPU processing

I was expecting the writebacks to the CPU, but I don't think I've seen it listed explicitly like this before.

The CPU cores being slaved to the GPU - hey we need to know more about that, that sounds pretty interesting!

Jawed
 
Xbox 360 talk, explaining design decisions and System Architecture

by Jeff Andrews and Nick Baker

(I'll post slides in a week if they are not published on the net by then)


* All games are required to support 720p (I think I heard that right)

-----
CPU
-----

* Most of the time was spent enhancing the VMX-128 units fro graphics purposes.

* XPS -- small amount of read data to generate lots and lots of geometry. (used as a "decompression" algorithm)

* GPU write back to CPU is to indicate that the GPU is done reading data.

* D3D compressed data formats were customized into both the VMX units and into the GPU.

* prefetching reads can go into L1 and skip L2. Writes can skip L1 and go to L2 (this is to avoid thrashing)

* Claim: the compressed D3D effectively adds an extra 20GB/s bandwidth.

-----
GPU
-----

* The added EDRAM allows Main memory to be dedicated to texture and vertex (read only) This makes things easier for main memory.

-----
DEMO
-----

(XBOX 360 in form factor is here.)

* PGR3
- starts with multiscreens
- cars racing on track
- dynamic angle selection from multiple sources while playing (i.e. like multiple screens at once)

-----
Questions
-----

* CPU Slave to GPU -- Just doing XPS stuff, trying to keep the GPU fed with data. There's a whole communication chain going back and forth between the GPU and CPU.
(I'm assuming the compressed D3D could play a big part in the matter)

* Can the XBOX 360 decode h.264 (MPEG4) -- Ummm.... What is h.264? We do High Def video out.
(My interpretation: it can do the work, but it doesn't sound like these hardware guys know of any projects dealing with next-gen video format.)

* Any comment on Blu-Ray or HD-DVD: ( No real information pulled out of them, they didn't act like they knew one way or another.)

**** (The parts in paranthesis are my own musings and interpretations, what I could remember)
 
This D3D compression and the new GPU-CPU relationships are very very interesting.

I should be interestig to compare and contrast the CPU-GPU inter-capabilities between MS's CPU and the Cell. Seems some good discussion is coming soon.
 
AlgebraicRing.........

thumb.jpg
 
Status
Not open for further replies.
Back
Top