CELL from GDC

Panajev2001a · Mar 10, 2005

Jaws said:
one said:

This is reminiscent of BlueGene/L, as its single compute node has a mini-kernel (CNK) running on it.

Click to expand...

Each SPE runs a mini- kernal

As I predicted they would run some sort of nano-kernel ala TAOS,

TAOS: http://www.byte.com/art/9407/sec6/art1.htm

Jaws said:

...
They get around the uniform ISA problem in a heterogeneous multi-processor environment by compiling to a 'virtual' processor with no overheads in translation due to a very efficient nano-kernel running on each processor. I know the CELL press releases mentions that the CELL processors can run multiple operating systems. If this means multiple nano-kernels or equivalent on each core, then they could be borrowing many ideas from TAOS for CELL.
...

Click to expand...

http://www.beyond3d.com/forum/viewtopic.php?p=438987#438987

Hannibal in the article below mentions that SPEs can have various scaler and vector configurations,

http://arstechnica.com/news.ars/post/20050308-4685.html)

Hmm...So we're still on for a CELL Virtual Machine, i.e. CELL VM ISA?

You can run a VM, if you want.

I do not think the micro-kernel is there as a VM: it is more like a local work scheduler. Remember that patnet by Gschwind about overlays and the need for a global module which would take care of loading on a need basis the chunks of overlay you need ?

Surely though you can implement your mini-VM idea to virtualize the SPE.

They mention the idea of using the SPE's to run VM's on there as an example on how to use them. We will see though, I am not excluding your idea completely.

hannibal's comment was talking about changing the Hardware lying inside, but letting the ISA stay the same: we still ahve to see what is exposed in the ISA and what is not. About GPR's, you could nto increase their number unless you implemented some kind of Register File rotation or Register Renaming mechanism.

The Taos operating system uses objects from the ground up to enable processors based on different architectures to work together on the same problem/

This is not CELL's problem though .

Click to expand...

j^aws · Mar 10, 2005

I'm keeping an open mind on EVERYTHING from ANYONE at this point, from a gamers and techies point of view, this is good shit!

bbot · Mar 10, 2005

Yeah, CELL is good. Compared to the crap that is the 90Gflops xenon cpu.

j^aws · Mar 10, 2005

bbot said:
Yeah, CELL is good. Compared to the crap that is the 90Gflops xenon cpu.

http://www.beyond3d.com/forum/viewtopic.php?p=476554#476554

tri-core Xe CPU @ 3GHz ~ 144 GFlops if the VMX units can themselves be 2-way SMT.

Just keep an open mind until everything is made official and even then wait for everything to die down when all the information overload is absorbed and fully analysed without knee-jerk reactions!

nAo · Mar 10, 2005

Jaws said:
tri-core Xe CPU @ 3GHz ~ 144 GFlops if the VMX units can themselves be 2-way SMT.

I think you got wrong. There is a registers bank per thread, but a VMX unit per core. Each thread sees 128 registers, but one VMX instruction should be executed per core (per clock). That cuts your numbers in two -> 72 GFlops

j^aws · Mar 10, 2005

nAo said:
Jaws said:

tri-core Xe CPU @ 3GHz ~ 144 GFlops if the VMX units can themselves be 2-way SMT.

Click to expand...

I think you got wrong. There is a registers bank per thread, but a VMX unit per core. Each thread sees 128 registers, but one VMX instruction should be executed per core (per clock). That's cut your numbers in two -> 72 GFlops

Oh okay...I was hoping for a set of 2*128.128 bit registers to allow for 2-way multi-threading on each VMX unit...if it doesn't then it'll be 90 GFlops then...

passerby · Mar 10, 2005

Those of you going to get sleepless over the possible 'Cell nightmare' - just ask Ty.

Panajev2001a · Mar 10, 2005

bbot said:
Yeah, CELL is good. Compared to the crap that is the 90Gflops xenon cpu.

That is wrong IMHO, very wrong.

Both Xenon CPU and CELL have their own GREAT strengths and weaknesses: there are things to get excited on each and things to get disappointed on each (SIMD style on SPE's and only 1 MB of shared L2 to feed three hungry cores and 3 VERY hungry super-VMX units to make an example for both consoles as far as weaknesses are concerned).

aaronspink · Mar 10, 2005

Jaws said:
nAo said:

Jaws said:

tri-core Xe CPU @ 3GHz ~ 144 GFlops if the VMX units can themselves be 2-way SMT.

Click to expand...

I think you got wrong. There is a registers bank per thread, but a VMX unit per core. Each thread sees 128 registers, but one VMX instruction should be executed per core (per clock). That's cut your numbers in two -> 72 GFlops

Click to expand...

Oh okay...I was hoping for a set of 2*128.128 bit registers to allow for 2-way multi-threading on each VMX unit...if it doesn't then it'll be 90 GFlops then...

Just to clarify people's understanding. SMT has ZERO, NONE, NADDA effect on peak performance. Never has, never will.

To calculate the peak for the processor: Take the peak issue rate, multiply by the peak flops per issue, multiply by the frequency.

Aaron Spink
speaking for myself inc.

j^aws · Mar 10, 2005

aaronspink said:
Jaws said:

nAo said:

Jaws said:

tri-core Xe CPU @ 3GHz ~ 144 GFlops if the VMX units can themselves be 2-way SMT.

Click to expand...

I think you got wrong. There is a registers bank per thread, but a VMX unit per core. Each thread sees 128 registers, but one VMX instruction should be executed per core (per clock). That's cut your numbers in two -> 72 GFlops

Click to expand...

Oh okay...I was hoping for a set of 2*128.128 bit registers to allow for 2-way multi-threading on each VMX unit...if it doesn't then it'll be 90 GFlops then...

Click to expand...

Just to clarify people's understanding. SMT has ZERO, NONE, NADDA effect on peak performance. Never has, never will.

To calculate the peak for the processor: Take the peak issue rate, multiply by the peak flops per issue, multiply by the frequency.

Aaron Spink
speaking for myself inc.

Yep...that's a common misconception.

Just to clarify the above 144 Gflops number, it's coming from the speculated capability of the VMX units to 2-way multi-thread themselves.

6 issue * 8 Flops per cycle * 3 GHz ~ 144 GFlops.

aaronspink · Mar 10, 2005

Jaws said:
Yep...that's a common misconception.

Just to clarify the above 144 Gflops number, it's coming from the speculated capability of the VMX units to 2-way multi-thread themselves.

Better way to state this would be that there are 2 seperate VMX execution units per core.

Aaron Spink
speaking for myself inc.

bbot · Mar 10, 2005

How many gflops for the following chips:

- 2 core G5 at 3 GHZ with 2 MB cache
- 8 vectorprocessor in another chip at 3GHZ

darkblu · Mar 10, 2005

aaronspink said:
Jaws said:

Yep...that's a common misconception.

Just to clarify the above 144 Gflops number, it's coming from the speculated capability of the VMX units to 2-way multi-thread themselves.

Click to expand...

Better way to state this would be that there are 2 seperate VMX execution units per core.

hmm. it not clear whether it's two VMX units distributed across thread contexts, or one VMX unit with multiple contexts. the only thing that can be said so far (if anything) is that there's one VMX unit seen per thread context.

Npl · Mar 10, 2005

Offtopic: Can someone enlight me about the difference between "arbitrary swizzle" and "permute"?

Panajev2001a · Mar 10, 2005

darkblu said:
aaronspink said:

Jaws said:

Yep...that's a common misconception.

Just to clarify the above 144 Gflops number, it's coming from the speculated capability of the VMX units to 2-way multi-thread themselves.

Click to expand...

Better way to state this would be that there are 2 seperate VMX execution units per core.

Click to expand...

hmm. it not clear whether it's two VMX units distributed across thread contexts, or one VMX unit with multiple contexts. the only thing that can be said so far (if anything) is that there's one VMX unit seen per thread context.

I think the multiple context issue is the most likely.

aaronspink · Mar 10, 2005

darkblu said:
hmm. it not clear whether it's two VMX units distributed across thread contexts, or one VMX unit with multiple contexts. the only thing that can be said so far (if anything) is that there's one VMX unit seen per thread context.

It is pretty damn clear. The options for the core are as follows:

2 issue with no VMX units
2 issue with up to 1 issue to a VMX unit
2 issue with up to 2 issue to two VMX units.

Those are the options. There could be a billion or 1 Hardware Contexts per core and it wouldn't change the options.

Multi-threading is an orthogonal issue. Multi-threading is merely a method to get better utilization of resources that are already there, it does not magically create new resources.

Aaron Spink
speaking for myself inc.

darkblu · Mar 10, 2005

aaronspink said:
It is pretty damn clear. The options for the core are as follows:

2 issue with no VMX units
2 issue with up to 1 issue to a VMX unit
2 issue with up to 2 issue to two VMX units.

how about 2 issue with up to 2 issue to one VMX unit?

Multi-threading is an orthogonal issue. Multi-threading is merely a method to get better utilization of resources that are already there, it does not magically create new resources.

absolutely. what did i say to leave you with the opposite impression?

aaronspink · Mar 10, 2005

darkblu said:
how about 2 issue with up to 2 issue to one VMX unit?

How about no. Think of it this way, know anything can can issue 2 instructions to one ALU in a single cycle? For each functional unit you have, you can issue 1 instruction per cycle.

Aaron Spink
speaking for myself inc.

darkblu · Mar 10, 2005

aaronspink said:
darkblu said:

how about 2 issue with up to 2 issue to one VMX unit?

Click to expand...

How about no. Think of it this way, know anything can can issue 2 instructions to one ALU in a single cycle? For each functional unit you have, you can issue 1 instruction per cycle.

ok, for one, it's about issuing, nothing is known about retiring. hypotetically, you can send as many ops as you have ready (i.e. decoded) to a unit given the unit has wide enough entry path to accept all those. what the unit does with them afterwards is, well, its business. if ops were all nops, then, it could just as well retire them all ; )

and then we have that info (as reliable as it is) that each thread in there sees its own VMX context, or at least, a register file. so regardless of how many ops per cycle that VMX unit can handle, it should at least keep a face to each of those couple of SMT threads out there. so how do you explain that? aside from discarding it as non-credible, that is.

Fafalada · Mar 11, 2005

Npl said:
Offtopic: Can someone enlight me about the difference between "arbitrary swizzle" and "permute"?

Former is part of execution pipeline.

CELL from GDC

Panajev2001a

j^aws

bbot

j^aws

nAo

Nutella Nutellae

j^aws

passerby

Panajev2001a

aaronspink

j^aws

aaronspink

bbot

darkblu

Npl

Panajev2001a

aaronspink

darkblu

aaronspink

darkblu

Fafalada

Similar threads