CELL from GDC

Jaws said:
one said:
This is reminiscent of BlueGene/L, as its single compute node has a mini-kernel (CNK) running on it.
kaigai033.jpg

Each SPE runs a mini- kernal

As I predicted they would run some sort of nano-kernel ala TAOS,

TAOS: http://www.byte.com/art/9407/sec6/art1.htm

Jaws said:
...
They get around the uniform ISA problem in a heterogeneous multi-processor environment by compiling to a 'virtual' processor with no overheads in translation due to a very efficient nano-kernel running on each processor. I know the CELL press releases mentions that the CELL processors can run multiple operating systems. If this means multiple nano-kernels or equivalent on each core, then they could be borrowing many ideas from TAOS for CELL.
...

http://www.beyond3d.com/forum/viewtopic.php?p=438987#438987


Hannibal in the article below mentions that SPEs can have various scaler and vector configurations,

http://arstechnica.com/news.ars/post/20050308-4685.html)

Hmm...So we're still on for a CELL Virtual Machine, i.e. CELL VM ISA? :)

You can run a VM, if you want.

I do not think the micro-kernel is there as a VM: it is more like a local work scheduler. Remember that patnet by Gschwind about overlays and the need for a global module which would take care of loading on a need basis the chunks of overlay you need ?

Surely though you can implement your mini-VM idea to virtualize the SPE.

They mention the idea of using the SPE's to run VM's on there as an example on how to use them. We will see though, I am not excluding your idea completely.

hannibal's comment was talking about changing the Hardware lying inside, but letting the ISA stay the same: we still ahve to see what is exposed in the ISA and what is not. About GPR's, you could nto increase their number unless you implemented some kind of Register File rotation or Register Renaming mechanism.

The Taos operating system uses objects from the ground up to enable processors based on different architectures to work together on the same problem/
This is not CELL's problem though ;).
 
I'm keeping an open mind on EVERYTHING from ANYONE at this point, from a gamers and techies point of view, this is good shit! :p
 
Jaws said:
tri-core Xe CPU @ 3GHz ~ 144 GFlops if the VMX units can themselves be 2-way SMT.
I think you got wrong. There is a registers bank per thread, but a VMX unit per core. Each thread sees 128 registers, but one VMX instruction should be executed per core (per clock). That cuts your numbers in two -> 72 GFlops
 
nAo said:
Jaws said:
tri-core Xe CPU @ 3GHz ~ 144 GFlops if the VMX units can themselves be 2-way SMT.
I think you got wrong. There is a registers bank per thread, but a VMX unit per core. Each thread sees 128 registers, but one VMX instruction should be executed per core (per clock). That's cut your numbers in two -> 72 GFlops

Oh okay...I was hoping for a set of 2*128.128 bit registers to allow for 2-way multi-threading on each VMX unit...if it doesn't then it'll be 90 GFlops then...
 
bbot said:
Yeah, CELL is good. Compared to the crap that is the 90Gflops xenon cpu.

That is wrong IMHO, very wrong.

Both Xenon CPU and CELL have their own GREAT strengths and weaknesses: there are things to get excited on each and things to get disappointed on each (SIMD style on SPE's and only 1 MB of shared L2 to feed three hungry cores and 3 VERY hungry super-VMX units to make an example for both consoles as far as weaknesses are concerned).
 
Jaws said:
nAo said:
Jaws said:
tri-core Xe CPU @ 3GHz ~ 144 GFlops if the VMX units can themselves be 2-way SMT.
I think you got wrong. There is a registers bank per thread, but a VMX unit per core. Each thread sees 128 registers, but one VMX instruction should be executed per core (per clock). That's cut your numbers in two -> 72 GFlops

Oh okay...I was hoping for a set of 2*128.128 bit registers to allow for 2-way multi-threading on each VMX unit...if it doesn't then it'll be 90 GFlops then...

Just to clarify people's understanding. SMT has ZERO, NONE, NADDA effect on peak performance. Never has, never will.

To calculate the peak for the processor: Take the peak issue rate, multiply by the peak flops per issue, multiply by the frequency.


Aaron Spink
speaking for myself inc.
 
aaronspink said:
Jaws said:
nAo said:
Jaws said:
tri-core Xe CPU @ 3GHz ~ 144 GFlops if the VMX units can themselves be 2-way SMT.
I think you got wrong. There is a registers bank per thread, but a VMX unit per core. Each thread sees 128 registers, but one VMX instruction should be executed per core (per clock). That's cut your numbers in two -> 72 GFlops

Oh okay...I was hoping for a set of 2*128.128 bit registers to allow for 2-way multi-threading on each VMX unit...if it doesn't then it'll be 90 GFlops then...

Just to clarify people's understanding. SMT has ZERO, NONE, NADDA effect on peak performance. Never has, never will.

To calculate the peak for the processor: Take the peak issue rate, multiply by the peak flops per issue, multiply by the frequency.


Aaron Spink
speaking for myself inc.

Yep...that's a common misconception.

Just to clarify the above 144 Gflops number, it's coming from the speculated capability of the VMX units to 2-way multi-thread themselves.

6 issue * 8 Flops per cycle * 3 GHz ~ 144 GFlops.
 
Jaws said:
Yep...that's a common misconception.

Just to clarify the above 144 Gflops number, it's coming from the speculated capability of the VMX units to 2-way multi-thread themselves.

Better way to state this would be that there are 2 seperate VMX execution units per core.

Aaron Spink
speaking for myself inc.
 
How many gflops for the following chips:

- 2 core G5 at 3 GHZ with 2 MB cache
- 8 vectorprocessor in another chip at 3GHZ
 
aaronspink said:
Jaws said:
Yep...that's a common misconception.

Just to clarify the above 144 Gflops number, it's coming from the speculated capability of the VMX units to 2-way multi-thread themselves.

Better way to state this would be that there are 2 seperate VMX execution units per core.

hmm. it not clear whether it's two VMX units distributed across thread contexts, or one VMX unit with multiple contexts. the only thing that can be said so far (if anything) is that there's one VMX unit seen per thread context.
 
Offtopic: Can someone enlight me about the difference between "arbitrary swizzle" and "permute"?
 
darkblu said:
aaronspink said:
Jaws said:
Yep...that's a common misconception.

Just to clarify the above 144 Gflops number, it's coming from the speculated capability of the VMX units to 2-way multi-thread themselves.

Better way to state this would be that there are 2 seperate VMX execution units per core.

hmm. it not clear whether it's two VMX units distributed across thread contexts, or one VMX unit with multiple contexts. the only thing that can be said so far (if anything) is that there's one VMX unit seen per thread context.

I think the multiple context issue is the most likely.
 
darkblu said:
hmm. it not clear whether it's two VMX units distributed across thread contexts, or one VMX unit with multiple contexts. the only thing that can be said so far (if anything) is that there's one VMX unit seen per thread context.

It is pretty damn clear. The options for the core are as follows:

2 issue with no VMX units
2 issue with up to 1 issue to a VMX unit
2 issue with up to 2 issue to two VMX units.

Those are the options. There could be a billion or 1 Hardware Contexts per core and it wouldn't change the options.

Multi-threading is an orthogonal issue. Multi-threading is merely a method to get better utilization of resources that are already there, it does not magically create new resources.

Aaron Spink
speaking for myself inc.
 
aaronspink said:
It is pretty damn clear. The options for the core are as follows:

2 issue with no VMX units
2 issue with up to 1 issue to a VMX unit
2 issue with up to 2 issue to two VMX units.

how about 2 issue with up to 2 issue to one VMX unit?

Multi-threading is an orthogonal issue. Multi-threading is merely a method to get better utilization of resources that are already there, it does not magically create new resources.

absolutely. what did i say to leave you with the opposite impression?
 
darkblu said:
how about 2 issue with up to 2 issue to one VMX unit?

How about no. Think of it this way, know anything can can issue 2 instructions to one ALU in a single cycle? For each functional unit you have, you can issue 1 instruction per cycle.

Aaron Spink
speaking for myself inc.
 
aaronspink said:
darkblu said:
how about 2 issue with up to 2 issue to one VMX unit?

How about no. Think of it this way, know anything can can issue 2 instructions to one ALU in a single cycle? For each functional unit you have, you can issue 1 instruction per cycle.

ok, for one, it's about issuing, nothing is known about retiring. hypotetically, you can send as many ops as you have ready (i.e. decoded) to a unit given the unit has wide enough entry path to accept all those. what the unit does with them afterwards is, well, its business. if ops were all nops, then, it could just as well retire them all ; )

and then we have that info (as reliable as it is) that each thread in there sees its own VMX context, or at least, a register file. so regardless of how many ops per cycle that VMX unit can handle, it should at least keep a face to each of those couple of SMT threads out there. so how do you explain that? aside from discarding it as non-credible, that is.
 
Back
Top