ISSCC 2005

DemoCoder · Feb 12, 2005

I think Peddie went too far in the other direction. CELL is an excellent desktop chip for not just games, but home media, and DCC work. It will run ported apps like Adobe Premiere/After Effects much faster than any dual core Pentium.

The fact of the matter is, for desktop users, scalar performance is good enough. Productivity apps don't need much more performance, especially since alot of their performance is dictated by system I/O anyway.

As for server, CELL isn't optimal, but neither is Intel's architecture. I think Sun's Niagara processor is the right approach for most server tasks.

The Wintel ISA sucks, is at a dead end, and multicore is a quick hack to save it. Simplified execution units and multithreading is the future of server processing. Server tasks are inherently multithreaded, the server apps are already written that way, so that expanding horizontally with lots of execution units is the way to go, not uber out of order mega-deep mega-cache pipelines.

nAo · Feb 12, 2005

Something is wrong here:
1) SPEs have 128 x 128 bit registers
2) Each instruction is 32 bits long
3) Each instruction has 4 operands (3 in, 1 out)

then there should be some heavy restriction about the registers that can be addressed in the same instruction, otherwise how can be that just 4 bits are left to store the opcode? what about masking, swizzling, etc?

Any idea?

darkblu · Feb 12, 2005

nAo said:
Something is wrong here:
1) SPEs have 128 x 128 bit registers
2) Each instruction is 32 bits long
3) Each instruction has 4 operands (3 in, 1 out)

then there should be some heavy restriction about the registers that can be addressed in the same instruction, otherwise how can be that just 4 bits are left to store the opcode? what about masking, swizzling, etc?
Any idea?

prefixes?

nAo · Feb 12, 2005

darkblu said:
prefixes?

Care to elaborate?

darkblu · Feb 12, 2005

nAo said:
darkblu said:

prefixes?

Click to expand...

Care to elaborate?

sure, too long compile times here not to ; )

ok, 'prefixing' may not be the best term to use here, given that the architecture is of fixed op width, so classic op prefixing-by-expansion would be out of question, but nevertheless, for statistically rare occasions (if we assume swizzles and such) special 'escape' op codes could be provided that would basically modify the semantics of (N) subsequent ops. something that could generally be referred as prefixing just as well, IMO.

rendezvous · Feb 12, 2005

nAo said:
Something is wrong here:
1) SPEs have 128 x 128 bit registers
2) Each instruction is 32 bits long
3) Each instruction has 4 operands (3 in, 1 out)

then there should be some heavy restriction about the registers that can be addressed in the same instruction, otherwise how can be that just 4 bits are left to store the opcode? what about masking, swizzling, etc?
Any idea?

Danack came to the same conclusion at page 18 so i'll just quote myself from page 19.

I said:
Most instructions would only need need 2 source registers, the only instruction that needs three source registers that i can think of right now is MAC where i assume that the destination registers is also one of the source registers.
I.E.
mac R1, R2, R3 => R3 = R3 + R1 * R2 (totally fictional ISA)

This would still give three source operands to the execution units slus one out operand.

nAo · Feb 12, 2005

Thank you rendezvous, I missed Danack question and your answer, but I'm still non that convinced.
Even if in the fmadd case the out register is just one of the in registers too few bits are left for things like masking, swizzling, broadcasting. Let's say there are least 5 opcode bits and that the out register is the accumulated register, that's left out 32 - 5 - 21 = 6 bits. Other 4 bits should be needed for destination masking, so we are left with just 2 bits for everyhthing else.
Or the ISA is very poor (a SIMD engine without component swizzling, or at least broadcasting?) or something is wrong here..imho

ciao,
Marco

DeanoC · Feb 12, 2005

nAo said:
Or the ISA is very poor (a SIMD engine without component swizzling, or at least broadcasting?) or something is wrong here..imho

Permute is in the odd pipeline its possible that it has a shortcut to become a fused swizzle with the even FMAC pipeline...

The instruction set could expose it to look like general swizzle on the FMACs but it would just use both pipes..
i.e.

Code:

FMADD r0.xxxx, r1, r2
is actually executed as (EVEN first)
EVEN                                            ODD
FMADD r0.xyzw, r1, r2                   PERMUTE r0.xxxxx, r0, r0

nAo · Feb 12, 2005

DeanoC said:
Permute is in the odd pipeline its possible that it has a shortcut to become a fused swizzle with the even FMAC pipeline...

That could be a good explanation, thanks Deano

Unfurtunately if done that way swizzling input operands will consume an extra slot too..
Anyway..we should expect things like this, a SPE is clocked so high and fmadd latency is still so low they probably couldn't fit all the cool things a vertex shader does in one pipe without increasing pipeline depth too much.

ciao,
Marco

chavvdarrr · Feb 12, 2005

http://realworldtech.com/page.cfm?ArticleID=RWT021005084318&p=1
don't beat me if it was posted

nelg · Feb 12, 2005

DemoCoder said:
The fact of the matter is, for desktop users, scalar performance is good enough. Productivity apps don't need much more performance, especially since alot of their performance is dictated by system I/O anyway.

Excuse me if this a dumb question but would it not be better to have even stronger scalar performance from the CPU and be able to utilize the GPU for non scalar ops? Of course this is assuming the bus is not a bottleneck.

Alejux · Feb 13, 2005

nelg said:
DemoCoder said:

The fact of the matter is, for desktop users, scalar performance is good enough. Productivity apps don't need much more performance, especially since alot of their performance is dictated by system I/O anyway.

Click to expand...

Excuse me if this a dumb question but would it not be better to have even stronger scalar performance from the CPU and be able to utilize the GPU for non scalar ops? Of course this is assuming the bus is not a bottleneck.

He's mentioning the simple fact that normal applications, such as corporate apps, word-processors, spreadsheets, finance management, and all such software, does NOT need lots of CPU processing power. All they need to run well, is the right amount of memory and a fast I/O.

Does anyone have any doubt about CELL running a word-processor or some other stupid application? The only thing I see which CELL might not suitable, is for database servers. And even that I have my doubts.

nelg · Feb 13, 2005

Alejux said:
He's mentioning the simple fact that normal applications, such as corporate apps, word-processors, spreadsheets, finance management, and all such software, does NOT need lots of CPU processing power. All they need to run well, is the right amount of memory and a fast I/O.

Yes, and if scalar performance is adequate for such programs what benefit does using a Cell system give (seeing that it will still require a GPU) that todayâ€™s CPU + GPU does not provide? Again this is assuming that the GPU could effectively be used as a general purpose vector processor without the bus being a limiting factor.

991060 · Feb 13, 2005

a dumb question: how can a 32bits int or float constant number be encoded into a 32bits instruction?

darkblu · Feb 13, 2005

most likely any immediate operands come as data word right after the op word.

Inane_Dork · Feb 13, 2005

991060 said:
a dumb question: how can a 32bits int or float constant number be encoded into a 32bits instruction?

IIRC, which is not too probable, the constant becomes part of the memory of the executable. The constant is loaded at runtime to a register via an offset or pointer. At least, that's one way to do it.

If you aren't too picky about being RISCy, you could flag somewhere in the 32 bits that the next 32 bits of the instruction stream is the constant you want. 'Course, this makes the processor a CISC chip, but we'll probably live anyway.

DemoCoder · Feb 13, 2005

nelg said:
Alejux said:

He's mentioning the simple fact that normal applications, such as corporate apps, word-processors, spreadsheets, finance management, and all such software, does NOT need lots of CPU processing power. All they need to run well, is the right amount of memory and a fast I/O.

Click to expand...

Yes, and if scalar performance is adequate for such programs what benefit does using a Cell system give (seeing that it will still require a GPU) that todayâ€™s CPU + GPU does not provide? Again this is assuming that the GPU could effectively be used as a general purpose vector processor without the bus being a limiting factor.

The point is, neither a faster Pentium, Dual core Pentium, nor Cell, is going to speed up Microsoft Word or Internet Explorer any subjectively experiential amount. So Cell's inability to do for Integer/Scalar what it does for Vector is not a relevant critique.

Cell will accelerate vector and multithreaded oriented tasks -- coding/decoding, rendering, compression/decompression, speech and handwriting recognition, sound and some device drivers, CAD, digital content creation, simulation, games, and other desktop oriented tasks.

It won't accelerate MS Office tasks, but scalar CPUs have diminishing returns in that area anyway. It won't accelerate server tasks like Web serving, database execution, socket servers, etc because frankly, those tasks are more amenable to a granular multithreaded approach, they are not stream oriented, and are predominantly scalar. On the other hand, I think Intel and AMD are going to get their ass handed to them in the server arena in the future, because cheap, low power, less complex chips can be built to handle server tasks, and linux and open-source means most applications can be easily recompiled to these commodity server-tuned systems, there's Microsoft lock-in effect. It may take a few years, but I think in the server space, I think the pendulum is going to swing back the other direction away from x86 towards other architectures.

one · Feb 13, 2005

Scalar-oriented work will migrate onto VM such as Java or .NET (see Java servlet or script languages), except for OS, VM itself, database, and so on. So as long as those VM can do JIT in the manner friendly to the Cell I think the relatively low scalar performance in the Cell won't matter in the (maybe far) future more or less. Intel already have a research on a CPU optimized for VM acceleration, but they are reluctant because it's difficult to maintain the x86 compatibility.

Entropy · Feb 13, 2005

one said:
Scalar-oriented work will migrate onto VM such as Java or .NET (see Java servlet or script languages), except for OS, VM itself, database, and so on. So as long as those VM can do JIT in the manner friendly to the Cell I think the relatively low scalar performance in the Cell won't matter in the (maybe far) future more or less. Intel already have a research on a CPU optimized for VM acceleration, but they are reluctant because it's difficult to maintain the x86 compatibility.

Interesting how you regard Intels slide as "truth" regarding the future.

Only teasing.
However, that slide and your last sentence underscores the difficulty Intel has in truly coming up with something better than the x86 and its software infrastructure for general computing. They can do it, no problem, but they are mortally afraid of hurting their cash cow. So they do some super computer projects, buy some mobile IP and push into that market, have made some designs for embedded, collect graphics IP diligently, came up with, and fumbled, Itanium. Intel tries to have a foothold in every market, in case it expands into greater significance, and they do a reasonably good job of it.
Fundamentally Intel is struggling to optimize their market revenue, and market control, with a healthy dose of internal politics thrown in for good measure.
And it's tricky going.
It's interesting however that Microsoft has recognized the trojan horse that gaming consoles represent, whereas Intel seem not to worry too much, and personally I see that as having different ambitions. Microsoft wants to be everywhere, Intel is focussed on corporations and governments.

So where does this rambling lead? In the context of this forum, to a claim that Intel is capable of designing a processor that could compete better with IBMs offerings in consoles (although not necessarily beat them), but that they may well feel that this is tangential to their overall strategy and interests, and indeed a deviation from x86 to maintain a foothold in the console market may even be detrimental in the greater scheme of things. It's semi-official that they offered Microsoft an x86 alternative, and that Microsoft rejected it.

It seems that Microsoft won't try to migrate the XBox2 design into PC space, their XNA evangelizing has been very clear in its separation of console and PC space. But what would be truly interesting would be if that dividing line began to blur. It would open up new possibilities in several ways, most significantly in terms of small computer (PC) architecture.

The most interesting aspect about the XBox2 is if there will be a variation of it that has PC-like capabilities.

nelg · Feb 13, 2005

DemoCoder said:
Cell will accelerate vector and multithreaded oriented tasks -- coding/decoding, rendering, compression/decompression, speech and handwriting recognition, sound and some device drivers, CAD, digital content creation, simulation, games, and other desktop oriented tasks.

It won't accelerate MS Office tasks, but scalar CPUs have diminishing returns in that area anyway. It won't accelerate server tasks like Web serving, database execution, socket servers, etc because frankly, those tasks are more amenable to a granular multithreaded approach, they are not stream oriented, and are predominantly scalar. On the other hand, I think Intel and AMD are going to get their ass handed to them in the server arena in the future, because cheap, low power, less complex chips can be built to handle server tasks, and linux and open-source means most applications can be easily recompiled to these commodity server-tuned systems, there's Microsoft lock-in effect. It may take a few years, but I think in the server space, I think the pendulum is going to swing back the other direction away from x86 towards other architectures.

Would a x86 CPU plus a r5xx (assuming good branching), able to communicate as effectively as a Cell processor + a GPU, be a any better or worse? Is there something specific about cell that would make it better for "vector and multithreaded oriented tasks -- coding/decoding, rendering, compression/decompression, speech and handwriting recognition, sound and some device drivers, CAD, digital content creation, simulation, games, and other desktop oriented tasks" than using a GPU for such tasks?

ISSCC 2005

DemoCoder

nAo

Nutella Nutellae

darkblu

nAo

Nutella Nutellae

darkblu

rendezvous

nAo

Nutella Nutellae

DeanoC

Trust me, I'm a renderer person!

nAo

Nutella Nutellae

chavvdarrr

nelg

Alejux

nelg

991060

darkblu

Inane_Dork

Rebmem Roines

DemoCoder

one

Unruly Member

Entropy

nelg

Similar threads