The SPE as general purpose processor

Crossbar · Mar 13, 2006

aaaaa00 said:
Not the same thing.

It's not the same thing but not that different.

aaaaa00 said:
Even though modern general purpose CPUs require alignment for efficient access, this alignment requirement is the same size as the machine word.

Don't get your point. You'll have a prefetch que anyway, the faster you fill it the better IMO.

aaaaa00 said:
This is not true of the SPE, the memory alignment requirement is 4x that of the machine word (there are no scalar 16-byte operations), and hence is evidence in favor of the SPE's lineage being a vector oriented co-processor, and not a general purpose CPU.

You think so? Here's the background of the 16 byte load/store straight from the horse himself (Peter Hofstee).

If you wonder why only 16B loads and stores? ... One reason is latency ( unaligned loads or smaller quantities require extra muxing stages ). For stores we have to compute an ECC (error correction code) to be stored with the data. These codes are typically calculated over larger fields (16 bytes in our case) to limit the overhead in the SRAM arrays. Writing a quantity less than 16B would therefore require a read, modify (combine new and old data for new ECC), write operation on the array. In true RISC fashion we felt that it would be better to do a load and a store and allow compiler optimization than hide what is really going on.

deathkiller · Mar 13, 2006

arjan de lumens said:
It also cannot queue up hints, so it cannot actually sustain more than 1 taken branch every 11 cycles (for a comparison, the Pentium4 can sustain a rate of 1 branch per clock cycle).

There is one exception, in loops the branch hint is preserved.

Gubbi · Mar 13, 2006

Crossbar said:
You think so? Here's the background of the 16 byte load/store straight from the horse himself (Peter Hofstee).

A bollocks argument. Any modern CPU has ECC protected L1 D$, how do they cope with byte accesses? They load up the data that needs ECC to be recomputed, stores the byte and stores the updated ECC data, all in a non-blocking fashion. No big deal.

The SPE saves the muxes but not because of being "true to the RISC philosophy", but rather they didn't care for byte/half/scalar word stores and would rather use the transistors for something else, which is a perfectly valid design choice.

The price the SPEs pay is a big increase in store latency. You need to load the quad-word from LS, mask in your scalar, then store it. If you can't pipeline this you're absolutely hosed.

Cheers

arjan de lumens · Mar 13, 2006

In true RISC fashion we felt that it would be better to do a load and a store and allow compiler optimization than hide what is really going on.

Sounds a bit odd. For loads, the additional MUXes needed to support smaller data types would add something like 1/5 of a cycle (the latency is already 6 cycles right now, which is very high for a load in a modern CPU), and for writes, it is quite easy to maintain ECC data even without requiring full-width writes (e.g. when writing only in some byte lanes, use the remaining byte lanes to read data, then reassemble the ECC code from the combined data.)

deathkiller said:
There is one exception, in loops the branch hint is preserved.

Which works well if you have a standalone loop with a small body, large loop count and there are no branches or sub-loops within the loop body.

Fafalada · Mar 13, 2006

Gubbi said:
The price the SPEs pay is a big increase in store latency. You need to load the quad-word from LS, mask in your scalar, then store it. If you can't pipeline this you're absolutely hosed.

Well according to guys at IBM you should work with 16byte scalars if you are concerned about performance

arjan de lumens said:
Which works well if you have a standalone loop with a small body, large loop count and there are no branches or sub-loops within the loop body.

On in-order CPU, optimizing a loop with sub-loops/branches will never work out to anything remotely optimal - no amount of branch prediction will help that.

Crossbar · Mar 13, 2006

Gubbi said:
A bollocks argument.

If you think so. whatever.

I think it's a pretty clever design given the transistor budget of the SPEs. I don't have a problem letting a compiler take care of the masking part, in the same way as I don't have a problem letting a compiler take care of word aligned data.

Shifty Geezer · Mar 13, 2006

Fafalada said:
Well according to guys at IBM you should work with 16byte scalars if you are concerned about performance

2^128. That's actually quite a big number. Useful if you're writing a program to count all the atoms in the universe.

Gubbi · Mar 13, 2006

Crossbar said:
I think it's a pretty clever design given the transistor budget of the SPEs. I don't have a problem letting a compiler take care of the masking part, in the same way as I don't have a problem letting a compiler take care of word aligned data.

I wasn't questioning the design choice, they clearly must have felt that the average SPE workload didn't require masked stores.

I was calling the "In true RISC fashion" line bollocks. The only RISC processor that didn't have masked stores was Alpha 21064, masked stores was added in 21164 for various reasons (one was to address hardware with byte stores, another was C performance).

Cheers

version · Mar 13, 2006

we work on spe with 2 threads, control thread and program threads
control thread modify the programcodes, when got a cachemiss
with this we are faster than any x86 on 1 SPE
but we have 8 spes and max 32 programthreads....

Gubbi · Mar 13, 2006

version said:
<snip>

Put down the crack pipe

Cheers

Crossbar · Mar 13, 2006

Gubbi said:
I wasn't questioning the design choice, they clearly must have felt that the average SPE workload didn't require masked stores.

I was calling the "In true RISC fashion" line bollocks.

OK, I agree.

I guess the 128-entry register file will help to reduce the number of masked stores.

Here's some more fuel for the general purpose debate, cut from the SPU ISA document.

Rationale for SPU Architecture
Key workloads for the SPU are:

The graphics pipeline which includes surface subdivision and rendering

Stream processing, which includes encoding, decoding encryption and decryption

Modeling, which includes game physics

The implementations of the SPU ISA achieves better performance to cost ratios than general-purpose processors because the SPU ISA implementations require approximately half the power and approxiamately half the chip area for equivalent performance. ...

Gubbi · Mar 13, 2006

Crossbar said:
OK, I agree.

I guess the 128-entry register file will help to reduce the number of masked stores.

Right, it's spilling them to LS that is costly.

<snip>
The implementations of the SPU ISA achieves better performance to cost ratios than general-purpose processors because the SPU ISA implementations require approximately half the power and approxiamately half the chip area for equivalent performance. ...

Exactly, they have a big advantage over general purpose CPUs doing these very specific tasks. That makes them special purpose, not general purpose.

Cheers

Edge · Mar 13, 2006

Gubbi said:
Exactly, they have a big advantage over general purpose CPUs doing these very specific tasks. That makes them special purpose, not general purpose.

So I guess that means the 486 was not a general purpose processor, since it's no where near the transistor count of today's processors.

Not sure how you equate die size and power consumption with being special purpose or general purpose?

16B aligned data is good, as now you lazy programmers will have to think about aligning different varibles into the same register.

There are instructions for accessing a particular byte in that 128-bit register. Just think, put all your global variables into a number of registers, and load it all in, with a few instructions, and have it sit there, so extremly fast access for the entire extent of your program.

Just as the 256 KB LS is excellent for forcing you lazy programmers to pack your programs into tiny sizes to make more room for data buffers.

Thanks Faf for the correction on the EE CPU. Not sure why I thought it was the R3000.

Version, dual pipelines, does not equal dual threads. You would need a second set of registers to have that, and that is missing from the SPE. Each SPE is single threaded, as per the original specifications. Maybe you were talking about software threads?

London Geezer · Mar 13, 2006

Edge said:
So I guess that means the 486 was not a general purpose processor, since it's no where near the transistor count of today's processors. Not sure how you equate die size and power consumption with being special purpose or general purpose?

16B aligned data is good, as now you lazy programmers will have to think about aligning different varibles into the same register. There are instructions for accessing a particular byte in that 128-bit register.

Just as the 256 KB LS is excellent for forcing you lazy programmers to pack your programs into tiny sizes to make more room for data buffers.

Thanks Faf for the correction on the EE CPU. Not sure why I thought it was the R3000.

No wonder you're in the red...

Version, dual pipelines, does not equal dual threads. You would need a second set of registers to have that, and that is missing from the SPE. Each SPE is single threaded, as per the original specifications. Maybe you were talking about software threads?

Version needs to be ignored.

Crossbar · Mar 13, 2006

Edge said:
So I guess that means the 486 was not a general purpose processor, since it's no where near the transistor count of today's processors. Not sure how you equate die size and power consumption with being special purpose or general purpose?

I think Gubbi was refering to the good performance regarding

Key workloads for the SPU are:

The graphics pipeline which includes surface subdivision and rendering

Stream processing, which includes encoding, decoding encryption and decryption

Modeling, which includes game physics

not only the die size and the power consumption.

version · Mar 13, 2006

Edge said:
Version, dual pipelines, does not equal dual threads. You would need a second set of registers to have that, and that is missing from the SPE. Each SPE is single threaded, as per the original specifications. Maybe you were talking about software threads?

spe has 128*4 registers(32bit), divide it and you get 4 threads,old x86 had with 8 regs

usually general code go to stall when cachemiss or branchmiss
we able to do a software cache and multiple threads for minimize cachemiss and latency
and write selfmodified code for brancmiss , this is was fine on old MC68000
possible more trick with this..

version · Mar 13, 2006

from IBM developer forum :
Barry_Minor (cell designer):
"
You're correct that the deepest pipeline latency (short of DP float) is 7 cycles, most SP float Ops are 6 cycles. The integer and float Ops share the same pipe while the shuffle load/store are on the other pipe. These are very short latencies for a 3.2 GHz processor. Most processors at this clock rate have 32 registers and greater than 10 cycles of latency for such Ops. When you combine the short latency with 128 registers and a good compiler you get very low CPIs. May of the 32 register processors show good CPIs but the instruction streams are filled with register spill loads and stores that perform no usefull work and contribute to the good CPI. I can't comment on the GPUs as there is little public info on their microarchitectures.

The multi-threading example you sited is another way to cover up DMA latency (the most common being multi-buffering). This can be implemented in software on the SPEs by segmenting the large register file into smaller ranges, compiling different threads for each register range, and switching (branching) to a different thread after each thread issues a DMA read. The threads stay resident in local store (no context switching), thread switching is light weight (1 cycle branch), and with some clever programming you can even defer the switch based on the DMA tag being ready (BISLED). If you're memory bound and can't predict your memory references ahead of time this is a good solution as you could write your code for size instead of speed and pack 4-8 threads in each SPE local store.
"

Guden Oden · Mar 13, 2006

Gubbi said:
I was calling the "In true RISC fashion" line bollocks. The only RISC processor that didn't have masked stores was Alpha 21064

I would think the man meant 'in accordance to RISC design philosophy guidelines' (ie: optimize for the common case, not the exceptions), and not 'like all these other RISC CPUs' with that line.

version said:
we able to do a software cache and multiple threads for minimize cachemiss and latency

You're not a cell/PS3 developer. You're just sitting there making shit up!

With the amount of bogus posts you've made in the past, nobody takes you seriously (unless they simply don't know what a mythomaniac sony fanbyo you are)...

Edge · Mar 13, 2006

version said:
spe has 128*4 registers(32bit), divide it and you get 4 threads,old x86 had with 8 regs
usually general code go to stall when cachemiss or branchmiss
we able to do a software cache and multiple threads for minimize cachemiss and latency
and write selfmodified code for brancmiss , this is was fine on old MC68000
possible more trick with this..

Thanks Version. I suggested this a while back, and glad to hear it can easily be done. That huge register set allows for a lot of 'general purpose'

flexibility.

Proud to be in the red!

3dilettante · Mar 14, 2006

Edge said:
IBM called them Synergistic Processing Elements to reflect their nature. The architecture and the broad instruction set backs this up. The SPE's instruction set is not lacking in integer math operations, logical operations, and flow control operations, all the instructions necessary for general purpose computing.

A 32-bit ARM microcontroller shares the same broad instruction set as an Intel XScale processor.

A lot of telecom gear shares the MIPS instruction set with the now end of lifed performance architecture.

IBM's been pushing embedded chips based on the PPC instruction set.

ISA does not determine whether an implementation is special purpose. There's a difference between supporting instructions and running them well.

The SPEs have a universal level of supporrt, but they have a specialized purpose.
There's nothing wrong with sacrificing some general capability to gain in efficiency and performance.

They are powerful at floating point, but that does not mean they are weak at general purpose. Just because you have one strength does not make everything else a weakness. The large register set is as useful for holding lots of integer values, as it is for holding floating point values, and same with the local store. C compilers allow general C code to run on the SPE's. There is no limits to the type of C code you can use, so that's proof, the SPE's are more than capable at handling general purpose code.

The registers, data path, and execution model are set up to get peak performance from a vectorized workload. That is not a general case.

Just because you can run almost any code on an SPE does not mean you should.

Synergistic Processing Elements - specialized and yet also general purpose. IBM/Toshiba/Sony themselves makes this claim. They are the designers.

If that is what their marketers claim, then fine. I seriously doubt anybody is going to hobble the SPEs with code outside of the target workloads unless desperate.

The SPE as general purpose processor

Crossbar

deathkiller

Gubbi

arjan de lumens

Fafalada

Crossbar

Shifty Geezer

uber-Troll!

Gubbi

version

Gubbi

Crossbar

Gubbi

Edge

London Geezer

Crossbar

version

version

Guden Oden

Senior Member

Edge

3dilettante

Similar threads