Most games are 80% general purpose code and 20% FP...

Shaderguy :

1 : How could MS look at Cell and base their decisions on it, when Cell's design was unknown during XeCPU's development?

2 : Can we really say XeCPU has 3x the GP performance? Faf points out the SPE's aren't any slouches in this regard (though we don't know how cache/LS management impacts things), and MS's statement that they have 3x the performance is based on 1 PPE cores vs. 3 (though they aren't the same cores in all respects) and totally discounted the worth of the SPEs in GP. I think what we're hearing regards GP performance is rather nonsensical and unfounded FUD and shouldn't be taken as valid, unless someone can present some hard facts on the matter.
 
shaderguy said:
It seems that Microsoft analyzed the Cell design, and came up with an alternative design with half the floating point performance and 3 times the integer performance. If you look at it that way, then you would say that Microsoft chose to optimize for potential integer performance at the expense of potential floating point performance.

The alternative view is that Microsoft slapped together 3 general purpose cores and then basically proclaimed it as such - Occam's Razor
 
Not at all. An XeCPU could hardly be called a GP core given the slimming down of GP components and beefing up of FP components. They're custom cores.
 
Shifty Geezer said:
Shaderguy :

1 : How could MS look at Cell and base their decisions on it, when Cell's design was unknown during XeCPU's development?

2 : Can we really say XeCPU has 3x the GP performance? Faf points out the SPE's aren't any slouches in this regard (though we don't know how cache/LS management impacts things), and MS's statement that they have 3x the performance is based on 1 PPE cores vs. 3 (though they aren't the same cores in all respects) and totally discounted the worth of the SPEs in GP. I think what we're hearing regards GP performance is rather nonsensical and unfounded FUD and shouldn't be taken as valid, unless someone can present some hard facts on the matter.

Don't buy in to marketing. MS tries to make their system look as good as possible, as do Sony.

The SPEs, of course, have non-zero "GP" performance so it would be wrong to discount them. On top of that there are 7 of them to make up for the two extra regular cores i Xenon.

The SPEs are handicapped though when it comes to program flow control (which has been discussed on many occasions here).

Cheers
Gubbi
 
Shifty Geezer said:
Gubbi said:
Don't buy in to marketing. MS tries to make their system look as good as possible, as do Sony.
I don't that I do :D Unless Faf's comments constitute marketting :p

Faf's comments doesn't constitute marketing.

Microsoft claiming 3x GP performance because they have 3 fully fledged cores does

Sony claiming 3x overall performance because they have 9 ( 8 ) cores compared to MS' 3 does as well.

By the way: Using the R5900 as a base of comparison is likely to skew speedup upwards, it is an emphatically low performance core, and already was at the release of PS2 in 1999.

Cheers
Gubbi
 
Gubbi said:
The SPEs are handicapped though when it comes to program flow control (which has been discussed on many occasions here).

Cheers
Gubbi

There's a lack of hardware branch prediction, of course - and thus you may consider it to be handicapped from that perspective, fair enough - but it is worth noting that the branch hinting in the SPEs isn't entirely useless. In fact it may be better in some instances, at least judging from discussion here.

Just a small side note..
 
As none of them need to run legacy binaries, I think the design of the program (data structures and handling, and size of threads) and the efficiency of the compiler will probably have the most impact on performance. That is as opposed to x86 cores, who just have to run any old (and poorly written or compiled) code.

As long as the programmers and compilers try to minimize random memory access as much as possible and handle branches in a consistent way (and use the hints), performance can be very good.
 
Titanio said:
Gubbi said:
The SPEs are handicapped though when it comes to program flow control (which has been discussed on many occasions here).

There's a lack of hardware branch prediction, of course - and thus you may consider it to be handicapped from that perspective, fair enough - but it is worth noting that the branch hinting in the SPEs isn't entirely useless. In fact it may be better in some instances, at least judging from discussion here.

In a normal CPU (with a proper branch apparatus) you'd have a branch target buffer (BTB) which provides the instructions at a branch target with very low latency, so no, the prepare instruction, which loads data into the SMBTB - Software Managed BTB, isn't any better. On the other hand if branches can't be prepared a branch taken induces a big bubble, mispredict latency is worse.

To quote this
Efficient software manages branches in three ways: it replaces branches
with bit-wise select instructions; it arranges for the common case
to be inline; it inserts branch hint instructions to identify branches
and load the probable targets into the SMBTB.

So the key to performance on the SPEs is to avoid branches. This means convert conditionals into predicates wherever possible (which incidentally is a hallmark of stream processing).

Another handicap is the local store vs a demand loaded cache. The local store is under explicit control, which is great if your data has spatial coherence and known reuse, so that you can load it all in one go, and purge it when you're done.

But for structures with varying spatial and temporal locality you're hosed. Here a cache automatically does best effort for you (as much as a limited associative LRU cache can do). You can make a software cache, but it will have 4-5 times the latency on a hit, worse granularity and give you the choice of inline the cache_load code and thereby bloat code size or use a function call and increase latency even further.

In general this means that the SPEs will work best on data that can be streamed in/out and crunched upon by kernels with most functions inlined. - And that is fairly special purpose.

Cheers
Gubbi
 
Gubbi said:
Titanio said:
Gubbi said:
The SPEs are handicapped though when it comes to program flow control (which has been discussed on many occasions here).

There's a lack of hardware branch prediction, of course - and thus you may consider it to be handicapped from that perspective, fair enough - but it is worth noting that the branch hinting in the SPEs isn't entirely useless. In fact it may be better in some instances, at least judging from discussion here.

I might be wrong but IIRC the argument was more along the lines of the programmer in some instances being in a better position to predict what branch would be taken than a hardware predictor, and hint accordingly.

Gubbi said:
Another handicap is the local store vs a demand loaded cache. The local store is under explicit control, which is great if your data has spatial coherence and known reuse, so that you can load it all in one go, and purge it when you're done.

But for structures with varying spatial and temporal locality you're hosed.

I don't disagree with your points per se - I agree, SPEs will work best with non-branching code where data access is known before execution and can be explicitly controlled - but the characterisation of the LS as a "handicap" is a little inappropriate, IMO, given its advantages over a cache with some types of code (which you hint at above - but to make explicit, latency to the LS is lower than a cache, right? And it can be useful not to have data shuffled around unbeknownst to the programmer i.e. if there's no cache locking). I know it's tempting to use the typical desktop CPU as a point of reference, and perhaps more tempting to highlight its shortcomings versus it, but there are advantages too. Is a CPU with a typical cache "handicapped" versus Cell from a data access point of view when it comes to code where access is explicit and known? And mask transfers from main memory (i.e. high computation:data access ratio)?
 
Gubbi said:
In a normal CPU (with a proper branch apparatus)

What do you see as a "normal" CPU? And what would be a "proper" branch apparatus? You mean, one that is used in desktop computers and does branch prediction in hardware? And a general purpose one, at that?

Sheesh, if anyone shows you a "non-standard" design, you're telling us what it will do worse than the "normal" and "proper" ones, instead of looking at the benefits the new architecture offers.

So, a GPU or a general stream processor is not as good in doing what a CPU does? But then again, a GPU or a general stream processor is much better at doing what it does best than a normal and proper CPU...

you'd have a branch target buffer (BTB) which provides the instructions at a branch target with very low latency, so no, the prepare instruction, which loads data into the SMBTB - Software Managed BTB, isn't any better. On the other hand if branches can't be prepared a branch taken induces a big bubble, mispredict latency is worse.

To quote this
Efficient software manages branches in three ways: it replaces branches
with bit-wise select instructions; it arranges for the common case
to be inline; it inserts branch hint instructions to identify branches
and load the probable targets into the SMBTB.

So the key to performance on the SPEs is to avoid branches. This means convert conditionals into predicates wherever possible (which incidentally is a hallmark of stream processing).

Exactly. Let that stream processor do what it does best.

Another handicap is the local store vs a demand loaded cache. The local store is under explicit control, which is great if your data has spatial coherence and known reuse, so that you can load it all in one go, and purge it when you're done.

But for structures with varying spatial and temporal locality you're hosed. Here a cache automatically does best effort for you (as much as a limited associative LRU cache can do). You can make a software cache, but it will have 4-5 times the latency on a hit, worse granularity and give you the choice of inline the cache_load code and thereby bloat code size or use a function call and increase latency even further.

Yes. And that is exactly the same thing you want to avoid for multi-core, multi-threaded architectures as well. You don't want all your threads to do random memory access over all the data at random times. Their logic will break down and they will stall if you do so.

Not that a "proper" CPU does that so well either, it only has a smaller penalty. Avoiding that as much as possible will benefit any architecture.

In general this means that the SPEs will work best on data that can be streamed in/out and crunched upon by kernels with most functions inlined. - And that is fairly special purpose.

Cheers
Gubbi

Why is that fairly special purpose? It would be the general case, if you designed your OS and API like that. Which console makers are free to do.
 
DiGuru said:
Gubbi said:
In a normal CPU (with a proper branch apparatus)

What do you see as a "normal" CPU? And what would be a "proper" branch apparatus? You mean, one that is used in desktop computers and does branch prediction in hardware? And a general purpose one, at that?

Sheesh, if anyone shows you a "non-standard" design, you're telling us what it will do worse than the "normal" and "proper" ones, instead of looking at the benefits the new architecture offers.

Pay attention!

The context of discussion was SPE general purpose performance. I simply pointed out why it is inferior to a modern CPU core.

DiGuru said:
Gubbi said:
Another handicap is the local store vs a demand loaded cache. The local store is under explicit control, which is great if your data has spatial coherence and known reuse, so that you can load it all in one go, and purge it when you're done.

But for structures with varying spatial and temporal locality you're hosed. Here a cache automatically does best effort for you (as much as a limited associative LRU cache can do). You can make a software cache, but it will have 4-5 times the latency on a hit, worse granularity and give you the choice of inline the cache_load code and thereby bloat code size or use a function call and increase latency even further.

Yes. And that is exactly the same thing you want to avoid for multi-core, multi-threaded architectures as well. You don't want all your threads to do random memory access over all the data at random times. Their logic will break down and they will stall if you do so.

But you'd have a cache to extract what spatial and temporal locality it can automatically and thereby lower the impact. - AND you'd have the possibility to switch thread context on a long latency stall.

DiGuru said:
Gubbi said:
In general this means that the SPEs will work best on data that can be streamed in/out and crunched upon by kernels with most functions inlined. - And that is fairly special purpose.

Why is that fairly special purpose? It would be the general case, if you designed your OS and API like that. Which console makers are free to do.

Because it wil suck at anything else!!

Cheers
Gubbi
 
Titanio said:
... but the characterisation of the LS as a "handicap" is a little inappropriate, IMO, given its advantages over a cache with some types of code (which you hint at above - but to make explicit, latency to the LS is lower than a cache, right? And it can be useful not to have data shuffled around unbeknownst to the programmer i.e. if there's no cache locking).

I think latency is more or less decided by the size of the cache array. For reference Athlons have 3 cycle load-to-use latency in their psudeo dual ported 64KB D$. The SPE has a 7 cycle latency to it's 256KB array, but at a much higher (potential) clock speed. You'd have roughly the same latency for a memory array of same proportions.

Titanio said:
I know it's tempting to use the typical desktop CPU as a point of reference, and perhaps more tempting to highlight its shortcomings versus it, but there are advantages too. Is a CPU with a typical cache "handicapped" versus Cell from a data access point of view when it comes to code where access is explicit and known? And mask transfers from main memory (i.e. high computation:data access ratio)?

Cache blocking has been a known technique for years to make algorithms cache-friendly, it basically involves loading and storing data in blocks that fit in cache, which is pretty much what the SPEs will do, so no a normal CPU doesn't suffer directly. For a given die size however you could implement more, simpler, stream cores than you could regular cores and hence end up with lower throughput using regular CPU cores on a stream-friendly workload.

Cheers
Gubbi
 
Gubbi said:
Because it wil suck at anything else!!

Cheers
Gubbi

We have GPU's, because CPU's suck at doing 3D graphics. And yes, GPU's suck at doing general purpose stuff. So, what's your point? That you want a single processor core that can do everything reasonably well? While using the transistors for multiple special purpose units would give you a much better performance?

You only need enough general purpose power to manage everything else.

Are you against floating point ALU's as well, because they can be done by the integer ALU's, and so it might make more sense to come up with ways to have the integer units execute them as fast as possible?

Would you use a general purpose car to haul freight, or run a F1 race?

The SPE's are pretty general purpose, and you have seven (!!!) of them to use as you please. And while you are complaining that they aren't proper CPU's, they will run rings around any proper CPU if they can do what they do best.

A CPU is Jack of all trades, master of none. A SPE is a Master in handling streams and vectors, and apprentice in everything else.
 
DiGuru said:
We have GPU's, because CPU's suck at doing 3D graphics. And yes, GPU's suck at doing general purpose stuff. So, what's your point?
Gubbi said:
Pay attention!

The context of discussion was SPE general purpose performance. I simply pointed out why it is inferior to a modern CPU core.

Cheers
Gubbi
 
So, use two or three. Or all seven if you need that much general purpose power. How about that?
 
My problem isn't so much the points being made - although some should be qualified - rather the point-blank assessment as one being inferior to the other e.g. "I simply pointed out why it is inferior to a modern CPU core." Nothing is black and white. You might think that its advantages over a modern CPU core are outweighed by its disadvantages, but then we're migrating to opinion.
 
Shifty Geezer said:
Shaderguy :

1 : How could MS look at Cell and base their decisions on it, when Cell's design was unknown during XeCPU's development?

Because Sony and IBM publicly announced much of the Cell architecture and PS3 performance targets several years ago. There were also the various papers and patent applications publicly available which gave many details. It's true that many of the details were not revealed, but overall performance targets, as well as the overall architecture, were well known. (For example, Sony was saying that 4 Cells == 1 Terraflop, from which one could determine 1 Cell == 256 GFlops. And that a Cell would be 1 CPU + a large number of DSPs. And that there was a high-speed RAMBus-designed interconnect between Cells, and that one Cell would be used as a CPU, while the other would be used as a GPU, and so on.)

Shifty Geezer said:
2 : Can we really say XeCPU has 3x the GP performance? Faf points out the SPE's aren't any slouches in this regard (though we don't know how cache/LS management impacts things), and MS's statement that they have 3x the performance is based on 1 PPE cores vs. 3 (though they aren't the same cores in all respects) and totally discounted the worth of the SPEs in GP. I think what we're hearing regards GP performance is rather nonsensical and unfounded FUD and shouldn't be taken as valid, unless someone can present some hard facts on the matter.

Since the SPEs don't have general-purpose access to main memory, I think their performance on general purpose code has to be discounted quite a bit. Wouldn't you have to implement some form of software cache? Wouldn't that make a main memory read access look like this:

int Peek(void* address)
{
TLBEntry e = TLB[TLBHash(address)];
if(e.base != address & PAGE_MASK)
{
// schedule DMA transfer here...possibly context switch while waiting.
}
return * (int*) (e.cache + (address & PAGE_OFFSET_MASK));
}

That seems like it would take at least 20 instructions, including several branches, even for the in-cache case. That seems slow enough that people wouldn't really want to use general-purpose algorithms on SPEs. Instead, devs will write custom SPE code, that reads and writes data in a more stream-oriented fashion.

Given that a single PPC core is not too wimpy, I predict many PS3 devs will just ignore the SPEs, and tune thier game to use the single PPC core and the GPU. That's effectively what Epic is doing with their Unreal Engine. (Of course, they're very diplomatic about ignoring the SPEs, saying "we're leaving the SPEs free for middleware to use".)
 
Back
Top