Are all FLOPS created equal?

JF_Aidan_Pryde · May 11, 2005

Tim said:
Point 1: The workload has to fit the architecture to get anywhere optimal performance.

Right on.

Point 2: It is easier to make workloads that fit general purpose processors like the ones in the Xbox CPU than specialized processors like in Cell.

Point 3: As long as we know as little as we do about the real life workloads on the next gen consoles it is hard to make conclusions.

I'm not sure about point 2 and 3. As Jaws said, we do know the nature of the work. They will be games. Games are generally pretty FP intensive. And they exhibit much greater data level parallelism than your typical desktop application. As a result, they should map to SIMD architectures fairly well.

While it's difficult to map a game to a main thread and a bunch of SIMD streams, I don't think it's much better in the case of the Xenon. To get the maximum performance out of the Xenon, one has to map their game to six or more logical threads.

So given the maximum performance of both architectures is strongly contingent on how well they facilitate multi-threaded programming, I'd say Cell is better equipt due to the following:

The PPE can handle all the job queue maintence and provide global sync. You'll need to devote a single core of the Xenon to equate it (they are probably identical anyway).
Each SPE has its own local memory. The programmer has total control over what goes in there. A well organised program that overlaps loads and stores with execution (as facilitated by the Cell archiecture) can expect to come close to saturating the SPE's peak performance.

For Xenon on the other hand, even the most optimised program can't expect that its data is in the cache. As such, it can never saturate it's peak performance. Given that six threads are sharing the same 1MB cache, and 2 threads will be using the cache for OS related tasks, there's bound to be a lot of thrashing in the cache.
Each SPE has its own DMA and MMU unit. In other words, each SPE has its own hardware to get the data it needs onto the CPU. The threads of the Xenon on the other hand are at the mercy of the L2 cache.

I like the Cell architecture. I like it not because it's Sony or it's going to be in the PS3; I just think it's got the right organisation to help multi-threaded SIMD applications.

The Xenon is a huge improvement over the XBOX 1 CPU. But it still relies on too many of the old tricks from the days of single threaded programming. I just don't think the CPU is going to come close to its peak performance when six threads are all hoping the data they need will be in that 1MB L2 cache.

The Cell architecture has really taken some thought. The designers realised you can't execute unless the data is there. And guessing (as with cache and speculation) just isn't good enough. That's why each SPE has its own memory and its own DMA and MMU hardware. It's designed to get the data onto the chip so it has a chance of comming close to its peak performance.

jvd · May 11, 2005

One of the main problems when talking about the playstation 3 is that the final chip configuration is not known . So in reality we may only get a 1x4 or a 1x8 or a 2x16

Most people are floating around the 1x8 chip but in reality only devs know .

But considering how fast this forum was willing to take at face value a rumor that due to adding 256 more megs of ram ms had to go with a dual core cpu. The same can be said with the ps3 . If costs are high from a next gen optical drive , 512 megs of ram and a nvidia gpu they may use a smaller cell chip .

So at this point its really not wise to compare anything

Lazy8s · May 11, 2005

Sony's FLOPS use less accurate rounding than the standard for single-precision.

jvd · May 11, 2005

Lazy8s said:
Sony's FLOPS use less accurate rounding than the standard for single-precision.

so what is the 1x8 accurate rounding flops haha

Carl B · May 11, 2005

Lazy8s said:
Sony's FLOPS use less accurate rounding than the standard for single-precision.

True, but as accurate as they need to be for their intended function.

Megadrive1988 · May 11, 2005

I'm not sure about point 2 and 3. As Jaws said, we do know the nature of the work. They will be games. Games are generally pretty FP intensive. And they exhibit much greater data level parallelism than your typical desktop application. As a result, they should map to SIMD architectures fairly well.

While it's difficult to map a game to a main thread and a bunch of SIMD streams, I don't think it's much better in the case of the Xenon. To get the maximum performance out of the Xenon, one has to map their game to six or more logical threads.

So given the maximum performance of both architectures is strongly contingent on how well they facilitate multi-threaded programming, I'd say Cell is better equipt due to the following:

The PPE can handle all the job queue maintence and provide global sync. You'll need to devote a single core of the Xenon to equate it (they are probably identical anyway).

Each SPE has its own local memory. The programmer has total control over what goes in there. A well organised program that overlaps loads and stores with execution (as facilitated by the Cell archiecture) can expect to come close to saturating the SPE's peak performance.

For Xenon on the other hand, even the most optimised program can't expect that its data is in the cache. As such, it can never saturate it's peak performance. Given that six threads are sharing the same 1MB cache, and 2 threads will be using the cache for OS related tasks, there's bound to be a lot of thrashing in the cache.

Each SPE has its own DMA and MMU unit. In other words, each SPE has its own hardware to get the data it needs onto the CPU. The threads of the Xenon on the other hand are at the mercy of the L2 cache.

I like the Cell architecture. I like it not because it's Sony or it's going to be in the PS3; I just think it's got the right organisation to help multi-threaded SIMD applications.

The Xenon is a huge improvement over the XBOX 1 CPU. But it still relies on too many of the old tricks from the days of single threaded programming. I just don't think the CPU is going to come close to its peak performance when six threads are all hoping the data they need will be in that 1MB L2 cache.

The Cell architecture has really taken some thought. The designers realised you can't execute unless the data is there. And guessing (as with cache and speculation) just isn't good enough. That's why each SPE has its own memory and its own DMA and MMU hardware. It's designed to get the data onto the chip so it has a chance of comming close to its peak performance.

that is an eye-opening post, regardless of if you're 100% right or what.

don't know what else to say.

Megadrive1988 · May 11, 2005

one other thing, and I've said this before but I'll say it again -- PS3 is not going to measure up to alot of claims and speculation from years ago, even though it is still going to be far beyond anything we've ever played or used, as far as consumer devices, and games. whatever the real PS3 does not do, that we had once almost *expected* - the PS4 will -- which will also use Cell, on a much grander scale. it will be to PS3 what PS2 was to PS1.

London Geezer · May 11, 2005

Megadrive1988 said:
the PS4 will -- which will also use Cell, on a much grander scale. it will be to PS3 what PS2 was to PS1.

Explain? What's PS2 to PS1?

Gubbi · May 11, 2005

JF_Aidan_Pryde said:
[*]Each SPE has its own local memory. The programmer has total control over what goes in there. A well organised program that overlaps loads and stores with execution (as facilitated by the Cell archiecture) can expect to come close to saturating the SPE's peak performance.

If the data can be preloaded in time from memory. A very big "if"

JF_Aidan_Pryde said:
For Xenon on the other hand, even the most optimised program can't expect that its data is in the cache. As such, it can never saturate it's peak performance. Given that six threads are sharing the same 1MB cache, and 2 threads will be using the cache for OS related tasks, there's bound to be a lot of thrashing in the cache.

Well, if you could preload the data into local store on the SPE, you could prefetch the data into the cache on the Xenon.

Consider cases where you (the programmer) can't instruct the CPU to preload data in time for it's use. For example (favourite example) going down a space decomposition tree: At each node you don't know which node is next until you examine the current node.

On the SPE you will stall every time (and have to jump through hoops to do multiple things at once). On the Xenon you might get lucky and hit the cache (after a few collision tests, all the top nodes of the tree will be in cache, and only leafs might not hit.)

And the fact that the level 2 cache is shared is a good thing. On SPEs you'll end up with a copy of the same data in each SPE.

JF_Aidan_Pryde said:
[*]Each SPE has its own DMA and MMU unit. In other words, each SPE has its own hardware to get the data it needs onto the CPU. The threads of the Xenon on the other hand are at the mercy of the L2 cache.

Yeah, using loads and stores instead of setting up DMA is so much harder, and relying on hardware to keep memory coherent is a real disadvantage.

Cheers
Gubbi

Shifty Geezer · May 11, 2005

Gubbi said:
Consider cases where you (the programmer) can't instruct the CPU to preload data in time for it's use. For example (favourite example) going down a space decomposition tree: At each node you don't know which node is next until you examine the current node.

On the SPE you will stall every time

Won't other solutions be possible though? Say load all the next possible nodes in at the same time, the when you descend a node, discard the rest and load in the next set of nodes? eg.

Code:

            1            
           / \
         2     3
       / |    /  \
      4  5   6    7
      |  |   |   / \
      8  9   A  B   C

At node 1, load nodes 2 and 3. Follow to node 3, load nodes 6 and 7, follow node 6, load node A. The next path would always be available without waiting. You might want to load in a certain amount of lookahead, like several levels down, to be sure you don't miss a fetch.

The key to efficient Cell programming means finding different approaches instead of relying on existing methods, and when successfully implemented the possibility to keep everything running near it's maximum throughput is more likely than on a traditional cache model shared between 6 threads (even if my example above is poor at showing this!)

That's how I understand it.

JF_Aidan_Pryde · May 11, 2005

Gubbi said:
On the SPE you will stall every time (and have to jump through hoops to do multiple things at once). On the Xenon you might get lucky and hit the cache (after a few collision tests, all the top nodes of the tree will be in cache, and only leafs might not hit.)

If this algorithm won't change the top nodes after they've been discovered then surely you'd store the 'discovered nodes' onto the SPE local store once you've found them.

When the tree hasn't been constructed, both architectures would be forced to go to memory. But once discovered, both would retain the essential part (top of tree) on the cache/LS. So other the fact that SPEs require explicit saving of the data, I don't see how it will be at a disadvantage. (Unless I misunderstood your example)

And the fact that the level 2 cache is shared is a good thing. On SPEs you'll end up with a copy of the same data in each SPE.

I'm not sure about your algorithm example but I don't think you'd want to run that on multiple SPEs. Why would you want to evaluate the same tree multiple times? Even if you did, Cell facilitates L2 cache sharing -- the PPE's 512kB cache is accessible to all SPEs.

aaaaa00 · May 11, 2005

Gubbi said:
And the fact that the level 2 cache is shared is a good thing. On SPEs you'll end up with a copy of the same data in each SPE.

Plus the shared L2 cache gets rid of all the cache synchronization traffic between seperate L2 caches. Think of the bandwidth savings.

nAo · May 11, 2005

aaaaa00 said:
Plus the shared L2 cache gets rid of all the cache synchronization traffic between seperate L2 caches. Think of the bandwidth savings.

Think about another stupid thread trashing all you beloved cached data set.
There are pro and cons..

Gubbi · May 11, 2005

Shifty Geezer said:
Gubbi said:

Consider cases where you (the programmer) can't instruct the CPU to preload data in time for it's use. For example (favourite example) going down a space decomposition tree: At each node you don't know which node is next until you examine the current node.

On the SPE you will stall every time

Click to expand...

Won't other solutions be possible though? Say load all the next possible nodes in at the same time, the when you descend a node, discard the rest and load in the next set of nodes? eg.

Code:

1 / \ 2 3 / | / \ 4 5 6 7 | | | / \ 8 9 A B C

At node 1, load nodes 2 and 3. Follow to node 3, load nodes 6 and 7, follow node 6, load node A. The next path would always be available without waiting. You might want to load in a certain amount of lookahead, like several levels down, to be sure you don't miss a fetch.

What you're doing is collapsing multiple nodes into one, and trading off bandwidth for latency gains is certainly a possibility. Normal space decomposition structures, octtrees, (or quadtrees) expand real fast though since each node has 8 (or 4) childs.

Cheers
Gubbi

aaaaa00 · May 11, 2005

nAo said:
aaaaa00 said:

Plus the shared L2 cache gets rid of all the cache synchronization traffic between seperate L2 caches. Think of the bandwidth savings.

Click to expand...

Think about another stupid thread trashing all you beloved cached data set.

That's why you can lock.

There are pro and cons..

Of course.

nAo · May 11, 2005

aaaaa00 said:
That's why you can lock.

Sure, that's why I love fast local store..eheh

Gubbi · May 11, 2005

JF_Aidan_Pryde said:
Gubbi said:

On the SPE you will stall every time (and have to jump through hoops to do multiple things at once). On the Xenon you might get lucky and hit the cache (after a few collision tests, all the top nodes of the tree will be in cache, and only leafs might not hit.)

Click to expand...

If this algorithm won't change the top nodes after they've been discovered then surely you'd store the 'discovered nodes' onto the SPE local store once you've found them.

Right, but even if you do implement a software cache, it'll have an order of magnitude worse latency than hardware level 1 cache access (a mask, two loads, a compare and a branch), and that is for a one-way associative cache. And you end up saving a copy of the same data in each SPE like I mentioned.

JF_Aidan_Pryde said:
When the tree hasn't been constructed, both architectures would be forced to go to memory. But once discovered, both would retain the essential part (top of tree) on the cache/LS. So other the fact that SPEs require explicit saving of the data, I don't see how it will be at a disadvantage. (Unless I misunderstood your example)

I think you did, see below.

JF_Aidan_Pryde said:
And the fact that the level 2 cache is shared is a good thing. On SPEs you'll end up with a copy of the same data in each SPE.

Click to expand...

I'm not sure about your algorithm example but I don't think you'd want to run that on multiple SPEs. Why would you want to evaluate the same tree multiple times? Even if you did, Cell facilitates L2 cache sharing -- the PPE's 512kB cache is accessible to all SPEs.

A big part of the physics calculations is actual collision detection. You need to know when an object hits another object or world geometry in order to have them interact. To speed up collision detection you use a space decomposition structure, which you search through for each object (so you search through this tree ALOT).

Cheers
Gubbi

nAo · May 11, 2005

What you're doing is collapsing multiple nodes into one, and trading off bandwidth for latency gains is certainly a possibility. Normal space decomposition structures, octtrees, (or quadtrees) expand real fast though since each node has 8 (or 4) childs.

You would like this:
Memory Optimization, Christer Ericson (Sony Computer Entertainment)
Look at page 31: A Compact Static KD-Tree

p.s. all the presentation is worth to be read

Fafalada · May 11, 2005

aaaa0 said:
That's why you can lock.

The moment you have to start prefetching way in advance and cache-locking all the arguments about DMA setups being harder to use go out of the water IMO.

Gubbi · May 11, 2005

nAo said:
You would like this:
Memory Optimization, Christer Ericson (Sony Computer Entertainment)
Look at page 31: A Compact Static KD-Tree

Neat, effectively halving storage by sacrificing two bits of mantissa precision.

nAo said:
p.s. all the presentation is worth to be read

Indeed, thank you for the link

Cheers
Gubbi

Are all FLOPS created equal?

JF_Aidan_Pryde

jvd

Lazy8s

jvd

Carl B

Friends call me xbd

Megadrive1988

Megadrive1988

London Geezer

Gubbi

Shifty Geezer

uber-Troll!

JF_Aidan_Pryde

aaaaa00

nAo

Nutella Nutellae

Gubbi

aaaaa00

nAo

Nutella Nutellae

Gubbi

nAo

Nutella Nutellae

Fafalada

Gubbi

Similar threads