PS4 to be based on Cell?

Latency is not a problem for SPE. LS is the foundation of CELL. There is just no point in SPEs without LS.
Latency is obviously a problem if you cannot hide it and it's not always possible or feasible to do it, especially if one does not have an infinite amount of time to work on it.
 
Hehe. So you want to move back to cache, and trying to fix the latency problem you've just introduced with hardware threading?
Latency is not a problem for SPE. LS is the foundation of CELL. There is just no point in SPEs without LS.

Latency introduced by HW threading?

Threading is supposed to allow you hide latency not create a latency problem where there was none.

To be able to go from thread to thread swiftly when a particular thread stalls for a certain period of time having a huge context to save and restore does not help you... increasing LS's size only makes this problem worse.

Is putting a lot of man hours and transistors pushing up single thread performance of SPU's and force developers' hands the best of possible roads to explore or can we move in the direction well the entire industry (
Sun (Rock, Niagara), Intel (LRB, their desktop CPU line, IA64), and AMD along with IBM itself... look for some of their recent patents while searching for SIMD, VTE, BTE, etc...) is going that is heavily threading our cores and have more and more cores on chip?

I think LS is a bit of a tough roadblock for the evolution of a single SPE and so if you want to significantly move it forward keeping on increasing LS's size is counter-productive.

If you want to say that doubling the LS and then just keep on adding SPE's as your heart desires it might be a way, but I am not convinced it is the only way forward.

You will increase bandwidth over time to keep those SPE's fed, but you will also have higher and higher main RAM latency and if you keep on increasing the LS's size you will sooner or later find yourself increasing the access time to the LS and you are kind of going back to square one.

Latency hiding will not be useless with a 16-32 SPE's system even if you double the LS, but it might very well become more and more critical.
Something has also to be done to allow a less steep learning curve for developers.

A cache hierarchy with flexible cache locking would get you a bit of both worlds. If you want the predictability and deterministic nature of a LS you lock out a portion of the cache (which might be a nice substantial portion of it) and work from there, but you are not killing developers who prefer the cache model and HW DMA/prefetching it delivers.
 
I understood wco81's question to be somewhat rhetorical. If the programming issues aren't what's causing PS3's lacklustre sales, why would Sony care to make things easier for the devs next gen? They could provide the same tools on Cell, launch a console at $250 say, and then have it sell. The programmers' lives aren't really a factor - just look at PS2's success! Only if developers refuse to develop for your platform because it's too hard does it make designing hardware for their ease important.

Looking at it from the other direction, imagine a console that was super-duper easy to create games on. What would that gain the console and how would that help it sell? I guess the postive argument there is devs could focus on creating games with more polish, but the negative side is an awful lot of dross fluffing out the library. Not that that would upset the console company, which gets paid no matter whether games sell or not. In that case, creating a console that is super-duper easy to create for - think LBP easy - and a library of zillions of games all with royalities being paid to Sony, makes some sense!
Me too ;)
Maybe I may have done my more diplomatic or clearer.
This discusion is interesting on the forcasting point of view even if obviously there are thing to learn from the past. When I read wco81 post and I thank that we should avoid to speak to specifically of the ps3. The topic I linked have seen pretty much every opinion possible and it ended going round (even locked :LOL:). In the end there are no clear response, it's a blend of complicated reasons and we lake crucial datas ad huge scale pool marketing studies/etc. to go further. I just didn't want his one to end like this to fast .
wco81 I hope that you didn't take my coment in a tough way or as I was trying to restrict to "freedom of talk" because it was not at all my goal. I just try to prevent the dicussion to lean too much on the side of the past.

But the forecast is indeed really interesting :)
 
Last edited by a moderator:
Or using the 360 example , waternoose was a enhanced cpu that ibm was already working on , the changes and everything took less than 18 months correct ?

Actually 24 months. link
“Microsoft's aggressive timetable required that IBM take the Xbox 360 chip design from concept to full execution in just 24 months,” said Ilan Spillinger, IBM Distinguished Engineer and director of the IBM Design Center for Xbox 360, in a statement.
So you are both right, it can be done if you go with moderate design changes on a mature process. But somehow I don´t believe Sony is such a fast mover, we´ll see.
 
Latency introduced by HW threading?
Before this goes any further, Vitaly's question was if you wanted to fixed the latency problem of the cache by adding SMT. :)

WRT to LS, if I remember the CBEA specs correctly, you can add cache between the SPUs and main-memory in any way you want. Maybe there's something to be done there. Right now, the DMA-everything approach is a curse and a blessing. If there was a caching system (complete with stream detection) it would be easier to get more general purpose code to work at acceptable speed (even though data-alignment will ruin your day). With PS3 the problem is mainly that you don't really have enough PPE cores to easily move code over to. I don't really think the SPUs needs to be changed much, there just needs to be more general purpose processing power.
 
Before this goes any further, Vitaly's question was if you wanted to fixed the latency problem of the cache by adding SMT. :)

I do not see a latency problem inherent in a caching system, so I still do not get his question :p.

I see cache as a non deterministic solution and LS as a deterministic one.

Besides that part about the nature of the cache (is your data in the cache or has to be loaded from main memory?), the situation is the same... you have a certain latency to read and write data to this "memory" (whether it is cache or LS).

Cache can be a bad friend though, it can give you a false sense of security and easy to get performance... you can have very poorly performing code because you do not pay attention to how poorly you are treating cache (breaking down your data in cache friendly chunks helps) where that code would not run at all on an SPU. So SPU's enforce some thought put into your design...

WRT to LS, if I remember the CBEA specs correctly, you can add cache between the SPUs and main-memory in any way you want.

IIRC the LS1 cache is supposed to sit between LS and the EIB and not between SPU and LS (from what I remember about IBM's CBEA docs).
 
Latency introduced by HW threading?

Threading is supposed to allow you hide latency not create a latency problem where there was none.
In Larrabee latency is hidden by both hardware and software threading. While the former is "free" of latency, the latter isn't.

The little info we have on Larrabee suggests that 4 hardware threads (i.e. no software threading) are not enough to effectively hide the latency of texturing. Intel will vary the number of software threads depending...

As far as I can tell, in Cell the LS's 256KB is a fairly harsh constraint upon software threading. I guess it would generally be fine for the "lightweight threads" (~10 vec4 fp32s, say, plus a wodge of constant data shared by all threads) that are currently typical of pixel shaders though.

I think LS is a bit of a tough roadblock for the evolution of a single SPE and so if you want to significantly move it forward keeping on increasing LS's size is counter-productive.
An alternative to 1MB of cache with 256KB lockable as "LS", each of the four hardware contexts could use the 1MB of memory as either cache or LS. The programmer could then divvy-up the thread types according to the size of context: e.g. 3 cache-based contexts with 768KB shared amongst them + 1 private LS for the 4th context.

A cache hierarchy with flexible cache locking would get you a bit of both worlds. If you want the predictability and deterministic nature of a LS you lock out a portion of the cache (which might be a nice substantial portion of it) and work from there, but you are not killing developers who prefer the cache model and HW DMA/prefetching it delivers.
It's interesting that Nehalem is reported to be giving little gain in some games - seemingly solely because total cache is a fraction of current processors. i.e. games appear to have what amounts to multi-MB of "context" and Nehalem simply "can't keep it all on-die". That seems rather sobering to me, as it implies that the future of game processors includes way more on-die RAM than anyone's taking seriously right now.

Obviously the standard console factors: thin-APIs and tightly-targeted code should make that less sobering, but still, I wonder...

LOL, 16 SPEs, each with 1MB of memory, would amount to a hell of a lot of on-die RAM.

Jawed
 
I think LS is a bit of a tough roadblock for the evolution of a single SPE and so if you want to significantly move it forward keeping on increasing LS's size is counter-productive.

If you want to say that doubling the LS and then just keep on adding SPE's as your heart desires it might be a way, but I am not convinced it is the only way forward..


Erm, what you are suggesting is going to have a *higher* latency and is likely to be considerably more difficult to program.

Going to cache means more silicon, more power and higher latency, if you follow on with the ease of programming the cache is likely to be coherent. Cell already has a coherent cache shared by all the cores, it's very high latency and slow. Making it bigger and shared by more cores will make it have a much higher latency.

As for going for a highly threaded programming model, you're not going to make anything easier that way, multithreading is notoriously difficult for anything other than "embarrassingly parallel" problems. If Cell is difficult to develop for (an opinion that has never been universally held) the multithreading is likely the biggest cause of the headaches.

You seem to have completely missed the point of Cell, it's designed for high throughput, you hide latency by moving data in big chunks into the LSs, it's a very, very efficient way of accessing memory. It's not designed for legacy code or legacy methods of programming, you need to write code specifically for it.

If you want to write high speed code on *any* CPU you end up programming it the same way as Cell anyway, you treat the cache like a LS and try and read in as much as possible in one go.

Anyway, If you need cache and/or threads these have both been done on Cell, You can run 8 software threads on an SPE and you can also use the LS as a cache. I think it makes a lot more sense to keep the existing architecture and maybe speed these operations up a bit.


Back On topic...
I'm wondering is Sony planning on going for a 16 SPE Cell? The über Cell IBM is planning sounds like it could be around 300 sqmm even at 32nm, maybe Sony don't want to go that big. That said 22nm should be appearing in 2012 so who knows.
 
SPEs only see data in the LS. The PPU's caches are not accessible directly by SPE's so I feel that comparison is invalid. Data must be explicity pushed into and out of an LS whether the data originates from main memory or the PPU's cache.

Also consider that the cache itself can run at an async rate to the rest of the chip. There is no reason to assume it *must* have a higher latency than an LS. The cache also need not be comprised of a single pool of transistors. It can be a collection of single or multi-ported pools underlying a unified cache which is exposed to code. This would blunt the affect of contention somewhat and also offer optimization opportunities. A cache will also remove the need for the LS to be part of a context which indirectly improves latency as fewer bits need to be moved about the system to facilitate a switch. A cache is helpful when top performance isn't needed but the ability to generate working code quickly is.

Having a cache does not obviate the the difficulties of mult-threaded programming but is certainly can make it simpler.
 
Last edited by a moderator:
Having a cache does not obviate the the difficulties of mult-threaded programming but is certainly can make it simpler.
Sometimes it makes things easier, sometimes harder. having a unified cache for all 'threads' running, can introduce a high amount of cache trashing, giving u less performance than with just one thread, and sometimes it's not that obvious to figure that out, cause it's very dependant on the addresses of your data. (the first generatin of P4's with HT suffer very bad of that)
having seperate caches per 'thread' helps to avoid that issue, but it introduces redudancy 'cause each cache needs to sometimes to store the same data that could be shared otherwise, while you can't messure that performance cost, you can estimate about 10% more speed on general code when doubling the cache size.
so I also think it can help, it can also hurt, but the bad thing is, you can barely track down how good it is workin, especially in an multithreaded enviropment. you can run the same code, the same data several times, having different performance, just because of different locations of your data.
 
Sometimes it makes things easier, sometimes harder. having a unified cache for all 'threads' running, can introduce a high amount of cache trashing, giving u less performance than with just one thread, and sometimes it's not that obvious to figure that out, cause it's very dependant on the addresses of your data. (the first generatin of P4's with HT suffer very bad of that)
having seperate caches per 'thread' helps to avoid that issue, but it introduces redudancy 'cause each cache needs to sometimes to store the same data that could be shared otherwise, while you can't messure that performance cost, you can estimate about 10% more speed on general code when doubling the cache size.
so I also think it can help, it can also hurt, but the bad thing is, you can barely track down how good it is workin, especially in an multithreaded enviropment. you can run the same code, the same data several times, having different performance, just because of different locations of your data.

Understood. I agree. I only mean to qualify my thoughts. When top performance is less of a requirement than turn around time...cache can definitely help. I do not mean to suggest there is free lunch, not at all.
 
Last edited by a moderator:
Erm, what you are suggesting is going to have a *higher* latency and is likely to be considerably more difficult to program.

Going to cache means more silicon, more power and higher latency, if you follow on with the ease of programming the cache is likely to be coherent. Cell already has a coherent cache shared by all the cores, it's very high latency and slow. Making it bigger and shared by more cores will make it have a much higher latency.

As for going for a highly threaded programming model, you're not going to make anything easier that way, multithreading is notoriously difficult for anything other than "embarrassingly parallel" problems. If Cell is difficult to develop for (an opinion that has never been universally held) the multithreading is likely the biggest cause of the headaches.

You seem to have completely missed the point of Cell, it's designed for high throughput, you hide latency by moving data in big chunks into the LSs, it's a very, very efficient way of accessing memory. It's not designed for legacy code or legacy methods of programming, you need to write code specifically for it.

If you want to write high speed code on *any* CPU you end up programming it the same way as Cell anyway, you treat the cache like a LS and try and read in as much as possible in one go.

Anyway, If you need cache and/or threads these have both been done on Cell, You can run 8 software threads on an SPE and you can also use the LS as a cache. I think it makes a lot more sense to keep the existing architecture and maybe speed these operations up a bit.


Back On topic...
I'm wondering is Sony planning on going for a 16 SPE Cell? The über Cell IBM is planning sounds like it could be around 300 sqmm even at 32nm, maybe Sony don't want to go that big. That said 22nm should be appearing in 2012 so who knows.

I think there is some confusion in the discussion what kind of latency problem the Cell has or doesn´t. I think nAo is refering to the latency of the LS ( 6 cycles) which is higher than the latency of a ordinary first level cache (typical 4 cycles). Panajev is discussing a hypothetical modified SPE with direct access to main memory, where a cache would help cut the latency and replace the function of the DMA driven LS (or let programmer choose either to keep BC).

Anyway I agree with ADEX that going cache for the SPEs means more silicon and the main strenght of the SPE is its small footprint. Concerning the latency of the LS memory, I only have layman knowledge of CPU design, but I wonder if it´s not possible to decrease the latency when you are moving to a considerably smaller process given that you keep the LS memory of the same size.
The reason for this theory of mine is that I have seen the first level caches grow from 4 kB -> 8 kB -> 16 kB -> 32 kB without increased latency.

I remember the comment from one of the Cell developers where he explained that they choose to increase the LS from 128 kB to 256 kB to give the programmers more flexibility. The trade off may have been that they introduced a higher latency but at the time they valued the increased flexibility higher.

It would be interesting to hear some developers view of the LS size. Would there be benefits of a doubled LS or is it better to keep the SPEs lean and mean (such as improved instructions and decreased latency to LS) and spend the silicon real estate on a few more SPEs instead of a larger LS?
 
Last edited by a moderator:
Threading is supposed to allow you hide latency
This is what I said. Seems I was not very clear.
Cache introduces non-deterministic memory access latency what you can try to hide with prefetching and hardware threading.

If you want to say that doubling the LS.
Where did I asked for LS doubling? LS size is fine.

Latency hiding will not be useless with a 16-32 SPE's system even if you double the LS, but it might very well become more and more critical.
You'll need a lot of B/W just to keep cache coherency in such a system.
In any case, CBEA allow cache to be placed between SPE and memory.

but you are not killing developers who prefer the cache model and HW DMA/prefetching it delivers.
CELL is a niche processor. Nehalem is better for them. It can easily beat the Niagara and other "massive multithreading" monsters.
 
Last edited by a moderator:
Should be, although I'd expect it to be based on a future iteration much like the roadmap defines.
 
Back
Top