*spin off* RAM & Cache Streaming implications

My question really boils down to this: Why use software based cache and waste resources and performance on that, while you can have a really fast optimized fixed function hardware for that purpose alone?
It boils down to transistor budget. With automatic caches Cell would probably have had half the SPUs at best greatly lowering the peak performance.

If you are important enough you can afford to use HW that is inconvenient to use, developers simply have little to no choise and have to work with it :)
 
Im partially agree with you,but we have to imagine 2013/2014 and not fixed thought in 2011 ...
Waiting longer doesn't mean you get wider buses for the same cash.

There are two kinds costs in electronics -- the kind that halves it's price every year and a half, and the kind that stays constant regardless of time. More signal lines and more memory *chips* are static costs, whereas faster chips and larger chips are the kind of costs that go down with time. For a console that will presumably be produced for 5 years+, this is very significant -- if you pick a wide bus with slow chips, your console will never be cheaper to produce than today. Because of this, they will have a narrowish bus that connects to the fastest and largest chips available at launch.
 
It boils down to transistor budget. With automatic caches Cell would probably have had half the SPUs at best greatly lowering the peak performance.
With automated caches (instead of manual ones), the transistor budget for memory for each core would stay the same (cache memory itself would not take more transistors than the SPU work memory). We are just talking about adding automatic cache logic here (no extra memory). Do you believe that just the cache logic would take as much transistors as the CPU pipelines, execution/logic units, on chip (cache) memory and all other core logic combined? Halving the number of cores would pretty much mean that. I am not a HW engineer, but that sounds like a lot for me.
 
We've had this discussion before, and the amount of extra silicon automatic caching added wasn't a huge amount (10%? 20%?). However, I believe that it adds latency to L2 cache in a way LS doesn't, and LS is exceptionally quick in feeding the processor for 256 KB. Someone may correct me on that though. ;)
 
We've had this discussion before, and the amount of extra silicon automatic caching added wasn't a huge amount (10%? 20%?). However, I believe that it adds latency to L2 cache in a way LS doesn't, and LS is exceptionally quick in feeding the processor for 256 KB. Someone may correct me on that though. ;)

L2 access isn't normally started until a core knows it missed its L1 cache.

Local store has:

Lower latency than L2.
Higher latency than L1
Higher average latency than a L1/L2 combo.

Cheers
 
... but combine the LS with absolutely massive register file (compared to x86) and somewhat smartish manual preloading and I'm quite sure it's still faster than L1/2 combo. At least as long as you have the data in LS. If I'm not mistaken then register file itself is around 2kB in size.
 
... but combine the LS with absolutely massive register file (compared to x86) and somewhat smartish manual preloading and I'm quite sure it's still faster than L1/2 combo. At least as long as you have the data in LS. If I'm not mistaken then register file itself is around 2kB in size.

You can preload data if you know your data access patterns.

Often you don't (in fact, in many cases you can't know in advance).

Cheers
 
From what I understand loop unrolling works much better on SPU than regular x86 CPU simply due to the huge register file and that helps quite well for prefetching data to registers. Yes, x86 CPUs have rename registers but they aren't as effective when you try to unroll loops by 4+ times I believe.

I'd say LS has it's good and bad sides. It's definitely not as good as automatic caches for programmer work efficiency but with some work you can get at least as much if not more performance out of it.
 
From what I understand loop unrolling works much better on SPU than regular x86 CPU simply due to the huge register file and that helps quite well for prefetching data to registers. Yes, x86 CPUs have rename registers but they aren't as effective when you try to unroll loops by 4+ times I believe.

There are three reasons for unrolling loops:
1. To remove false dependencies on registers.
2. To interleave multiple iterations of the loop to increase distance between data dependent instructions.
3. To reduce loop overhead relative to work done.

Out-of-order execution handles 1.) and 2.)

You need loop unrolling to get any kind of performance from a SPU. The LS has a load-to-use latency of 6 cycles, that's 12 instruction slots you've have to schedule to do something useful while you wait for data.

Loop unrolling is a crutch, and has been since the mid 90s.

Cheers
 
Last edited by a moderator:
... but combine the LS with absolutely massive register file (compared to x86) and somewhat smartish manual preloading and I'm quite sure it's still faster than L1/2 combo. At least as long as you have the data in LS.
You can manually prefetch data to L1/L2 caches too on any recent multicore chip. If you repeatedly access a small (let's say a 256 KB) structure, all your reads come directly from L1/L2 every time. And if your access pattern is predictable, your cache hints pretty much quarantee that your reads always come from L1. Adding cache hints is labor intensive, but you only need to add those to your performance critical code sections (less than 1% of your total code). And the profiling tools of current consoles give you very good insight where to add them.
 
Yeah caches support basically all the same features as local stores (prefetching vs DMA), but they give you a "soft edge" of performance and still provide you a usable model when you have coherent, but *unpredictable* data-dependent accesses. They also don't force you to enumerate all the data paths for non-performance sensitive reads/writes which is pretty convenient.

Said another way, it's fairly easy to emulate a local store in software (using a cache), but it's excessively expensive to do the opposite

I have the same opinion as sebbbi... I used to think explicit local store was a big win but I've worked on enough varied architectures since then that I'm not so sure. It seems like the constraints it puts on the problems that you can implement efficiently are a bit too crippling, and the benefits are minimal.
 
Local store did have some advantages for processing data streams over x86 caches circa PS3 release, the biggest issue with X86 caches until SSE2 was the inability to not pollute them with writes, if you're dealing with streaming data, you need to be able to queue writes of < 1 cache line without imposing a cache read. SSE2 added instructions to do that.

But in general with prefetching caches can do everything a local store can do, with the advantage that you only pay the cost of moving the data you touch. They don't stop you having to do appropriate data design as has been discussed previously.
 
How much more expensive transistor-wise is a local store vs similar sized cache? I mean both for cache itself and anything that has to be added to CPU. For simplicity let's assume it's not shared cache
 
As a reflection or thought... as far as possible we have to use creativity with common sense to imagine what will be the scene in 2013 and 2014 with hardware ready for production /Taped out at least 6 months (xbox NV2A and xbox360 R-500/C1 final specs clock etc completed only six months before the release ...) before launch date.Clearly beta SDKs will come long before that, but certainly the final specs are subject to last(6 months) minute changes.
The only specs subject to change 6 months before launch are clock speed and disabling of features due to a bug. Significant features are locked down well before that. Especially for a console.

Edit: amount of memory can be increased late. I'm referring to chip features.
 
The only specs subject to change 6 months before launch are clock speed and disabling of features due to a bug. Significant features are locked down well before that. Especially for a console.

I'll go further than that, CPU's and GPU's are locked probably 2 years before launch and certainly once any external dev has access to a devkit.
Yes clock rates can change, and occasionally features never materialize because of hardware bugs, but before any dev sees the hardware, someone has to write OS/Firmware/Drivers for it.

The only dramatic hardware change I can think of after dev kits were released was N64 and it only affected the early "Dream Team" teams, rumor has it that the original design had substantially faster graphics.
 
Actually I was referring more to the circular use of the ring of the interconnect bus connecting everything.

Criticism of the local stores is not new; a hybrid approach is probably called for where they can work both ways.
 
Waiting longer doesn't mean you get wider buses for the same cash.

There are two kinds costs in electronics -- the kind that halves it's price every year and a half, and the kind that stays constant regardless of time. More signal lines and more memory *chips* are static costs, whereas faster chips and larger chips are the kind of costs that go down with time. For a console that will presumably be produced for 5 years+, this is very significant -- if you pick a wide bus with slow chips, your console will never be cheaper to produce than today. Because of this, they will have a narrowish bus that connects to the fastest and largest chips available at launch.
I still disagree with that. Nvidia has a product that see ~three shrinks the old GTX8800. The ship almost still ship nowadays with few changes, it started with a 384 bits bus wide at an high price and chip nowadays at ~100$, there have been really few changes still they moved from a 384 bits bus to a 256 bit one. The move from 90nm to 65nm to 55nm. It's an interesting "case study" but with Nvidia renaming frenzy I really have a tough time gathering proper informations on die sizes as I wonder if they moved from 384bits bus to 256 bits one in one jump or if they pass by by a 320 bits one (if not with this exact product but one of the same family).

Anyway point is that in a corner case you could say that it's the tiny bus that doesn't scale down. For example AMD GPU use 64bits wide memory controllers, so you have two of them on a 128 bits wide bus, so far so good. I would not be surprised if the configuration is the same in Xenos ie two 64 bits wide memory controllers so if MS want to use a 64bit bus they would need GDDR3 clocked @1.4GHz. Either it's not available or to costly but they did not deemed this choice as worthy (they are not concerned by cost? I don't think so).

On the other hand if with their next system they start with a 256 bits bus and nowadays cheapest GDDR5 clocked @900MHz, say two or three years down the road they could move to a 192 bit bus and use GDDR5 clocked @1.2GHZ which will be super cheap at this time and offer overall the same characteristic.
Then there is the matter of the number of RBEs tied to a given memory controller, but if they plan ahead it may not be a problem depending at which granularity AMD can disable RBE on a given design. HD 6870 has 32 RBEs 8 per memory controller, on a 192bits bus that would be 24 RBEs, say one is planning ahead they could disable 2 RBEs per memory controller in the first revision (coarse grain redundancy that may help with yields too).
Anyway point is tiny bus don't scale as to shrink them you need the bandwidth per memory controller to go up by a 2X factor which is unlikely to happen.

Overall if we were to go by this absolute belief that wider bus is not an economical option the 360 would be stuck on 64bits bus as the xbox (or a bit of a sarcasm the GTS 250 would not sell at 100$), moving from a 64bits bus to a 128bits was also a X2 increase with associated costs. Point is manufacturers need that extra bandwidth, actually MS went further they added an EDRAM chip (even more cost) to meet ~720p bandwidth rendering requirements (and to have a single pool of RAM which comes handy). Nowadays is another matter, cheapest GDDR5 will buy you +100GB/s worse of bandwidth on a 256 bits bus, modern GPUs do really well at 720p and more (and with AA) with this bandwidth. In few years cheapest GDDR5 will offer the same bandwidth on a 192bit bus bringing cost down 'it needs only GDDR5 @1200MHz). And one will will never have to deal with EDRAM an extra chip and its associated costs. In the same time given enough bandwidth (and they would have enough) editors will be way happier with a single pool of RAM for real and no limit and no arbitrarily set size for say their G-buffer.
 
Last edited by a moderator:
Waiting longer doesn't mean you get wider buses for the same cash.

There are two kinds costs in electronics -- the kind that halves it's price every year and a half, and the kind that stays constant regardless of time. More signal lines and more memory *chips* are static costs, whereas faster chips and larger chips are the kind of costs that go down with time. For a console that will presumably be produced for 5 years+, this is very significant -- if you pick a wide bus with slow chips, your console will never be cheaper to produce than today. Because of this, they will have a narrowish bus that connects to the fastest and largest chips available at launch.

Forgive me but I think you have not read about Isuppli, because then you're saying that some costs and memory chips do not fall, but the facts and data show the opposite, as the manufacturing process allowing more smaller chip and continuing better yields rate per wafer and reducing costs and datas showed by Isuppli proving that even the 256MB XDR modules that nobody uses,since mid-2009 to be in the range of less than US$10(today much less).

About use of GDDR5 and others ... We must remember that the bargaining power of sony and ms do not get to buy 1000 units, but millions and often in conditions far more favorable than AMD or Nvidia to reduce costs, outside the fact that they able to subcontract or partner with other companies to (Globalfoundries,TSMC, NEC, Toshiba, etc.) to produce its chips and still paid to intellectual property rights (AMD etc.).


Particularly I don't see a scenario unlikely that a console with 4 GB RAM XDR2 with high bandwidth(256+GB/sec), as in the case of Sony would be practically the maintenance, improvement and updating of an already existing contract for more than 12 years (from PS2).
 
Last edited by a moderator:
I'll go further than that, CPU's and GPU's are locked probably 2 years before launch and certainly once any external dev has access to a devkit.
Yes clock rates can change, and occasionally features never materialize because of hardware bugs, but before any dev sees the hardware, someone has to write OS/Firmware/Drivers for it.

The only dramatic hardware change I can think of after dev kits were released was N64 and it only affected the early "Dream Team" teams, rumor has it that the original design had substantially faster graphics.


Very interesting,because I had heard from a friend who worked with x360, only in June and July 2005 had received the final SDK, less than six months of the console release, but that's another story.

And very interesting that information you bring us here, because we put in a position to ask that developers already have clear information about what to expect of the next generation (probably many are already working with Radeon HD 5870 and PhenomII quad or core i7 simulating this next gen console specs) or even as Crytech similarly to Epic did in the past (xbox360 story about 256 to 512MBs) may have to be indirectly or directly affecting what specs will these next gen consoles have.
 
Very interesting,because I had heard from a friend who worked with x360, only in June and July 2005 had received the final SDK, less than six months of the console release, but that's another story.
Your friend's timeline is probably accurate, but the SDK is software and hardware and the hardware (silicon) features apart from bugs were locked down much earlier.
 
Back
Top