RAM latency in PS3, Xbox 360

So XDR is cheaper than DDR, in the long run. It's not aout being technically superior. And the DDR is there because...I thought it was cheaper. And it's not like RSX couldn't be gicen an XDR interface. Two pools of 256 Mb RAM (or 1 512 Mb pool) would surely be easier to manage and cheaper to source.

I'm confused! :? :LOL:
 
Shifty Geezer said:
So XDR is cheaper than DDR, in the long run. It's not aout being technically superior. And the DDR is there because...I thought it was cheaper. And it's not like RSX couldn't be gicen an XDR interface. Two pools of 256 Mb RAM (or 1 512 Mb pool) would surely be easier to manage and cheaper to source.

I'm confused! :? :LOL:


Or maybe they weren't sure which ones would become cheaper in the long run, so they just put half and half just to be sure! ;)
 
You may well be right. I remember discussions here on which console costs more to produce, and the talk was 'XDR's cheaper because Sony makes it themselves (and now we're told it's simpler too)' and 'DDR's cheaper because it doesn't have RAMBUS's markup and there's loads of sources competing'. I reckon Sony were watching, got as confused as I am, and whacked in a bit of both to be on the safe side. It's the only logical explanation :p
 
nAo said:
AlgebraicRing said:
What's the cost for copying memory from main memory to an SPE's local memory?
Please define 'cost'.
I'm just worried about the situation where all 7 SPE's need to fetch or write main memory. If simultaneous reads/writes are not possible then I've got to wait 500*7 or 3500 cycles for the last SPE to get memory... Or am I thinking about the situation wrong?
CELL mem controller can handle 128 simoultaneous memory transactions in order to better hide/reduce memory latencies as more memory pages can be opened at the same time.

I don't really know what I am thinking of in terms of cost. I would need to look at the programming model more closesly. But essentially if I wanted to send a task to be computed by an SPE, what needs to be done to package up the main memory data (and code???) and send it to the SPE to be processed? And if the data is more than 256KB, is my SPE going to sit idle while waiting for another 500 cycle mem fetch? Can I optimize the situation by pre-fetching 128KB at a time, and while the SPE is working on the first half of its local memory, I can be populating the second half of it's local memory?

Let me give an overview of where I am coming from. I am researching for my professor about whether or not we should port his language. SequenceL. to a multicore processor, specifically the Cell processor. The language is functional, so of course parallelization is easier at the compiler level, but the language's semantics is set up to make the parallelisms explicit to the programmer's eye. My task is to determine whether or not creating a Cell specific port would be worthwhile (i.e. that the Cell architecture would highlight any benefits or advantages to using the language). My hunch is that the structure of the language would make it very easy to create a compiler/scheduler for the cell processor. I need to present on the possibility of implementing SequenceL on a multicore architecture by next Friday, just a 10-15 minute presentation. I think the Cell is a perfect match because it gives the most parallel SIMD punch for the buck, and SequenceL is all about vectorization. I would just like to be more informed about the specifics of Cell, though. How well can I keep the SPE's churning with data processing? When is there going to be forced or required idle times? etc.

Got any papers you could point me to about the memory access questions? I've read that the SPEs have 128 registers. But I would love to learn more about the memory controller. Got a link? :)
 
Ideally you want the opensource IBM docs, but they're not out yet. AFAIK the programmer has full control over memory access and schedules memory fetches in advance of needing them. This I think overcomes the latency so the moment you need the data (if you've set it up effeciently) it's already on it's way to the SPE. As I understand the requirements of the architecture have been pretty well thought out, and although implementation is more complex (or at least different) from throwing instructions at Pentium/PPC, when done right there's no major disadvantage.

Incedentally I think the same goes for XeCPU, so by structuring you caching you can hide latency. I don't think writing for the diffeent platforms is going to be too different.
 
AlgebraicRing said:
Can I optimize the situation by pre-fetching 128KB at a time, and while the SPE is working on the first half of its local memory, I can be populating the second half of it's local memory
You can have 16 outstanding memory requests on each SPE, so to use analogy to your example, local memory can be split into 16banks all being loaded while you work on... 17th :p

Transactions can also be started from SPE and PPE sides, so you don't actually need any overhead on PPE to give SPE work. Don't know what the DMA setup and tag overhead is yet, that's a question I'd like to see answered myself also.
 
london-boy said:
Shifty Geezer said:
So XDR is cheaper than DDR, in the long run. It's not aout being technically superior. And the DDR is there because...I thought it was cheaper. And it's not like RSX couldn't be gicen an XDR interface. Two pools of 256 Mb RAM (or 1 512 Mb pool) would surely be easier to manage and cheaper to source.

I'm confused! :? :LOL:


Or maybe they weren't sure which ones would become cheaper in the long run, so they just put half and half just to be sure! ;)


Who knows , i doubt they even know . Rambus has said they wanted a premium for thier ram , its also not in mass production and will basicly be made for the ps3 as I see no other products slated to use it . So not only will it start off higher because of the premium but it will scale slower as one product uses it .

With gdr ram its widely used and has been in mass production for years . Its first used in high end graphics cards and gets a premium price when its first introduced at a new speed , but in a year or two the ram will be in the low end parts selling at very low prices .

Ms and sony don't have to worry about the ram supply drying up as they can allways use faster ram and it doesn't look like gdr ram will be phased out for another 2 years or so and even then it will be at the high end .


So its hard to say which one will be cheaper at the end . But for the first few years i believe it will be gdr ram not xdr
 
Okay, so can anyone give a definite answer on RAM latency for the PS3? I've seen a value of 150 clocks, but that was a totally unofficial source, so I have my doubts...
 
version said:
"Toshiba’s XDR memory chips are configured as 4Mb word x 8 banks x 16 bits, are available with 40ns, 50ns and 60ns cycle time and 27ns or 35ns latency and have 1.8V VDD."

http://www.xbitlabs.com/news/memory/display/20031225163917.html

:) That sentence and any figures derived from it has very little to do with with the real world CPU latency...

I'd suggest re-reading this very thread for some very real numbers...
 
Gubbi said:
Cycle time is just wrong. 40ns equates to a 25MHz cycle. And cycle times higher than latency :rolleyes:

Cheers
Gubbi

You don't understand how DRAM works...

40nS cycle time is very aggresive. most likely they are using 50ns or 60ns cycle time parts.

Cycle time generally refers to random access requiring a full pre-ras-cas cycle for each access while latency generally refers to either having the bank closed and clean requiring only a ras-cas or the page open requiring only a cas.

Aaron Spink
speaking for myself inc.
 
DeanoC said:
I'd suggest re-reading this very thread for some very real numbers...

All I've found was this from ERP:
They're both comparable, GDR is faster, but both are in the 500+ cycle range for a cache miss.

So, am I right to assume that both Cell and XCPU has a memory latency of ~500 clock cycles?
And is this really such a big drawback, because of the lack of OOE?
 
I believe that the 40ns refers to tRC, which IIRC is the amount of time it takes to open a different page (row) from the one currently open in a given bank of the memory chip. AFAIK this is usually some multiple (10+) of the DRAM cycle time... I have no idea what the latency number corresponds to, but maybe it's best case latency inside the chip (no bank conflicts, page miss, or R/W turnaround)?
 
Laa-Yosh said:
DeanoC said:
I'd suggest re-reading this very thread for some very real numbers...

All I've found was this from ERP:
They're both comparable, GDR is faster, but both are in the 500+ cycle range for a cache miss.

So, am I right to assume that both Cell and XCPU has a memory latency of ~500 clock cycles?
And is this really such a big drawback, because of the lack of OOE?

OOOE wouldn't help you with a 500cy latency.

OOOE is more about hiding instruction latency and the latency from the L1 and L2 caches which are still pretty significant.
 
ERP - but OOOE also issues independent loads/stores close to the memory instruction that missed the caches. If these nearby memory ops also miss the cache, and can be serviced in parallel to the initial cache miss, won't the total stall time be significantly reduced compared to the in-order machine? Or is this a rare/insignificant effect on most codes?
 
psurge said:
ERP - but OOOE also issues independent loads/stores close to the memory instruction that missed the caches. If these nearby memory ops also miss the cache, and can be serviced in parallel to the initial cache miss, won't the total stall time be significantly reduced compared to the in-order machine? Or is this a rare/insignificant effect on most codes?

Sure it might save you 20 cycles out of 500......
The point is to hide instruction latnecies and cache HIT latencies, even a L1 cache at 3+GHz isn't 0 clock latency, and L2 is well into double figures.

Think about it, why do you think bigger caches make such a big differences on the OOO Pentiums?
 
Back
Top