PS3 XDR bandwidth?

Gubbi · Aug 23, 2006

Guden Oden said:
That was largely due to the RIMM memory layout, might be added.

Actually it had more to with the way RDRAM was an afterthought on Intel chipsets. They just tacked it on a regular SDR controller, adding extra latency serializing-deserializing data and commands, - on top of that they weren't able to take advantage of some the more advanced things i RDRAM (like many more open pages).

Intel's 850 chipset had main memory latency above 300ns, at the same time Alpha EV7 had <100ns (but with the memory controller on die) with the same RDRAM memory chips.

Cheers

patsu · Aug 24, 2006

ShootMyMonkey, if you have time, how do I interpret your data points say if an SPU initiates a DMA of 32K (XDR and GDDR3) to its Local Store ? How about outbound/write traffic ?

e.g., Upon a DMA request, how long (or how many CPU cycles) before the first byte appear. Subsequently, how many bits/bytes appear per clock ?

EDIT: Oh wait, I forgot about the NDA

Is it correct to assume that:

+ We have to wait 100s of SPU cycles for the first byte to show up

+ 4 * 8-bits every DRAM clock [400MHz]. i.e., 4 byte every 8 SPU cycles when the data arrives

+ A subsequent fetch of 32K can be pipelined at the next DRAM cycle

?

ShootMyMonkey · Aug 24, 2006

+ We have to wait 100s of SPU cycles for the first byte to show up

+ 4 * 8-bits every DRAM clock [400MHz]. i.e., 4 byte every 8 SPU cycles when the data arrives

+ A subsequent fetch of 32K can be pipelined at the next DRAM cycle

4 * 8 bits? Where does the 4 come from? There are 4 DRAM devices all right, but each of them is an x16 XDR DRAM, so it's 16 bits wide. 4 bytes per controller channel, but the memory controller has two channels. The rest of it is basically true.

Even relying on the things we all know about Cell and PS3, it's relatively safe to say that a request isn't necessarily immediate as a single SPE may not be the only device making a request at a given moment, or it may have to wait on something like a cache writeback. That particular data point that Rambus lists has more to do with how frequently you can make requests. It's basically 1 DRAM cycle, and since the XDR effective bus speed equals the clock speed of the Cell, that means 8 cycles at the CPU. How soon you actually get back the results of your request is a whole other matter from how frequently a new request can be issued.

And requesting from GDDR might as well take a few million cycles as the GPU is basically lord and master of that pool.

patsu · Aug 24, 2006

ShootMyMonkey said:
4 * 8 bits? Where does the 4 come from? There are 4 DRAM devices all right, but each of them is an x16 XDR DRAM, so it's 16 bits wide. 4 bytes per controller channel, but the memory controller has two channels. The rest of it is basically true.

So it's on-paper 8 bytes every 8 SPU cycle (instead of 4 byte every 8 cycles) ? I used 4 * 8 bits (instead of 16 bits) because of the line:
"In XDR's case, it transfers 8 bits per cycle relative to the DRAM. ..."
This seems to match the 1 byte per node per cycle transfer rate of EIB.

Even relying on the things we all know about Cell and PS3, it's relatively safe to say that a request isn't necessarily immediate as a single SPE may not be the only device making a request at a given moment, or it may have to wait on something like a cache writeback. That particular data point that Rambus lists has more to do with how frequently you can make requests. It's basically 1 DRAM cycle, and since the XDR effective bus speed equals the clock speed of the Cell, that means 8 cycles at the CPU. How soon you actually get back the results of your request is a whole other matter from how frequently a new request can be issued.

Assuming a SPU scenario (no cache), will the second "chunk" (8 byte) always arrive 8 cycles after the first chunk ? (since it's part of the 32K DMA request in my example)

For a "normal/separate" memory requests, the second chunk may come later because of another interleaving memory access. Is this accurate ?

And requesting from GDDR might as well take a few million cycles as the GPU is basically lord and master of that pool.

Yap... I remember the exceptionally low GDDR3 read bandwidth.

Guden Oden · Aug 24, 2006

Gubbi said:
They just tacked it on a regular SDR controller

Are you sure? That doesn't seem possible, considering DRDRAM doesn't deal with RAS/CAS addressing etc; that's done locally on-die of each RAM chip from what I understand.

adding extra latency serializing-deserializing data and commands

No way to avoid that with a serial-like bus...

on top of that they weren't able to take advantage of some the more advanced things i RDRAM (like many more open pages).

From what I read at the time, they DID do that, but had to limit the number of open pages due to DRDRAM power draw. Since each chip could dissipate as much as 4W apiece would have meant a full RIMM would have destroyed itself without active cooling. And who can count on that being generally available in PCs?

at the same time Alpha EV7 had <100ns (but with the memory controller on die) with the same RDRAM memory chips.

While I don't doubt EV7 would have been faster at accessing memory, I can imagine it was due to somewhat other reasons than just plain intel incompetence...

Just for starters, EV7 was a newer and more evolved product than the i850 chipset, aimed at a higher-end piece of the market, likely with more I/O buffers etc and other advanced features simply because R&D budget was bigger, and general price point of the end result was (a lot!) higher. I850 was a consumer chipset, very cost sensitive!

EV7 was also HIGHLY multichanneled (8, as I recall - which allows some pretty advanced interleaving), and likely only allowed one RIMM per channel while i850 had only 2 channels with 2 RIMMs apiece, meaning twice as long a signal path, twice the max number of devices. And both the device count and the bus length affects latency from what I understand.

Guden Oden · Aug 24, 2006

patsu said:
Yap... I remember the exceptionally low GDDR3 read bandwidth.

That only applies when using Cell to read GDDR3 memory directly. If you instead have RSX send you the data you want, it transfers at full speed (which may be up to max interface speed; 30+ GB/s as I recall, but more likely limited by GDDR read speed).

ShootMyMonkey · Aug 24, 2006

So it's on-paper 8 bytes every 8 SPU cycle (instead of 4 byte every 8 cycles) ? I used 4 * 8 bits (instead of 16 bits) because of the line:
"In XDR's case, it transfers 8 bits per cycle relative to the DRAM. ..."
This seems to match the 1 byte per node per cycle transfer rate of EIB.

Perhaps that needed more clarity. What I meant was 8 bits per cycle per pin relative to the DRAM (technically not just one pin because it's differential signaling, but you get the idea). An x16 DRAM is 16 "pins" wide bus. The 8 bits per DRAM clock cycle is along a single pair of wire traces, and since you've got 16 pairs at the DRAM, you get It's the same with GDDR transferring 4 bits per DRAM cycle per pin, but having essentially 128 pins.

Either one can only do 1 bit per clock cycle per pin if you're measuring the clock speed in terms of the *effective* transfer rate (which is why a 4-bit or 8-bit fetch inside the DRAM is not really a contiguous series of bits belonging to the same real byte).

So with an x16 DRAM, you get 8bits * 16 traces at 400 MHz (or 128 bits) per DRAM clock cycle. With a 64-bit wide controller, you get 64 bits every cycle at 3.2 GHz (or 64 bytes per DRAM cycle at 400 MHz). Either way you figure it, it comes out to the theoretical 25.6 GB/sec of the XDR bus.

patsu · Aug 24, 2006

Ok... so its 64 bytes per 8 SPU cycles on-paper. Thanks for the explanation. How about the subsequent accesses in the same 32K DMA fetch ? Are they guaranteed to come during the next DRAM cycle (i.e., another 8 SPU cycles) ? Or can other activities get in the way ?

EDIT: Ok, I checked EIB's transfer rate again... it averages 8 bytes per SPU core per cycle (or rather 16 bytes every 2 SPU cycles). So things seem to match up now.

Gubbi · Aug 24, 2006

Guden Oden said:
EV7 was also HIGHLY multichanneled (8, as I recall - which allows some pretty advanced interleaving), and likely only allowed one RIMM per channel while i850 had only 2 channels with 2 RIMMs apiece, meaning twice as long a signal path, twice the max number of devices. And both the device count and the bus length affects latency from what I understand.

Yes, but one extra hop on the chained bus cannot possibly account for 200ns. I blame Intel incompetence

I'm guessing that the i820, i840 and i850 chipset all just had that horrible memory hub integrated into the chipset, and the core of the chipset still thought it was talking to regular DRAM.

Note: I am guessing here, but how else do you explain the 100-150 ns extra latency compared to the regular SDR chipsets.

Cheers

Guden Oden · Aug 25, 2006

Gubbi said:
Yes, but one extra hop on the chained bus cannot possibly account for 200ns.

No, not alone I wouldn't expect that big a diff. Yet, EV7 has on-board memory controller, it has 4x the memory channels. It's overall newer, it can afford to do things better, throw more resources at the problem.

I blame Intel incompetence

Well, SOME of it is. Or maybe it's just a matter of priorities and general evolution in the semiconductor industry.

Fafalada · Aug 25, 2006

Guden Oden said:
PS2 never used RIMM-style DRDRAM, it had two memory channels with just one memory device per channel.

Which also resulted in lower latency then DDR/SDR of the time IIRC.

flec04 · Aug 27, 2006

I think the most important advantage of XDR in the PS3 is the FlexIO system it is a part of, but I'm not an expert on this. I would guess though that FlexIO and XDR together allow for very efficient data transfers between the SPEs, PPE and other components, and this is the main reason for XDR being on the system.

To add further support hasn't the ps3 been designed from the ground up improve memory access/transfers which impede latest generation processors. A cache miss can result in hundreds of lost cycles waiting for the correct data & even Itanium processors with large & fast cache spend as much as 80% of their time waiting.

The ps3 has been designed to eliminate this. The CELL's solution for more efficient memory access has also been benchmarked exceeding current processors by almost 2 orders of magnitude. The CELL is able to transfer data in the 100s of GB/s where as conventional processors are only in the 10s of GB/s.
CELL Broadband Architecture From 20,000 Feet. Dr P Hofstee, Architect, IBM

That white paper makes for a very interesting read.

Guden Oden · Aug 27, 2006

Please, do yourself a favor and never quote white papers on this site.

They're often nothing but biased PR at best, biased PR drivel at worst.

Shifty Geezer · Aug 27, 2006

flec04 said:
The CELL is able to transfer data in the 100s of GB/s where as conventional processors are only in the 10s of GB/s.

Where's the memory that feeds 100s of GB/s?

Hugely fast internal memory doesn't solve the BW limits of fetching data from RAM, which is the issue with cache misses.

flec04 · Aug 27, 2006

Please, do yourself a favor and never quote white papers on this site. They're often nothing but biased PR at best, biased PR drivel at worst.

If you don't measure up against an industry expert (IBM architect) then fobbing off white papers as biased PR is about the only comeback you're capable of.

Where's the memory that feeds 100s of GB/s?

Well said! Only time will reveal limitations, if any.

Guden Oden · Aug 27, 2006

flec04 said:
If you don't measure up against an industry expert (IBM architect) then fobbing off white papers as biased PR is about the only comeback you're capable of.

Uhm, I don't know how old you are, or how long you've been interested in the technical side of games, 3D graphics and hardware, but most people learn pretty quick in my experience that white papers only tell the positive side of whatever product they're 'analyzing' while glossing over or outright skipping any deficiencies it may have.

It is a well-known fact that white papers are as much, if not more, a tool of a company's PR department than the technical writer that is listed at the top as the creator. For example, if you were to look up the white papers for the matrox mystique (remember that one?), you'd find that it describes the product lacking alpha blending entirely as a positive thing.

Also, you might find that the best approach when joining a new board is not to immediately start insulting other members.

Well said! Only time will reveal limitations, if any.

Perhaps you misunderstood Shifty's comment. He just pointed out a limitation - and incidentally the same kind of limitation Cell and all other processors share, ie the performance discrepancy between processors and the memory chips they're attached to...

Gubbi · Aug 27, 2006

flec04 said:
If you don't measure up against an industry expert (IBM architect) then fobbing off white papers as biased PR is about the only comeback you're capable of.

It is a IBM paper trying to peddle a IBM product. If you do not view that with scepticism, you are in the wrong field.

IBM "removed" the memory wall by forcing programs to execute out of local store and on data that is in local store. They did nothing to improve the fundamental latency of the main memory system, quite the contrary.

Relying on developers to transform latency bound problems into bandwidth bound ones is a losing strategy for a CPU architecture in the long run.

This does not mean that it does not makes sense in PS3. It just does not make much sense in most other applications/markets.

Cheers

Arwin · Aug 27, 2006

Shifty Geezer said:
Where's the memory that feeds 100s of GB/s?

Surely that comes from all the different devices that can communicate with each other over this bus?

Hugely fast internal memory doesn't solve the BW limits of fetching data from RAM, which is the issue with cache misses.

Which is why there are so many different memory areas imho (the local caches, the XDR memory, the GDDR3 memory, etc.). The clever way is to set this up so that they can all efficiently talk to each other.

russo121 · Aug 27, 2006

Arwin said:
Surely that comes from all the different devices that can communicate with each other over this bus?

As you said, this is a bus, and shared (the purpose of a bus). In this bus the maximum you will ever get will be 64bits*3.2Ghz/8=25.6GB/s.

Edit : You can even had 100000000000000GB/s of internal transfer in the Cell that it will not help you. You'll be limited to the bus speed - 25.6GB/s.

Megadrive1988 · Aug 27, 2006

ok let me see if I have this right, current-gen XDR can scale upto ~102 GB/sec bandwidth correct? PS3 has 1/4th of that.

XDR2 memory scales upto 200 GB/sec or starts there, but either way, it can hit or will be able to hit 200 GB/sec. that's the least I'd expect for PS4's main memory. It's about 8 times more more bandwidth, about the same increase from PS2's main bandwidth (3.2 GB/sec) to PS3's.

hopefully, XDR2 has even lower latency than XDR. 200 GB/sec (or more) of low-latency memory would help towards making the PS4 an absolutely killer machine.

PS3 XDR bandwidth?

Gubbi

patsu

ShootMyMonkey

patsu

Guden Oden

Senior Member

Guden Oden

Senior Member

ShootMyMonkey

patsu

Gubbi

Guden Oden

Senior Member

Fafalada

flec04

Guden Oden

Senior Member

Shifty Geezer

uber-Troll!

flec04

Guden Oden

Senior Member

Gubbi

Arwin

Now Officially a Top 10 Poster

russo121

Megadrive1988

Similar threads