PS3 XDR bandwidth?

Discussion in 'Console Technology' started by patroclus02, Aug 23, 2006.

  1. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    Actually it had more to with the way RDRAM was an afterthought on Intel chipsets. They just tacked it on a regular SDR controller, adding extra latency serializing-deserializing data and commands, - on top of that they weren't able to take advantage of some the more advanced things i RDRAM (like many more open pages).

    Intel's 850 chipset had main memory latency above 300ns, at the same time Alpha EV7 had <100ns (but with the memory controller on die) with the same RDRAM memory chips.

    Cheers
     
  2. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    ShootMyMonkey, if you have time, how do I interpret your data points say if an SPU initiates a DMA of 32K (XDR and GDDR3) to its Local Store ? How about outbound/write traffic ?

    e.g., Upon a DMA request, how long (or how many CPU cycles) before the first byte appear. Subsequently, how many bits/bytes appear per clock ?

    EDIT: Oh wait, I forgot about the NDA :(

    Is it correct to assume that:

    + We have to wait 100s of SPU cycles for the first byte to show up

    + 4 * 8-bits every DRAM clock [400MHz]. i.e., 4 byte every 8 SPU cycles when the data arrives

    + A subsequent fetch of 32K can be pipelined at the next DRAM cycle

    ?
     
    #22 patsu, Aug 24, 2006
    Last edited by a moderator: Aug 24, 2006
  3. ShootMyMonkey

    Veteran

    Joined:
    Mar 21, 2005
    Messages:
    1,177
    Likes Received:
    72
    4 * 8 bits? Where does the 4 come from? There are 4 DRAM devices all right, but each of them is an x16 XDR DRAM, so it's 16 bits wide. 4 bytes per controller channel, but the memory controller has two channels. The rest of it is basically true.

    Even relying on the things we all know about Cell and PS3, it's relatively safe to say that a request isn't necessarily immediate as a single SPE may not be the only device making a request at a given moment, or it may have to wait on something like a cache writeback. That particular data point that Rambus lists has more to do with how frequently you can make requests. It's basically 1 DRAM cycle, and since the XDR effective bus speed equals the clock speed of the Cell, that means 8 cycles at the CPU. How soon you actually get back the results of your request is a whole other matter from how frequently a new request can be issued.

    And requesting from GDDR might as well take a few million cycles as the GPU is basically lord and master of that pool.
     
  4. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    So it's on-paper 8 bytes every 8 SPU cycle (instead of 4 byte every 8 cycles) ? I used 4 * 8 bits (instead of 16 bits) because of the line:
    "In XDR's case, it transfers 8 bits per cycle relative to the DRAM. ..."
    This seems to match the 1 byte per node per cycle transfer rate of EIB.

    Assuming a SPU scenario (no cache), will the second "chunk" (8 byte) always arrive 8 cycles after the first chunk ? (since it's part of the 32K DMA request in my example)

    For a "normal/separate" memory requests, the second chunk may come later because of another interleaving memory access. Is this accurate ?

    Yap... I remember the exceptionally low GDDR3 read bandwidth.
     
    #24 patsu, Aug 24, 2006
    Last edited by a moderator: Aug 24, 2006
  5. Guden Oden

    Guden Oden Senior Member
    Legend

    Joined:
    Dec 20, 2003
    Messages:
    6,201
    Likes Received:
    91
    Are you sure? That doesn't seem possible, considering DRDRAM doesn't deal with RAS/CAS addressing etc; that's done locally on-die of each RAM chip from what I understand.

    No way to avoid that with a serial-like bus...

    From what I read at the time, they DID do that, but had to limit the number of open pages due to DRDRAM power draw. Since each chip could dissipate as much as 4W apiece would have meant a full RIMM would have destroyed itself without active cooling. And who can count on that being generally available in PCs?

    While I don't doubt EV7 would have been faster at accessing memory, I can imagine it was due to somewhat other reasons than just plain intel incompetence... ;) Just for starters, EV7 was a newer and more evolved product than the i850 chipset, aimed at a higher-end piece of the market, likely with more I/O buffers etc and other advanced features simply because R&D budget was bigger, and general price point of the end result was (a lot!) higher. I850 was a consumer chipset, very cost sensitive!

    EV7 was also HIGHLY multichanneled (8, as I recall - which allows some pretty advanced interleaving), and likely only allowed one RIMM per channel while i850 had only 2 channels with 2 RIMMs apiece, meaning twice as long a signal path, twice the max number of devices. And both the device count and the bus length affects latency from what I understand.
     
  6. Guden Oden

    Guden Oden Senior Member
    Legend

    Joined:
    Dec 20, 2003
    Messages:
    6,201
    Likes Received:
    91
    That only applies when using Cell to read GDDR3 memory directly. If you instead have RSX send you the data you want, it transfers at full speed (which may be up to max interface speed; 30+ GB/s as I recall, but more likely limited by GDDR read speed).
     
  7. ShootMyMonkey

    Veteran

    Joined:
    Mar 21, 2005
    Messages:
    1,177
    Likes Received:
    72
    Perhaps that needed more clarity. What I meant was 8 bits per cycle per pin relative to the DRAM (technically not just one pin because it's differential signaling, but you get the idea). An x16 DRAM is 16 "pins" wide bus. The 8 bits per DRAM clock cycle is along a single pair of wire traces, and since you've got 16 pairs at the DRAM, you get It's the same with GDDR transferring 4 bits per DRAM cycle per pin, but having essentially 128 pins.

    Either one can only do 1 bit per clock cycle per pin if you're measuring the clock speed in terms of the *effective* transfer rate (which is why a 4-bit or 8-bit fetch inside the DRAM is not really a contiguous series of bits belonging to the same real byte).

    So with an x16 DRAM, you get 8bits * 16 traces at 400 MHz (or 128 bits) per DRAM clock cycle. With a 64-bit wide controller, you get 64 bits every cycle at 3.2 GHz (or 64 bytes per DRAM cycle at 400 MHz). Either way you figure it, it comes out to the theoretical 25.6 GB/sec of the XDR bus.
     
  8. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Ok... so its 64 bytes per 8 SPU cycles on-paper. Thanks for the explanation. How about the subsequent accesses in the same 32K DMA fetch ? Are they guaranteed to come during the next DRAM cycle (i.e., another 8 SPU cycles) ? Or can other activities get in the way ?

    EDIT: Ok, I checked EIB's transfer rate again... it averages 8 bytes per SPU core per cycle (or rather 16 bytes every 2 SPU cycles). So things seem to match up now.
     
    #28 patsu, Aug 24, 2006
    Last edited by a moderator: Aug 24, 2006
  9. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    Yes, but one extra hop on the chained bus cannot possibly account for 200ns. I blame Intel incompetence :D I'm guessing that the i820, i840 and i850 chipset all just had that horrible memory hub integrated into the chipset, and the core of the chipset still thought it was talking to regular DRAM.

    Note: I am guessing here, but how else do you explain the 100-150 ns extra latency compared to the regular SDR chipsets.

    Cheers
     
  10. Guden Oden

    Guden Oden Senior Member
    Legend

    Joined:
    Dec 20, 2003
    Messages:
    6,201
    Likes Received:
    91
    No, not alone I wouldn't expect that big a diff. Yet, EV7 has on-board memory controller, it has 4x the memory channels. It's overall newer, it can afford to do things better, throw more resources at the problem.

    Well, SOME of it is. Or maybe it's just a matter of priorities and general evolution in the semiconductor industry.
     
  11. Fafalada

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,773
    Likes Received:
    49
    Which also resulted in lower latency then DDR/SDR of the time IIRC.
     
  12. flec04

    Newcomer

    Joined:
    Aug 26, 2006
    Messages:
    17
    Likes Received:
    0
    To add further support hasn't the ps3 been designed from the ground up improve memory access/transfers which impede latest generation processors. A cache miss can result in hundreds of lost cycles waiting for the correct data & even Itanium processors with large & fast cache spend as much as 80% of their time waiting.

    The ps3 has been designed to eliminate this. The CELL's solution for more efficient memory access has also been benchmarked exceeding current processors by almost 2 orders of magnitude. The CELL is able to transfer data in the 100s of GB/s where as conventional processors are only in the 10s of GB/s.
    CELL Broadband Architecture From 20,000 Feet. Dr P Hofstee, Architect, IBM

    That white paper makes for a very interesting read.
     
  13. Guden Oden

    Guden Oden Senior Member
    Legend

    Joined:
    Dec 20, 2003
    Messages:
    6,201
    Likes Received:
    91
    Please, do yourself a favor and never quote white papers on this site. :) They're often nothing but biased PR at best, biased PR drivel at worst.
     
  14. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,106
    Likes Received:
    16,898
    Location:
    Under my bridge
    Where's the memory that feeds 100s of GB/s? :p

    Hugely fast internal memory doesn't solve the BW limits of fetching data from RAM, which is the issue with cache misses.
     
  15. flec04

    Newcomer

    Joined:
    Aug 26, 2006
    Messages:
    17
    Likes Received:
    0
    If you don't measure up against an industry expert (IBM architect) then fobbing off white papers as biased PR is about the only comeback you're capable of.

    Well said! Only time will reveal limitations, if any.
     
  16. Guden Oden

    Guden Oden Senior Member
    Legend

    Joined:
    Dec 20, 2003
    Messages:
    6,201
    Likes Received:
    91
    Uhm, I don't know how old you are, or how long you've been interested in the technical side of games, 3D graphics and hardware, but most people learn pretty quick in my experience that white papers only tell the positive side of whatever product they're 'analyzing' while glossing over or outright skipping any deficiencies it may have.

    It is a well-known fact that white papers are as much, if not more, a tool of a company's PR department than the technical writer that is listed at the top as the creator. For example, if you were to look up the white papers for the matrox mystique (remember that one?), you'd find that it describes the product lacking alpha blending entirely as a positive thing.

    Also, you might find that the best approach when joining a new board is not to immediately start insulting other members.

    Perhaps you misunderstood Shifty's comment. He just pointed out a limitation - and incidentally the same kind of limitation Cell and all other processors share, ie the performance discrepancy between processors and the memory chips they're attached to...
     
  17. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    It is a IBM paper trying to peddle a IBM product. If you do not view that with scepticism, you are in the wrong field.

    IBM "removed" the memory wall by forcing programs to execute out of local store and on data that is in local store. They did nothing to improve the fundamental latency of the main memory system, quite the contrary.

    Relying on developers to transform latency bound problems into bandwidth bound ones is a losing strategy for a CPU architecture in the long run.

    This does not mean that it does not makes sense in PS3. It just does not make much sense in most other applications/markets.

    Cheers
     
  18. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    18,762
    Likes Received:
    2,639
    Location:
    Maastricht, The Netherlands
    Surely that comes from all the different devices that can communicate with each other over this bus?

    Which is why there are so many different memory areas imho (the local caches, the XDR memory, the GDDR3 memory, etc.). The clever way is to set this up so that they can all efficiently talk to each other.
     
  19. russo121

    Regular

    Joined:
    Aug 27, 2003
    Messages:
    283
    Likes Received:
    4
    As you said, this is a bus, and shared (the purpose of a bus). In this bus the maximum you will ever get will be 64bits*3.2Ghz/8=25.6GB/s.

    Edit : You can even had 100000000000000GB/s of internal transfer in the Cell that it will not help you. You'll be limited to the bus speed - 25.6GB/s.
     
    #39 russo121, Aug 27, 2006
    Last edited by a moderator: Aug 27, 2006
  20. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,723
    Likes Received:
    242
    ok let me see if I have this right, current-gen XDR can scale upto ~102 GB/sec bandwidth correct? PS3 has 1/4th of that.


    XDR2 memory scales upto 200 GB/sec or starts there, but either way, it can hit or will be able to hit 200 GB/sec. that's the least I'd expect for PS4's main memory. It's about 8 times more more bandwidth, about the same increase from PS2's main bandwidth (3.2 GB/sec) to PS3's.

    hopefully, XDR2 has even lower latency than XDR. 200 GB/sec (or more) of low-latency memory would help towards making the PS4 an absolutely killer machine.
     
    #40 Megadrive1988, Aug 27, 2006
    Last edited by a moderator: Aug 27, 2006
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...