RSX access to XDR latency issues again.

SPM · May 28, 2006

I found this interesting link relating to GDDR3 vs XDR issues.

http://www.edn.com/article/CA629315.html?ref=nbra

Although servers and supercomputers clearly benefit from high-capacity-memory systems, designers of these and other products must determine an optimal means to obtain added capacity. Adding devices to memory subsystems such as those in servers or graphics cards is conceptually straightforward regardless of the memory technology. DDR, DDR2, and GDDR devices are capable of multidrop topologies with certain limitations.

Multidrop topologies in a memory system are those in which each link of the data bus connects to more than one DRAM device. For DDR2 systems, you can connect as many as four devices on each data link. Because GDDR-family devices usually have higher peak data rates, signal-integrity issues typically prevent more than two connections per link, and even then only if the devices reside in close proximity, such as back to back on opposite sides of a pc board.

XDR offers a slightly different approach to scaling capacities. Although the address and command bus is a multidrop configuration, it can connect to 36 devices in sequence on a channel. One reason the XDR address channel supports more devices than DDR links is that DDR multidrop connections are usually stub topologies instead of sequential connections. Stubs generate reflections, which degrade signal quality; with sequential connections, you can electrically compensate for the added capacitive loading of each device along the channel, thereby minimizing impedance discontinuities and their resulting reflections.

Each data link in an XDR system, however, is routed point to point; that is, each data link connects to only one port on the DRAM and one port on the host controller. However, XDR DRAM devices are programmable in width; for example, you can program a Ã—16 DRAM to act like a Ã—8, Ã—4, or Ã—2 device. Low-capacity systems program each DRAM wider, with more links connecting to each device. Adding capacity merely involves programming the devices to be narrower and connecting fewer data links to each device.

A 32-bit XDR interface can support as little as 64 Mbytes and as much as 1 Gbyte of memory.

However, peak bandwidth is not the only parameter to consider when optimizing for bandwidth. Remember that efficiency refers to the percentage of a memory system's total aggregate bandwidth that a controller can actually use. Fewer banks and a higher tRC (row-cycle time) in a DRAM device yield more frequent bank conflicts. Bank conflicts drastically reduce the efficiency of a memory system by forcing potentially long periods of inactivity on the data bus. Write-to-read and read-to-write turnarounds also require long periods of inactivity on the data bus.

Memory systems can experience reduced efficiency even with the data bus active 100% of the time. To keep internal pipelines full, DRAM devices implement a feature called prefetch, which allows the DRAM core to run slower than the DRAM interface. The prefetch of a DRAM technology essentially determines how much data transfers for any given transaction, commonly referred to as access granularity.

In the above example, GDDR3 implements a prefetch of four, and a single transaction would therefore yield 32 bytes of data in a configuration that allows for fine access granularity. Graphics processors work largely with units called triangles, and, as graphics-processor generations mature, each triangle decreases in bytes, because smaller triangles yield more realistic rendered images. A transfer of 32 bytes may be more than necessary to access the triangles needed for a process. For example, if only one 4-byte triangle were necessary from memory, 28 bytes of the corresponding access would go to waste. Even though the bus is active during the entire transfer, the efficiency of the transfer significantly reduces. New memory technologies, such as XDR2, are emerging to further enhance bandwidth efficiency for various applications. Figure 1 shows the effective triangle transfer rate versus triangle size for GDDR3 at 1.6 GHz and XDR2 at 8 GHz. Designers optimizing their memory system for bandwidth must consider both peak bandwidth and efficiency to get the best performance out of their host processor.

It seems that XDR's characteristics are very different from GDDR3 and it can work much more efficiently than DDR when going through a number of buffered links (as in Cell's ringbus). In fact XDR seems to have been selected for Cell for this very reason. I am wondering whether the assumptions about latency issues preventing RSX from using the XDR RAM via the Flex i/o interface and Cell for texturing in the PS3 based on GDDR3 latency considerations are valid for XDR and the Flex i/o interface.

The Flex i/o interface is designed to connect two Cell chips together without any additional logic or more Cell chips with additional logic, and so it should be similar in nature to the XDR data paths rather than the massively parrallel GDDR3 interface. Why step it out to GDDR3 width and then narrow it back to XDR/Cell ringbus data width again to connect Cells together?

Teasy · May 28, 2006

What is XDR's sustained latency?

SPM · May 29, 2006

Teasy said:
What is XDR's sustained latency?

From what I can understand of the article, it looks like the latency involved in the RSX accessing XDR should be similar to the latency involved in the PPE or SPE accessing XDR. This must be small otherwise the performance of Cell would be unacceptable.

The problem is that I think people have been assuming the RSX's link to XDR via Flex i/o and Cell ring bus has the same characteristics as GDDR3. According to the article XDR has very different characteristics. GDDR and DDR has a multi-drop connection and reflection of the signal at the connections means you have to wait for the signal to settle down before reading or writing the state of the connections (hence high latency). If you connect GDDR3 interfaces in sequence the latency would be unacceptable.

According to the article, XDR is always connected very differently - point to point with a sequence of up to 36 links (as in the ring bus link through Cell) rather than in a multi-drop connection as with conventional busses. Reflections are eliminated by electronic compensation, hence very low latency and high data rates are possible compared to DDR. Hence if Flex i/o, and connections to other Cells and to RSX uses the same type of bus at the the same speed to transfer data as XDR (and why shouldn't it? - it would be stupid to change it to GDDR3 by widening the bus and reducing data rate and then change it back again), the whole topology works like a single XDR bus.

_xxx_ · May 29, 2006

Teasy said:
What is XDR's sustained latency?

Dunno 'bout the sustained bit, but Rambus say this:

1.25/3.0/2.5/3.33 ns

3.33 for the fastest variant.

inefficient · May 29, 2006

_xxx_ said:
Dunno 'bout the sustained bit, but Rambus say this:

3.33 for the fastest variant.

Don't you mean 1.25ns for the fastest variant since with latency lower is better.

Guden Oden · May 29, 2006

Latency (measured in cycles) generally goes up with increasing clock speed, though this may be fully, or almost fully mitigated by the clock speed increase. I don't know if anyone's done any conclusive studies on what pattern there may be between the two.

_xxx_ · May 29, 2006

inefficient said:
Don't you mean 1.25ns for the fastest variant since with latency lower is better.

Higher frequency --> more latency AFAICR.

Shifty Geezer · May 29, 2006

When you say fastest varient, you mean fastest RAM type, and not fastest latency? Because 1.25 ns is obviously the fastest RAM regards latency, but not necessarily the latency of the fastest RAM, which seems to be the confusion.

_xxx_ · May 29, 2006

No, when you ramp up the clock, that leads to increased latency. So the highest clocked (overall fastest) version has the highest latency. Think of it like with DDR-->DDR2, about the same situation.

Shifty Geezer · May 29, 2006

_xxx_ said:
No, when you ramp up the clock, that leads to increased latency. So the highest clocked (overall fastest) version has the highest latency. Think of it like with DDR-->DDR2, about the same situation.

When you say 'no', you mean 'yes'? Because what you've described is what I said

That fastest RAM has the higher latency, and by 'fastest variant' you were talking fastest RAM type (clock) and not fastest latency.

_xxx_ · May 29, 2006

LOL, yes...

What I meant is that the increase in the clock speed is the cause for higher latency, that's all.

Gubbi · May 29, 2006

_xxx_ said:
No, when you ramp up the clock, that leads to increased latency. So the highest clocked (overall fastest) version has the highest latency. Think of it like with DDR-->DDR2, about the same situation.

DDR and DDR2 modules have the same latency measured in time. Measured in (transfer) cycles DDR2 obviously have much higher latency because of the much shorter cycle time.

The 1.25 to 3.3ns request packet latency compares to the CAS latency of DDR modules, which can be had as low a 1.5 cycles for DDR 400 or 3.75ns (boutique GEIL modules).

This is pretty far from being the total latency though. Since you have latency going through the memory controller twice, first to send the command (address) and second to gather the data (XDR is a serial interface). Then you have the latency when your access misses an open page (but XDR is much better in this regard with many open pages).

Consider a typical CAS latency of 5-7ns for affordable DDR modules and how even on K8s with its on die memory controller you still see 50+ ns latency (and that's with constant stride access), the 1.25ns figure doesn't say much.

Cheers

SPM · May 29, 2006

Reading through what little information Rambus actually has on it's website, it does appear that PS3's memory/bus architecture is end to end Rambus designed.

The Flex i/o processor bus is a trademark of Rambus. http://www.rambus.com/products/flexio/index.aspx

The way the components are connected together with point to point links in series rather than tapping off a common parallel bus looks like standard Rambus XDR/Flex i/o interconnect topology, and one of RSX's 128 bit GDDR3 interfaces has been replaced by a Rambus Flex i/o interface for the connection to Cell.

Now, Sony may be capable of hashing up the PS3 memory bus design, but Rambus?

_xxx_ · May 30, 2006

Gubbi: sure, the values were just an approximation.

I think Samsung has some more info on XDR on their page.

ShootMyMonkey · May 30, 2006

3.33 for the fastest variant.

Presumably then, we should be expecting 2.5 in the PS3? That sounds unusual, in spite of the fact that it does divide evenly into the DRAM clock. 3.2 GHz transfer rate implies a DRAM clock of 400 MHz, and 2.5 ns would be 1 cycle at 400 MHz.

Also, given that the fastest clock should be 500 MHz (as the top rated XDR I know of is 4 GHz effective -- "XDR2" notwithstanding), again 3.33 doesn't add up. As it so happens, the slowest rated XDR is 2.4 GHz effective for a 300 MHz DRAM clock, which makes sense with a 3.33 ns latency. So I have to question whether this figure is supposed to be a "total" latency or just a latency between packet requests or something (i.e. that the whole thing is fully pipelined).

DDR and DDR2 modules have the same latency measured in time. Measured in (transfer) cycles DDR2 obviously have much higher latency because of the much shorter cycle time.

I'm not sure which speed DDR2 you're comparing against what speed DDR, but DDR2 is worse latency in absolute time as well since the DRAM itself is clocked at half the speed of a DDR DRAM for the same bus speed (and since the latency in clock cycles is not exactly half, it works out as a net loss). In general, I think a DDR2 DIMM at a bus speed around 566 should have the same absolute latency as a DDR DIMM at 400.

_xxx_ · May 31, 2006

Gubbi said:
Since you have latency going through the memory controller twice, first to send the command (address) and second to gather the data (XDR is a serial interface).

But I thought they referred to that exactly?

Gubbi · May 31, 2006

_xxx_ said:
But I thought they referred to that exactly?

No, they refer to the handling of a request packet in a XDR module. From a request packet appears on the input of the XDR module (ie. get data at <this> address) until data start coming out. The request packet latency is equivalent to the CAS latency in the DDR world.

Total latency is composed of:
1. Generate the request packet in the CPU's memorycontroller.
2. Transmit the request packet to the XDR module (serial interface).
3. Wait for XDR module to start output (that's what specified above, the 1.25-3.33ns figure)
4. Gather data from XDR module (serial interface), forward critical word first.

On top of that you have the latency introduced by first missing the caches (in the case of the PPE) or firing up a DMA request (in the case of a SPE) which is much more expensive.

If you have multiple modules on a XDR channel extra latency is used to transmit from one module to the next. This increases total latency slightly but bandwidth remains the same.

Cheers

Gubbi · May 31, 2006

ShootMyMonkey said:
I'm not sure which speed DDR2 you're comparing against what speed DDR, but DDR2 is worse latency in absolute time as well since the DRAM itself is clocked at half the speed of a DDR DRAM for the same bus speed (and since the latency in clock cycles is not exactly half, it works out as a net loss). In general, I think a DDR2 DIMM at a bus speed around 566 should have the same absolute latency as a DDR DIMM at 400.

Erh, you lost me.

I was thinking of DDR-400 vs DDR2-800. Both have a command rate of 200MHz, both can be had with similar CAS latency measured in time: 1.5 cycles for DDR, 3 cycles for DDR2, both = 3.75ns.

The internal DRAM array is the same, so latency from that would be the same. But we were talking about request latency, so asume an open bank-hit.

Since critical word is transmitted first, DDR2 have no real latency advantage from the lower transmission time (ie. DDR-400 gets the first word as fast as DDR2-800)

Cheers

jpr27 · Jun 4, 2006

Ok Im a little confused by the numbers but I guess a good question would be why not use all XDR memory? I would suspect timings with the GPU but I would think NV could make a memory interface to take advantage of it (Flex I/O?) Is it mainly due to XDR cost or Nvidia not having the time to create a memory controller that would make the most of what XDR could offer? (Or possibly NV coming into the project later?) Wouldnt the PS3 have a more Steamlined structure using a Unified Memory structure reducing overall latency so no need for requests from the Cell's (cache) XDR memory poll and then from the current GPU DDR3 Memory pool?

I bring this up because I thought I read that Cell was capable of assisting the GPU under certain conditions (believe cache and/or XDR) then had to tap into the DDR3 memory to do so (or vice versa)? Wouldnt this only add to the latency and possible stalls using two different memory pools? Or does the Flex I/O help minimize these conditions?

This article has brought up more questions for me but I think Im over analyzing how the memory is being used.

SPM · Jun 4, 2006

jpr27 said:
Ok Im a little confused by the numbers but I guess a good question would be why not use all XDR memory? I would suspect timings with the GPU but I would think NV could make a memory interface to take advantage of it (Flex I/O?) Is it mainly due to XDR cost or Nvidia not having the time to create a memory controller that would make the most of what XDR could offer? (Or possibly NV coming into the project later?) Wouldnt the PS3 have a more Steamlined structure using a Unified Memory structure reducing overall latency so no need for requests from the Cell's (cache) XDR memory poll and then from the current GPU DDR3 Memory pool?

I bring this up because I thought I read that Cell was capable of assisting the GPU under certain conditions (believe cache and/or XDR) then had to tap into the DDR3 memory to do so (or vice versa)? Wouldnt this only add to the latency and possible stalls using two different memory pools? Or does the Flex I/O help minimize these conditions?

This article has brought up more questions for me but I think Im over analyzing how the memory is being used.

I think it is because XDR is more expensive. GDDR3 cannot connect to more than 2 devices (which is presumably why Xenon accesses GDDR3 via Xenos on the Xbox360, as Cell does via RSX on the PS3). XDR is required for Cell because it connects to the PPE and 7 SPE's as well as the RSX or another Cell chip.

RSX access to XDR latency issues again.

SPM

Teasy

SPM

_xxx_

inefficient

Guden Oden

Senior Member

_xxx_

Shifty Geezer

uber-Troll!

_xxx_

Shifty Geezer

uber-Troll!

_xxx_

Gubbi

SPM

_xxx_

ShootMyMonkey

_xxx_

Gubbi

Gubbi

jpr27

SPM

Similar threads