Suprised no one has mentioned this

elimc

Newcomer
What kind of performance decrease will we see when comparing DDRI and DDRII at the same clock speed? With the memory array running at half speed and the burst length being increased to four, one would think that DDRII woul be slower on a clock vs. clock comparison. So if the NV30 does use DDRII at 1GHz on a 256-bit wide bus, the perforrmance increase over the Radeon 9700 might not be as large as one might think.

Also, I've heard that minimum burst length and initial access latency are fairly important for graphics card memory. DDRII would make these worse, I would think?
 
elimc said:
What kind of performance decrease will we see when comparing DDRI and DDRII at the same clock speed? With the memory array running at half speed and the burst length being increased to four, one would think that DDRII woul be slower on a clock vs. clock comparison. So if the NV30 does use DDRII at 1GHz on a 256-bit wide bus, the perforrmance increase over the Radeon 9700 might not be as large as one might think.

Also, I've heard that minimum burst length and initial access latency are fairly important for graphics card memory. DDRII would make these worse, I would think?

No, from what I understand it will have a CAS latency of 1, which is faster than most DDRI specs.
 
Some indepth specs of DDR II..ATI has also inlcuded DDR II support into the R300...so I assume both companies have a reason for this.

ddr22.gif
 
Multiplexing the bus by 4 as done in DDR-II will increase first word latency due to the need to transfer data across different clock domains within the RAM chip - I would expect this issue to add about 1 cycle to the CAS latency (which is already typically about 4 or 5 for the DDR-I's used on high-end graphics cards now). Also, the increase of minimum burst length from 2 to 4 will eat up some of the performance advantage that a crossbar memory controller gives.

On the other side, DDR-II supports fast read<->write turnarounds (1 cycle, as opposed to 1 cycle + 1 CAS latency for DDR-I), which will to some extent offset the other disadvantages of DDR-II.

I would expect that at any given clock speed, DDR-II will be slightly (<5%) slower than DDR-I for typical GPU access patterns and moderately faster for more CPU-like access patterns. Also, I'd expect DDR-II to max out at about 50% higher clock speeds than DDR-I.
 
Excellent arjans.

Also, I'd expect DDR-II to max out at about 50% higher clock speeds than DDR-I.
This is really the main point to take home. Clock for clock comparisons are not too interesting when the changes made to DDR-II to are largely there to ensure that it can scale higher in frequency.

But could you please go into your comments about the crossbar controller because I can't really see how it would make a crossbar controller specifically more inefficient, and aren't you further assuming that the implementation is unchanged? More interesting though, could you please go more in depth on your assumptions behind "typical GPU access patterns", regardless that we are discussing angels on top of the head of a pin?


Entropy
 
The main point of having a crossbar memory controller is that it lets you perform large numbers of small memory accesses rather than few large ones. By doing lots of small transfers rather than few big ones, you can avoid accessing memory you don't really need to access, making more efficient use of the bandwidth that is available.

Now, if we assume that the memory controller configuration is kept unchanged other than going from DDR-I to DDR-II, you'll get that the minimum size block of memory that you can access doubles in size, corresponding to the doubling of the minimum burst length. The result: You end up wasting about 2 to 2.5 times as much bandwidth accessing data you don't need. If you start out with an N-way crossbar for DDR-I memory, you will need at least a 2N-way crossbar for the DDR-II memory to counteract this effect. Otherwise you lose some efficiency and thus performance.

As for GPU vs CPU access patterns: GPUs tend to perform lots of rather different accesses to frame/Z/P-buffers, textures, vertex arrays, etc. with many of these accesses benefitting greatly from the ability to accesss small memory blocks (especially if you do frame or Z buffer compression); whereas when a CPU accesses memory, it is almost always in nice, large, cache-line-size blocks.
 
arjan de lumens said:
As for GPU vs CPU access patterns: GPUs tend to perform lots of rather different accesses to frame/Z/P-buffers, textures, vertex arrays, etc. with many of these accesses benefitting greatly from the ability to accesss small memory blocks (especially if you do frame or Z buffer compression); whereas when a CPU accesses memory, it is almost always in nice, large, cache-line-size blocks.

Of course that almost all CPU access to main memory are cache lines (in fact L2 cache lines that in P4 are rather larger, 128 bytes), only the L2 reads the main memory, the CPU reads L1, L1 reads L2 (unless you think that disabling the cache will make the program run faster ;) ). Of course writes to uncached lines can go to main memory with write-through, but I would assume those are rare.

Which makes me wonder, how the caches in a GPU work and what data is cached. Access from GPU caches should affect the granularity of video memory access. With small lines the overhead of the TAG table (compared with the cache size) would be rather large.
 
arjan de lumens said:
The main point of having a crossbar memory controller is that it lets you perform large numbers of small memory accesses rather than few large ones. By doing lots of small transfers rather than few big ones, you can avoid accessing memory you don't really need to access, making more efficient use of the bandwidth that is available.

You are correct about multiple memory controllers, but what most people get caught up in is nvidia's marketing. You don't need a crossbar controller for this, just multiple memory controllers. Both the 9700 and Parhelia use 4 memory controllers as well. Although theirs are 64 bits each instead of 32 bits. The advantage of 32 bits is there will be longer bursts per controller so page breaks can be hidden more effectively.
 
RoOoBo said:
Which makes me wonder, how the caches in a GPU work and what data is cached. Access from GPU caches should affect the granularity of video memory access. With small lines the overhead of the TAG table (compared with the cache size) would be rather large.

You could set up caches such that each cache line as a whole is rather large, but the line is split up into several sub-lines, each with only a valid bit to indicate whether the data in the sub-line is valid according to the cache line tag, and then only fetch sub-lines rather than whole lines. IIRC, the AMD K6 processor did something like this for its L1 caches (64-byte lines, 32-byte sub-lines), and I would not be surprised if GF4 and R300 use similar schemes to avoid fetching unneeded sub-lines as well for their caches.
 
3dcgi said:
You are correct about multiple memory controllers, but what most people get caught up in is nvidia's marketing. You don't need a crossbar controller for this, just multiple memory controllers. Both the 9700 and Parhelia use 4 memory controllers as well. Although theirs are 64 bits each instead of 32 bits. The advantage of 32 bits is there will be longer bursts per controller so page breaks can be hidden more effectively.

A "crossbar switch" is a switch where all inputs can be connected to all outputs, like this. For a "crossbar memory controller", you will have as "inputs" all the units that can access memory (framebuffer cache, texture cache, etc) and as "outputs" all the actual raw DRAM controllers themselves. The only way you can have multiple memory controllers without having a "crossbar memory controller" is when each memory controller is dedicated to one or a limited number of tasks (like, say, one controller for framebuffer data, another one for texture data, etc. like in e.g. the Wildcat III series of cards).

Also, in the case of Parhelia, the documentation available indicates that it has only 1 big memory controller - it has 4 sets of memory address/control pins, but these are always run in lockstep. Otherwise it would have had multiple 64 or 128 bit buses internally instead of the single fat 512-bit bus it has today.
 
Back
Top