Recent Radeon X1K Memory Controller Improvements in OpenGL with AA

Bouncing Zabaglione Bros. said:
The new memory controller has just become ATI's (no longer secret) superweapon. The potential for strongly improving performance in all sorts of places has just become pretty significant.

THough it has only that much headroom, so I wouldn't expect wonders. I assume it'll be some 10% more max in the end.
 
_xxx_ said:
THough it has only that much headroom, so I wouldn't expect wonders. I assume it'll be some 10% more max in the end.

Anything more is obviously for the better. I'd urge anybody who will consider testing any of those improvements to compare standard available with custom timedemos to exclude any weird possibilities.
 
Our "Turkey Baster" timedemo is very custom, and becuase it was designed to show up 512MB board performances it specifically goes through nearly an entire level to pick up areas of swapping of textures.
 
Ailuros said:
Anything more is obviously for the better. I'd urge anybody who will consider testing any of those improvements to compare standard available with custom timedemos to exclude any weird possibilities.

Sure. Just wanted to say, people should not expect another 30% jump or two :)
 
_xxx_ said:
THough it has only that much headroom, so I wouldn't expect wonders. I assume it'll be some 10% more max in the end.
It's already shown more than that on some games at the first attempt from ATI. Given the memory bottleneck we've been seeing for a while, it seems like it could be quite an advantage if the memory controller can monitor itself and change operation on the fly in order to maximise bandwidth use as Sireric discussed. A memory controller that can reconfigure itself depending what kind of game or what kind of scene it's rendering could be a significant improvement if it means your chip is more highly utilised.

I wouldn't turn down at an extra 10-20 percent performance just from a smart memory controller.
 
Last edited by a moderator:
Dave Baumann said:
Our "Turkey Baster" timedemo is very custom, and becuase it was designed to show up 512MB board performances it specifically goes through nearly an entire level to pick up areas of swapping of textures.

I still want to see tests from as many different levels as possible; and that because I more than often fell on my nose lately while not checking more areas in a game; it was actually TAA related but it doesn't do me any good if performance is fine in the majority of maps, yet proves lacklustering in a minority of instances. NV claimed a 10-15% performance drop for TAA; if the application/map is highly CPU bound yes of course, otherwise....
 
TAA is naturally going to be very dependant on the number of Alphas used - scenes that are primarily built of opaque textures are going to behave just like MSAA, scenes that have lots of Alphas (such as a forest area in Far Cry, for instance) and you're basically performing the same as SSAA. A feature like this is going to be very variable in performance according to the composition within it; MSAA mostly less so.
 
Jawed said:
R520's 32-bit channels makes the minimum memory access half the size of previous GPUs. But so far we have no clear description of the specific benefits that affords. Particularly as there is a general push (seemingly) within R520 to "bulk-up" memory accesses to make the best use of the latency-tolerant clients (texture pipes, RBEs, Hierarchical-Z ...). There must be certain kinds of tasks that thrive on these very small memory accesses, but what are they

Halving the minimum access is one possibility. But if you've built up the rest of the system to handle higher latencies, why not keep the same sized accesses? Those would now keep a smaller channel occupied twice as long. You don't mind a little longer latency (because of upstream design changes), you keep the data bus moving data a higher percentage of the time (reducing command overhead), and you can be doing something for twice as many blocks on the chip at the same time.

DRAM cores increase in speed a lot slower than the busses that feed them. If you look at various spec sheets you'll see that absolute access times have not really changed (or even gotten worse), while clock speeds (and thus bandwidth) have increased quite a bit. A 250 MHz DDR part with a CL of 3 starts giving you data back in 12 ns. A top-end 800 MHz GDDR3 part requires a CL of 11, giving you data back in 13.75 ns. If you make just one small access, you're wasting a lot of cycles on your data bus. Instead you've got to work harder to find larger blocks of consecutive requests to keep those high bandwidth data pins humming.
 
I have been benchmarking the X1800XL and X1600XT in Doom III (v1.3) and Riddick (v1.1) with and without the OGL fix. The X1600XT does not really seem to react with the fix. I don’t know if there’s some registry bugs changing from X1800XL to X1600XT without "re-ghosting" the rig or something like that, but the X1600 card did not show any difference. Can somebody confirm that or?

The X1800Xl reacted like reported; 2xAA shows no difference or a slightly decrease in performance (1-2 FPS) and same goes for 6xAA in 1024x768, 1280x1024 and 1600x1200. With 4xAA I experienced boost ranging from 9% to 13% in Riddick and 9% to 15% in Doom III.
 
CK said:
I have been benchmarking the X1800XL and X1600XT in Doom III (v1.3) and Riddick (v1.1) with and without the OGL fix. The X1600XT does not really seem to react with the fix. I don’t know if there’s some registry bugs changing from X1800XL to X1600XT without "re-ghosting" the rig or something like that, but the X1600 card did not show any difference. Can somebody confirm that or?

The X1800Xl reacted like reported; 2xAA shows no difference or a slightly decrease in performance (1-2 FPS) and same goes for 6xAA in 1024x768, 1280x1024 and 1600x1200. With 4xAA I experienced boost ranging from 9% to 13% in Riddick and 9% to 15% in Doom III.
Quick note:
The update is mainly for X1800* in OGL at 4xAA. We will follow up with other AA modes and with X1600/X1300 in the future, but we can't do everything at once. All of them require different tuning to optmize performance.
 
  • Like
Reactions: Geo
sireric said:
Quick note:
The update is mainly for X1800* in OGL at 4xAA. We will follow up with other AA modes and with X1600/X1300 in the future, but we can't do everything at once. All of them require different tuning to optmize performance.

Will these optimisations controlled via Cat.AI , on a per game basis or will they also work if Cat. AI is disabled?
 
CK said:
I have been benchmarking the X1800XL and X1600XT in Doom III (v1.3) and Riddick (v1.1) with and without the OGL fix. The X1600XT does not really seem to react with the fix. I don’t know if there’s some registry bugs changing from X1800XL to X1600XT without "re-ghosting" the rig or something like that, but the X1600 card did not show any difference. Can somebody confirm that?
Yes.
 
BobbleHead said:
DRAM cores increase in speed a lot slower than the busses that feed them. If you look at various spec sheets you'll see that absolute access times have not really changed (or even gotten worse), while clock speeds (and thus bandwidth) have increased quite a bit. A 250 MHz DDR part with a CL of 3 starts giving you data back in 12 ns. A top-end 800 MHz GDDR3 part requires a CL of 11, giving you data back in 13.75 ns. If you make just one small access, you're wasting a lot of cycles on your data bus. Instead you've got to work harder to find larger blocks of consecutive requests to keep those high bandwidth data pins humming.
Yup, and that one reason that it's difficult to compare per transistor efficiency between architectures. G70 wasn't designed for sky-high core/memory speeds, so it won't need as many FIFO's as R520 will. I remember doing some work on R300 at ATI, and by lowering both the clock and memory speed a bit resulted in a significantly higher efficiency.
 
BobbleHead said:
some stuff about latencies and memory

The reason is that DDR is already designed to work at full bandwith with a given burst rate (if you ignore the overhead of loading a new row, changing pages, etc) and GDDR3 supports only one burst mode: 4. Burst 4 means that four 32 bit elements per memory chip are read/written each 2 cycles. GDDR2 supported burst modes 4 and 8 but I doubt any GPU used 8. For 64-bit independent buses (2 memory chips per bus) and burst 4 the minimum access is 32 bytes. So for burst 8 the minimum access is 64 bytes (what I'm currently using in the simulator, just to get even worst bandwidth usage ;) ).

R520 implements 32-bit independant buses (1 memory chip per bus) so the minimum access is 16 bytes and given the GDDR3 restriction it can't get longer. Of course the GPU stages can request or send more than 16 bytes to the memory controllers but those accesses are splitted into multiple 16 byte accesses to the memory chips.
 
Last edited by a moderator:
RoOoBo said:
The reason is that DDR is already designed to work at full bandwith with a given burst rate (if you ignore the overhead of loading a new row, changing pages, etc) and GDDR3 supports only one burst mode: 4. Burst 4 means that four 32 bit elements per memory chip are read/written each 2 cycles. GDDR2 supported burst modes 4 and 8 but I doubt any GPU used 8. For 64-bit independent buses (2 memory chips per bus) and burst 4 the minimum access is 32 bytes. So for burst 8 the minimum access is 64 bytes (what I'm currently using in the simulator, just to get even worst bandwidth usage ;) ).

R520 implements 32-bit independant buses (1 memory chip per bus) so the minimum access is 16 bytes and given the GDDR3 restriction it can't get longer. Of course the GPU stages can request or send more than 16 bytes to the memory controllers but those accesses are splitted into multiple 16 byte accesses to the memory chips.

Initial GDDR3 parts only required support for burst of 4 (8 was optional), however the newer revisions all generally support both 4 and 8. That really is not important though, since a burst of 8 read from address 0 looks exactly the same as back to back burst of 4 reads from 0 and 4. As you said, it just appears as multiple accesses. Upstream blocks on the chip do not need to know anything about that. If you design something to always work on blocks of 64 bytes (because you know that is more efficient), the lowest level hardware that talks to the DRAM chip can decide whether to do that as 2x burst of 8 or 4x burst of 4.

More troublesome is your first sentence. You can not ignore the substantial time required to open and close pages. As an example the 800 MHz 512 MBit Samsung part has a row cycle time of 35 clocks. More importantly it has a four-activate time of 40 clocks. If you were to read only a single burst of 4 each time you opened a page, you would be wasting 80% of your bandwidth. 2 cycles per read * 4 banks read from / 40 clocks = 0.20. Make that 2 bursts of 4 (or 1 burst of 8), and you are still tossing away 60%. Even doubling up to 2 bursts of 8 only gets you to 80% utilization, for that short period of time. Add in loss due to read/write switching, refresh, and not always having something available for a different bank to do, and utilization over a longer time scale drops much lower.

You need a lot of reads or writes that can be sent out on consecutive cycles to approach any kind of useful utilization. Make much longer bursts to get stretches where you are using 100% to offset the inevitable waste that happens at other times.
 
  • Like
Reactions: Geo
WRT this patch, I'd suggest looking at the pre-patched 4x vs 6x scores. 6x is actually faster than 4x on the XT, suggesting that something may not quite have been right with the 4x memory mappings in the first place.
 
Headstone said:
It is very strange that the 1600 sees no benifit where the 1800 and 1300 see large gains. Does any one have a good reason for this?

I think its most likely because x1300 is more like x1800 than it is like x1600 and the improvements were done mostly with the x1800 in mind.
 
Back
Top