Future solution to memory bandwidth

kemist · Feb 2, 2006

there any reason why they cant make a higher clocked more serial ram? isnt that what the idea behind xdr/rambus was or am i way outta line?

Inane_Dork · Feb 2, 2006

Ailuros said:
Unless you mean something entirely different I can see "SW tiling" on GPUs for ages now.

No, we're probably thinking about the same thing. I'm just taking it in a different direction.

I would think that something Xenos-like would be quite beneficial to PC developers, if they were all used to the mindset of fitting the framebuffer into cache. Granted, we're talking about several tiles to hit the uber high resolutions that PCs can do, but if your renderer is set up to scale that way, the benefits are tremendous.

And the perennial programmer favorite, access to the framebuffer within the pixel shader, would be much more feasible if the current tile was in cache.

Or maybe GPUs are taken in a more programmable stream direction. That way, data is kept on chip and external bandwidth is reserved for other things. Well, every stream that fits in cache must be small enough, so I'm back at tiling again.

It's not that tiling is the answer to everything. It's just that when I thought about the question, the answers I came up with all pretty much demanded that software be made aware of some GPU memory which cannot be exceeded. 'Twas just my initial response.

Blazkowicz · Feb 2, 2006

Inane_Dork said:
No, we're probably thinking about the same thing. I'm just taking it in a different direction.

I would think that something Xenos-like would be quite beneficial to PC developers, if they were all used to the mindset of fitting the framebuffer into cache. Granted, we're talking about several tiles to hit the uber high resolutions that PCs can do, but if your renderer is set up to scale that way, the benefits are tremendous.

We already have that.. TurboCache !

Ailuros · Feb 2, 2006

Inane_Dork said:
No, we're probably thinking about the same thing. I'm just taking it in a different direction.

I would think that something Xenos-like would be quite beneficial to PC developers, if they were all used to the mindset of fitting the framebuffer into cache. Granted, we're talking about several tiles to hit the uber high resolutions that PCs can do, but if your renderer is set up to scale that way, the benefits are tremendous.

And the perennial programmer favorite, access to the framebuffer within the pixel shader, would be much more feasible if the current tile was in cache.

Or maybe GPUs are taken in a more programmable stream direction. That way, data is kept on chip and external bandwidth is reserved for other things. Well, every stream that fits in cache must be small enough, so I'm back at tiling again.

It's not that tiling is the answer to everything. It's just that when I thought about the question, the answers I came up with all pretty much demanded that software be made aware of some GPU memory which cannot be exceeded. 'Twas just my initial response.

http://www.beyond3d.com/articles/xenos/index.php?p=05#tiled

There are both directions in that page. Read carefully and re-think.

ShootMyMonkey · Feb 2, 2006

there any reason why they cant make a higher clocked more serial ram? isnt that what the idea behind xdr/rambus was or am i way outta line?

That was the idea. XDR and Rambus DRAMs are not produced in high enough volume to be cheap. The cost difference there probably counters any advantage in cost that you'd get from having the simpler board design.

As far as actually raising the clock itself, the fundamental problem with that is simply the capacitance of those wire traces on your circuit boards. It's not easy to swing voltages that fast when you've got lots of capacitance. XDR manages it by using a very small voltage swing (only 0.2V), and using a differential signaling scheme to be a little more noise-resistant as 0.2V is not a lot.

Second of all, RAM itself can't be clocked super high. XDR at 3.2 GHz signaling means that the DRAM clock is 400 MHz. That's not an easy clock to reach considering that the larger the DRAM, the slower it is, the longer the multiplexor delays, the more there is to refresh and so on.

Inane_Dork · Feb 2, 2006

Ailuros said:
http://www.beyond3d.com/articles/xenos/index.php?p=05#tiled

There are both directions in that page. Read carefully and re-think.

I've already read it several times. I don't know what you're trying to convince me of, but I would really appreciate a simple laying out of why my ideas are infeasible.

It's okay, I don't bite.

Just say it.

KimB · Feb 2, 2006

Inane_Dork said:
No, we're probably thinking about the same thing. I'm just taking it in a different direction.

I would think that something Xenos-like would be quite beneficial to PC developers, if they were all used to the mindset of fitting the framebuffer into cache. Granted, we're talking about several tiles to hit the uber high resolutions that PCs can do, but if your renderer is set up to scale that way, the benefits are tremendous.

You can't tile efficiently in software, because tiling sits between the calculation of screen-space vertex positions (done in the vertex shade currently), and the computation of pixels. As such, the only way to tile efficiently is in hardware. But I don't think it's really necessary.

Consider ATI's performance hit from FSAA as a quick example. Simply being very careful about what you do with available bandwidth can really improve things quite a lot.

Additionally, as we move into the future, pixel shader are naturally going to get longer. So framebuffer bandwidth demands are going to decrease in relation to fillrate demands. And the same goes for texture bandwidth, since the ALU to TEX operation ratio is just going to increase.

Of course, since memory bandwidth naturally scales more slowly than processing power, it is still conceivable that at some point we'll need something pretty different to keep improving performance (i.e. eDRAM, on-package DRAM, or TBDR). But since silicon process technologies themselves don't have all that far to go (There's unlikely to be much improvement past about 30nm), we may never hit that wall. Not until we move away from silicon-based processors, anyway.

Humus · Feb 2, 2006

ERK said:
This is something I've always been very curious about. For instance, would it be possible to get any kind of reasonable quality by moving to significantly higher resolution textures, but with lossier compression?

I often get annoyed with the smeary magnified look.

Well, going from RGBA8 to DXT already improves both quality and performance at the same storage space. At minification the RGBA8 texture will naturally look better, even though the difference often is hard to see, but at magnification a higher res DXT texture will certainly look better.

ERK said:
How close to the limits of compression are we now? Seems like if there were performance to be mined here it would have been done already.

Well, we have DXT for color, which works fine already and doesn't neccesarily need to compress better. 4bpp is already quite small. Improving quality is probably a more important concern. There's some promising research in this area:
http://graphics.cs.lth.se/research/papers/ipackman2005/
We also have 3Dc to cover normal maps, and now with single channel 3Dc we have the luminance texture cases covered as well. What's left on the texturing side IMO would be some kind of HDR compression.

_xxx_ · Feb 2, 2006

Chalnoth said:
we'll need something pretty different to keep improving performance (i.e. eDRAM, on-package DRAM, or TBDR)

I sincerely doubt the usefulness of the eDRAM in PC space. What I can think of is rather some new memory technologies pushing it forward.

Would XDR be sufficiant for the dual-core setup discussed in the G71 thread? Or some beefed-up version of that?

From the Samsung XDR page:

What are the Advantages of XDR DRAM?
Highest Frequency Memory
4.0/3.2/2.4Gbps speed with max. 8.0GB/s sustained bandwidth
More head room for expandability
Highly Effective Memory Bandwidth
Large number of banks (8 banks)
Efficient operation for different bank-set (Even/Odd)
Zero refresh overhead

Jawed · Feb 2, 2006

Chalnoth said:
Additionally, as we move into the future, pixel shader are naturally going to get longer. So framebuffer bandwidth demands are going to decrease in relation to fillrate demands. And the same goes for texture bandwidth, since the ALU to TEX operation ratio is just going to increase.

But the counter argument is that rendering to cubemaps and shadowing etc. are all techniques that will increase in pervasiveness. There will no longer be just a backbuffer chewing up ROP/memory-bandwidth.

Jawed

Ailuros · Feb 2, 2006

Inane_Dork said:
I've already read it several times. I don't know what you're trying to convince me of, but I would really appreciate a simple laying out of why my ideas are infeasible.

It's okay, I don't bite. Just say it.

From said page for tiling on IMRs.

The net result here is that geometry needs to be recalculated multiple times for each of the buffers.

A TBDR can remove that kind of redundancy for one and with D3D10 and conditional rendering I believe even more so. That said framebuffer consumption (besides very low bandwidth requirements) on a TBDR for anything FSAA is miniscule compared to other architectures and even more so if a combination of any form of AA with float HDR is being required.

Theoretically at least the advantage of floating point framebuffers on tile based deferred renderers can be quite large.

kemist · Feb 2, 2006

ShootMyMonkey said:
That was the idea. XDR and Rambus DRAMs are not produced in high enough volume to be cheap. The cost difference there probably counters any advantage in cost that you'd get from having the simpler board design.

As far as actually raising the clock itself, the fundamental problem with that is simply the capacitance of those wire traces on your circuit boards. It's not easy to swing voltages that fast when you've got lots of capacitance. XDR manages it by using a very small voltage swing (only 0.2V), and using a differential signaling scheme to be a little more noise-resistant as 0.2V is not a lot.

Second of all, RAM itself can't be clocked super high. XDR at 3.2 GHz signaling means that the DRAM clock is 400 MHz. That's not an easy clock to reach considering that the larger the DRAM, the slower it is, the longer the multiplexor delays, the more there is to refresh and so on.

Yea, i knew xdr was expensive, i was suggesting maybe a consortium could design next gen graphics ram with the basic idea of a more serial design. Though any promising design in that direction probably has elements patented by rambus which would make it expensive or difficult to produce royalty free.

And i didnt know all of that info about capacitance/xdr's design, very interesting stuff.

hughJ · Feb 2, 2006

What would a 512bit bus do to the pincounts and board complexity? What kind of difference did we see from 128bit to 256bit?

KimB · Feb 2, 2006

Jawed said:
But the counter argument is that rendering to cubemaps and shadowing etc. are all techniques that will increase in pervasiveness. There will no longer be just a backbuffer chewing up ROP/memory-bandwidth.

Well, rendering to cubemaps, in general, isn't going to be any different than rendering to the framebuffer, so that's not a concern.

Rendering shadowmaps is, of course, but this is where z-buffer compression comes in handy. It should be possible to compress a shadowmap in the same way that the z-buffer is compressed, dramatically reducing the bandwidth requirements.

Jawed · Feb 2, 2006

Chalnoth said:
Well, rendering to cubemaps, in general, isn't going to be any different than rendering to the framebuffer, so that's not a concern.

It's a concern because it's an additional workload on the ROPs/bandwidth.

Rendering shadowmaps is, of course, but this is where z-buffer compression comes in handy. It should be possible to compress a shadowmap in the same way that the z-buffer is compressed, dramatically reducing the bandwidth requirements.

I think the compression is relatively limited where there's high geometric complexity, which makes for a particular problem when generating self-shadowing.

I suspect texture space lighting allows the shadowing engine to partition the scene into "pockets" of self-shadowing, where high geometric complexity prevails, but over the remainder of the scene the more general case of traditional shadow mapping (but with multiple light sources for interiors) can make best use of Z compression as you suggest.

But, again, each self-shadowing object/character will only add to the ROP/bandwidth load.

Jawed

the maddman · Feb 2, 2006

hughJ said:
What would a 512bit bus do to the pincounts and board complexity? What kind of difference did we see from 128bit to 256bit?

The biggest problem with a true 512bit bus would be the pad space on the GPU die. The transistors keep shrinking, but not the wires that connect them to the rest of the card.

ShootMyMonkey said:
That was the idea. XDR and Rambus DRAMs are not produced in high enough volume to be cheap. The cost difference there probably counters any advantage in cost that you'd get from having the simpler board design.

As far as actually raising the clock itself, the fundamental problem with that is simply the capacitance of those wire traces on your circuit boards. It's not easy to swing voltages that fast when you've got lots of capacitance. XDR manages it by using a very small voltage swing (only 0.2V), and using a differential signaling scheme to be a little more noise-resistant as 0.2V is not a lot.

Second of all, RAM itself can't be clocked super high. XDR at 3.2 GHz signaling means that the DRAM clock is 400 MHz. That's not an easy clock to reach considering that the larger the DRAM, the slower it is, the longer the multiplexor delays, the more there is to refresh and so on.

XDR's RAM core might not be clocked much faster than what we have already, but it is higher bandwidth per pin. It might be possible to actually route enough lines to get more bandwidth to the GPU, but I don't know. One XDR channel might be easy, but what happens when you want 8 or 16 of them in close proximity?

I've always wondered how well the serial interface memory would work on video cards, but the only one I know of that used Rambus was the Cirrus Logic Laguana series and those sucked. I don't think that was the memory's fault though

Inane_Dork · Feb 2, 2006

Chalnoth said:
You can't tile efficiently in software, because tiling sits between the calculation of screen-space vertex positions (done in the vertex shade currently), and the computation of pixels. As such, the only way to tile efficiently is in hardware. But I don't think it's really necessary.

Very efficient software tiling, no. But likely efficient enough that, in a bandwidth constrained situation, recomputation of part of the scene is a win in order to fit inside cache. It would basically boil down to frustum culling and maybe some tile selection algorithm, but if you were really pressed for bandwidth, it well could be a win on the whole.

Consider ATI's performance hit from FSAA as a quick example. Simply being very careful about what you do with available bandwidth can really improve things quite a lot.

Additionally, as we move into the future, pixel shader are naturally going to get longer. So framebuffer bandwidth demands are going to decrease in relation to fillrate demands. And the same goes for texture bandwidth, since the ALU to TEX operation ratio is just going to increase.

Absolutely. I don't see that bandwidth is going to become the big bottleneck in real time graphics, but that's the question put forth in this thread.

Ailuros said:
From said page for tiling on IMRs.

Already known. It's a trade-off. But, like I said above, in the situation presumed in this thread, it would probably be advantageous. You trade off a resource that's not getting maxed (shader processing) for a resource that is (bandwidth). And we're talking about vertex shaders here which, to date, have not been terribly large.

KimB · Feb 2, 2006

Inane_Dork said:
Very efficient software tiling, no. But likely efficient enough that, in a bandwidth constrained situation, recomputation of part of the scene is a win in order to fit inside cache. It would basically boil down to frustum culling and maybe some tile selection algorithm, but if you were really pressed for bandwidth, it well could be a win on the whole.

But in this situation you'll have significantly higher bus and vertex bandwidth. So it may not be an overall bandwidth win after all.

Ailuros · Feb 2, 2006

Inane_Dork said:
Already known. It's a trade-off. But, like I said above, in the situation presumed in this thread, it would probably be advantageous. You trade off a resource that's not getting maxed (shader processing) for a resource that is (bandwidth). And we're talking about vertex shaders here which, to date, have not been terribly large.

On a GPU with separate PS/VS units where it would hit in a theoretical scene onto a very long vertex shader at the same time with a quite short pixel shader I'm not so sure it would be advantageous.

Jawed · Feb 2, 2006

From:

http://microsoft.sitestream.com/PDC05/PRS/PRS311_files/Botto_files/PRS311_Balaz.ppt

Jawed

Future solution to memory bandwidth

kemist

Inane_Dork

Rebmem Roines

Blazkowicz

Ailuros

Epsilon plus three

ShootMyMonkey

Inane_Dork

Rebmem Roines

KimB

Humus

Crazy coder

_xxx_

Jawed

Ailuros

Epsilon plus three

kemist

hughJ

KimB

Jawed

the maddman

Inane_Dork

Rebmem Roines

KimB

Ailuros

Epsilon plus three

Jawed

Similar threads