Future solution to memory bandwidth

ERK

Regular
I hope this topic might make for interesting discussion...

I am basically soliciting your opinions on the following. What is the most likely way that significantly higher memory bandwidths can be achieved to feed future TFLOP-class GPUs, which will presumably be needed in our progress towards real-time motion-picture quality graphics? (Well, one or two more generations down the line, for sake of discussion.)

Potential paths forward for your consideration:
1) move to 512 bit bus widths. This will have to be associated with a significant increase in die size, packaging and board costs.
2) rely on the continuing improvements in RAM speeds, such as with GDDR4. But is there a limit, and will this be enough?
3) EDRAM. Local bandwidths are very fast, but cost may be high and capacity is perhaps restricted.
4) single-board SLI or multi-core, where each core/chip has an independent memory interface, and frames are tiled/composited. Board costs will be high here, as well, IMO.

Which of these is the way to go? Or is there another better way not listed?
What say the experts?

ERK
 
Most of my ideas boil down to this: I think we'll eventually be tiling. Not hardware or driver tiling, but software tiling.
 
Inane_Dork said:
Most of my ideas boil down to this: I think we'll eventually be tiling. Not hardware or driver tiling, but software tiling.

Unless you mean something entirely different I can see "SW tiling" on GPUs for ages now.
 
EDRAM functioning as some sort of cache; if it gets big enough to hold the entire working set of data for a frame, any bandwidth problems should be gone for a long time. I think it will take a few yers still for EDRAMs to come that high in density.

Other than that, you have the old spectre that is Tile-based Rendering, which is amazing on paper but somehow never seems to make it out of the low end - which I would like to believe is merely due to lack of resources among us who make tilers.
 
Ailuros said:
Unless you mean something entirely different I can see "SW tiling" on GPUs for ages now.
Yeah, once you've got a geometry shader that can work through geometry to determine occlusion, before any vertex shading takes place, you've got the makings of a fully deferred renderer, haven't you?

Jawed
 
Occluding redundant data isn't necessarily deferring anything.

I don't know how IHVs will lay out their coming GPUs, but why not have the VS before the GS?
 
Ailuros said:
Occluding redundant data isn't necessarily deferring anything.
I've always taken "deferred" to refer to the fact that the scene is multi-passed - so first the scene is read in to identify what's visible, then the scene is rendered (vertex shaded, rasterised and pixel shaded) using only the visible bits of primitives.

I don't know how IHVs will lay out their coming GPUs, but why not have the VS before the GS?
I'm thinking purely hierarchically: objects consist of primitives consist of pixels.

Though the WGF2 diagram does indeed show VS before GS :???:

Jawed
 
ERK said:
4) single-board SLI or multi-core, where each core/chip has an independent memory interface, and frames are tiled/composited. Board costs will be high here, as well, IMO.

OK, tiling: I'm confused about the meaning of this here. I assume that it means that the programmer can explicitly 'devide' a single frame to run on each GPU/local memory, whereas in current SLI/Crossfire solutions it's left up to the drivers, regardless of using AFR, SFR, ATI's 'scissor' mode etc... Correct, or not?
 
I suppose what I'm forgetting is that you have to transform into screenspace before you can test occlusions. So a sort of pre-VS requiring animation/skinning/tessellation to be complete before the occlusion can be tested.

Sigh. Messier than I was thinking.

Depends on the level of detail at which you want to defer, I suppose - per "pixel" or per "poly" or "per object". Hmm.

Jawed
 
VS after GS makes more sense IMHO. The frequency of data at each stage is growing in volume. GPUs are optimized for high VS throughput, so it makes sense that the VS stage should be where most of the vertex data is. GS implies that one can work with compressed geometry representations at high levels of representation, and I think the proper place to work with GS inputs is on the CPU.

If you had VS before GS, I think you'd need VS->GS->VS->PS. Now, you could claim that the GS could perform VS inside of it on each vertex it is generating, but I think this violates separation of concerns, since it would tend to produce large GS's that contain several VSes with branching.

But I also think that doing all the VS inside the GS may take away some opportunities for pipelining, because it's more likely that separate VS units could be operating on all of those batches of output vertices using different thread, context, and dispatching heuristics than the overall GS logic for dealing with primitives.

Seems to me that if you're doing GS, you're probably working with reduced geometry which is expanded by the GS, therefore, the load on the CPU to deal with manipulating GS primitives probably isn't that much to worry about.
 
Doesn't conditional rendering in D3D10 skip data even before the vertex setup?

But I also think that doing all the VS inside the GS may take away some opportunities for pipelining, because it's more likely that separate VS units could be operating on all of those batches of output vertices using different thread, context, and dispatching heuristics than the overall GS logic for dealing with primitives.

Sounds even more reasonable if of course we're talking about non-USC cores. In a USC I'd be very surprised if the ALUs wouldn't be capable of PS/VS/GS.
 
Yes, the more I think about this, a deferred render has traditionally been thought of as a purely VS/PS device (seemingly, I'm a noob really) so the concept of GS has always been safely removed.

Jawed
 
Jawed said:
Yes, the more I think about this, a deferred render has traditionally been thought of as a purely VS/PS device (seemingly, I'm a noob really) so the concept of GS has always been safely removed.

Jawed

With each post you're confusing me more. A D3D10 TBDR would obviously also have a GS; this has nothing to do with the differences between an immediate mode and a deferred renderer.

Vastly oversimplyfied an IMR processes data as it floats in, while a DR collects/defers data before it processes it.
 
Ken2012 said:
OK, tiling: I'm confused about the meaning of this here. I assume that it means that the programmer can explicitly 'devide' a single frame to run on each GPU/local memory, whereas in current SLI/Crossfire solutions it's left up to the drivers, regardless of using AFR, SFR, ATI's 'scissor' mode etc... Correct, or not?

The way I listed it is probably most like crossfire's supertiling, but didn't mean to restrict to a specific method.
 
Part of DX10 is the multi-passing of geometry (vertices). That's why stream-out is there (that's my understanding).

Jawed
 
Humus said:
5) Better compression of textures and geometry is another option.
This is something I've always been very curious about. For instance, would it be possible to get any kind of reasonable quality by moving to significantly higher resolution textures, but with lossier compression?

I often get annoyed with the smeary magnified look.

How close to the limits of compression are we now? Seems like if there were performance to be mined here it would have been done already.
 
Well all memory will be on the chip once we moved to high-k dielectrics low clock speeds and 3d chips.


But before that we will have optical connectors on the Ram Package there will be a High speed module i.e. say a SiGe chip that will have a large number of connectors to actual ram chip the High speed chip will also have a optical package basicly we will have a serialise chiper that connects to the GPU via a plastic fibre cable(s).
 
1) move to 512 bit bus widths. This will have to be associated with a significant increase in die size, packaging and board costs.
Hmmm... I have to wonder how well serial bus architectures would work out here. The pin counts would be lower, but the point is that they can probably scale much farther. Optical interconnects were also mentioned earlier in the thread, which is nice and all and you can still have your really wide bus without a lot of traces, but it's also expensive.

2) rely on the continuing improvements in RAM speeds, such as with GDDR4. But is there a limit, and will this be enough?
You can keep doing that, but eventually latency will catch up to you. GDDR3 can scale higher on clock that GDDR2, but it's also higher latency, and GDDR4 will probably be no different. If you can keep hiding the latency, that'll be great, but I think pixel/vertex processing volume will probably level off at some point in the long run. Maybe with raytracing hardware, there will be more readbacks of the scene data, but we'll see.

3) EDRAM. Local bandwidths are very fast, but cost may be high and capacity is perhaps restricted.
This is especially problematic if everybody's going to try and push FP32 HDR framebuffers or something similar that just eats bandwidth and capacity like a gigantic Pacman.

4) single-board SLI or multi-core, where each core/chip has an independent memory interface, and frames are tiled/composited. Board costs will be high here, as well, IMO.
I really don't think this kind of idea will go anywhere, and at best will win a very wee market segment. GPUs as they are will simply continue to get bigger and fatter down the same lines that they currently are, and so there's probably not much to gain by making a multi-chip board.
 
Back
Top