Future solution to memory bandwidth

Chalnoth said:
But in this situation you'll have significantly higher bus and vertex bandwidth. So it may not be an overall bandwidth win after all.
I was under the impression that render targets consumed well over half the bandwidth on a current PC card. If so, rendering to cache is more likely to be a win than not, even when quadrupling vertex bandwidth.

Bus speed, however, might be an issue. I had not thought of that.



Ailuros said:
On a GPU with separate PS/VS units where it would hit in a theoretical scene onto a very long vertex shader at the same time with a quite short pixel shader I'm not so sure it would be advantageous.
This is 3-5 years down the line. I kinda doubt there will be separate hardware shaders. And I also doubt that there will be numerous cases with long vertex shaders and short pixel shaders. Shadow map generation is probably the closest match I can think of for this situation.

Sure, there will be corner cases and instances where caching the render target is slower than the current system. If the situation was really so slam dunk, we would be there by now.
 
What about ditching rasterizing for visibility calcs and using raytracing? Does anybody know of any good reasons why not? Atleast if the KD-tree construction problem is solved.
AFAICS the raytracing specific tree traversal units can be added to current or next generation GPU's for gradual change. The HW requirements shouldn't be too hefty. Looking at the RPU paper from last SIGGRAPH I'd estimate approximate HW complexity per unit around the same as R580 ALU's: 1.5M apiece. If I didn't make any mistakes in my back of the envelope calculations around 10-20 of those should be enough to provide good framerates at 1600x1200.
 
Chalnoth said:
Rendering shadowmaps is, of course, but this is where z-buffer compression comes in handy. It should be possible to compress a shadowmap in the same way that the z-buffer is compressed, dramatically reducing the bandwidth requirements.
Difficult, because the TMU (or rather, a texel prefetcher, however you want to call it) would need access to the compression flag table that indicates which tiles are compressed and which aren't.
 
Xmas said:
Difficult, because the TMU (or rather, a texel prefetcher, however you want to call it) would need access to the compression flag table that indicates which tiles are compressed and which aren't.
There is another form of compression available: ATI1N, for single-channel textures.

Jawed
 
Inane_Dork said:
This is 3-5 years down the line. I kinda doubt there will be separate hardware shaders. And I also doubt that there will be numerous cases with long vertex shaders and short pixel shaders. Shadow map generation is probably the closest match I can think of for this situation.

Sure, there will be corner cases and instances where caching the render target is slower than the current system. If the situation was really so slam dunk, we would be there by now.

What is it then I'm probably misinterpreting here?

Lemme check... lets see.... SM2.0... SM2.0... (it's hard to find it around here, since we still have a lot of 1.x shaders :)... OK, I think it's something in the range of 15-20 instructions. But, as I said before – the shader that really raped high-end GPU was the one with 8 texture fetches (and only 25 instructions!).

As for vertex shaders... well, we went overboard a bit - turns out that some shaders now have >100 instructions (water surface geometry deformation, complex reflections with fresnel and similar high-tech mumbo-jumbo :).

http://www.beyond3d.com/interviews/dean/index.php?p=05

To be frank I don't have the slightest idea what really goes in SS2 and the downside is that most sites that use it disable any form of compression which does tax any GPU quite a bit and not necessarily for an important reason. However I still am scratching my head over those kind of results:

http://www.xbitlabs.com/articles/video/display/radeon-x1900xtx_25.html

Theoretically ATI's VS throughput with complex VS is a lot higher than on GeForces.
 
Ailuros said:
What is it then I'm probably misinterpreting here?
Well, I dunno. :p

I think that, overall, rendering in smaller chunks and caching those chunks on chip is a win when you're strapped for VRAM bandwidth. I grouped that under "tiling" because that made sense to me, but maybe it has too loaded a definition. The smaller chunks could be portions of render targets or a bundle of streams or whatever.

Every system has its strengths and weaknesses. I think efforts like tiling fit bandwidth constrained scenarios better than the current system does.
 
Jawed said:
There is another form of compression available: ATI1N, for single-channel textures.

Jawed
And how is that related to using the compressed Z-buffer as a texture? It's read-only, lossy, and low precision and therefore completely useless for shadow maps.
 
Ailuros said:
A TBDR can remove that kind of redundancy for one and with D3D10 and conditional rendering I believe even more so.
Conditional rendering builds on top of occlusion queries, which are much harder to implement efficiently in TBDRs than in IMRs.
 
Xmas said:
And how is that related to using the compressed Z-buffer as a texture?
I wasn't suggesting it was!

It's read-only, lossy, and low precision and therefore completely useless for shadow maps.
My mistake, being single-channel it seemed useful :oops:

Jawed
 
Question.

Would it be feasible (or desirable, considering cost) to have a secondary xenos-like chip that can supply the main GPU with things like cubemaps and shadowmaps rendered of course on embedded memory thus preserving main video ram bandwidth?

Of course, this would probably need to be specifically programmed for, so it'd fit better on a console (cost would be a concern, of course).
 
I find it strange that in 3 pages there is no mention of QDR. Micron has made QDR static ram for some years now and there was R&D being done on QDR SDRAM. This would lower the amout of traces compaired to a 512bit mem bus.
 
{Sniping}Waste said:
I find it strange that in 3 pages there is no mention of QDR. Micron has made QDR static ram for some years now and there was R&D being done on QDR SDRAM. This would lower the amout of traces compaired to a 512bit mem bus.
"QDR" in Micron's technology was achieved by having one read and one write port. Each port was DDR, in total achieveing 2 read + 2 write = 4 transfers per clock cycle. It had no bandwidth-per-pin advantage over DDR; its advantage was solely that it eliminated bus turnaround times.

"True" QDR, as implemented in the Pentium4 front-side-bus, requires more clock lines for a given amount of bandwidth than DDR and is therefore not considered particularly attractive. Besides, XDR already goes beyond QDR with 8 (!) transfers per clock cycle.
 
arjan de lumens said:
Conditional rendering builds on top of occlusion queries, which are much harder to implement efficiently in TBDRs than in IMRs.

It's a D3D10 requirement though or not? If yes then there's no way around for those kind of architectures.
 
Inane_Dork said:
Well, I dunno. :p

I think that, overall, rendering in smaller chunks and caching those chunks on chip is a win when you're strapped for VRAM bandwidth. I grouped that under "tiling" because that made sense to me, but maybe it has too loaded a definition. The smaller chunks could be portions of render targets or a bundle of streams or whatever.

Every system has its strengths and weaknesses. I think efforts like tiling fit bandwidth constrained scenarios better than the current system does.

Radeons have tiled back buffers for eons as most of us here know.

I have the impression that you're suggesting rather the "viewports"/macro tiles Xenos uses in order to fit 4xMSAA at 720p into it's 10MB of eDRAM. Here the developers will decide obviously when and how they'll split into those "viewports" or else use 2x or 1xAA after all.

I've already read in another thread suggestions for eDRAM on PC GPUs without more eDRAM than Xenos has but even more "viewports" than 3 overall. From a certain point and on it can work more against you, than show actual benefits. Resubmitting geometry by an N-th degree sound rather like pure nonsense.

It should be bleedingly obvious that I have nothing against tiling or Tilers in general; au contraire.

If we'd need larger caches (and not something close to a framebuffer like on Xenos), I'd rather believe Demirug's thoughts about Z-RAM to be better suitable for such tasks.
 
The extent to which you have to resubmit geometry really depends on the culling. I think by the time we will have pixel level geometry at almost all time, and solving all the aliasing problems involved, performing efficient hierarchical culling down to tile level won't be a problem anymore either for practical tile sizes.
 
Last edited by a moderator:
Ailuros said:
5.) TBDR (who on earth expected me to say otherwise? :D )

Didn't TBDR die long time ago? I thought TBDR would not scale well with increasing triangle counts.
 
Back
Top