Future solution to memory bandwidth

_xxx_ · Feb 6, 2006

Could anyone in the know please elaborate on my question about the usability of (second gen?) XDR in the upcoming GFX-gen?

KimB · Feb 6, 2006

Xmas said:
How can you early-exit the Vertex Shader for culled primitives when you don't store the primitives?

You can exit once the position information is calculated. A smart shader compiler in such an architecture would just perform all operations required to calculate the position first.

So while early-exit might help a TBDR in that it could actually report those primitives that were culled during the first pass so that you don't even have to calculate the position for the second pass, you're not going to end up doing less vertex work total.

arjan de lumens · Feb 6, 2006

the maddman said:
Quick question: How much memory bandwidth would it save if we didn't have to refresh the DRAM? Or does GDDR do that internally?

AFAIK, about 5% - that is, under normal operation, a DRAM is unavailable for ordinary memory accesses due to refresh about 5% of the time. Hiding refreshes from the outside world is possible, but adds considerably to the cost and complexity of the DRAM chip and is not normally done unless hard real-time requirements are more important to you than ordinary price/performance considerations. In a multi-banked DRAM, it should be possible to refresh one bank while performing normal accesses to other banks, thus reducing the bandwidth loss; I don't know if any current RAM type actually does that.

arjan de lumens · Feb 6, 2006

Chalnoth said:
You can exit once the position information is calculated. A smart shader compiler in such an architecture would just perform all operations required to calculate the position first.

That sounds completely busted. Once you have computed the position status for ALL the vertices in a polygon, then you have enough information to perform culling - a single vertex does NOT contain enough information to determine its own cull status (especially if it is backface culling you wish to do). At that point, you have either stashed away the vertex shader execution state somewhere, in which case you can resume executing the remaining part (=high hardware complexity) or you haven't, in which case you have to re-run the entire shader from scratch (=hardware to actually run the shader twice). You cannot do this kind of early-exit based on compiler tricks alone without some sort of hardware support to back it up.

arjan de lumens · Feb 6, 2006

Mintmaster said:
For the iterator storage in the second method, remember that plain old bump mapping requires three vec3's for normal/tangent, one for light, one for eye, and a vec2 for texcoords. That's the bare minimum, and once you throw in multiple lights, halfway vectors for specular lighting, attenuation factors, and more textures like lightmaps and shadowmaps, then the DX9 limit is not enough. If you remember the PS3.0 patch that FarCry had, one of the big reasons for it was the 10 iterators that helped to avoid multipassing (though that's a pretty lame reason for calling it a PS3.0 shader). Even though PS2.0 already had 8, they could use those extra 2. So even without going into extravagant algorithms, developers will easily need >100 bytes per vertex between the VS and PS.

OK - I wasn't really aware that requirements were that severe in modern games.

Regarding compression, you're going to need block-based access to keep your memory efficiency high anyway, so it's not really a con for compression.

It's not so much the block-based nature of the memory accesses I am worried about, but the need to perform read-modify-write in situations where you could make the traffic write-only if compression was not present. Let's say you have a compressed block of color data. Now, just to be evil, I will write a single - opaque - pixel into that block. Without compression, I would just write it and get it done with, using byte enables to avoid disturbing surrounding pixels. With compression, on the other hand, I cannot just know where in the compressed block I am supposed to write the pixel; to find that out, I must read the block and decompress it. After it has been decompressed, I write my evil pixel into it, and presumably the block gets re-compressed again before being shipped back off-chip.

Okay, I guess you've convinced me that TBDR is more feasible than I originally thought. If you have a fast enough vertex shader, then the first method should minimize the memory costs, which were my biggest concern. I don't think the cons you mentioned are that big. At worst you could resolve the final framebuffer early for lock/readback, and occlusion queries could be solved by keeping the Z-buffer in external memory and/or using a Hi-Z buffer (say, 1/16th resolution) like ATI.

The main issue with framebuffer reads/locks and occlusion queries is that when used, they tend to break functional parallellism between vertex and pixel processing in TBDRs, in addition to triggering lengthy rendering processes and force you to wait until those processes are completed. They aren't unimplementable, just painfully slow. You can to some extent optimize framebuffer reads and occlusion queries (but not general locks) by only forcing rendering of the necessary sections of the framebuffer, but this gets very complex rather quickly and still interferes with ordinary rendering.

So what exactly is keeping ImgTec from releasing a high end TBDR? Resources? Risk? Graphics enthusiasts are a well informed bunch, and if you put out a superior product for a lower cost, then it will sell.

I don't know, I would guess at resources. ImgTec seemed to have something going for them with the Kyro line (Kyro1 was rather lacklustre at the time of its release, but Kyro2 was quite impressive), but ST still dropped them like a hot potato, indicating that something somewhere went very, very wrong - of course, unless someone from ImgTec or ST comes forth and spills the beans, we will probably never know what that might have been. It appears to be rather clear, though, that ImgTec doesn't have the resources to enter a full-scale war against the NV/ATI desktop hegemony without someone else very large backing them up. (While they've enjoyed some success with MBX in the mobile segment, there isn't really that much money around to collect there anymore.)

Xmas · Feb 6, 2006

Chalnoth said:
You can exit once the position information is calculated. A smart shader compiler in such an architecture would just perform all operations required to calculate the position first.

I thought my emphasis on "vertex" and "primitive" was clear... well, as arjan already explained, you cannot cull a vertex until you have the position of all other vertices that vertex is connected to. And vertices are usually part of multiple triangles (six is a common number for a rectangular grid). In fact, since post-transform vertex cache is limited, vertices regularly get transformed twice or more.
A deferred renderer has the potential to transform every vertex exactly once, and do all the rest of the VS only for vertices that actually belong to visible triangles.

So while early-exit might help a TBDR in that it could actually report those primitives that were culled during the first pass so that you don't even have to calculate the position for the second pass, you're not going to end up doing less vertex work total.

Calculate the position for the second pass? I'm not sure if I understand you correctly here, but why would you want to calculate anything from the first pass again (except if the calculation is so simple that it takes less time than loading from memory)?

arjan de lumens said:
It's not so much the block-based nature of the memory accesses I am worried about, but the need to perform read-modify-write in situations where you could make the traffic write-only if compression was not present. Let's say you have a compressed block of color data. Now, just to be evil, I will write a single - opaque - pixel into that block. Without compression, I would just write it and get it done with, using byte enables to avoid disturbing surrounding pixels. With compression, on the other hand, I cannot just know where in the compressed block I am supposed to write the pixel; to find that out, I must read the block and decompress it. After it has been decompressed, I write my evil pixel into it, and presumably the block gets re-compressed again before being shipped back off-chip.

For a simple "on/off" block compression for multisampling, you always know where a pixel value is supposed to be written to.

KimB · Feb 6, 2006

Xmas said:
I thought my emphasis on "vertex" and "primitive" was clear... well, as arjan already explained, you cannot cull a vertex until you have the position of all other vertices that vertex is connected to. And vertices are usually part of multiple triangles (six is a common number for a rectangular grid). In fact, since post-transform vertex cache is limited, vertices regularly get transformed twice or more.
A deferred renderer has the potential to transform every vertex exactly once, and do all the rest of the VS only for vertices that actually belong to visible triangles.

Hmm, yes, I guess it would be a bit hard to implement this in hardware. So a TBDR does have some advantage there.

Calculate the position for the second pass? I'm not sure if I understand you correctly here, but why would you want to calculate anything from the first pass again (except if the calculation is so simple that it takes less time than loading from memory)?

If everything were stored, there would be no point in doing a second pass at all. This is certainly one possible way of doing things (and the way it's done in IMG Tech's stuff).

Xmas · Feb 6, 2006

Chalnoth said:
If everything were stored, there would be no point in doing a second pass at all.

Not everything, just live temp registers at the point where position data is available. Though I guess most of the time there are none because few vertex shaders do some fancy position calculation.

Ailuros · Feb 7, 2006

I don't know, I would guess at resources. ImgTec seemed to have something going for them with the Kyro line (Kyro1 was rather lacklustre at the time of its release, but Kyro2 was quite impressive), but ST still dropped them like a hot potato, indicating that something somewhere went very, very wrong - of course, unless someone from ImgTec or ST comes forth and spills the beans, we will probably never know what that might have been. It appears to be rather clear, though, that ImgTec doesn't have the resources to enter a full-scale war against the NV/ATI desktop hegemony without someone else very large backing them up. (While they've enjoyed some success with MBX in the mobile segment, there isn't really that much money around to collect there anymore.)

Large semiconductor manufacturer invests relatively small amounts yet expects to capture the graphics market over night. IMHO of course. Did anyone tell ST Micro's upper management that in order to succeed that it's going to take a couple of years with red numbers?

Besides that I recall ST Micro's official excuse (since they had in fact licensed Series3/4/5, but released only Series3 GPUs); some sort of financial crisis and the departments that made the lowest revenue were shut down. I also recall an ex-employee from ST's graphics department stating that the resources diverted to it were so sparse that it was virtually impossible to expect anything serious out of there.

MBX might be what one would call their main money milking cow, but it's not the only source of revenue. The DAB market in collaboration with Frontier Silicon (which IMG owns stock in afaik) is also doing well so far. What they need is a really phat(tm) contract with low cost, yet high volume products.

KimB · Feb 7, 2006

Well, from a business point of view, is it really worth risking all of that cash, if you can be profitable now instead?

Ailuros · Feb 7, 2006

Chalnoth said:
Well, from a business point of view, is it really worth risking all of that cash, if you can be profitable now instead?

No it isn't, but no company could ever expect that they'd gain profit from any new market they try to enter from day one. I tend more to believe that ST Micro's initial decision to enter the graphics market wasn't very well thought of right from the beginning. I read a comment back then that ST had spare fab space at TSMC and wanted to take advantage of it; exaggeration or not, if true that's obviously pure nonsense.

Mintmaster · Feb 7, 2006

arjan, I'm aware of what you originally were referring to about z-and colour compression, and read-modify-write. I'm just saying that even if you didn't have compression, you'd probably do nearly the same thing. The memory controller will load a block of pixels simply because it has a minimal cost over loading a single pixel's data. There are read and write caches even for an IMR, and I doubt issuing single pixel writes, for example, will save much effective bandwidth due to the nature of GDDR3.

arjan de lumens said:
The main issue with framebuffer reads/locks and occlusion queries is that when used, they tend to break functional parallellism between vertex and pixel processing in TBDRs, in addition to triggering lengthy rendering processes and force you to wait until those processes are completed. They aren't unimplementable, just painfully slow.

I'd be very surprised if very many modern games do framebuffer locks/reads, so even if it is really slow, the cases where it shows up should be few and far between, and it shouldn't affect sales. Regarding parallelism, a unified shader pipeline would help, would it not? It's interesting how this discussion is making all the parts of Xenos seem to fit together very well.

arjan de lumens said:
I don't know, I would guess at resources. ImgTec seemed to have something going for them with the Kyro line (Kyro1 was rather lacklustre at the time of its release, but Kyro2 was quite impressive), but ST still dropped them like a hot potato, indicating that something somewhere went very, very wrong - of course, unless someone from ImgTec or ST comes forth and spills the beans, we will probably never know what that might have been.

I always figured it would be just too hard to get hardware vertex processing into the architecture, given its conspicuous absence in the Kyro at a time when that was all the rage. Discussing it with you, though, makes me less sure.

arjan de lumens · Feb 7, 2006

Xmas said:
For a simple "on/off" block compression for multisampling, you always know where a pixel value is supposed to be written to.

Huh? How can the block be layed out in memory so that when I write that pixel (with NO knowledge of the prior contents of the block; I want to avoid reading it), it is guaranteed not to damage the rest of the block?

Ailuros · Feb 7, 2006

Mintmaster said:
I always figured it would be just too hard to get hardware vertex processing into the architecture, given its conspicuous absence in the Kyro at a time when that was all the rage. Discussing it with you, though, makes me less sure.

Naomi arcade systems developed by PVR and vaporware Series4 had HW T&L units. In fact for whatever reason either ST Micro or ST and IMG decided that a higher clocked KYRO at 175MHz (KYRO2) would make more sense from a price/performance perspective than a originally planned KYRO3@166MHz (4 pipelines, DDR, T&L). KYRO3 was postponed for a die shrink and targeted then for a 250MHz clockspeed (which never made it either).

IMO a stupid idea; despite the possible higher price of a K3@166MHz/4 pipes vs. K2@175MHz/2 pipes, the first would had obviously had much higher performance and longevity than the latter.

Mintmaster · Feb 7, 2006

Xmas said:
Not everything, just live temp registers at the point where position data is available. Though I guess most of the time there are none because few vertex shaders do some fancy position calculation.

I think you're mixing some things up. The two pass method, and arjan's proposed weeding out of culled triangles during the second pass, only applies for the first method (as per my post above). That method doesn't store anything, but I guess you've just created a new proposal that is sort of a hybrid.

arjan de lumens said:
That sounds completely busted. Once you have computed the position status for ALL the vertices in a polygon, then you have enough information to perform culling - a single vertex does NOT contain enough information to determine its own cull status (especially if it is backface culling you wish to do). At that point, you have either stashed away the vertex shader execution state somewhere, in which case you can resume executing the remaining part (=high hardware complexity) or you haven't, in which case you have to re-run the entire shader from scratch (=hardware to actually run the shader twice). You cannot do this kind of early-exit based on compiler tricks alone without some sort of hardware support to back it up.

Obviously there's hardware support to do this. If they have to re-run the shader, then that's not a big deal. It's almost exactly like a cache miss. This should only happen during culled/unculled transitions, such as along the silhouettes of an object, or where an object crosses the view frustum or a clip plane. Remember that the hardware works on batches (not the same size as pixel batches, but 6-8 is very possible), and all will be executing the same instruction. How exactly the hardware handles culled/unculled primitives sharing vertices between batches is up to the architect.

ATI optimization document, page 38: "Export computed vertex position as early as possible"

arjan de lumens · Feb 7, 2006

Mintmaster said:
arjan, I'm aware of what you originally were referring to about z-and colour compression, and read-modify-write. I'm just saying that even if you didn't have compression, you'd probably do nearly the same thing. The memory controller will load a block of pixels simply because it has a minimal cost over loading a single pixel's data. There are read and write caches even for an IMR, and I doubt issuing single pixel writes, for example, will save much effective bandwidth due to the nature of GDDR3.

My point of the evil-pixel example was that compression would seemingly force me to do a full block read of the color buffer whereas without compression I would not need to do any color reads at all, only a write. While I have been told a couple of times that this is not the case, I have to ask: how do you lay out the compressed data in memory so that I can safely write the evil-pixel into it without needing to examine/decompress the compressed data?

I'd be very surprised if very many modern games do framebuffer locks/reads, so even if it is really slow, the cases where it shows up should be few and far between, and it shouldn't affect sales. Regarding parallelism, a unified shader pipeline would help, would it not? It's interesting how this discussion is making all the parts of Xenos seem to fit together very well.

OK. Full framebuffer locks seem to be massively unpopular with both Nvidia, ATI and the DirectX guys, so it's probably "only" a legacy issue; games that really need it are presumably so old that the speed hit, though massive, would be tolerable for an otherwise high-end TBDR, and full-frame reads (without the lock, such as e.g. glReadPixels() under OpenGL) aren't particularly useful for anything other than snapshots and conformance tests, so performance should be mostly a non-issue here too. At that point, you are left with smaller reads (which are annoying, but manageable) and occlusion queries/conditional rendering (which need additional smart tricks beyond just those of basic TBDR, if you want them to help rather than harm performance).

The extremely high vertex performance that a unified shader is capable of would help quite a bit in this case; if you have just flushed your scene buffer to provide a framebuffer read, then you can max out 100% of your pipelines on vertex processing until you have collected the full next scene - which is much better than having the vertex shader work 100% and the pixel shader pipes sitting completely idle.

I always figured it would be just too hard to get hardware vertex processing into the architecture, given its conspicuous absence in the Kyro at a time when that was all the rage. Discussing it with you, though, makes me less sure.

IIRC, PowerVR released the ELAN T&L chip in pretty much the same time-frame; I have no idea why that technology was not included in the Kyro series.

There are other issues that always puzzled me a bit with Kyro as well, such as in particular the lack of Cube-Mapping. Its feature set was a very complete DirectX6 feature set, with no apparent DirectX7-class features - it's as if its feature set was targeted at a release ~1 year earlier than its actual release date or something odd like that.

arjan de lumens · Feb 7, 2006

Ailuros said:
IMO a stupid idea; despite the possible higher price of a K3@166MHz/4 pipes vs. K2@175MHz/2 pipes, the first would had obviously had much higher performance and longevity than the latter.

While stupid in hindsight, there are a couple of issues to consider here:

The Kyro architecture was not proven at that point; the Kyro1 flopped badly before the release of Kyro2.
A Kyro3 specced like that in that timeframe would be a quite expensive high-end chip, presumably even more than the Geforce2 Ultra, meaing that it would not be able to sell into the high-volume budget market that Geforce2MX was so successful in. The final Kyro2 did eventually end up selling into that market, although one could well argue that it was hurt in the market by not having a flagship product above it.

aaronspink · Feb 7, 2006

the maddman said:
Quick question: How much memory bandwidth would it save if we didn't have to refresh the DRAM? Or does GDDR do that internally?

Refresh overheads are generally in the couple of percent range. Not really a big deal.

Aaron Spink
speaking for myself inc.

aaronspink · Feb 7, 2006

arjan de lumens said:
AFAIK, about 5% - that is, under normal operation, a DRAM is unavailable for ordinary memory accesses due to refresh about 5% of the time. Hiding refreshes from the outside world is possible, but adds considerably to the cost and complexity of the DRAM chip and is not normally done unless hard real-time requirements are more important to you than ordinary price/performance considerations. In a multi-banked DRAM, it should be possible to refresh one bank while performing normal accesses to other banks, thus reducing the bandwidth loss; I don't know if any current RAM type actually does that.

Most modern dram technologies support both multi-bank refresh as well as off bank refresh. A refresh is just like a Pre-Ras sequence for all purposes. DRAM refresh is in the range of 2-3% at most and if it becomes a problem, there are numerous ways to lessen it. The net effect of refresh is less than read-write turnaround losses.

Aaron Spink
speaking for myself inc.

Ailuros · Feb 7, 2006

arjan de lumens said:
While stupid in hindsight, there are a couple of issues to consider here:

The Kyro architecture was not proven at that point; the Kyro1 flopped badly before the release of Kyro2.

A Kyro3 specced like that in that timeframe would be a quite expensive high-end chip, presumably even more than the Geforce2 Ultra, meaing that it would not be able to sell into the high-volume budget market that Geforce2MX was so successful in. The final Kyro2 did eventually end up selling into that market, although one could well argue that it was hurt in the market by not having a flagship product above it.

K2 was released early 2001. Not only had the GF2U price dropped at that point but the latter had also much higher speced ram. I wouldn't had expected anything apart from 166/166MHz (DDR). More something like mainstream pricing than high end.

Future solution to memory bandwidth

_xxx_

KimB

arjan de lumens

arjan de lumens

arjan de lumens

Xmas

Porous

KimB

Xmas

Porous

Ailuros

Epsilon plus three

KimB

Ailuros

Epsilon plus three

Mintmaster

arjan de lumens

Ailuros

Epsilon plus three

Mintmaster

arjan de lumens

arjan de lumens

aaronspink

aaronspink

Ailuros

Epsilon plus three

Similar threads