You can exit once the position information is calculated. A smart shader compiler in such an architecture would just perform all operations required to calculate the position first.Xmas said:How can you early-exit the Vertex Shader for culled primitives when you don't store the primitives?
AFAIK, about 5% - that is, under normal operation, a DRAM is unavailable for ordinary memory accesses due to refresh about 5% of the time. Hiding refreshes from the outside world is possible, but adds considerably to the cost and complexity of the DRAM chip and is not normally done unless hard real-time requirements are more important to you than ordinary price/performance considerations. In a multi-banked DRAM, it should be possible to refresh one bank while performing normal accesses to other banks, thus reducing the bandwidth loss; I don't know if any current RAM type actually does that.the maddman said:Quick question: How much memory bandwidth would it save if we didn't have to refresh the DRAM? Or does GDDR do that internally?
That sounds completely busted. Once you have computed the position status for ALL the vertices in a polygon, then you have enough information to perform culling - a single vertex does NOT contain enough information to determine its own cull status (especially if it is backface culling you wish to do). At that point, you have either stashed away the vertex shader execution state somewhere, in which case you can resume executing the remaining part (=high hardware complexity) or you haven't, in which case you have to re-run the entire shader from scratch (=hardware to actually run the shader twice). You cannot do this kind of early-exit based on compiler tricks alone without some sort of hardware support to back it up.Chalnoth said:You can exit once the position information is calculated. A smart shader compiler in such an architecture would just perform all operations required to calculate the position first.
OK - I wasn't really aware that requirements were that severe in modern games.Mintmaster said:For the iterator storage in the second method, remember that plain old bump mapping requires three vec3's for normal/tangent, one for light, one for eye, and a vec2 for texcoords. That's the bare minimum, and once you throw in multiple lights, halfway vectors for specular lighting, attenuation factors, and more textures like lightmaps and shadowmaps, then the DX9 limit is not enough. If you remember the PS3.0 patch that FarCry had, one of the big reasons for it was the 10 iterators that helped to avoid multipassing (though that's a pretty lame reason for calling it a PS3.0 shader). Even though PS2.0 already had 8, they could use those extra 2. So even without going into extravagant algorithms, developers will easily need >100 bytes per vertex between the VS and PS.
It's not so much the block-based nature of the memory accesses I am worried about, but the need to perform read-modify-write in situations where you could make the traffic write-only if compression was not present. Let's say you have a compressed block of color data. Now, just to be evil, I will write a single - opaque - pixel into that block. Without compression, I would just write it and get it done with, using byte enables to avoid disturbing surrounding pixels. With compression, on the other hand, I cannot just know where in the compressed block I am supposed to write the pixel; to find that out, I must read the block and decompress it. After it has been decompressed, I write my evil pixel into it, and presumably the block gets re-compressed again before being shipped back off-chip.Regarding compression, you're going to need block-based access to keep your memory efficiency high anyway, so it's not really a con for compression.
The main issue with framebuffer reads/locks and occlusion queries is that when used, they tend to break functional parallellism between vertex and pixel processing in TBDRs, in addition to triggering lengthy rendering processes and force you to wait until those processes are completed. They aren't unimplementable, just painfully slow. You can to some extent optimize framebuffer reads and occlusion queries (but not general locks) by only forcing rendering of the necessary sections of the framebuffer, but this gets very complex rather quickly and still interferes with ordinary rendering.Okay, I guess you've convinced me that TBDR is more feasible than I originally thought. If you have a fast enough vertex shader, then the first method should minimize the memory costs, which were my biggest concern. I don't think the cons you mentioned are that big. At worst you could resolve the final framebuffer early for lock/readback, and occlusion queries could be solved by keeping the Z-buffer in external memory and/or using a Hi-Z buffer (say, 1/16th resolution) like ATI.
I don't know, I would guess at resources. ImgTec seemed to have something going for them with the Kyro line (Kyro1 was rather lacklustre at the time of its release, but Kyro2 was quite impressive), but ST still dropped them like a hot potato, indicating that something somewhere went very, very wrong - of course, unless someone from ImgTec or ST comes forth and spills the beans, we will probably never know what that might have been. It appears to be rather clear, though, that ImgTec doesn't have the resources to enter a full-scale war against the NV/ATI desktop hegemony without someone else very large backing them up. (While they've enjoyed some success with MBX in the mobile segment, there isn't really that much money around to collect there anymore.)So what exactly is keeping ImgTec from releasing a high end TBDR? Resources? Risk? Graphics enthusiasts are a well informed bunch, and if you put out a superior product for a lower cost, then it will sell.
I thought my emphasis on "vertex" and "primitive" was clear... well, as arjan already explained, you cannot cull a vertex until you have the position of all other vertices that vertex is connected to. And vertices are usually part of multiple triangles (six is a common number for a rectangular grid). In fact, since post-transform vertex cache is limited, vertices regularly get transformed twice or more.Chalnoth said:You can exit once the position information is calculated. A smart shader compiler in such an architecture would just perform all operations required to calculate the position first.
Calculate the position for the second pass? I'm not sure if I understand you correctly here, but why would you want to calculate anything from the first pass again (except if the calculation is so simple that it takes less time than loading from memory)?So while early-exit might help a TBDR in that it could actually report those primitives that were culled during the first pass so that you don't even have to calculate the position for the second pass, you're not going to end up doing less vertex work total.
For a simple "on/off" block compression for multisampling, you always know where a pixel value is supposed to be written to.arjan de lumens said:It's not so much the block-based nature of the memory accesses I am worried about, but the need to perform read-modify-write in situations where you could make the traffic write-only if compression was not present. Let's say you have a compressed block of color data. Now, just to be evil, I will write a single - opaque - pixel into that block. Without compression, I would just write it and get it done with, using byte enables to avoid disturbing surrounding pixels. With compression, on the other hand, I cannot just know where in the compressed block I am supposed to write the pixel; to find that out, I must read the block and decompress it. After it has been decompressed, I write my evil pixel into it, and presumably the block gets re-compressed again before being shipped back off-chip.
Hmm, yes, I guess it would be a bit hard to implement this in hardware. So a TBDR does have some advantage there.Xmas said:I thought my emphasis on "vertex" and "primitive" was clear... well, as arjan already explained, you cannot cull a vertex until you have the position of all other vertices that vertex is connected to. And vertices are usually part of multiple triangles (six is a common number for a rectangular grid). In fact, since post-transform vertex cache is limited, vertices regularly get transformed twice or more.
A deferred renderer has the potential to transform every vertex exactly once, and do all the rest of the VS only for vertices that actually belong to visible triangles.
If everything were stored, there would be no point in doing a second pass at all. This is certainly one possible way of doing things (and the way it's done in IMG Tech's stuff).Calculate the position for the second pass? I'm not sure if I understand you correctly here, but why would you want to calculate anything from the first pass again (except if the calculation is so simple that it takes less time than loading from memory)?
Not everything, just live temp registers at the point where position data is available. Though I guess most of the time there are none because few vertex shaders do some fancy position calculation.Chalnoth said:If everything were stored, there would be no point in doing a second pass at all.
I don't know, I would guess at resources. ImgTec seemed to have something going for them with the Kyro line (Kyro1 was rather lacklustre at the time of its release, but Kyro2 was quite impressive), but ST still dropped them like a hot potato, indicating that something somewhere went very, very wrong - of course, unless someone from ImgTec or ST comes forth and spills the beans, we will probably never know what that might have been. It appears to be rather clear, though, that ImgTec doesn't have the resources to enter a full-scale war against the NV/ATI desktop hegemony without someone else very large backing them up. (While they've enjoyed some success with MBX in the mobile segment, there isn't really that much money around to collect there anymore.)
Chalnoth said:Well, from a business point of view, is it really worth risking all of that cash, if you can be profitable now instead?
I'd be very surprised if very many modern games do framebuffer locks/reads, so even if it is really slow, the cases where it shows up should be few and far between, and it shouldn't affect sales. Regarding parallelism, a unified shader pipeline would help, would it not? It's interesting how this discussion is making all the parts of Xenos seem to fit together very well.arjan de lumens said:The main issue with framebuffer reads/locks and occlusion queries is that when used, they tend to break functional parallellism between vertex and pixel processing in TBDRs, in addition to triggering lengthy rendering processes and force you to wait until those processes are completed. They aren't unimplementable, just painfully slow.
I always figured it would be just too hard to get hardware vertex processing into the architecture, given its conspicuous absence in the Kyro at a time when that was all the rage. Discussing it with you, though, makes me less sure.arjan de lumens said:I don't know, I would guess at resources. ImgTec seemed to have something going for them with the Kyro line (Kyro1 was rather lacklustre at the time of its release, but Kyro2 was quite impressive), but ST still dropped them like a hot potato, indicating that something somewhere went very, very wrong - of course, unless someone from ImgTec or ST comes forth and spills the beans, we will probably never know what that might have been.
Huh? How can the block be layed out in memory so that when I write that pixel (with NO knowledge of the prior contents of the block; I want to avoid reading it), it is guaranteed not to damage the rest of the block?Xmas said:For a simple "on/off" block compression for multisampling, you always know where a pixel value is supposed to be written to.
Mintmaster said:I always figured it would be just too hard to get hardware vertex processing into the architecture, given its conspicuous absence in the Kyro at a time when that was all the rage. Discussing it with you, though, makes me less sure.
I think you're mixing some things up. The two pass method, and arjan's proposed weeding out of culled triangles during the second pass, only applies for the first method (as per my post above). That method doesn't store anything, but I guess you've just created a new proposal that is sort of a hybrid.Xmas said:Not everything, just live temp registers at the point where position data is available. Though I guess most of the time there are none because few vertex shaders do some fancy position calculation.
Obviously there's hardware support to do this. If they have to re-run the shader, then that's not a big deal. It's almost exactly like a cache miss. This should only happen during culled/unculled transitions, such as along the silhouettes of an object, or where an object crosses the view frustum or a clip plane. Remember that the hardware works on batches (not the same size as pixel batches, but 6-8 is very possible), and all will be executing the same instruction. How exactly the hardware handles culled/unculled primitives sharing vertices between batches is up to the architect.arjan de lumens said:That sounds completely busted. Once you have computed the position status for ALL the vertices in a polygon, then you have enough information to perform culling - a single vertex does NOT contain enough information to determine its own cull status (especially if it is backface culling you wish to do). At that point, you have either stashed away the vertex shader execution state somewhere, in which case you can resume executing the remaining part (=high hardware complexity) or you haven't, in which case you have to re-run the entire shader from scratch (=hardware to actually run the shader twice). You cannot do this kind of early-exit based on compiler tricks alone without some sort of hardware support to back it up.
My point of the evil-pixel example was that compression would seemingly force me to do a full block read of the color buffer whereas without compression I would not need to do any color reads at all, only a write. While I have been told a couple of times that this is not the case, I have to ask: how do you lay out the compressed data in memory so that I can safely write the evil-pixel into it without needing to examine/decompress the compressed data?Mintmaster said:arjan, I'm aware of what you originally were referring to about z-and colour compression, and read-modify-write. I'm just saying that even if you didn't have compression, you'd probably do nearly the same thing. The memory controller will load a block of pixels simply because it has a minimal cost over loading a single pixel's data. There are read and write caches even for an IMR, and I doubt issuing single pixel writes, for example, will save much effective bandwidth due to the nature of GDDR3.
OK. Full framebuffer locks seem to be massively unpopular with both Nvidia, ATI and the DirectX guys, so it's probably "only" a legacy issue; games that really need it are presumably so old that the speed hit, though massive, would be tolerable for an otherwise high-end TBDR, and full-frame reads (without the lock, such as e.g. glReadPixels() under OpenGL) aren't particularly useful for anything other than snapshots and conformance tests, so performance should be mostly a non-issue here too. At that point, you are left with smaller reads (which are annoying, but manageable) and occlusion queries/conditional rendering (which need additional smart tricks beyond just those of basic TBDR, if you want them to help rather than harm performance).I'd be very surprised if very many modern games do framebuffer locks/reads, so even if it is really slow, the cases where it shows up should be few and far between, and it shouldn't affect sales. Regarding parallelism, a unified shader pipeline would help, would it not? It's interesting how this discussion is making all the parts of Xenos seem to fit together very well.
IIRC, PowerVR released the ELAN T&L chip in pretty much the same time-frame; I have no idea why that technology was not included in the Kyro series.I always figured it would be just too hard to get hardware vertex processing into the architecture, given its conspicuous absence in the Kyro at a time when that was all the rage. Discussing it with you, though, makes me less sure.
While stupid in hindsight, there are a couple of issues to consider here:Ailuros said:IMO a stupid idea; despite the possible higher price of a K3@166MHz/4 pipes vs. K2@175MHz/2 pipes, the first would had obviously had much higher performance and longevity than the latter.
the maddman said:Quick question: How much memory bandwidth would it save if we didn't have to refresh the DRAM? Or does GDDR do that internally?
arjan de lumens said:AFAIK, about 5% - that is, under normal operation, a DRAM is unavailable for ordinary memory accesses due to refresh about 5% of the time. Hiding refreshes from the outside world is possible, but adds considerably to the cost and complexity of the DRAM chip and is not normally done unless hard real-time requirements are more important to you than ordinary price/performance considerations. In a multi-banked DRAM, it should be possible to refresh one bank while performing normal accesses to other banks, thus reducing the bandwidth loss; I don't know if any current RAM type actually does that.
arjan de lumens said:While stupid in hindsight, there are a couple of issues to consider here:
- The Kyro architecture was not proven at that point; the Kyro1 flopped badly before the release of Kyro2.
- A Kyro3 specced like that in that timeframe would be a quite expensive high-end chip, presumably even more than the Geforce2 Ultra, meaing that it would not be able to sell into the high-volume budget market that Geforce2MX was so successful in. The final Kyro2 did eventually end up selling into that market, although one could well argue that it was hurt in the market by not having a flagship product above it.