HSR vs Tile-Based rendering?

Simon:

Well, in terms of complexity yes, I would have thought a large number of calculations would be involved though simply because that trivial task would have to be done for every single vertex - or am I wrong? (wouldn't be the first time ;) )

Anyway, perhaps you could describe how metagence would help a graphics core better than I :)
 
Dave B(TotalVR) said:
Simon:

Well, in terms of complexity yes, I would have thought a large number of calculations would be involved though simply because that trivial task would have to be done for every single vertex - or am I wrong? (wouldn't be the first time ;) )
It would have to be done for every vertex but *2.0 is (mostly) "add 1 to the 8 bit integer exponent". It's utterly insignificant.
Anyway, perhaps you could describe how metagence would help a graphics core better than I :)
Well, to be brutally frank, it's not needed for all the examples you gave. Hyperthreading a la Metagence is really great for complex software which isn't what's going on in most parts of a graphics chip <shrug>.
 
I don't think it's as simple as that. It probably was in the case of the NV2x and NV3x, but probably not on the NV4x or R3xx, but this upscaling should have been a pretty trivial change to the triangle setup engine (you're just doing more interpolations for z data than you are for the rest).
 
Well, as simon says (lol), upscaling by 2x is just a ROR command on the co-ords of each vertex and that definately isn't a complex operation. In that case it must be something else that takes up the real estate because there is definately a lot of space reserved for MSAA.
 
Dave B(TotalVR) said:
Well, as simon says (lol), upscaling by 2x is just a ROR command on the co-ords of each vertex and that definately isn't a complex operation. In that case it must be something else that takes up the real estate because there is definately a lot of space reserved for MSAA.
Such an "upscaling" operation would be part of the viewport transform and, as such, cost nothing at all.

But there's much more to MSAA than that. You need a triangle setup that is sample position aware, so that it generates a "quad thread" for every quad with at least one sample covered. You need a wider coverage mask (with quad based rendering, you have a 4bit mask already), and the datapaths for it. You may need centroid logic. You need more Z-interpolators. You need multiple ROPs, and/or the ability to process additional samples of a pixel in subsequent cycles. You will want to have framebuffer compression, with different compression schemes for different numbers of samples. You might want to support alpha-to-coverage, etc.
 
Well yes, but we digress, I was suggesting Metagence could be of great use in a GPU - should any future GPU conain the technology - lol.

For the record, I have never received any info official or rumour that Series 5 will contain Metagence. I just reckon it will ;)
 
Chalnoth said:
I don't see why alpha blending need be handled any differently from texture fetches, in terms of latency hiding. That is, I would expect that an architecture could simply use the latency hiding that is used for texture fetches to also hide the latency for alpha blends.
Its the same type of latency hiding technique, but its a different set of buffering so it costs extra area, and no, you can't combine the two as they're required in completly different parts of the pipeline.

Before you start throwing around the latest buzz words of "dynamicaly allocated shader pipes", this only allows you to push shader performance where its required, you still need pretty much the same buffering to cover all cases.
Not really. If you design the pipelines in an optimal fashion, then there's no need for much of any vertex->pixel cache. The main problem is that you'd want the pipelines to be able to switch quickly between vertex and pixel processing. For example: have a pipeline do all vertex processing for a single triangle, then do all pixel shading for that triangle, then start on the next triangle. If an architecture like this could be designed to handle two states at once efficiently, there would be no problem with load balancing.
Hehe, sorry but I have to snigger at you trying to tell me how to design this type of thing, anyway...

Your're ignoring the fact that the pipelines themselves are very deep, so the switching involves allowing multiple instances (one instance being either a vertex or a pixel) to be active within the pipeline, this requires you to have storage for each active instance, this means you still have the buffer, some of it has just migrated into the pipeline itself.

...
Anyway, I'm not going to go any more into the problems. I'm sure you can think of a number of other problems with this solution,

Lol, oh really, [massive effort to avoid become extremly sarcastic involved here!]

but the questions remain: Can this become more efficient than dedicated vertex and pixel pipelines? Is the additional transistor worth it? What additional programming possibilities could this add for developers, and would they make the change more worthwhile?

Actually, a unified shading engine make a lot of sense from the load balancing standpoint. It doesn't increase programming possibilities in itself, it just gives you the performance where you need it.

John.
 
JohnH said:
Its the same type of latency hiding technique, but its a different set of buffering so it costs extra area, and no, you can't combine the two as they're required in completly different parts of the pipeline.

Let's hope this wont be true for too much longer though, if this persist into the next DX it would be an outrage IMO ... the present setup is too fugly to be justified by the extent to which it simplifies the hardware, it is legacy cruft.

I wouldnt be surprised if S3 did combine it in their new architecture (given that they said it would offer framebuffer as input in the shader, this suggests they combined alpha blending with the pixelshader).

Actually, a unified shading engine make a lot of sense from the load balancing standpoint. It doesn't increase programming possibilities in itself, it just gives you the performance where you need it.

In a perfect world the extra flexibility which the hardware needs for unified shaders would be exposed :)
 
JohnH said:
Chalnoth said:
I don't see why alpha blending need be handled any differently from texture fetches, in terms of latency hiding. That is, I would expect that an architecture could simply use the latency hiding that is used for texture fetches to also hide the latency for alpha blends.
Its the same type of latency hiding technique, but its a different set of buffering so it costs extra area, and no, you can't combine the two as they're required in completly different parts of the pipeline.
This doesn't make sense to me. They're fundamentally the same thing. I mean, you have a pixel input that leads to a pixel output. There may be some complications that arise in accessing the framebuffer as a read/write buffer, but other than that, I don't see why you would need to have another latency-hiding buffer.

I mean, the way I see it, what you should do is this:
1. Request data fetch from external memory.
2. Process other things until that data is available.
3. Process the pixel's final color.
4. Send pixel to output buffer, where it waits until it can be written to memory.

I don't see why you'd want to:
1. Request texture data.
2. Process other things.
3. Process rest of pixel data.
4. Request framebuffer data.
5. Process other things.
6. Process blend.
7. Send pixel to output buffer.

I mean, sure, the actual functional unit that calculates the blend will probably be a specialized unit that can act before (or after, depending) other pixel processing, but I see no reason why one must have a different buffer here.
 
The difference is that you know it's impossible for a shader to alter the contents of a texture. Screen pixels are completely different.

At the very least, rendering using screen reads will take a severe performance hit if there are overlaps, as multiple pixels with overlaps cannot simultaneously be live in the pipeline. At the moment the pixel lifetime is exactly known.

Even avoiding this pathological case, there are potential cost implementations - having a 'color cache' large enough to latency compensate the entire pixel pipeline could be an awful lot of area.
 
Right, so you would have to have additional logic to make sure there are no overlaps. Any overlaps that do occur within close proximity (which should be rare, I would think), but I don't see why that would require a different buffer.
 
Chalnoth:

I think what you are talking about would require a large level of 'cache management' (something my bank account could do with) which would be expensive on a GPU as space is a premium. Cache memory isn't simply a pool of ram that buffers reads and writes - it is in an ideal world but in the real world it is divided into sets. Say you have 128KB of cache that may be divided into 8 blocks, so you have 8x 16kb blocks for which you can assign each to certain processes that require caching - i.e. you're texturing this polygon which has 3 texture layers so you give 3 of the cache sets to that pipe. The next pipe is working on something else so you give it these other cache sets and so on.

Sure you could give some sets away for alpha blending and such but by doing so you would be taking away from not just your texture cache mount, but the functionality of your texture cache too because ti would physically be only capable of caching fewer textures/threads/whatever.

Disclaimer: I could be talking rubbish here - how about somebody like John or Simon back me up and/or correct me ;)
 
Chalnoth/Mfa, Ok, see what you where getting at now, if you have no seperate alpha blend unit and perform this in the shader then you could just treat the current FB as a texture source. The obvious downside of this is your peak translucent fillrate would end up as half of your opaque rate, of course this may not matter as peak fill isn't relevent on a complex shader.

Turning this back to the original discussion, this plays very nicely into the hands of a TBDR as it has the current FB value stored locally and accessable at any point in the shader.

John.
 
Dave B(TotalVR) said:
Chalnoth:

I think what you are talking about would require a large level of 'cache management' (something my bank account could do with) which would be expensive on a GPU as space is a premium. Cache memory isn't simply a pool of ram that buffers reads and writes - it is in an ideal world but in the real world it is divided into sets. Say you have 128KB of cache that may be divided into 8 blocks, so you have 8x 16kb blocks for which you can assign each to certain processes that require caching - i.e. you're texturing this polygon which has 3 texture layers so you give 3 of the cache sets to that pipe. The next pipe is working on something else so you give it these other cache sets and so on.

Sure you could give some sets away for alpha blending and such but by doing so you would be taking away from not just your texture cache mount, but the functionality of your texture cache too because ti would physically be only capable of caching fewer textures/threads/whatever.
I don't think so. I was thinking more about timing, not where the data requests were stored. Then again, remember that texture cache also has to deal with point sampling, and I don't see much difference between point sampling and caching framebuffer reads.

JohnH said:
Chalnoth/Mfa, Ok, see what you where getting at now, if you have no seperate alpha blend unit and perform this in the shader then you could just treat the current FB as a texture source. The obvious downside of this is your peak translucent fillrate would end up as half of your opaque rate, of course this may not matter as peak fill isn't relevent on a complex shader.
No, I don't think so. Not that the two would be completely separate, but I don't see any reason why with sharing of the latency hiding you'd need to do the alpha blending in the shader. You could just have some more pipeline stages that are dedicated to the alpha blend operation, that every pixel goes through (with those that don't actually use the alpha blend either taking a parallel "stalling" route, or going through the alpha blend hardware with blending values set to not change the pixel value). Or you could even use the alpha blend unit as part of the latency hiding, if you decide that you can assume the reads for alpha blending will take more or less latency than the reads for other operations that need latency hiding.
 
No, I don't think so. Not that the two would be completely separate, but I don't see any reason why with sharing of the latency hiding you'd need to do the alpha blending in the shader. You could just have some more pipeline stages that are dedicated to the alpha blend operation, that every pixel goes through (with those that don't actually use the alpha blend either taking a parallel "stalling" route, or going through the alpha blend hardware with blending values set to not change the pixel value). Or you could even use the alpha blend unit as part of the latency hiding, if you decide that you can assume the reads for alpha blending will take more or less latency than the reads for other operations that need latency hiding.

Chalnoth, you're completly failing to understand how HW works.
When alpha blending with the contents of the frame buffer you need the current contents of the frame buffer, this has to be fetched from memory, that fetch has a huge latency, as such you must provide buffering to hide that latency. This is additional buffering, it can't be the same buffering as something else, so it costs additional area. End of story.

John.
 
It should be no different than any other memory fetch in terms of latency. In fact, I would tend to think that a framebuffer read for blending would get priority, as if you minimize the latency of the blending read, you'll minimize the chance you'll have to stall the pipeline due to overwriting.
 
Excuse the minor off topic, yet it wouldn't be for Chalnoth this thread wouldn't even be half as interesting. It's a rare occassion that you get as much out of PVR folks :oops:

*runs*
 
This is a very informative thread, and thanks to the guys from PowerVR for contributing.

Chalnoth, you are very knowledgable, but when will you finally concede that TBDR has plenty of advantages vs IMR? If nVidia jumped ship, what would you do?? :eek:

Most of the disadvantages and problems with the style have long been overcome and solved across patents or by other means.

Bandwidth will always be an issue as precisions rise & pipelines/fill-rate increase (maybe not quite so much with a future emphasis on shaders), and although overdraw optimisations have been updated in modern IMRs, none have quite the efficiency of TBDR. The ability to eliminate redundant complex shaders will also have its performance bonuses - and hopefully silence a few critics if such a new architecture is brought to the PC. :oops:

Speaking of latencies... a couple (probably trivial) questions to the PVR guys - Im interested in what sort of latency improvements were possible by synchronising the core clock & memory on kyro and older architectures? Would this save internal caching space by not having to buffer quite so much data being fed to the core or is this just an easier implementation? Does a TBDR architecture need synchronisation - or is this just an optimal design characteristic - as most IMRs have asynchronous core & memory rates. If this is getting too close to NDA'd or private information i would understand.
 
Back
Top