nvidia and memexport..

nAo

Nutella Nutellae
Veteran
I've just found this new patent from nvidia: Shader pixel storage in a graphics memory
Not read it yet but from the abstract it seems interesting:


Circuits, apparatus, and methods that enable a shader to read and write data from and to a memory location during a single pass through a graphics pipeline. Some embodiments of the present invention provide an increase in the number of buffers available to a shader. These buffers may be read/write (input/output) or read only (input) buffers. Another provides pixel store and pixel load commands that may be used as instructions in a shader program or program portion, and may appear at positions other than the end of the shader program or program portion. Other embodiments provide a data path between a shader and a graphics memory, typically through a frame buffer interface. This data path simplifies the timing of the above store (write) and load (read) commands
 
It's too bad ditching blending in favour of this is not a marketable feature.
Maybe we wouldn't have to wait for this to appear in future if it was :p
 
I didn't think the NV30 supported any read/write buffers other than for blending or the z/stencil-buffer (and blending supported only on traditional 16-bit and 32-bit buffers).
 
Jawed said:
I think that a D3D10 GPU can't be implemented without some form of stream out.
Well D3D10 has the fixed capabilities, so yeah - it must have stream-out in order to be worth anything :smile:

I don't personally like reading those "legalise" patents - but it almost sounds as if it could be some sort programmable OM unit... which would absolutely rock in ways I can't even begin to imagine. Although, the software doesn't allow for that yet, and it does break a lot of basic principles that exist today - so maybe I'm just being optimistic.

Jack
 
Jawed said:
What's OM?
OM = Output Merger

It's one of the refined fixed-function components in the D3D10 pipeline. It's effectively the part that deals with frame buffer blending and compositing.

A programmable OM would be one where we could write our own custom blending operations rather than rely on the stock add/subtract/modulate type operators.

Jack
 
JHoxley said:
A programmable OM would be one where we could write our own custom blending operations rather than rely on the stock add/subtract/modulate type operators.
I think what would be better is to simply make the input from the buffer available in the pixel shader. This would require appropriate blocking of execution of the shader for some pixels to preserve order, but shouldn't be that difficult to implement.
 
JHoxley said:
OM = Output Merger

It's one of the refined fixed-function components in the D3D10 pipeline. It's effectively the part that deals with frame buffer blending and compositing.

A programmable OM would be one where we could write our own custom blending operations rather than rely on the stock add/subtract/modulate type operators.

Jack
Ah, so programmable OM might be why David Kirk discounted anti-alias support for floating-point backbuffers - because with programmable OM, presumably, AA would be just another function of the OM pipeline.

Jawed
 
Jawed said:
Ah, so programmable OM might be why David Kirk discounted anti-alias support for floating-point backbuffers
More likely it was discounted because lack of support in current or near-future nVidia hardware.
 
I have the feeling that such feature might considerably complicate the cache ("buffer interface" is the name used in the patent). And to be honest, I'm not quite sure why. So here is a couple of questions converning the old classic way of doing blending:

  1. in order: fragment are produce "in-order", and the harware insure that if one does a classic alphablend while drawing two overlaping triangles (A and then B), then the order of submission of the triangle will determine the result (ie: A will be drawn before B). How does the hardware insure such things ? are the fragments tagged with some ID and sorted at the very end of the pipeline, or is it a strict order which does not even require IDs ?
  2. blend cost: it's a known fact, alpha-blend is a slow process. Obvisouly, memory must be read and then write, and the extra "read" operation is the costly part. But, considering the high number of fragment coming out the pipeline, it must be very easy to pre-fetch each destination value before the fragment enter the last "blendOp" brick in the pipeline. Also, if too fragments (of the same pixel) are too close, it might be a good idea to space them so that the second one effectively get the result of the first one. Am I right here, is this what really happen in a GPU ?
    In case the answer is yes: so read latency is not the issue here. So why the hell this operation is slow ? Is it because the read operations tend to saturate the bandwith, meaning that the next write operation has to wait ?

And finally, concerning this patent: the in-order of the fragments must be insured in the shader itself. Is it something harder to achieve ?
 
purpledog said:
How does the hardware insure such things ? are the fragments tagged with some ID and sorted at the very end of the pipeline, or is it a strict order which does not even require IDs ?
Or a combination of both. Really, it depends on the architecture.

purpledog said:
So why the hell this operation is slow ?
If triangles never overlap, or if they overlap far enough apart in time that caches have been flushed between them, then you effectively need twice the memory bandiwdth to perform the blend.

Since ROPs are mostly already designed to saturate memory bandwidth on writes alone, doubling the bandiwdth needed makes blending run at half speed.

Prefetching won't help. Prefetching only helps latency at the cost of bandwidth. If you already have no bandwidth to spare, there's no real point in prefetching.
 
purpledog said:
I have the feeling that such feature might considerably complicate the cache ("buffer interface" is the name used in the patent). And to be honest, I'm not quite sure why. So here is a couple of questions converning the old classic way of doing blending:

  1. in order: fragment are produce "in-order", and the harware insure that if one does a classic alphablend while drawing two overlaping triangles (A and then B), then the order of submission of the triangle will determine the result (ie: A will be drawn before B). How does the hardware insure such things ? are the fragments tagged with some ID and sorted at the very end of the pipeline, or is it a strict order which does not even require IDs ?
  2. blend cost: it's a known fact, alpha-blend is a slow process. Obvisouly, memory must be read and then write, and the extra "read" operation is the costly part. But, considering the high number of fragment coming out the pipeline, it must be very easy to pre-fetch each destination value before the fragment enter the last "blendOp" brick in the pipeline. Also, if too fragments (of the same pixel) are too close, it might be a good idea to space them so that the second one effectively get the result of the first one. Am I right here, is this what really happen in a GPU ?
    In case the answer is yes: so read latency is not the issue here. So why the hell this operation is slow ? Is it because the read operations tend to saturate the bandwith, meaning that the next write operation has to wait ?

And finally, concerning this patent: the in-order of the fragments must be insured in the shader itself. Is it something harder to achieve ?


The operation is slow because Read Modify Write gets absolutly the worst possible performance from your RAM. Latency isn't the issue it's the number of swaps from read to write that hurts.
 
ERP said:
The operation is slow because Read Modify Write gets absolutly the worst possible performance from your RAM. Latency isn't the issue it's the number of swaps from read to write that hurts.

Ok, bandwith is the bottleneck, is it what your saying ?
Still, is this new patent as slow as classic blending then ? Or is doing this operation earlier in the pipeline make it worse for some reason ?
I still suspect some very tricky sync issue, but I can't figure it out.
 
ERP said:
The operation is slow because Read Modify Write gets absolutly the worst possible performance from your RAM. Latency isn't the issue it's the number of swaps from read to write that hurts.
It's not that hard to buffer up a DRAM page worth of work, thus eliminating the read-write turnaround penalty.
 
Bob said:
It's not that hard to buffer up a DRAM page worth of work, thus eliminating the read-write turnaround penalty.
Not that I'm trying to refute this, but just bear in mind that you'd need to do this for each memory chip, since data is likely to be striped across each to maximize bandwidth.
 
I haven't read the patent, but if it's about arbitrary frame buffer access in a pixel shader, then a big problem is parallelism between triangles.
There's a lot of pixels in flight at once in the PS nowadays. So many that you need to have pixels from more than one triangle in flight to keep them busy. If the triangles overlap there's no big problem, since the calculations for each triangles pixels are independent of each other. So you can just run them in parallel.

Frame buffer blends disturb this nice optimization. A pixel from one triangle must be completely blended before the blend for the same pixel in next triangle start - this part is serial. That's why it's pulled out from the PS, and put into a pretty fixed function unit that only is capable of a few quick calcs. With just quick operations in the ROPs you can have a lot fewer pixels in flight there.

There can still be a lot of pixels waiting in in/out buffers to the ROPs. That's no problem since it's not in the serial path.

So you have complex operations that run in parallel in the PS, feeding the simple operations in the much more serial ROPs. (Serial per pixel, but of course parallel for different pixel positions.)
 
Basic said:
I haven't read the patent, but if it's about arbitrary frame buffer access in a pixel shader, then a big problem is parallelism between triangles.
I don't think it's arbitrary, but rather just a generalization of blending. I haven't read the patent, either, though. Just a little bit of skimming.
 
Back
Top