On this test on the XL it appears to make no different in 6x FSAA and a slightly detrimental performance in 2x.
Dave Baumann said:On this test on the XL it appears to make no different in 6x FSAA and a slightly detrimental performance in 2x.
One feature that hasn't got much attention so far is the Color Buffer Cache (buffer and cache?).sireric said:The MC requires the clients to have lots of latency tolerance so that it can establish a huge number of outstanding requests and pick and chose the best ones to maximize memory bandwidth (massive simplification).
sireric said:I'm not saying replace the gfx APIs -- Just trying to limit to prolification of new ones. What if the physics API doesn't allow for all physical phenomena to be done? Do you create a new API for that? What if signal processing wants to be done and you only have collision hooks?
At the end, I fear the same thing regarding low level of detail. But I fear the extreme work in having lots of new specialized APIs too. I'd like a reasonably low level API that allows more "to the metal" performance, but that abstracts some of the quirks of programming a given architecture. I don't really know the answer either. It's a new place were we are continuing to explore, but we are listening and talking to that community.
Chalnoth said:Well, the problem is that how do you maintain a low-level API along with vendor-agnostic interfaces?
Edit:
Actually, now that I think about it, it might be quite nice to have a low-level API as an intermediate step between a higher-level API or language and the hardware. This would allow the implementation of compilers and API's without having to go through the graphics pipeline and also without IHV's having to write drivers for the specific API's.
Nite_Hawk said:We want vendor agnostic high level APIs and vendor specific low level apis. Cross platform too.
Nite_Hawk
'Cache' inside graphics chips has three uses, not all of which may be familiar if you're only used to the term 'cache' in the way that CPU's use it.Nite_Hawk said:Why exactly is the color buffer cache needed?
Dave Baumann said:On this test on the XL it appears to make no different in 6x FSAA and a slightly detrimental performance in 2x.
To be honest I am not that fond of low level APIs; I would prefer a solution (a gcc-like compiler, like ATI said in one of its presentations) that was built on top of OpenGL (or Direct3D, but OGL is not bound to a certain platform), so that it could use not only ATI cards, but nV cards as well. Of course you have the problem that a) graphics APIs are not really designed to do general programming stuff, so you are bound to miss certain general-purpose functions that must be created somehow b) API built on top of API equals lost speed and efficiency.krychek said:Hehe, yep I was thinking along the same lines (if I understood you right). The other APIs would be implemented ontop of the vendor specific low level API. Also for GPGPU, you could have libraries implemented on top of the low level API that are specific to a certain domain. But if you really want just the performance and features, just code to the low level API.
I'm hypothesising that RBE is just another "latency-tolerant" client of the MC.Nite_Hawk said:I'm still trying to process everything in your post... Why exactly is the color buffer cache needed? Couldn't the RBE directly request blocks from the MC and remove the CBC layer?
I'm suggesting that each CBC is owned by a single RBE. And each RBE solely owns screen-tiles. The net effect being that a given pixel is always the property of a single RBE. This increases parallelism in the GPU, without creating intra-GPU dependencies.I think the missing peice of information for me is how often the same data gets requested from the CBC over again. If we have 4 RBEs, do each of them only contact one CBC? If so I could see why this is important...
Dio said:'Cache' inside graphics chips has three uses, not all of which may be familiar if you're only used to the term 'cache' in the way that CPU's use it.
1. Cache avoids going to memory when an item of data is frequently accessed in a short period of time. This is important in some places in graphics chips (vertex accesses, texture filtering, small triangles) but it's not always the main raison d'etre. It's what I thought cache was until I had the other two uses explained to me...
Dio said:'Cache' inside graphics chips has three uses, not all of which may be familiar if you're only used to the term 'cache' in the way that CPU's use it.
1. Cache avoids going to memory when an item of data is frequently accessed in a short period of time. This is important in some places in graphics chips (vertex accesses, texture filtering, small triangles) but it's not always the main raison d'etre. It's what I thought cache was until I had the other two uses explained to me...
2. Caching to increase burst lengths - speculative reads and holding write data for more items to come into the same cache line and so improve memory efficiency. This is absolutely essential for graphics as most data items are much smaller than the amounts that it is efficient to get from memory in one go.
3. Caching for efficient pipelining; when you find out you are going to need a particular cache line further down the pipe you can issue the request to fill that cache line immediately, and then later in the pipeline the data is already present in the cache by the time it's needed to be used.
So while the colour buffer cache may not always need much of function 1, function 2 is absolutely essential and function 3 allows for more efficient design.
Jawed said:I'm hypothesising that RBE is just another "latency-tolerant" client of the MC.
In order to be latency-tolerant it, presumably, runs render tasks in a disjoint fashion. One way to do this is to implement a really long pipeline, so that by the time the data-sensitive portion of the RBE task occurs, the data is all in place (latency has run its course).
But in typical RBE tasks, I don't know what you'd fill the pipeline with. It's pixel colour data with an address in the back-buffer. There's not very much you can do with that data until you have access to the relevant portion of back-buffer.
So an alternative to this is to "batch-up" RBE tasks. Instead of working on a single quad of pixels (e.g. writing four pixels' colour values into the back-buffer), it makes sense to work in blocks. A block might be 8 pixels. Or 64 pixels. I don't know. Increasing the size makes each access to memory more efficient. But it also costs more in terms of on-die CBC space.
Then you also have to bear in mind that the shader/texture pipelines produce colour-writes out of order (or at least I presume they do, since the threads are themselves able to execute out of order).
So the RBE has to take care to write order-sensitive pixels in the correct order.
So the RBE would seem to have to be able to re-order incoming tasks, and block them up into memory-efficient packets.
I should point out that the "Xenos" AA EDRAM patent covers similar ground.
http://patft.uspto.gov/netacgi/nph-P...number=6873323
But the concepts I'm discussing aren't the main focus of that patent.
Instead, notice the use of Packing and Unpacking units. It's a tenuous link, but I think something similar is prolly happening in relation to CBC.
Kombatant said:To be honest I am not that fond of low level APIs; I would prefer a solution (a gcc-like compiler, like ATI said in one of its presentations) that was built on top of OpenGL (or Direct3D, but OGL is not bound to a certain platform), so that it could use not only ATI cards, but nV cards as well. Of course you have the problem that a) graphics APIs are not really designed to do general programming stuff, so you are bound to miss certain general-purpose functions that must be created somehow b) API built on top of API equals lost speed and efficiency.
Jawed said:Since ATI's older GPUs couldn't execute pixel shader threads out of order, packing colour writes in the RBE would have been easier, I guess.
The packing would match up well with the triangle walk, I expect. Though prolly not perfectly (since triangles aren't memory-tile sized).
I doubt the CBC is entirely new in R520 - but I expect the scope of RBE operation has evolved so much in R520 that CBC has had to grow significantly, both in terms of size and functionality. I dare say in much the same way that texture caches have evolved.
Jawed