Bob said:
I could prolly spend a few hours on this
and always run the risk of speaking way out of turn being a normal person, not a GPU engineer. But anyway...
- How do you maintain DRAM coherent access streams?
How do you maintain them now? Texture reads are based on tiled organisation of textures in memory, as far as I can tell. Textures are pre-fetched.
As I see it, a unified architecture could make multiple, non-contiguous (in time) accesses to the same textures, so a degree of set-associativity would need to be introduced to caches, where I presume GPUs currently have no need. So that's definitely an added cost in a unified GPU.
- If you allow for batches to execute out of order, you need to buffer up the results: Although you can run shaders out of the order, you can't run ROP out of order. How big is the buffering? How efficient is it?
Since every fragment is accompanied by one or more z values, I don't accept your argument that fragments have to be submitted to the ROPs "in order". A fragment is only rendered if its z allows it. There is always a final z test on pixel render, as I understand it, because fragments that pass early-z rejection are not 100%-guaranteed visible.
On the other hand, ROPs already have to batch-up fragments to make the most efficient DRAM accesses. So there's already a degree of buffering in ROPs. Sure, if a ROP happens to receive, say, 16 fragments for contiguous pixels simultaneously, then buffering is trivial.
- How do you resolve resource starvation?
By making that a parameter of the scheduler. The scheduler knows exactly how long non-TMU operations take, so it can (trivially) predict that the ALU pipeline is going to need new batches ahead of the pipeline actually emptying.
TMU pipelines are a bit more tricky due to the degrees of requested filtering and the latency of DRAM accesses. But it should be reasonably easy to put a short buffer on the front of the TMU pipelines. Perhaps the TMU can warn the scheduler when it's about to start its last iteration.
- How do you guarantee forward progress?
By designing the scheduler to prioritise oldest batches in an "available batches" tie. Also by sizing the inter-stage buffers in a reasonably typical fashion and keeping an eye on their "percentage full" indicators.
- If you have a shader that can generate triangles, how do you guarantee that rasterization/early-Z happens in order of generation, accross the whole chip?
By single-threading the triangle setup engine (erm, the interpolation/rasterisation engine is prolly a better name for it), which is how GPUs currently work (apparently).
If the GPU is designed to perform tiled/predicated triangle rendering, then the GPU will need to have either an internal or external vertex/triangle buffer. Additionally if you can queue triangles per tile, then you won't get competing rasterisation/early-z-test threads corrupting z or rasterising otherwise useless fragments.
- How do you build efficient register files, when you need fine-grained allocation?
I presume that the execution pipeline gets longer in order to allow slower register fetching. So you don't necessarily need to tackle fetch latency at source. Since there's no branching in a pipeline (that only executes one instruction per batch, as Xenos seemingly does), there's no risk in making the pipeline longer. (All branching will happen when the batch is out of context.)
- Is resource allocation done in hardware or in software? Either case, how is the allocation done? Is there a cost for reallocation, and if so, what is it?
I'm unclear what kinds of resources you're referring to that haven't already been tackled. Do you mean cache lines? ALU or TMU pipeline prioritisation?
In Xenos resource allocation is a GPU-state-driven hardware function, as far as I can tell.
- Do you keep triangles generated by one shader pipe totally inside that shader pipe (ie: no transmission of attributes), or do you add large busses to funnel through attributes? Either way has its own set of issues.
I guess you're referring to a geometry shader and attributes generated by the shader, as "output values to be treated as constants by the next stage".
My understanding is that these attributes are part of the batch shader state, for the next stage. It's irrelevant what pipes work on these triangles, as the batch shader state is universally accessible by all pipelines.
- How do you share vertex work between shader pipes? Do you need to vertex batches to be complete primitives?
I can't say anything about this as my understanding of vertex shading is pretty minimal. I'm not sure why vertex work would be shared between shader pipes (concurrently), whilst performing vertex shading.
---
I'm basing my answers on what I've read about Xenos. Some of it is pure presumption, no doubt about it :smile:
Jawed