Joe DeFuria
Legend
John Reynolds said:That's ok, you're on my sh*t list today.
j/k
John Reynolds said:That's ok, you're on my sh*t list today.
j/k
It's nothing to do with being 'register-combiner-ish'. It was a tradeoff, to do with the complex equations of multipass-vs-multitexture.DemoCoder said:BTW, now that I think about it, has anyway ever explained why the R300 has a limit of 4th-order dependent texture fetches? Is it just an API/driver limitation? Does this mean the pipeline is register-combiner-ish?
DemoCoder said:You have to do alot more to get on my list. It's very exclusive.
Demalion, do you have any links or info as to why you think the R300's vertex units are already close to VS3.0? Seems to me like there is alot missing.
I don't think adding multiple FP ALUs to each pipe is overly complicated.
Each pipe already has 3 processing units (scalar, vector, and texture addressing) which the device driver has to schedule for co-issue. The pipeling already has to contend with data being routed to different units on the R300. Adding a second vector unit and having the driver re-arrange instructions for co-issue shouldn't be a big deal, since it must already do that anyway for the R300.
Perhaps this sharing/pooling business is upside down. Perhaps there is no dedicated vertex ALUs, but instead, each pixel pipeline has 2 ALUs. Whenever vertices are ready to be processed, then one of the pixel ALUs on each pipe is "borrowed" temporarily to process the vertex.
If that's the case, I would suspect then that there are two kinds of ALUs on the pixel pipe. A PS3.0 capable ALU, and a combined VS3.0/PS3.0 capable ALU (can do both)
The reason I suspect that the ALUs are located in the pixel pipelines simply has to do with chip locality. Since pixels are processed at a much higher rate than vertices, wouldn't it really make sense to have the additional ALUs sit as close to the pixel pipes as possible, just for clock timing? It is more likely that the vertex processor will always be waiting for the rasterizer to finish, not the other way around.
demalion said:If you disagree with my evaluation of "quite close", just mention why.
Well, aside from not seeing the relevance to my discussion of re-tasking the VS 3.0 units for pixel processing, the problem here seems to be you having a problem with my saying "quite close". The one of those that wasn't in my mental list was "No subroutines", and yet I still think of VS 2.0 model functionality with texture fetches and centroid sampling and state buffers (V Buffer/F Buffer) as being "quite close" to PS 3.0. Probably because I'm focusing on what it adds to PS 2.0 rather than what is missing from PS 3.0, and what it allows.DemoCoder said:demalion said:If you disagree with my evaluation of "quite close", just mention why.
Lack of dynamic flow control. No Gradient. No subroutines. I count these as big items.
Could you explain in detail how F-Buffer is supposed to solve the branch penalty?
demalion said:Why does my commentary require these questions that don't seem to relate well to it?
What I said was that the hardware used to provide the VS 3.0 implementation while avoiding branch penalties could be used for PS 3.0 implementation as facilitated by F Buffer and V Buffer mechanisms to schedule re-tasking of such units.
DemoCoder said:demalion said:Why does my commentary require these questions that don't seem to relate well to it?
Why do you fear explaining your speculations?
Jeez, I'm only asking questions because your writing style is so difficult to understand and vague, and sometimes it seems as if you just throw a bunch of terms together with some vague association and hope others will "get" the connections.
What I said was that the hardware used to provide the VS 3.0 implementation while avoiding branch penalties could be used for PS 3.0 implementation as facilitated by F Buffer and V Buffer mechanisms to schedule re-tasking of such units.
#1 What is the hardware used to avoid branch penalties? Can you explain it?
#2 How is F-Buffer and V-Buffer used to "schedule re-tasking of such units"? I'm aware of how F-Buffer works and what it's used for, I'm just not clear exactly how this is supposed to work in the context you are talking about.
*sigh*So perhaps you could explain it, instead of hand-waving. (for example, can you give a simple pseudo-code or algorithm, or atleast step by step explaination?)
Yes, yes, wonderful for alleviate the huge overhead of multipass, but...demalion said:The F Buffer stores the processing state such that a new "pass" can occur without re-constructing state by re-performing vertex processing and reading a computation result back as a substitute.
With states stored, scheduling (by some sort of "scheduler") the re-tasking of processing units is facilitated, made easier, made more feasible, seems more likely for the R420. Solutions (that I also haven't designed) directed towards hiding the latency of storing and recovering the state to be processed seem indicated in a design that utilizes such a system for overcoming instruction count limitations.
DemoCoder said:Yes, yes, wonderful for alleviate the huge overhead of multipass, but...demalion said:The F Buffer stores the processing state such that a new "pass" can occur without re-constructing state by re-performing vertex processing and reading a computation result back as a substitute.
With states stored, scheduling (by some sort of "scheduler") the re-tasking of processing units is facilitated, made easier, made more feasible, seems more likely for the R420. Solutions (that I also haven't designed) directed towards hiding the latency of storing and recovering the state to be processed seem indicated in a design that utilizes such a system for overcoming instruction count limitations.
You haven't fully explained how this is supposed to be an overall performance win.
#1 the overhead of swapping the state will increase latency.
#2 You've got, what, oodles of pixels in flight in the pipeline, at various stages of using an ALU, and you want to pause a shader mid-execution, swap out its state, to free up the ALU for use by the vertex engine? What happens to the other hundred ALU ops in flight queued up?
Context switching can either increase performance, or decrease performance. Unless you're blocked waiting for I/O or your unit is idle because you had to insert some NOPs, often it will decrease overall performance.
f you're in the middle of executing a pixel shader, can you explain how swapping your state out to the F-Buffer will increase the performance of this shader?
Now, I could see how a "V-Buffer" might work to increase performance. If your triangle setup is stalled waiting for some triangles to finish, you could interrupt a vertex shader in progress, save it's state to V-Buffer, lend those ALUs to the pixel pipe, and then continue where you left off.
But this only seems to make sense for very long vertex shaders, otherwise, you could wait for completion of a vertex, and avoid having to "save state" in the middle of a shader.
Then there is the difficulty of when to "take back" the ALUs you lent.
It all seems too overly complicated and risky compared to just doubling up FP units, making the VLIW instruction words longer, and having the driver simply co-issue two vector ops at once.
Now, it's possible there's some other trick at work and I'm wrong, and the F-Buffer will enable some mega performance boost not related to multi-pass savings, but I don't know what it is, and I'm asking you to explain it, instead of asserting it without details.