ATi's roadmap for 2004

Joe DeFuria · Feb 20, 2004

John Reynolds said:
That's ok, you're on my sh*t list today.

j/k

Dio · Feb 20, 2004

DemoCoder said:
BTW, now that I think about it, has anyway ever explained why the R300 has a limit of 4th-order dependent texture fetches? Is it just an API/driver limitation? Does this mean the pipeline is register-combiner-ish?

It's nothing to do with being 'register-combiner-ish'. It was a tradeoff, to do with the complex equations of multipass-vs-multitexture.

demalion · Feb 20, 2004

DemoCoder said:
You have to do alot more to get on my list. It's very exclusive.

Demalion, do you have any links or info as to why you think the R300's vertex units are already close to VS3.0? Seems to me like there is alot missing.

"...to the full PS 3.0 spec, if coupled with the R300's centroid sampling capable TMUs and pixel processing frontend" is actually what I said. I'm not sure of the point of your question...do you expect ATI won't have VS 3.0 functionality? That's all my discussion assumes. If you disagree with my evaluation of "quite close", just mention why.

I don't think adding multiple FP ALUs to each pipe is overly complicated.

Hmm? Of course that (as stated here) isn't complicated, it just lends itself to your comment before of a high peak but low effective processing, so I think they'll try to do something else.

Each pipe already has 3 processing units (scalar, vector, and texture addressing) which the device driver has to schedule for co-issue. The pipeling already has to contend with data being routed to different units on the R300. Adding a second vector unit and having the driver re-arrange instructions for co-issue shouldn't be a big deal, since it must already do that anyway for the R300.

I think you misunderstood my wording...I meant, within the context of juggling "uber" pipelines between vertex and pixel processing, that allocating processing for the same pixels complicates latency hiding...it introduces issues for maintaining pipelining that discrete pixel processing with the F Buffer seems like it would not. The solution for dynamic branching for the "based on R3xx" that the R420 is supposed to be is uncharted, so I do see how it could be useful to help to resolve some "misprediction" penalty in some way.

On to the interesting bits...

Perhaps this sharing/pooling business is upside down. Perhaps there is no dedicated vertex ALUs, but instead, each pixel pipeline has 2 ALUs. Whenever vertices are ready to be processed, then one of the pixel ALUs on each pipe is "borrowed" temporarily to process the vertex.

This corresponds to the perspective you state later, except that I don't see how you get around the problem of a single pixel stalling while the rest of a quad processes. It seems more likely that it would need to occur on a quad basis, and that this would also point more towards batching than the "on demand" picture you paint (the "TBDR-alike solution" I alluded to).

If that's the case, I would suspect then that there are two kinds of ALUs on the pixel pipe. A PS3.0 capable ALU, and a combined VS3.0/PS3.0 capable ALU (can do both)

Your labelling seems to indicate that they have independent scheduling capabilities? This sounds like a transistor expensive way to do it, as well as sounding a bit overly complicated, unless there simply isn't any alternative. Or if you mean something else, could you clarify?

The reason I suspect that the ALUs are located in the pixel pipelines simply has to do with chip locality. Since pixels are processed at a much higher rate than vertices, wouldn't it really make sense to have the additional ALUs sit as close to the pixel pipes as possible, just for clock timing? It is more likely that the vertex processor will always be waiting for the rasterizer to finish, not the other way around.

Well, I don't think that matters much when the solution is aimed towards processing, and it doesn't seem a significant latency concern at all given th other latencies the F-Buffer and V-Buffer are designed for. And I'm not sure where your expectation of locality arises from in any case.

DemoCoder · Feb 20, 2004

demalion said:
If you disagree with my evaluation of "quite close", just mention why.

Lack of dynamic flow control. No Gradient. No subroutines. I count these as big items.

Could you explain in detail how F-Buffer is supposed to solve the branch penalty?

demalion · Feb 21, 2004

DemoCoder said:
demalion said:

If you disagree with my evaluation of "quite close", just mention why.

Click to expand...

Lack of dynamic flow control. No Gradient. No subroutines. I count these as big items.

Well, aside from not seeing the relevance to my discussion of re-tasking the VS 3.0 units for pixel processing, the problem here seems to be you having a problem with my saying "quite close". The one of those that wasn't in my mental list was "No subroutines", and yet I still think of VS 2.0 model functionality with texture fetches and centroid sampling and state buffers (V Buffer/F Buffer) as being "quite close" to PS 3.0. Probably because I'm focusing on what it adds to PS 2.0 rather than what is missing from PS 3.0, and what it allows.

Now, this leaves you perfectly free to consider them "not quite close", with quite a valid basis. The only relevance I see to my point is regarding some sort of assumption that this means a significant transistor count increase over VS 2.0 for the VS 3.0 capable units. If that is your point, and you were just waiting to tell me, where are you going with that that you couldn't have been more usefully directed at responding to this place where I discussed transistor count, so we wouldn't have to do this run-around:
"What I'm wondering, while out here on this hypothetical branch of thought, is what's the worst branching case that might happen in shading, what's the best, and what kind of solutions would be best suited to dealing with managing each acceptably? Would any such solutions lend themselves to having only 4 "uber" pipelines instead of 8, because transistors would be better spent on implementing the solution usefully?"

:?:

Perhaps I'm just confused by expecting this to relate to something else besides the perspective for "quite"?

Could you explain in detail how F-Buffer is supposed to solve the branch penalty?

Are you going to offer an explanation in detail of how it is going to be done without it? :-?

But to try and answer:

What I said was that the hardware used to provide the VS 3.0 implementation while avoiding branch penalties could be used for PS 3.0 implementation as facilitated by F Buffer and V Buffer mechanisms to schedule re-tasking of such units. This doesn't propose the F Buffer as the solution itself, but as a mechanism that would allow "the solution" to be applied to processing while working to avoid stalls, and would offer other opportunities for benefit to processing throughput besides. I also stated that the processing units being "doubled" in a pipeline did seem reasonable as part of a solution to reduce "misprediction penalty" because of the "uncharted" nature of "the solution".

I'm not sure where you're going here:

The F Buffer and V Buffer weren't just made up of thin air, nor the assertion that the R420 is going to offer PS and VS 3.0, nor that the R420 isn't viewed by ATI as being as revolutionary as they'd wanted (i.e., the R500) and being based on the R3xx.

If there is going to be a solution to branch penalties in the R420, it is going to be one on a chip that there is quite a bit of indication will have F Buffers and V Buffers, will be trying to achieve PS and VS 3.0 functionality, and which has been asserted to be "relatively close" (safer?) in some aspects to the R3xx (when having the R500 in mind, apparently). Why does my commentary require these questions that don't seem to relate well to it?

DemoCoder · Feb 21, 2004

demalion said:
Why does my commentary require these questions that don't seem to relate well to it?

Why do you fear explaining your speculations? Jeez, I'm only asking questions because your writing style is so difficult to understand and vague, and sometimes it seems as if you just throw a bunch of terms together with some vague association and hope others will "get" the connections.

What I said was that the hardware used to provide the VS 3.0 implementation while avoiding branch penalties could be used for PS 3.0 implementation as facilitated by F Buffer and V Buffer mechanisms to schedule re-tasking of such units.

#1 What is the hardware used to avoid branch penalties? Can you explain it?

#2 How is F-Buffer and V-Buffer used to "schedule re-tasking of such units"? I'm aware of how F-Buffer works and what it's used for, I'm just not clear exactly how this is supposed to work in the context you are talking about. So perhaps you could explain it, instead of hand-waving. (for example, can you give a simple pseudo-code or algorithm, or atleast step by step explaination?)

demalion · Feb 21, 2004

DemoCoder said:
demalion said:

Why does my commentary require these questions that don't seem to relate well to it?

Click to expand...

Why do you fear explaining your speculations?

What fear are you talking about? Your questions don't relate well to discussing my speculation.

So far we've discussed "quite" with no apparent relevance to anything else, and the idea that I'm saying the F Buffer "is" the branch penalty solution which doesn't even seem to be in my commentary. We haven't discussed the actual commentary, and hence the question.

Jeez, I'm only asking questions because your writing style is so difficult to understand and vague, and sometimes it seems as if you just throw a bunch of terms together with some vague association and hope others will "get" the connections.

I suppose it would be useful for me to make some assertions about you and your thinking process because of how you come up with your questions?

What I said was that the hardware used to provide the VS 3.0 implementation while avoiding branch penalties could be used for PS 3.0 implementation as facilitated by F Buffer and V Buffer mechanisms to schedule re-tasking of such units.

Click to expand...

#1 What is the hardware used to avoid branch penalties? Can you explain it?

Eh? You're asking me to be a hardware designer as a response to my discussion. Why don't you answer when I turn the question around and ask you for details? Heck, how many questions of mine have you answered, and what is the ratio to the ones you've simply skipped?

You're a very lopsided conversationalist.

#2 How is F-Buffer and V-Buffer used to "schedule re-tasking of such units"? I'm aware of how F-Buffer works and what it's used for, I'm just not clear exactly how this is supposed to work in the context you are talking about.

Ah, relevant question.

The F Buffer stores the processing state such that a new "pass" can occur without re-constructing state by re-performing vertex processing and reading a computation result back as a substitute.

With states stored, scheduling (by some sort of "scheduler") the re-tasking of processing units is facilitated, made easier, made more feasible, seems more likely for the R420. Solutions (that I also haven't designed) directed towards hiding the latency of storing and recovering the state to be processed seem indicated in a design that utilizes such a system for overcoming instruction count limitations. Such solutions seem applicable to the concept of efficiently re-tasking processing units from vertex processing to pixel processing and back, as predicted by the scheduler, at least in quads and in batches, in contrast to my understanding of your PS 3.0 ALU + PS/VS 3.0 ALU discussion (which I mention because I've yet to have my questions answered).

Relating to question #1, where in my discussion do I assert that I have a design for the scheduler, or depend on having one to make my point?

So perhaps you could explain it, instead of hand-waving. (for example, can you give a simple pseudo-code or algorithm, or atleast step by step explaination?)

*sigh*

Perhaps you should ask more useful questions instead of accusing people of "hand-waving" while they endure your personality while wading through questions that are not. And/or consider how your attributions to other people might appear to apply to your own behavior from someone else's perspective.

DemoCoder · Feb 21, 2004

demalion said:
The F Buffer stores the processing state such that a new "pass" can occur without re-constructing state by re-performing vertex processing and reading a computation result back as a substitute.

Yes, yes, wonderful for alleviate the huge overhead of multipass, but...

With states stored, scheduling (by some sort of "scheduler") the re-tasking of processing units is facilitated, made easier, made more feasible, seems more likely for the R420. Solutions (that I also haven't designed) directed towards hiding the latency of storing and recovering the state to be processed seem indicated in a design that utilizes such a system for overcoming instruction count limitations.

You haven't fully explained how this is supposed to be an overall performance win.

#1 the overhead of swapping the state will increase latency.
#2 You've got, what, oodles of pixels in flight in the pipeline, at various stages of using an ALU, and you want to pause a shader mid-execution, swap out its state, to free up the ALU for use by the vertex engine? What happens to the other hundred ALU ops in flight queued up?

Context switching can either increase performance, or decrease performance. Unless you're blocked waiting for I/O or your unit is idle because you had to insert some NOPs, often it will decrease overall performance. If you're in the middle of executing a pixel shader, can you explain how swapping your state out to the F-Buffer will increase the performance of this shader?

Now, I could see how a "V-Buffer" might work to increase performance. If your triangle setup is stalled waiting for some triangles to finish, you could interrupt a vertex shader in progress, save it's state to V-Buffer, lend those ALUs to the pixel pipe, and then continue where you left off. But this only seems to make sense for very long vertex shaders, otherwise, you could wait for completion of a vertex, and avoid having to "save state" in the middle of a shader. Then there is the difficulty of when to "take back" the ALUs you lent.

It all seems too overly complicated and risky compared to just doubling up FP units, making the VLIW instruction words longer, and having the driver simply co-issue two vector ops at once. Now, it's possible there's some other trick at work and I'm wrong, and the F-Buffer will enable some mega performance boost not related to multi-pass savings, but I don't know what it is, and I'm asking you to explain it, instead of asserting it without details.

demalion · Feb 21, 2004

DemoCoder said:
demalion said:

The F Buffer stores the processing state such that a new "pass" can occur without re-constructing state by re-performing vertex processing and reading a computation result back as a substitute.

Click to expand...

Yes, yes, wonderful for alleviate the huge overhead of multipass, but...

With states stored, scheduling (by some sort of "scheduler") the re-tasking of processing units is facilitated, made easier, made more feasible, seems more likely for the R420. Solutions (that I also haven't designed) directed towards hiding the latency of storing and recovering the state to be processed seem indicated in a design that utilizes such a system for overcoming instruction count limitations.

Click to expand...

You haven't fully explained how this is supposed to be an overall performance win.

#1 the overhead of swapping the state will increase latency.

Yes, which is why I mentioned the idea of solutions directed towards hiding latency in that text.

#2 You've got, what, oodles of pixels in flight in the pipeline, at various stages of using an ALU, and you want to pause a shader mid-execution, swap out its state, to free up the ALU for use by the vertex engine? What happens to the other hundred ALU ops in flight queued up?

No, this pausing is not occuring arbitrarily...where are you getting that? It is occuring when it will offer benefit.

How many pixels will be calculated for a triangle? How many clock cycles per pixel? How many triangles can be queued? This seems to leave quite a lot of room for quite a significant benefit of assigning a quad (or maybe even two) of processing pipelines where even an "unhidden" latency of state storing/recovery is minor in comparison, though I still don't see why another state set couldn't have been loaded and put in the pipeline in anticipation of this.
The highest performance case for this idea that seems feasible (AFAICS) is that this tasking would occur on a per quad basis and that there are two "uber" quad sets...one could be biased towards vertex processing to avoid unexpected "empty bucket" cases due to, for example, "uber buffer" type usages.

Also, for the case of dynamic branching and branching penalties: What I propose (as a theory for the transistor budget, and working towards the optimistic 8 uber/8 pixel pipe case) is that only the "uber" pipes are capable of handling dynamic branching effectively, and that the solution for pixel processing is to schedule their retasking for it. As far as your suggestion for processing being "doubled up", I say that makes sense to me specifically in conjunction with this idea in order to direct computation resources at a "misprediction penalty reduction". Scheduled before hand by taking the state, storing it, and queuing it for them.

Context switching can either increase performance, or decrease performance. Unless you're blocked waiting for I/O or your unit is idle because you had to insert some NOPs, often it will decrease overall performance.

Well, the idea of the vertex units being retasked when waiting for some place to put output has been discussed several times already and seems fairly straightforward. You just wanted it mentioned in this thread?

f you're in the middle of executing a pixel shader, can you explain how swapping your state out to the F-Buffer will increase the performance of this shader?

Eh?! Where am I saying swapping state is increasing performance?

Now, I could see how a "V-Buffer" might work to increase performance. If your triangle setup is stalled waiting for some triangles to finish, you could interrupt a vertex shader in progress, save it's state to V-Buffer, lend those ALUs to the pixel pipe, and then continue where you left off.

Sure. And my discussion for the F Buffer is based on 1) state being stored there between "passes" anyways 2) this providing opportunity for "uber" pipes to apply enhanced functionality and/or performance when this occurs 3) a move towards intermixing the data sources pixel and vertex processing units will be processing.

But this only seems to make sense for very long vertex shaders, otherwise, you could wait for completion of a vertex, and avoid having to "save state" in the middle of a shader.

Sure. I'm probing handling the shader model specifications fully and how state management can apply to this, that's why I asked for your input on best and worst cases for branching.

Then there is the difficulty of when to "take back" the ALUs you lent.

Well, that's why I've been curious about what can be done with expanding state management beyond "fast multipass". The overly simplified answer is "when it avoids stalling". The interesting answer is, if this happens to pan out, what ATI implemented to schedule this. Well, among other interesting answers, such as the answers for what at least two other IHVs are doing.

It all seems too overly complicated and risky compared to just doubling up FP units, making the VLIW instruction words longer, and having the driver simply co-issue two vector ops at once.

"just" doubling up FP units and switching them in and out for each pixel pipeline? I've asked my questions about that already...that seems to net less benefit (similar peak, less typical sustained) and be quite risky or unworkable.

To me it makes more sense for a different architecture where risks for high transistor overhead branching solutions and state management might have already been taken, where the option of being "undoubled up" could be a means to allow more generally useful throughput by increased "parallelism".

Now, it's possible there's some other trick at work and I'm wrong, and the F-Buffer will enable some mega performance boost not related to multi-pass savings, but I don't know what it is, and I'm asking you to explain it, instead of asserting it without details.

As you go ahead and assert "just doubling up" without details or any response to my questions about it? I say again, "lopsided".

ATi's roadmap for 2004

Similar threads