ATi's roadmap for 2004

tEd said:

to have 8 advanced pixel pipelines ATI calls “extreme pipelinesâ€￾

Dave do you work at ati now?' ;)
yes, that quotation seems very similar to what DaveBaumann mentioned in this thread:

http://www.beyond3d.com/forum/viewtopic.php?t=9611
R420's pipelines might be somewhat extreme.


.... also, the quotation seems similar to what MuFu stated in this thread:

http://www.rage3d.com/board/showthread.php?s=&threadid=33731776&perpage=30&pagenumber=2
What's certain is that they're 8 UBER-PIPELINES of some sort.
 
Heh - well, if you read some of my posts carefully you'll get the news two months early from when other sites start to catch on!
 
8) Sure. actually, you should just post new information. Because 5489 is a bit much to find out what's new, rumor, correction, just your thoughts, teasing etc... :devilish: ;) :LOL:
 
What it looks like to me is a great deal of indication in general of pipelines being re-taskable to pixel or vertex shading in the upcoming generation, lending some validation to some long standing speculation. If you consider "uber buffers" along with various commentary, all sorts of interesting thoughts are sparked.

Currently, there seems to be very strong indication of this from at least 2 IHV's, and a general perception of feasibility, AFAICS, for a 3rd as well. I hope all 3 pan out soon.
 
DaveBaumann said:
Heh - well, if you read some of my posts carefully you'll get the news two months early from when other sites start to catch on!

Well, taken altogether, and with my existing perception of the R420 being based on enhanced F/V Buffer handling, various commentary leads to the perception of the R420 as 8 pixel pipelines, some number of vertex pipelines (8 seems possible for the transistor budget, but perhaps too optimistic?), with some ability to re-task them.

Whether all pipelines are capable of branch handling, or whether only the "8" that seem like they could be evolved VS 2.0 pipelines are, and what peaks are actually reached for pixel and vertex processing (PS 2.0 and PS 3.0 levels), would still remain a big question even if the speculation above were true in its optimistic form. But the various ambiguities around "16" and "8" pixel pipelines would sure make a lot of sense.

But all this amounts to is another opportunity for Wavey to peg the tease-o-meter. :-?
 
Maybe it means that there is a pool of TMUs and they can be dynamically allocated between pipeline TMU duty, and performing some other operations (non-TMU). e.g. it's an 8x2 design, but can function like 16x1 with single FP power, or 8x1 with double the FP power?
 
DemoCoder said:
Maybe it means that there is a pool of TMUs and they can be dynamically allocated between pipeline TMU duty, and performing some other operations (non-TMU). e.g. it's an 8x2 design, but can function like 16x1 with single FP power, or 8x1 with double the FP power?

I doubt that's possible. I had a similar thought some time ago, but it seems that it isn't that simple.

Pool of TMUs is an interesting thought though. TMU array?
 
Yeah, but the other ideas don't seem feasible either. Even if you allow the 4 vertex units to be lent to the pixel shaders, that's 0.5 per pipeline (if 8 pipelines), and I think balancing the workload on the vertex units with the shader workload would be a quite difficult problem. Process 3 vertices, then hand them to the PS? When do they get handed back to the vertex processor to do the next batch? Doesn't seem so straightforward either.
It's doable, but not easy.
 
Consider the F-Buffer, which is an implementation of pixel processing state management and storage to avoid re-processing vertices (sort of strikes me as making an IMR more closely suited to pixel/vertex processing re-tasking as a TBDR...hmm).

Consider the V-Buffer, which was mentioned by someone from ATI in passing as something like "F-Buffers for vertex processing".

Consider effort being extended to add transistors focused specifically on the performance enhancement, management, and scheduling of these buffers, and what that might allow something awfully close to the existing basic processing pipelines of the R3xx to accomplish (for example, PS/VS 3.0).

Consider how long we've had a description of the R420 that corresponds to that last bit.

...


As for it not being easy, that's why even though I think 8 vertex or "uber" (or maybe some more exciting marketingspeak term) pipelines in addition to 8 pixel pipelines seems feasible with the expected transistor budget, I don't think a peak of 16 pipelines applied to all types of processing is guaranteed even if that expectation is true.

It seems likely that not all pipelines are created equal for the R420 (because of transistor budget...I don't see ATI pulling out both the 8 unit case and full PS 3.0 functionality for the 8 basic pixel pipelines that seem given). It also seems to me that vertex processing units of the R300 are quite close to the full PS 3.0 spec, if coupled with the R300's centroid sampling capable TMUs and pixel processing frontend.

Among the possibilities, the one that seems likely to me is no peak processing with full PS 3.0/VS 3.0 utilization. Specifically, what seems likely to me is a peak parallel processing of 16 for base PS 2.0, 8 for VS 3.0/VS 2.0, and some sort of choke for PS 3.0. Why I say a "choke", and not "8" for parallelism for PS 3.0, is that buffer management and preservation of state by the buffer systems I mentioned seem to have some possibilities for avoiding stalls, and in combination with there being lots of other instructions besides flow control, allowing a return to the theoretical 16 parallel pixel peak (pardon...*deep breath* the spittle).

I don't see the hypothetical "uber" pipelines being used for adding processing depth, because this seems to complicate the problem of maintaing pipelining and managing buffers. I do see them doing something akin to that for managing branching situations, however, simply because of transistor budget and existing VS 2.0 featureset and implementation in the R3xx. I'm not sure about the TMU speculation...I don't see multiple texture unit per pipe functionality as being useful focus with an emphasis on processing, but OTOH, the directions shading might be taking and the idea of "uber" buffer implementation fitting into this might allow it to still be necessary to hide latency issues.

What I'm wondering, while out here on this hypothetical branch of thought, is what's the worst branching case that might happen in shading, what's the best, and what kind of solutions would be best suited to dealing with managing each acceptably? Would any such solutions lend themselves to having only 4 "uber" pipelines instead of 8, because transistors would be better spent on implementing the solution usefully? The bandwidth relationship we've been led to expect seems to fit this more than the case for 16, really, but maybe ATI reads something else into the future.
 
I will chortle if the uber pipelines extremeness only consists of being able to do z-only at twice the rate of color ops. ;)
 
RussSchultz said:
I will chortle if the uber pipelines extremeness only consists of being able to do z-only at twice the rate of color ops. ;)

The thought had crossed my mind too. :LOL:

That doesn't really fit in with the early ideas WRT Loki though - I'd be more inclined to believe it's some 16-pipe, 200mil+ transistor behemoth based on those.
 
RussSchultz said:
I will chortle if the uber pipelines extremeness only consists of being able to do z-only at twice the rate of color ops. ;)

Yeah, but as long as they don't claim 16 uber pipelines (for a part that is 8x1 / 16x0) unlike some other company ;), it will only be a chortle, not thoroughly disappointing.
 
You have to do alot more to get on my list. It's very exclusive.

Demalion, do you have any links or info as to why you think the R300's vertex units are already close to VS3.0? Seems to me like there is alot missing.


I don't think adding multiple FP ALUs to each pipe is overly complicated. Each pipe already has 3 processing units (scalar, vector, and texture addressing) which the device driver has to schedule for co-issue. The pipeling already has to contend with data being routed to different units on the R300. Adding a second vector unit and having the driver re-arrange instructions for co-issue shouldn't be a big deal, since it must already do that anyway for the R300.

Perhaps this sharing/pooling business is upside down. Perhaps there is no dedicated vertex ALUs, but instead, each pixel pipeline has 2 ALUs. Whenever vertices are ready to be processed, then one of the pixel ALUs on each pipe is "borrowed" temporarily to process the vertex.

If that's the case, I would suspect then that there are two kinds of ALUs on the pixel pipe. A PS3.0 capable ALU, and a combined VS3.0/PS3.0 capable ALU (can do both)

The reason I suspect that the ALUs are located in the pixel pipelines simply has to do with chip locality. Since pixels are processed at a much higher rate than vertices, wouldn't it really make sense to have the additional ALUs sit as close to the pixel pipes as possible, just for clock timing? It is more likely that the vertex processor will always be waiting for the rasterizer to finish, not the other way around.
 
BTW, now that I think about it, has anyway ever explained why the R300 has a limit of 4th-order dependent texture fetches? Is it just an API/driver limitation? Does this mean the pipeline is register-combiner-ish?
 
Back
Top