PDA

View Full Version : R400/R500 guessing game


T2k
28-Jan-2003, 06:46
IIRC, somewhere around the 9700-launch there was a notice like "Orton was more excited about the next chip R400 than the R300" - so, do we know anything about it? At least some rumours?
:?:

EDIT: Renamed a bit... ;)

Mulciber
28-Jan-2003, 06:58
I think a lot of people are expecting it to have the ability to render up to 16 pixel per clock, but I am very doubtful of this, just my opinion though. I am guessing they will move to a full 128bit pipe through the fb though, and probably support fp16 and fp32 just like nVidia.

As for other guesses...who knows? Maybe the holy grail we all seek, such as deferred rendering, with the use of embedded memory? hehehe

I think the nv35 will be a lot easier to guess, since its just a refresh. My guess is they'll move to 256bit bus, .13u low-k dielectric process, and up the core to about 700mhz. Maybe double the vertex units as well.

megadrive0088
28-Jan-2003, 08:41
my guesses:

NV35
*8 pipes
*2 TMUs each (ala GF256 to GF2)
*256-bit bus
*a larger "sea of vertex math engines" than NV30
* VS/PS 3.0+
*700-750 Mhz core

R400
* 16 pipes
* 1 TMU/TEV unit per pipe that can do 4 texture per clock!
* 8 vertex shader units or equivalent of it
*256-bit bus
*GDDR3
*small amount embedded memory
*500-700 Mhz core

hkultala
28-Jan-2003, 10:26
my guesses:

NV35
*8 pipes
*2 TMUs each (ala GF256 to GF2)


not so likely.
and GF256 could do trilinear / ( cycle * TMU ),
GF2 needs to use both of it's TMUs to get trilinearry filtered texel,
so the improvement was not so big.


*256-bit bus


IMHO this is the most likely.
This alone could make quite big performance impact,
we may not see other major things.


* VS/PS 3.0+


not likely, NV35 is just refresh of NV30,
and AFAIK VS3.0's texture lookup seems "too radical stuff"


*700-750 Mhz core


GF FX is already clocked very high,
I don't except clock speeds over 600 MHz.


R400
* 16 pipes
* 1 TMU/TEV unit per pipe that can do 4 texture per clock!


4 different textures? 4 texture samples?

16 pipes with 4 different textures/(pipe*cycle) even at 4 samples(bilinear) would IMHO too much.
And we are moving to situation where pixel shader FLOPS matter more,
not texturing speed, so I don't see this kind of "texturing monster" reasonable.

16* 1 * 8(trilinear) might be possible/reasonable.


* 8 vertex shader units or equivalent of it


sounds reasonable.


*256-bit bus
*GDDR3
*small amount embedded memory


I except them to have one answer to the bandwith problem,
not that many.

Small amount of integrated framebuffer memory with traditional IMR would not help at all. Large number would help.
Small number of integrated framebuffer memory with tile-based IMR would help.
But then the buffer could maybe be made small enough to be made from SRAM, not eDRAM.

anyway, if they were using internal framebuffers in either way, external memory bandwith requirements would be greatly reduced, so they would not need
>40 GB/s memory system you are suggesting.
(and if they would not need it, they would not use it as 256-bit GDDR-3 will be expensive, but if it's needed, they will use it, even at higher cost)

Their push towards GDDR-3 suggest they will not use eDRAM,
unless they will run their GDDR-3 at only 128-bit bus.

Dave Baumann
28-Jan-2003, 10:47
Small amount of integrated framebuffer memory with traditional IMR would not help at all. Large number would help.
Small number of integrated framebuffer memory with tile-based IMR would help.

The likes of R300, and possibly NV30, already have a small amount of on board memory to hold the Heirarchical Z-buffer levels. This could be increased so that more levels are added.

T2k
28-Jan-2003, 14:50
Hmmm... a large embedded memory... sounds reasonable - Flipper... ;)

Tagrineth
28-Jan-2003, 14:55
Hmmm... a large embedded memory... sounds reasonable - Flipper... ;)

Yes, as in embedded memory not for the frame buffer, but as maybe an intelligent texture cache to assist one-cycle Trilinear? That would be awesome, ne? :)

And R300 already uses a tile-based (albeit not deferred) frame buffer, and it already uses on-chip cache for that...

Mintmaster
28-Jan-2003, 15:11
I know most of R400's specs as of last summer, but who knows what changed. I think R300 was originally planned to be 8x2, but they changed that, seemingly for the better. I'm back studying now instead of at ATI.

Anyways, a lot of you guys are quite a bit off. I seem to remember Hellbinder saying a couple of things that were fairly close.

Well, can't say anything, or I'll get in trouble. :wink:

I'll tell you that I like the architecture a lot, and can see why Orton said what he did.

T2k
28-Jan-2003, 17:17
I know most of R400's specs as of last summer, but who knows what changed. I think R300 was originally planned to be 8x2, but they changed that, seemingly for the better. I'm back studying now instead of at ATI.


So, is it 8x2?


Anyways, a lot of you guys are quite a bit off. I seem to remember Hellbinder saying a couple of things that were fairly close.

Well, can't say anything, or I'll get in trouble. :wink:


OK, just tell us what he said a how close is that! ;)
I don't remember his prophecies...


I'll tell you that I like the architecture a lot, and can see why Orton said what he did.

:?:

MuFu
28-Jan-2003, 17:36
Last I heard it was this crazy hybrid, adaptive architecture - essentially 16x1 though. The source I always had my doubts about so I'm not sure on this one at all. :?

MuFu.

T2k
28-Jan-2003, 17:38
Wow...

Ho do you mean "hybrid" and "adaptive"?

MuFu
28-Jan-2003, 17:45
Absolutely no idea, sorry - I am pretty much quoting there. Sounds far fetched doesn't it? Like I said - not sure about the source at all (was almost a year ago as well). Could well have been some f*cker spotting the opportunity to get a guillible ATi fanboi worked up over nothing. :lol:

Hmm... I really have crap-all idea about architectures so it's tricky to make head or tail of the commments. I asked in a PM whether it was an 8x2 or 16x1 architecture - "16 rendering pipelines" apparently. Throws the idea of TMUs out of the window a little I think...

MuFu.

Dave Baumann
28-Jan-2003, 17:58
Well, as the onus moves away from texturing, to shading pipelines I've been wondering about the possability of architectures haveing fewer texture units than actual pixel (fragment) units....

Mintmaster
28-Jan-2003, 18:09
Well, as the onus moves away from texturing, to shading pipelines I've been wondering about the possability of architectures haveing fewer texture units than actual pixel (fragment) units....

I don't really expect this to happen, but rather the number of ops per pipe per clock would increase. Handling more pixels at a time isn't really that useful, and can complicate the rasterizing method.

Well, there is the Doom3 scenario of Z-only and stencil-only operations, but that's not because of the textures->shaders transition.

MuFu
28-Jan-2003, 18:16
What about localised "double-pumping" of the architecture? i.e. instead of a loopback which is essentially a "stall" relative to the global clock, the texture units operate at twice the frequency. One for the analogue guys for sure - but perhaps it's a way of alleviating situations where PS/VS units are waiting on textured output in order to fully implement a shader routine. I remember reading a while back about how the move to *x1 rendering arrays made sense in terms of PS/VS being the way forward instead of multitexturing, but paradoxically left the shading units cycling redundantly in some situations.

Again, I must apologise for my mediocre understanding of the way these things work - way too caught up in circuit theory/maths right now. Urgh! Hate maths... :?

MuFu.

MuFu
29-Jan-2003, 01:35
Screw that, how about R400 being totally async? :lol:

MuFu

Nagorak
29-Jan-2003, 02:02
Their push towards GDDR-3 suggest they will not use eDRAM,
unless they will run their GDDR-3 at only 128-bit bus.

I thought Nvidia swore off GDDR-3?

Luminescent
29-Jan-2003, 02:13
Yeah Mufu, it would be really cool if the individual vertex and fragment pipes could be partitioned and organized in different ways for different jobs, with the developer being able to code to the metal/have the flexibility exposed.

Hellbinder
29-Jan-2003, 02:31
*cough* I... May... Have ... Some idea.... *cough*

8)

After all, You guys all seem to have forgotten that i was dead on with the exsistance and release timeframe of the R350 back before november.... I *may* also have some sort of *pipedream* about the R350 as well of wich i *may* have dropped some subtle hints about at a certain website over the past coupple of months... ;)

Thus far in the last year, my only big strike out was the 4x4 Nv30 fiasco. But in that case that was truley *speculation* :P

Chalnoth
29-Jan-2003, 03:16
my guesses:

NV35
*8 pipes
*2 TMUs each (ala GF256 to GF2)
I don't think so. I think it's been posted more than a few times that multiple TMU's per pipe are relatively useless today. I think that the fragment processor/texture filtering units won't change significantly. Mostly all that will happen are performance enhancements.
*256-bit bus
More probable, but I still wouldn't count on it.
*a larger "sea of vertex math engines" than NV30
I very seriously doubt this. Its fixed-function power is already incredible (Which, with either driver optimization, or once a high-poly optimized VP benchmark is released, should translate well to high VP performance), and games today just are not vertex processing limited.
* VS/PS 3.0+
Would be very nice, but again, I wouldn't really count on it. I'd put this around the same probability as the 256-bit bus.
*700-750 Mhz core
I consider this somewhat unlikely. A core speed in that range with low-K dielectrics would probably be similar in heat to the current FX Ultra. I have a feeling that nVidia will not release another similar cooling solution anytime soon.

R400
* 16 pipes
I seriously doubt it. The die processes just aren't there yet.
* 1 TMU/TEV unit per pipe that can do 4 texture per clock!
Um, definitely not.
* 8 vertex shader units or equivalent of it
Again, same reason: die process just not there.
*256-bit bus
Almost a given, but it would be nice if it can be shown between now and then that a 256-bit bus is not necessary.
*GDDR3
Highly-likely, according to current rumors, but what does this translate to in terms of performance? Some nice memory bandwidth-bound benches would be helpful here.
*small amount embedded memory
Seriously doubt it, assuming you're talking about embeded DRAM. embedded DRAM still will cause a major hit in core speed. Not until external memory bandwidth becomes a serious limitation will embedded DRAM happen.
*500-700 Mhz core
Probably closer to 500MHz. Die process.

I think the main thing about the R400 is that it is almost assured to have around PS/VS 3.0+. I doubt it will have much more. And yes, I really do feel that the die processes will have finally caught up with ATI with this next processor.

But, at the same time, nVidia should not be counting on this right now. They should be looking at the best possible video chip they feel ATI can produce by this fall. Then they should increase that number by 50% and target that performance (ASAP...may be impossible to target that performance this fall).

Hellbinder
29-Jan-2003, 04:09
think the main thing about the R400 is that it is almost assured to have around PS/VS 3.0+. I doubt it will have much more. And yes, I really do feel that the die processes will have finally caught up with ATI with this next processor.


R400 is at .13u

R400 does not resemble any graphics processor to come before it.

-R300 is a F15E Strike Eagle
-R400 is a F117A Stealth Figher

Nagorak
29-Jan-2003, 04:33
]
R400 is at .13u

R400 does not resemble any graphics processor to come before it.

-R300 is a F15E Strike Eagle
-R400 is a F117A Stealth Figher

But the question remains: which of those would win in a dog fight? :lol:

antlers
29-Jan-2003, 04:35
]

R400 is at .13u

R400 does not resemble any graphics processor to come before it.

-R300 is a F15E Strike Eagle
-R400 is a F117A Stealth Figher

I hope that analogy doesn't go too far. F15's kill F117A in dogfights.

Chalnoth
29-Jan-2003, 05:06
I hope that analogy doesn't go too far. F15's kill F117A in dogfights.
Hehe :) Only if it can find it. But the "F" moniker on the F117A really seems to be a misnomer (I have heard that it was supposed to 'throw off' any who found out the name back when it was still classified, though I suppose it may have more to do with the size of the aircraft than its capabilities), as the F117A has no air-to-air weaponry.

Regardless, it really does look as if Hellbinder is referring to the R400 as a deferred renderer. I hope not, as I still don't like the idea of fully-deferred rendering.

CMKRNL
29-Jan-2003, 05:08
Well if my info is correct Hellbinder, you're going to strike out again on R400 8)

JF_Aidan_Pryde
29-Jan-2003, 05:10
All this mysterious 'sourcing', gets a bit tiring doesn't it? :D

speng
29-Jan-2003, 05:20
hrm, CMKRNL or HellBinder....

I wonder who's more right.

Typedef Enum
29-Jan-2003, 06:04
CMKRNL,

Lets hear it... :)

Hellbinder
29-Jan-2003, 06:49
Well if my info is correct Hellbinder, you're going to strike out again on R400


I seriously, seriously doubt it. in fact.. I can absolutly gurantee it.

Unless you have some inside info that the R400 is Not a F117A???
Or are you so shure that you even know what I mean by that. :wink:

and dont suppose that i am refering to what chalnoth is suggesting. That was him talking not me.

MistaPi
29-Jan-2003, 07:01
Hmmm... a large embedded memory... sounds reasonable - Flipper... ;)

Yes, as in embedded memory not for the frame buffer, but as maybe an intelligent texture cache to assist one-cycle Trilinear? That would be awesome, ne? :)

Yes.. The embedded memory for frame/z buffer had to be humongous. I mean, 1600x1200*32 with 8xFSAA :shock:
I dont see embedded memory for frame/z like in flipper would be a option for PC graphics cards in the foreseeable future.

Chalnoth
29-Jan-2003, 07:16
Hmmm... a large embedded memory... sounds reasonable - Flipper... ;)
Yes, as in embedded memory not for the frame buffer, but as maybe an intelligent texture cache to assist one-cycle Trilinear? That would be awesome, ne? :)
Well, Embedded memory would be best-used as a large cache. I don't think it would help for basic texture filtering, but would be a tremendous help for intensive dependent texture reads. In more general circumstances, it would be even more useful as a frame buffer/z-buffer cache.

Hellbinder
29-Jan-2003, 08:09
Come to think of it...

Perhps I should have used the Term.. *Fly by Wire*...

yes, thats much better.

Grall
29-Jan-2003, 12:41
OMG!

R400 is teh antigravity engine from Roswell!

:)

Dave Baumann
29-Jan-2003, 12:56
One thing to remember about R400 is that its probably taping out about now. I think we'll gradually start hearing a lot more about this soon...

MuFu
29-Jan-2003, 13:28
I have a feeling it will be a fairly conventional 16x1 IMR with a 256-bit GDDR-3-capable controller, clocked at ~600/600MHz.

The "special" part will be the VS/PS unit(s). I think that is where the big leap forward will be compared to current designs.

MuFu.

Mariner
29-Jan-2003, 13:38
I have a feeling it will be a fairly conventional 16x1 IMR with a 256-bit GDDR-3-capable controller, clocked at ~600/600MHz.

If that were to be the case, it would be a pretty considerable improvement over the R300. :shock:

We're talking about a fourfold increase in fillrate and virtually double the bandwidth without even considering the advances in the shaders. How many transistors would that be - 170m+, perhaps? :shock:

JF_Aidan_Pryde
29-Jan-2003, 13:42
Since when weren't old cards 'fly by wire' in the first place?

My source tells me the R400 will feature a 'look down shoot down' cockpit package. Pulse doplar radar that can track 16 targets at once and engage 4. It will also be more manervourable than the NVIDIA S-37. 3D Thrust Vectoring engines is also incorperated. :D

Arun
29-Jan-2003, 13:47
Well if my info is correct Hellbinder, you're going to strike out again on R400 8)

What I wonder is if Mufu source saying "hybrid" and "adaptive" is right, even partially.
An "adaptive" architecture could pretty much mean the GPU can allocate power to either vertices or pixels. That has been widely speculated before.
An "hybrid" architecture, however... What the heck is that?

"hybrid" would mean it got a little of multiple worlds, and united it into one to make something which is potentially better than both. The best example would be a semi-deferred architecture.

Actually, there's one strange thing which might be called that. And it might give a lot of performance advantages, even more in such an "adaptive" case.
In current cases, front-to-back ordering saves rasterization work. However, you don't save *any* T&L/VS work! And in an architecture where those would basically be united, it would hurt both to do that.

So, the idea would be to determine X, Y and Z for the vertices. You do everything as usual, but don't do any useless color/texture/... work. You send that to Triangle Setup, and only determine which pixels/subpixels are inside the triangle. Then you do Hierarhical Z and Early Z.

If every pixel fails Early Z, you simply stop right there. If at least one succeeds, however, you send information BACK to the VS and compute all other things.

Basically, you potentially saved a LOT of work. It might still cost quite much if you use complex bone skinning ( you got to do it anyway to determine X/Y/Z ) , but it's still a lot less. And this can be called a semi-deferred architecture.

This approach is even better in a case where PS/VS is united. PS is generally more optimized towards several dozens instructions. VS, on the other hand, is a lot better with much more than that. So, since VS programs would be executed in two different places, it would be less instructions at once. And that means both become slightly more similar ( this isn't a major factor, however, and it would be excellent in traditional architectures too )

Wait a second... That's brillant! If ATI doesn't use it for the R400, I think I'm gonna steal the idea and try to sell it to nVidia :D But then again, I'm sure one of you people will say proof me it's stupid. Ah well...


Uttar

demalion
29-Jan-2003, 13:48
GDDR-3 capable, heh. In case of any surprises from nVidia, of course.

That doesn't mean they will definitely use high speed RAM. It might not even clock that high (I consider above 500 MHz "high" for the R400). I'm not sure ATI would be banking on 600 MHz...I think the 500 MHz of the GF FX is an aberration to correct for deficient (compared to the R300) design, and was an aggressive target even for the "Black Diamond" process. I don't see that changing with chips with even higher transistor counts than the FX (though ATI seems to be able to design for lower power usage at a given complexity and clock speed).

This does not mean that I think that ATI can't achieve 600 MHz, however.

Though, of course, MuFu is in a better position to know (is that 600 MHz guesswork, or based on info?).

demalion
29-Jan-2003, 13:51
Uttar, how is what you describe different than the Z buffer first pass as in Doom and as S3 claims for the DeltaChrome? Perhaps I am missing something.

MuFu
29-Jan-2003, 14:00
Though, of course, MuFu is in a better position to know (is that 600 MHz guesswork, or based on info?).

I am not in a better position to know - I'll try and get myself into one, but a few people here could just as easily do that with a little effort. My guess would be 500MHz, but I added 100MHz based on the fact that I "guessed" 250MHz for R300. ;) :lol:

Can twist the words "adaptive" and "hybrid" all you want but they are just something I heard off-hand and probably don't amount to much. At the time I was fishing for the dirt on R300, lol.

A PS/VS unit that dynamically "partitions" a combined PS/VS pipe based on demand - is that even possible?! Surely it would have to be coded for. Hmm... I initially thought "hybrid" meant 16x1/8x2 - just a hunch, but since this was *ages* ago it probably refers to mixed-mode rendering of the sort that we see in current parts (i.e. extensive occulsion in the pipe but still essentially IMR).

Stop the hybrid/adaptive talk now! I still think the shading pipeline is where all the big advances are...

MuFu.

MuFu
29-Jan-2003, 14:07
P.S. All those in favour of clockless graphics rendering architectures say aye!

Fuz
29-Jan-2003, 14:37
Yes, cockless for me please!

Aye!

CMKRNL
29-Jan-2003, 14:38
Actually, hybrid and adaptive are probably good descriptions of both R400 and NV40, so I wouldn't dismiss it so quickly MuFu :wink:

As for the F117A comment, I have no idea what that means or what it's referring to.

Ascended Saiyan
29-Jan-2003, 14:39
P.S. All those in favour of clockless graphics rendering architectures say aye!Is that even possible? :?


Anyway,I thought that is where the adaptive/hybrid talk pertained to in the first place in the form of if the R400 has an ingrated shader basically resources can be 'intelligently' allocate where it's most needed whether it be towards triangle setup or pixel shading.

MuFu
29-Jan-2003, 14:46
Actually, hybrid and adaptive are probably good descriptions of both R400 and NV40, so I wouldn't dismiss it so quickly MuFu :wink:

As for the F117A comment, I have no idea what that means or what it's referring to.

LOL. I love you, man. :)

MuFu.

Maverick
29-Jan-2003, 15:11
Actually, I'd say that R300 was most like the F117A; It sure as hell didn't show up on nVidia's radar... :wink:

elroy
29-Jan-2003, 15:11
Hmm, I remember hybrid being a word used to describe Fusion (the 3dfx part after Rampage). Gigapixel tech in NV40 maybe?

Gunhead
29-Jan-2003, 15:24
F177A? Why not F22 instead? Apples to apples...

F22 has the ability to supercruise naturally so unlike competitive designs it doesn't come with an afterburner :lol:

Okay, all analogies break down at some point...

***

Then, any ideas on R500? What should we expect for transistor count and MHz at 0.9 micron? And moreover, if R400 is DX10 [and BTW what else is new there but VSPS 3.0?], is R500 DX11, and what features/functionality is that likely to bring? And "Universal Shader 4.0" won't really reveal it to me :lol:

Rampant, completely untamed speculation welcomed.

Humus
29-Jan-2003, 16:00
... cockless ...

= Female?

Basic
29-Jan-2003, 16:05
F117A? Is that those stealth-planes that look like a failed origami-experiment?

Does that mean that we should look at French sites for any leaks. I remember that USA were kind of miffed at the French sometime around the gulf war, since the French had radars that could pick up the stealth planes.


demalion:
What uttar said is different and rather orthogonal to a Z-first pass.
It won't remove any hidden pixels (above what R300 already do). But it will reduce VP calcs. Color/lightning calculations in the VP will only be done for polys that are visible (at the time they are rendered).
It should be possible to combine it with a Z-first pass.

I'm not sure the idea is worth the extra hardware complexity though.

Arun
29-Jan-2003, 16:22
Uttar, how is what you describe different than the Z buffer first pass as in Doom and as S3 claims for the DeltaChrome? Perhaps I am missing something.

It's different from what S3 claims.

The S3 method goal, AFAIK, is to minimize color/Z writes and use the least fillrate possible. It's also quite useful for developers because not ordering front-to-back is not as important anymore, if you develop with that GPU in mind ( which is unlikely... )

The goal of my method is to reduce VS usage. And you absolutely need to order front-to-back for it to be of any use. It's significantly more complex to implement, too.

In which cases can it give signifiant advantages? Look at 3DMark 8 Lights test. The performance drop, compare to 1 Light, is huge. Using my idea, united with front-to-back ordering, the 8 Light test performance should barely be any slower than the 1 Light test performance ( 10% hit maximum ) - with the GFFX, it's currently a 70% hit...

Basic: IMO, this idea isn't worth the extra complexity in current hardware. However, in a case where everything is done with the same units and where cache is used to have the last few programs which were used, such a system would not be as complex as it seems. The problem would be getting developers support, it would be a lot more efficient if they would separate everything with that in mind. The drivers could do it automatically, but it might never be as efficient.
So, if that whole "all units are the same" thing is true for the R400, my idea could be in it... But then it isn't mine anymore, now, is it? :D


Uttar

Dio
29-Jan-2003, 16:26
There's quite a lot of theory around about clockless / asynchronous architectures, mostly in the low-power field, although not too many practical examples.

The AMULET group at Manchester CS have some interesting things on this. They've been building aysnchronous ARM cores for years.

http://www.cs.man.ac.uk/amulet/

Dave Baumann
29-Jan-2003, 16:31
The goal of my method is to reduce VS usage. And you absolutely need to order front-to-back for it to be of any use. It's significantly more complex to implement, too.

Vertex processing is cheap though.

Hellbinder
29-Jan-2003, 16:36
Actually, hybrid and adaptive are probably good descriptions of both R400 and NV40, so I wouldn't dismiss it so quickly MuFu

As for the F117A comment, I have no idea what that means or what it's referring to.


ok, how bout these.....

R300 = Drag racer
R400 = Formula One

R300 = Brute Force
R400 = Intelligent Design

R300 = Culmination of generations since R100
R400 = Instigation of new Generational Relationships for future products

R300 = Pack the crap in there till we kick their ass
R400 = Step back, re-evaluate, get Mr. Peabody involved

R300 = More parts = more fun for U
R400 = Do we still neeed all these parts???

Arun
29-Jan-2003, 16:51
The goal of my method is to reduce VS usage. And you absolutely need to order front-to-back for it to be of any use. It's significantly more complex to implement, too.

Vertex processing is cheap though.

It's cheap for two reasons:
1. So many transistors are going in it. Saving transistors by implementing a smarter architecture is always good.
2. Current games don't do too complex vertex shading.

Take the following shader:
http://www.cgshaders.org/shaders/show.php?id=43

With the current methods, *everything* in the shader got to be done even if none of the triangles defined by the vertex is drawn. With the method I described, you could only do the following line in such a case:
OUT.HPosition = mul(WorldViewProj, IN.Position);

Nice save, eh? :)

One of the biggest disadvantage, even in an architecture which got a pool of calculators for both VS/PS, is that you need even more cache.
You've got to cache the vertex X/Y/Z at first. But then, once other variables have been calculated, you've also got to cache that.
So I'd guestimate a two times bigger vertex cache might be required to get optimal performance out of this method.


Uttar

demalion
29-Jan-2003, 16:52
F117A? Is that those stealth-planes that look like a failed origami-experiment?

Does that mean that we should look at French sites for any leaks. I remember that USA were kind of miffed at the French sometime around the gulf war, since the French had radars that could pick up the stealth planes.


demalion:
What uttar said is different and rather orthogonal to a Z-first pass.
It won't remove any hidden pixels (above what R300 already do). But it will reduce VP calcs. Color/lightning calculations in the VP will only be done for polys that are visible (at the time they are rendered).
It should be possible to combine it with a Z-first pass.

Yes, but I thought lighting wasn't done for the first Z pass? I thought transformation alone was, just like his methodology. If you aren't skipping transformation vertex processor calcs, what are you skipping?

I thought triangle rejection was also already achieved by Hierarchical Z?

Feel free to educate me on what I'm missing.

Arun
29-Jan-2003, 16:56
Yes, but I thought lighting wasn't done for the first Z pass? I thought transformation alone was, just like his methodology.

Hmm, I think you're right on that. S3 First Z pass probably doesn't do lighting.

But I don't believe it can decide not to do Lighting in the second pass based on the results of the first...


Uttar

Dave Baumann
29-Jan-2003, 16:59
1. So many transistors are going in it. Saving transistors by implementing a smarter architecture is always good.

Actually, thats what I'm getting at. Vertex Processing is cheap because relatively few transistors transitors going onto producing powerful vertex shaders as opposed to producing powerful fragment shaders. Its far cheaper to scale up the number of vertex shading processors - why do you think all 4 are still in 9500? Because the majority of the die is the fragment shader pipes, the 4 vertex shaders are small comparatively.

Arun
29-Jan-2003, 17:13
Actually, thats what I'm getting at. Vertex Processing is cheap because relatively few transistors transitors going onto producing powerful vertex shaders as opposed to producing powerful fragment shaders. Its far cheaper to scale up the number of vertex shading processors - why do you think all 4 are still in 9500? Because the majority of the die is the fragment shader pipes, the 4 vertex shaders are small comparatively.

You sure of that?

I know fragment is more costly than vertex. But there are more really more than one reason to that.
First of all, in most cases, there are less vertex pipes than fragment pipes.
Secondly, you often include things such as Early Z / Hierarchical Z in fragment pipe count. Those things take fairly signifiant transistors, and their main use is to save fillrate. So it makes sense to include them there.
Thirdly, fragment include TMUs.

IMO, ATI also didn't remove the VS because it's easier to reduce texture quality than vertex quality, so it simplifies programmers jobs and all they got to do is design around the "R300", not around all the derivatives ( but that could be completely wrong, feel free to say it and I'll simply stop saying that )

And anyway, let's suppose the VS take 20% of the die space on the R300. That would be 20M transistors. Imagine if you could reduce that number by 50%, but you'd have to add 5M transistors to implement my method. You're at 15M, which is 25% better. Saving 5M transistors and getting as good performance is really not bad IMHO. Of course, those figures are just guesses.


Uttar

Sabastian
29-Jan-2003, 17:14
Oh you gotta love the Weebly Wobbly... net. ;) R350 in April, RV350 sooner likely and the R400 for after July.

"When Mr. Edinger spoke in the morning the ATI's future plans were brought to the words Next Generation. But that idyl was destroyed by the next speech of a guy from CP Technology who presented the roadmap showing R350, RV350, R400 and their approximate implementation time. When the information was displayed the Vice President was taken aback at such straightforwardness. Nevertheless, I can inform you that the R350 based cards will appear in April this year, RV350 ones will come even a little earlier. Of course, this is just a plan and it can be corrected any time. The R400 is due to be released after July. "

http://www.digit-life.com/articles2/ati-jan2k3/index.html

Sabastian
29-Jan-2003, 17:17
crap( I deleted a stupid question.)

Fuz
29-Jan-2003, 17:36
... cockless ...

= Female?

Yes.

Hellbinder
29-Jan-2003, 17:55
April in what county ;)

Basic
29-Jan-2003, 18:29
demalion:
As uttar said. In the second pass of the Z-first method, you'll still have to run the full VP for all polys, including the hidden ones. If you had run a occlusion querry on each poly in the Z-pass, you could use it to skip polys pre-VP in the second pass. But that would of course only work for polys that are hidden at the time they are rendered in the first pass.

Maybe R300 can cull whole triangles before setup by comparing vertices to the HZ-buffer, but it would still have run through the whole VP.


I'm still doubtfull that it would be worth the complexity though.

Heathen
29-Jan-2003, 18:31
. I remember that USA were kind of miffed at the French sometime around the gulf war, since the French had radars that could pick up the stealth planes.

Naah, that was the british Type 42 destroyers, I've seen the traces

demalion
29-Jan-2003, 18:39
Yes, but I thought lighting wasn't done for the first Z pass? I thought transformation alone was, just like his methodology.

Hmm, I think you're right on that. S3 First Z pass probably doesn't do lighting.

But I don't believe it can decide not to do Lighting in the second pass based on the results of the first...


Uttar

Hmm...ok, correct my errors...I have in italics the step I think you are saying isn't done:

Z Pass...

You get to a triangle.

You transform the vertices of a triangle.

The hardware checks the depth boundary range of the triangle against tile information in the Hierarchical Z cache, and skips it if coverage is determined

You rasterize the triangle per pixel to a zbuffer for per pixel depth testing, with the Early Z feature saving bandwidth between the core and memory to find out whether each pixel of the triangle is occluded.

You keep that z buffer around.

Color Pass...

You go through again, this time to perform lighting.

You get to a triangle.

You transform the triangle.

The hardware checks the depth boundary range of the triangle against tile information in the Hierarchical Z cache, and skips it if coverage is determined.

...

In your scenario, you are either covering the case where the Hieararchical Z cache resolution is inadequate to discard a triangle that really should be, or saying the Hierarchical Z doesn't do what I propose when a transformation takes place, and are proposing it be changed to do this.

Also,

In the case of a Hierarchical Z mechanism as I described, you either preclude a vertex shader that can change vertex coordinates (so the "transform the triangle" part cannot be affected by a vertex shader) or one that can perform a change in vertex coordinates dependent on a lighting calculation (atleast if this technique is performed universally by the hardware or driver).

Finally, assuming my understanding of Hierarchical Z is valid, and you want to be able to do the above with vertices in a vertex shader, my understanding is that a normal is calculated for each vertex transformation.

If not, that seems a trivial (correct me if I'm wrong) savings to shoot for (is it something the GF FX VP array would directly benefit from?).

If so, won't the lighting calculations be performed per pixel in any case since a texture lookup, fogging application, etc, would have to performed as well? In which case won't the applicable code be rejected by Early Z (presuming it rejects shader code, since it seems silly for it not to) in any case? Finally, if both of these are not true, isn't this problem going to be solved by the "universal shader" concept and/or improving Early Z?

Again, this is AFAICS.

EDIT: clarity.

psurge
30-Jan-2003, 00:03
Uttar... that is vaguely similar to an idea i had for a fully defferred approach in another thread - i like it :).

I propose a slight modification :


1. Compute positions for a batch of N vertices at a time. As you are computing the positions compute min,max of screenspace x,y,z.

2. Check the generated screenspace bounding rectangle against the HZ pyramid. Rasterization is skipped, you have exactly one z value to check against the HZ pyramid... should be a relatively low latency op.

3. If bounding box is rejected, go to 1. Otherwise goto 4.

4. Finish VS computation for the N vertices, send the appropriate tris to the rasterizer...
----

Your solution is better at saving geometry work, but the above would make cache management and task scheduling (i.e. implementation) easier AFAICS.

Also, it would be trivial to add support for user defined bounding boxes. (meaning that even step 1 could potentially be skipped). Perhaps a bounding box could even be output as the result of a program (similar to a VS program). This could be used to avoid generating geometry when displacement-mapping and/or procedural geometry is being used.

Jima13
30-Jan-2003, 01:07
OMG!

R400 is teh antigravity engine from Roswell!

:)

AURORA?

gets my vote in the 'name game' :)

http://accelerationresearch.tripod.com/

demalion
30-Jan-2003, 01:21
Uttar... that is vaguely similar to an idea i had for a fully defferred approach in another thread - i like it :).

I propose a slight modification :


1. Compute positions for a batch of N vertices at a time. As you are computing the positions compute min,max of screenspace x,y,z.

2. Check the generated screenspace bounding rectangle against the HZ pyramid. Rasterization is skipped, you have exactly one z value to check against the HZ pyramid... should be a relatively low latency op.

Ok, so Hierarchical Z doesn't do this currently then? I see.

3. If bounding box is rejected, go to 1. Otherwise goto 4.

4. Finish VS computation for the N vertices, send the appropriate tris to the rasterizer...

The question I asked in the above post...doesn't this preclude some cases of changing vertex position? Is that ok? What I mean is if you have a vertex program that might move a vertex position, it might end up outside of your above bounding box, but that program would never be run due to being culled above. Can lighting calculations be presumed to be completely separate and independent of calculations based on vertex normals (i.e., lighting type calculations)?

What I was thinking was that this would have to be a check against Hierarchical Z that occurred as part of the 'vertex move' instruction "microcode" to allow full flexibility...am I completey wrong? The latency for this type of check shouldn't be worse than a texture cache miss (i.e., it would occur completey on chip) should it?

----

Your solution is better at saving geometry work, but the above would make cache management and task scheduling (i.e. implementation) easier AFAICS.

Also, it would be trivial to add support for user defined bounding boxes. (meaning that even step 1 could potentially be skipped). Perhaps a bounding box could even be output as the result of a program (similar to a VS program). This could be used to avoid generating geometry when displacement-mapping and/or procedural geometry is being used.

Doesn't geometry creation present new problems for this if implemented in the vertex shader? Wouldn't the bounding box check have to occur for every vertex creation and movement instruction to prevent impeding functionality?

Would there be (or is there) an established order for types of instructions?

Dio
30-Jan-2003, 09:43
AURORA? gets my vote in the 'name game' :)
Ahhh.... so long ago... :)

Of course, nowadays it's virtually impossible to trademark any real word in the computer field, which is why all the companies largely go for made-up high-tech-sounding names.

g__day
30-Jan-2003, 10:25
Does anyone have a view - the R350 is supposed to be 0.13 micron - but will it also be low-K diaelectric? Wouldn't that be the ultimate insult to NVidia, a quieter, faster competitor...

Chalnoth
30-Jan-2003, 10:27
The R350 is to be .15 micron.

Evildeus
30-Jan-2003, 11:16
Does anyone have a view - the R350 is supposed to be 0.13 micron - but will it also be low-K diaelectric? Wouldn't that be the ultimate insult to NVidia, a quieter, faster competitor...RV350 is supposed to be 0.13

Arun
30-Jan-2003, 17:16
Hey, I just think I had an excellent idea on how to make my idea easier to implement, and more efficient on a die size POV. In fact, it even makes more sense in an "hybrid" and "adaptive" architecture.

The current architecture is:

Vertex Shading-> Triangle Setup-> Hierarchical Z & Early Z-> PS for each pixel not rejected by Early Z

VS computes position, normal, lighting, ...

My idea is:

Vertex Position-> Triangle Setup for X/Y/Z -> Hierarchical Z & Early Z -> Vertex Lighting -> Triangle Setup for Color/Texture/... -> PS

Now, let me explain that.

First, you define Vertex Position. This is a *specialized* unit, which takes very few transistors, but it *is* programmable.
An interesting optimization I'd like to add is that you only read whatever is required to be read to determine that. Thus, you could save a lot of memory bandwidth too, which is another great advantage of this idea.

Next, you send the vertices to a cache which got fields for every of the vertex attributes. That includes color, even though it isn't computed yet. There's a flag to say what is computed already and what isn't.
It also got a flag which says if it's currently being processed in Triangle Setup/Z Check/Vertex Lighting. Why? Because it's important NOT to do Vertex Shading twice for the same vertex! More on that later.

If all information has already been computed, it is known that you can skip steps and go directly to Triangle Setup which would compute everything at once. Hierarchical Z / Early Z is then done after that.

Otherwise, the different vertices of a triangle, who can be found in that cache, are sent to Triangle Setup, which only determines the X/Y/Z values of each pixel. So it's actually faster than current Triangle Setup.
The pixels of the triangle are then sent to Hierarchical Z / Early Z. They *won't* be sent there again. So doing Early Z also is essantial.

Now, since it won't be sent again, you got a small cache which use one bit per (sub)pixel to say if each one is visible or not stores visibility in the SAME order as in which it was computed, because that order will be the same the next time you send vertices to Triangle Setup. If no pixel is visible, such a cache is of course not used and all following steps are ignored.

If any pixel is visible, you read the vertex information ( of all vertices of the triangle ) in the cache again and also read from memory every other information required to do Vertex Lighting. You then send all that info to Vertex Lighting.
That determines texture/normal/color/... and, once it's computed, it sends it to the cache where the vertex is already being stored. You also change a flag to say it isn't being computed anymore, and the other one to say all information on the vertex is available. That means that if another triangle asks for that same vertex again, you don't have to do Vertex Lighting anymore.

Then, once all the vertices of the triangle are processed, you send the vertices back into Triangle Setup. Triangle Setup then determines the Color/Texture/Normal/... value of every *visible* pixel based on the final value of the three vertices. As I said, you know if every pixel is visible already since you got that in a cache.

Every of those pixels are then sent to the PS, in the same traditional way as before.

I'm going to speak of the problems/disadvantages ( and yes, there are a hefty load of them ) of this method in the end of this post, but let me first comment a little more on its advantages.

Did you notice what was done in Vertex Lighting? Now compare that to what's done in PS in both architectures. It's *exactly* the same thing! The position cannot be changed, but everything else can.
That means you could use the *EXACT* same units! Which could be a very optimal way to do it. Work which needs high quality is done in the PS, work which can be done with lower quality is done in Vertex Lighting.

As I said before, one of the great advantage too is that you're saving memory bandwidth for vertices which are not visible on the screen. Not much, agreed, but it is an important factor in today's architectures and high-speed memory costs a lot. So that means my idea doesn't help for a "cheap" problem anymore, as Wavey said, it also helps for an expensive one.

The reason Vertex Position is a specialized unit is because it should not be able to do many things, thus making it much cheaper. Using complex units with loads of useless native instructions would simply be wasting it.

As for the disadvantages... First of all, much bigger vertex caches. The system for cache I gave might not be optimal, so it might be interesting to think a little more about it. The way flags work might not be so great too.

The most important problem, however, is... what to do if a vertex requires color/texture/normal/... computations to calculate its position? In the current architecture, that creates no problem at all. In this one, however...
What I propose, in that case, is to do Vertex Lighting *first* , but only doing the parts of the program which are required to do Vertex Position. Then, you use Vertex Position and determine if it's required to compute the remaining information.
That destroys a part of the benefits of the method. The worst case scenario, however, is when:
You use Vertex Lighting-specific features first, then use the results to do Vertex Position, then need to do some more Vertex Lighting based on Vertex Position results. That's the worst-case scenario, and my idea would actually run SLOWER in those cases because of the important number of input / output.

Is this idea complex to implement? Yes. Is it worth it? Maybe. The problem is that it got a horrible worst-case scenario, which destroys all benefits. If current games / games in developement used such worst-case scenario, the card would be significantly slower in them.
The best way to use such a technology, thus, is to implement it on a console, where no game is written yet and everything can be done with the hardware limitations / advantages in mind.
If it gets great feedback there, PC developers might try to push the technology into the PC and make it a success. But really, trying to put it directly into the PC might give bad results, unless no game currently got a similar worst-case scenario. And I've really got no clue about that...


Any feedback?


Uttar

demalion
30-Jan-2003, 17:45
The same one I had before:

If moving vertices based on lighting calculations needs to be arbitrarily possible, why not put the effort into a low latency triangle rejection engine based on Hierarchical Z? That way, this could be executed as part of the "move vertex" op.

This would seem to facilitate programmable tesselation/vertex creation as well (as the same check could be integrated into such ops).


My thought is that move vertex operations could set a "rejected bit" in the vertex data register, such that some operands are not processed if all vertex values have the bit set. Does not seem expensive, and does not require the special conditions mentioned AFAICS. This functionality is part of the Hierarchical Z triangle rejection change and would be propagated throughout all applicable instructions.


I don't think Early Z check would work too well for this universally (on an IMR), unless it was improved such that the area covered by the "Early Z cache" (if there is one, which seems reasonable) gets larger (assuming it is very small at the moment) and the vertex organization batches vertices that tended to be within the same area (which, of course makes it less of an "IMR" ;) ). Which might very well be a priority for a "universal shader" architecture.

1) Hey, if I'm completely off base, I wish someone would tell me where I'm going wrong so I could learn something. :(

2) I wish Kristof and Simon F would make any type of comment, cuz all sorts of interesting clues could fall out of their pockets. ;)

Arun
30-Jan-2003, 18:49
My thought is that move vertex operations could set a "rejected bit" in the vertex data register, such that some operands are not processed if all vertex values have the bit set. Does not seem expensive, and does not require the special conditions mentioned AFAICS


I don't quite understand what you mean there...

Correct me if I understood you wrong: You're saying that once the final Vertex Position is determined, you'd use Hierarchical Z to determine if that Vertex is visible? And if it isn't, you'd stop Vertex Shading for that vertex and go to the next?

Well, if that's what you meant... A vertex can be invisible and a triangle including it can be partly visible. For example, imagine a triangle where one of the vertex is invisible and the two others are visible. Part of the triangle is still visible, isn't it? And that "invisible" vertex color/normal/texture/... may have to be used to determine part of the pixels info inside the triangle.

Did I understand you correctly, or did you mean something completely different?


Uttar

demalion
30-Jan-2003, 19:04
My thought is that move vertex operations could set a "rejected bit" in the vertex data register, such that some operands are not processed if all vertex values have the bit set. Does not seem expensive, and does not require the special conditions mentioned AFAICS


I don't quite understand what you mean there...

Correct me if I understood you wrong: You're saying that once the final Vertex Position is determined, you'd use Hierarchical Z to determine if that Vertex is visible? And if it isn't, you'd stop Vertex Shading for that vertex and go to the next?

Well, if that's what you meant... A vertex can be invisible and a triangle including it can be partly visible.

When I said "all vertex values" I didn't mean "all components of a vertex value" (doesn't make sense to set a bit for each component), I meant "all vertices associated with a visible element", i.e, a triangle, or perhaps more accurately, all triangles associated with the vertex. I suppose mentioning something like a triangle would have helped getting that across, eh? ;)

For example, imagine a triangle where one of the vertex is invisible and the two others are visible. Part of the triangle is still visible, isn't it? And that "invisible" vertex color/normal/texture/... may have to be used to determine part of the pixels info inside the triangle.

Exactly. Further, with geometry creation, you couldn't create geometry based on lighting calculations (since they might not be performed on a vertex that might affect a visible vertex), but you could create based on transformation data from occluded vertices and still move a vertex based on lighting type calculations.

Did I understand you correctly, or did you mean something completely different?

You understood me to a reasonable degree given what I stated and I meant something completely different. ;)

psurge
31-Jan-2003, 02:05
demalion,

I think I understand your concern. As it stands a vertex program computes vertex position and attributes.

I think what Uttar is referring to is this:

Given a bog standard vertex program P, remove all instructions which do not affect the computed output position. Running this trimmed program (call it P') gives you enough info to do z-culling for a triangle.

If the triangle is not culled, then you need to go back and execute the full vertex program P, giving the vertex attributes necessary for pixel shading.

Note that P and P' always give the same position. Whatever calculations are necessary to determine position are included in P'. So you only save work if a significant amount of time is spent computing vertex attributes...

Note that this kind of thing can be completely transparent to the programmer - given a vertex program P it should be possible to generate P'.

Does that make sense?

Regards,
Serge

demalion
31-Jan-2003, 04:44
My statements is backwards to the way I should be stating it, then, I think.

Instead of "not being able to move vertices based on a lighting type calculation if you want to skip any calculations", it is "you can't skip any lighting type calculations that are the basis for moving a vertex".

The methodology seems like a lot of "paperwork" though.

I think what I'm proposing (restated again related to my understanding of your comments) requires less "paperwork" and offers more performance saving in the case of tesselation related processing. The method you describe of re-executing strikes me as too prone to performance wasting...I think.

What I'm proposing is that all calculations that affect a vertex position perform the Hierarchical Z check (bounding calculation check, not per pixel) on the resultant position, and set the "rejection" bit then.

Also, any operation based on vertex attributes is not placed on the processing stack if its operands all have the rejection bit set. Basically, writing P' on the fly.
I suppose I should open up a vertex shader language specification for this next part, but I am also assuming this would carry over into vertex registers as well, where registers have rejection bits set if their contents are based on the result of a calculation that resulted in rejection. This would allow the destination register for operations, if it is associated with a vector attribute besides position, to also have its rejection bit checked for discard (so you could perform an operation on a discarded vertex to change its position, but not waste time changing an attribute...this is one of the scheduling concerns I mention below).

I had also considered that this might have to have be implemented as a pre-opcode, with the GPU "low level" code having the order of "specify destination" then "operation and operands". Also, I had considered other scheduling concerns such as when a rejected vector attribute is changed, but then its position is changed afterwards (in which case the GPU should be warned not to reject it), but I assumed that to be within the scope of the driver's shader interpreters.

...

Depending on the scheduling concerns and some details about assumptions that can be made for vertex shading (details I don't know), I think using such checks per instruction might both be simpler (assuming that performing it quickly does not require too much in the way of "wiring" compared to "going back and filtering"), and I think cover more operations as for increasing vertex program functionality (even into geometry creation, I think) as well.

---

Also, based on your comments and re-reading Uttar's post after flipping my sentence around at the beginning:

I still think the Hierarchical Z based rejection system inline with applicable instructions could potentially save more performance. Remember, the check I'm specifying is "bounding" based, not a per pixel check. I think the difference in effectiveness of determining triangle occlusion here should be a very small percentage, but also I think that should be able reject more calculations in cases of tesselation and possibly geometry creation.

But what struck me in regards to my earlier comments about how Early Z check would have to be improved for a unified shader (atleast as I envision the meaning of the term), is that in a unified shader language, what Uttar describes, as I understand better from your wording, can be implemented as part of the Early Z check, in addition to the methodology I illustrate (even assuming my concerns about this per pixel method's drawbacks are valid). The Early Z check "cache" seems to me to be available to be checked (with high efficiency if a first Z pass has been done), since with equivalent instruction sets between pixel and vertex shading, it seems to me Early Z would benefit from a simple expedient like a "wait stack" of vertex lighting instructions (just like your P' generation sequence, but already necessary for intermixed shading) that could be dumped if all pixels are occluded. Uttar's comment on the similarity of pixel shading and vertex lighting calculations seems apt here.

More expensive and intrusive into the shading language architecture, except when you consider that this would be a fundamental change in the mechanism that might already be needed to implement an Early Z check for "unified shading".

Remaining concerns (assuming I haven't gone wrong above) is how many of these operations and operands is it practical to store. Extending this to per pixel calculations based on vertex normals, which would make this problem much worse, is how would that limit affect performance for the worst case where a triangle is largely occluded early in rasterization, and how to gracefully flush the wait stack in that worst case. It would have to be on chip for maintaining high performance, I think, or else it might be much like trading bandwidth for for calculations and I'm not sure how to balance that (amongst other things I'm not sure of ;) ).

Getting a bit late, I'll go back and edit if any questions of clarity crop up.

hkultala
31-Jan-2003, 10:21
F177A? Why not F22 instead? Apples to apples...

F22 has the ability to supercruise naturally so unlike competitive designs it doesn't come with an afterburner :lol:


off-topic, but

F22 can supercruise at about 1.5 mach without afterburner.
But for take-offs, high acceleration and higher speeds than 1.5 mach it has afterburner ( It's maximum speed is something like 2.2 mach )

Dio
31-Jan-2003, 10:31
OT ALERT : 1.2 in supercruise and 1.6 or so in full burner, according to F22-ADF and various 'Military Hardware For Boys Who Haven't Grown Up' magazines.

AFAIK 2.0 was the original ATF requirement, but it was reduced to make the airframe a lot cheaper (not so much heat resistant material).

Of course this could all be military FUD...

Normal (!) service will now be resumed.

psurge
31-Jan-2003, 22:09
demalion

your post was very hard to follow for me, i'm not sure if i understood correctly - so a couple questions:

- you mention culling vertices... what you need to do is cull triangles. I don't understand you here...

- given that you are culling an area, the number of cycles for (even a rough) visibility check is at least somewhat dependant on the area being checked... so:

- while you perform the visibility check, what is the VS unit doing?

- all these flags for vertex attributes and VS registers... sounds like a nightmare to keep track of. AFAICS your approach would require saving not only vertex position and atttributes in cache, but also the complete state of the VS at the point of visibility checking. This is already 32 16byte registers, plus loop registers, predicate registers, flags etc... Also, how are you detecting on the fly whether or not an instruction is in the dependency chain of the 'write vertex position' instruction? How do you go back and execute instructions which are not in this dependency chain, but do depend on the state of the VS prior to the 'write vertex position' instruction (this state is not easily recoverable without re-executing the shader up to the desired instruction)?

Ideally, a vertex program would be split into 2 parts, one solely related to position (P'), the other solely related to other attributes and lighting (P'')... For compatibility with the current API, my idea is to derive P' from P, and set P'' to P.

I am not at all an expert on HW design (I'm a software guy), but to me your idea (which I admittedly do not fully understand) limits the generality/scalability of the processing units involved. It involves a complicated scheme of flags for registers and cache elements and ties the execution units to the HZ unit.

3dcgi
31-Jan-2003, 23:25
I still think the Hierarchical Z based rejection system inline with applicable instructions could potentially save more performance. Remember, the check I'm specifying is "bounding" based, not a per pixel check. I think the difference in effectiveness of determining triangle occlusion here should be a very small percentage, but also I think that should be able reject more calculations in cases of tesselation and possibly geometry creation.

demalion. If you are trying to reject vertices then as psurge said that won't work. You need to reject on a triangle basis. A bounding box around the triangle will work because it is conservative. This might be what you are trying to say.

The performance benefit/penalty associated with rejecting, based on the triangle bounding box, before setup is a big unknown that cannot be guessed. The results will vary greatly depending on the type of scene being rendered.

Is this what you were trying to describe?

demalion
01-Feb-2003, 01:27
demalion

your post was very hard to follow for me, i'm not sure if i understood correctly - so a couple questions:

OK, I assume you are addressing my Hierarchical Z scheme and not my 2nd description based on Uttar's scheme and your comments. Also, remember my mention that I assume some scheduling would be setup up by the driver when sending the program.

- you mention culling vertices... what you need to do is cull triangles. I don't understand you here...

A "rejected" triangle is "culled". A "rejected" vertex has a bit set so that a triangle can be later determined to have all its vertices "rejected", and therefore be "rejected/culled" itself. I'd expect such a mechanism is already in effect, actually, but I'm not a hardware designer either. :P

I do note that this depends on vertex coherency, i.e., being able to tell what triangle you are processing at the moment (or else you can't draw the line to check for occlusion against the Hierarchical Z data). This seems necessary to be in place already, but it is one of those vertex shader details I don't know...

The effect is to cull triangles and vertex lighting calculations effectively, the details discussed are for how to hopefully do that effectively even when vertices are moved by a vertex shader, and hopefully even when geometry is created in a shader.

- given that you are culling an area, the number of cycles for (even a rough) visibility check is at least somewhat dependant on the area being checked... so:

I'm presuming (in my Hierarchical Z rejection description) that the check is is not significantly affected by the size of the area being checked, namely I'm presuming that Hierarchical Z structure is structured to facilitate rapid rejection of a mathematically defined area. I'd think the hardware would already be designed to facilitate this. I don't know the current latency, and the transistor cost/clock cycle tradeoff for making this latency low enough for what I propose, but in the absence of that I'm going to go ahead and propose it. ;)

- while you perform the visibility check, what is the VS unit doing?

Well, if my scheme indeed cannot perform the check quickly, I guess it would stall any dependent operations. :(

How long would such a stall be compared to the number of cycles an operation would take? Another possibility is having multiple Z check units (they seem like they should be considerably cheaper than vertex processors) and front loading upcoming non-dependent Z checks before the operation is put in the pipeline (again, driver scheduling provides an opportunity for this). Depending on how the latency situation looks compared to the latency of executing the other operations prior, the latency seems like it could be hidden.

Really, an answer to the delay required for this is what will determine feasibility.

- all these flags for vertex attributes and VS registers... sounds like a nightmare to keep track of. AFAICS your approach would require saving not only vertex position and atttributes in cache, but also the complete
state of the VS at the point of visibility checking.

There is no unique data per triangle? It would just take 3 bits more there to hold visibility for the triangle (an AND test of all 3 is the "cull" status). I also don't see why saving the complete state of the VS would be required...the point is to store the rejected status only, not the results used to achieve it, and then referencing those bits later when performing any calculation that would not result in a change of visibility status so the calculation can be skipped.

This is already 32 16byte registers, plus loop registers, predicate registers, flags etc... Also, how are you detecting on the fly whether or not an instruction is in the dependency chain of the 'write vertex position' instruction?

I don't, I'm seeking to discard all operations that depend on a vertex that was moved but was still rejected (the culled triangle remained occluded after the op) by propagating the rejected status in the result register of the "write vertex position instruction" as part of the operation discard step. The "reject based on reject status of operands" should be a simple AND operation of the operands' reject bits for some operations, and a simple AND of the triangle's individual reject bits for other operations.

The driver scheduling work I'm assuming would be able to handle invalid dependency cases (i.e., for example the "warn not to reject" method I mentioned before if the following instructions and destinations for instructions cannot be determined to be totally dependent on the data in the triangle).

How do you go back and execute instructions which are not in this dependency chain, but do depend on the state of the VS prior to the 'write vertex position' instruction (this state is not easily recoverable without re-executing the shader up to the desired instruction)?

The driver would warn in this case. Given the simplicity of the language spec, and that JIT translation and scheduling (from the standardized SL to the GPU's opcodes) is the function of the driver already, I am considering this a reasonable expectation.

...

Hopefully it is clear enough to address what you think won't work?



Ideally, a vertex program would be split into 2 parts, one solely related to position (P'), the other solely related to other attributes and lighting (P'')... For compatibility with the current API, my idea is to derive P' from P, and set P'' to P.

Do you mean derive P' from P, and set P'' to P dependent on the result of running P'? :-?

I'm trying to avoid running through the code more than once. The scheduling management I'm specifying depends on logic dependency, and should be manageable driver side as part of its existing workload (I think), but I see the ones you are proposing as dependent on data being processed by the GPU, and so can't be done beforehand.

I am not at all an expert on HW design (I'm a software guy), but to me your idea (which I admittedly do not fully understand) limits the generality/scalability of the processing units involved. It involves a complicated scheme of flags for registers and cache elements and ties the execution units to the HZ unit.

I fail to see how the added flags are excessive or complicated (to think of yes :-?, to implement, well, I'm not a hardware guy), but yes the execution unit dependency on Hiearchical Z is a major factor.

3dcgi
01-Feb-2003, 19:46
A "rejected" triangle is "culled". A "rejected" vertex has a bit set so that a triangle can be later determined to have all its vertices "rejected", and therefore be "rejected/culled" itself.

Actually, you can't do this. It is possible for all vertices to fail a z test, but part of the triangle can still be visible.

demalion
01-Feb-2003, 20:01
A "rejected" triangle is "culled". A "rejected" vertex has a bit set so that a triangle can be later determined to have all its vertices "rejected", and therefore be "rejected/culled" itself.

Actually, you can't do this. It is possible for all vertices to fail a z test, but part of the triangle can still be visible.

Hmm...I don't get that at all. By what basis can a triangle be culled then?

Can you provide an illustration of an example case for my education? :-?

Oh wait, did you see my comment:

I do note that this depends on vertex coherency, i.e., being able to tell what triangle you are processing at the moment (or else you can't draw the line to check for occlusion against the Hierarchical Z data). This seems necessary to be in place already, but it is one of those vertex shader details I don't know...

:?:

Vince
01-Feb-2003, 20:56
Why do you think all 4 are still in 9500? Because the majority of the die is the fragment shader pipes, the 4 vertex shaders are small comparatively.

Why do we still have and think in terms of discrete "VS" & "F/PS" units and other computation groups? It's all the same fundimentally. How much of a contemporary DX9-based processor is in use at any given time? How many active gates, etc? There is probobly alot of redundency - think intelligently.

psurge
02-Feb-2003, 18:25
demalion, if you want me to understand your idea, i think you are going to have to go through an example, step by step... sorry :(

just to make sure we are on the same page :

- To "reject" a vertex V, you have to check that all the triangles which have V as one of their vertices are completely occluded.

.___.___.
| /| /|
|/ |/ |
.___*___.
| /| /|
|/ |/ |
.___.___.


I.e., for the vertex denoted by * above, 6 triangles would need to be checked.

- To check if a triangle is occluded, you need to know the screenspace position of all three of it's vertices. In the example above, this means you would need to know 9 screenspace positions before being able to determine that * is "rejected".

- Say the vertex shader is operating on a stream of indexes referring to vertices. You have no guarantee that a given vertex V won't be referenced again later in the index stream. So, the VS doesn't know which triangles require data for V.

- V could be a vertex for both a completely occluded and a partially visible triangle, in which case vertex attributes must be computed for it.


The goal is to perform vertex attribute computations only if they are required.
What is being proposed is this:
-first do the bare minimum for each vertex: compute screenspace position. (I refer to the program which computes these as P' in my previous posts)
-From these positions, you get screenspace triangles which can be tested for visibility. If a triangle is visible, compute vertex attributes for all 3 of its vertices, if they haven't already been computed previously. (I refer to the program which computes these as P'' in my previous posts)

3dcgi
02-Feb-2003, 19:01
A "rejected" triangle is "culled". A "rejected" vertex has a bit set so that a triangle can be later determined to have all its vertices "rejected", and therefore be "rejected/culled" itself.

Actually, you can't do this. It is possible for all vertices to fail a z test, but part of the triangle can still be visible.

Hmm...I don't get that at all. By what basis can a triangle be culled then?

Can you provide an illustration of an example case for my education? :-?


Ok. Here we'll see if I understand you correctly and vice versa. The triangle is behind the two rectangles, but I've shown its outline for clarity. None of the triangle's vertices are visible so if that is the criteria for rejecting a triangle then this triangle would not be drawn. However, as you can see it should be drawn.

http://www.3dcgi.com/misc/forumimages/visibility.gif

By the way demalion, you must have creativity coming out of your pores this week. You seem to have a long post full of ideas in every thread I read.

demalion
02-Feb-2003, 20:56
3dcgi,
My provided quote after the bit you quoted recognized that (I think...), but psurge brought up another complexity. (BTW, I know it sounds scary, but the short answer to your comment at the end is "it's my meds". :lol: I've become more used to the side effects. )

---

psurge,

demalion, if you want me to understand your idea, i think you are going to have to go through an example, step by step... sorry :(

just to make sure we are on the same page :

- To "reject" a vertex V, you have to check that all the triangles which have V as one of their vertices are completely occluded.

.___.___.
| /| /|
|/ |/ |
.___*___.
| /| /|
|/ |/ |
.___.___.


I.e., for the vertex denoted by * above, 6 triangles would need to be checked.

- To check if a triangle is occluded, you need to know the screenspace position of all three of it's vertices. In the example above, this means you would need to know 9 screenspace positions before being able to determine that * is "rejected".

- Say the vertex shader is operating on a stream of indexes referring to vertices. You have no guarantee that a given vertex V won't be referenced again later in the index stream. So, the VS doesn't know which triangles require data for V.

- V could be a vertex for both a completely occluded and a partially visible triangle, in which case vertex attributes must be computed for it.

Hmm...6 connected triangles? That would require enough vertex processors to scheduling having all connected vertex transforms done at the same time. That is too many for the current amount of vertex processors on a chip without assuming a lot more scheduling than might be practical.
EDIT: Hmm... based on Kristof's article indicating SIMD for each vertex processor, I'm not sure if this is necessarily the case....but it might still be less confusing architecturally to go with the split transform/lighting processors.

If necessary, an alternative might be that the pipeline is broken up into the "transform processor" and "lighting processor".

Assuming not just connected vertex coherency (in a triangle), but connected triangle coherency (in all triangles associated with a vertex which it seems I failed to completely consider):

Alternative one (full vertex processor), if 6 is the maximum amount of possible triangles, then 6 vertex processors would be required if there is enough latency in the pipeline that a transform for a 7th could have been done by the time the occlusion query is being executed (seems possible with cascading, but might be too complicated a way of looking at this).
EDIT: Not sure how SIMD for transforming operations fits into assumptions already made in vertex processor design regarding this.

Alternative two (split transform and lighting), P' is executed in the transform processors, P'' is then executed in the lighting processor (no repeated instructions). By having enough transform units for P' to have enough data to determine occlusion, you could still calculate P' and P'' on the fly. Stalls will still result if the transform processor is waiting on dependent calculations (the scheduling concerns I mentioned still mostly apply), but occlusion (procedural) checking latency would be hidden effectively I believe.
EDIT: Based on the SIMD concept, could few transform processors serve more lighting processors? In any case, occlusion detection could become the work of scheduling data output from one to the input of the other.

Scheduling seems vital to all of the above, but I have no idea where the scheduling requirements that might be required to make the above feasible stand in relation to what drivers/execution units on the GPU already have to do.

The goal is to perform vertex attribute computations only if they are required.

Let me see if I can get a better idea of the computations from Kristof's article (which has been percolating in my brain for the past few weeks and has helped crystallize my current understanding, as incomplete as it is)...

Hmm...the transform processor could be easilySIMD, but seeing it in Kristof's article as SIMD in each vertex processor again puts a new spin on that for me. I'm going back and adding EDITs for the thoughts and questions it spawned (providing more opportunities to see where I'm going wrong, or, against all odds, right ;) )


What is being proposed is this:
-first do the bare minimum for each vertex: compute screenspace position. (I refer to the program which computes these as P' in my previous posts)

OK, my idea in all forms (I think) depend on doing this first.

-From these positions, you get screenspace triangles which can be tested for visibility. If a triangle is visible, compute vertex attributes for all 3 of its vertices, if they haven't already been computed previously. (I refer to the program which computes these as P'' in my previous posts)

By storing "rejected" bits for vertices, disregarding this bit for operations that might move a vertex, then by instruction scheduling, and flagging when not to set rejection bits due to dependency (either driver side or in the GPU vertex instruction scheduler), I'm proposing that if the triangle visiblity determination I describe does not provide a significant latency, or a latency that can be masked, that P' and P'' (if I understand them correctly) can be written on the fly (and, in any case, doesn't require a per pixel visibility check for determining execution as originally proposed).

Hmm...mighty long sentence. :-? Anyways, I hope I'm making more sense...

Arun
02-Feb-2003, 21:08
You know, it's really sad nobody has seen the interest in saving Vertex processing power before nVidia released the NV20, which makes it real hard to implement.

With the NV1x, T&L was indeed separated as a Transform unit and a Lighting unit. This can clearly be seen in old nVidia patents.

However, with the NV2x & NV3x, both are done in the VS, thus making it a lot harder to implement.

If such an idea might have been implemented before such as with a Kyro using T&L, we might already have this whole idea implemented in current GPUs and switching to VS3.0. would be a lot easier.

Oh well... It's still better to have it a day or another than never! Let's hope we do get it in the R400 or NV40 or R500 or something...


Uttar

psurge
03-Feb-2003, 18:05
Uttar,

Look at this PPT for a related way of saving T&L work:
http://gamma.cs.unc.edu/SIG02_COURSE/greene.ppt

Ned Greene works at NVidia BTW...

Serge

Arun
03-Feb-2003, 20:47
Interesting info, psurge. I'll have to look a little more at it later.

BTW, just for the record, here is my current R400 guess:

- 16 pipelines, shared between PS & VS ( R300, if you count VS & PS pipelines, is 12 pipelines - but remember R400 ones are significantly more advanced and many things can be done with one instruction instead of several on the R300 )
- Between 8 and 12 TMUs pipe, no idea on exact number ( useless to put one per pipeline )
- PS3.0. & VS3.0. ( slightly above specs, but not as much as GFFX was over PS2.0. and VS2.0. )
- 400Mhz+ core, pretty much similarly clocked memory
- Between IMR and TBDR: requires front-to-back, but even more effective at saving work in such a case.
- Truly non traditional FSAA sampling patterns ( got a theory, but it could be completely incorrect )


Uttar

psurge
03-Feb-2003, 21:12
Uttar, since VS3.0 includes the ability to lookup textures (i.e. a displacement map), I think it still makes sense to have 1 "TMU" per pipe.

Honestly I would be surprised if 16 load balanced uber-pipelines are possible in .13...

T2k
06-Feb-2003, 02:01
Honestly I would be surprised if 16 load balanced uber-pipelines are possible in .13...

Why?
I think there's plenty of room to grow on 130 nanos...

jvd
06-Feb-2003, 02:08
Uttar, since VS3.0 includes the ability to lookup textures (i.e. a displacement map), I think it still makes sense to have 1 "TMU" per pipe.

Honestly I would be surprised if 16 load balanced uber-pipelines are possible in .13...


Of course many people didn't believe the r300 was possible on the .15micron at those clock speeds and that happened.

psurge
06-Feb-2003, 03:07
Because:

- each "pipe" must support dynamic branching.
- if a VS/PS programs can run on any pipe,
the 24bit fp units of the r300 must be switched to full 32bit FP
- AFAIK instruction predication is not yet supported by ATI
(pls don't flame if this is wrong, i can't afford any of these next gen cards)
- for true displacement mapping, I was under the impression that the texture sampling unit had to perform trilinear filtering of floating point textures.
- load balancing logic
- handling the fact that the output of any pipe could be a pixel or a vertex means all 16 pipes have to somehow be connected to the rasterization/HZ/framebuffer compression logic...

(these 2 are maybe's)
- each pipe must have access to the constants
available to all the programs it needs to switch between
- each pipe needs more instruction memory (to support VS/PS3.0 instruction count limits... maybe also for faster program switching)

Who knows though, maybe it will all fit into around 150million transistors. All I said was I woul be surprised.

Serge

Arun
06-Feb-2003, 06:19
16 TMUs is out of the question.
The waste you'd get in the VS would be simply ridiculous! Few VS programs will use textures.
Thus, I think a poll of TMUs would make a lot more sense. It's harder to implement, but more efficient.

As I said, I don't think 16 true pipelines are possible. I'm suggest 16 pipelines total, compared to the R300's 12 ( -> +35% )
And maybe +50% TMUs ( -> 12 TMUs? )

Uttar

Hellbinder
06-Feb-2003, 06:35
I think it has 32 TMU's and *1* pipeline running at 3400 Mhz.

No really...

jvd
06-Feb-2003, 20:19
come on man ... Don't you know they just took 54 rage chips , clocked them to 2000mhz each and put them all on one card ?

martrox
06-Feb-2003, 20:28
No...it's got a roomfull of monkeys......... singing The Porpose Song!

Kaizer
06-Feb-2003, 20:32
No...it's got a roomfull of monkeys......... singing The Porpose Song!

And a RSWOC (Render Subject WithOut Clothes [RSN is a Matrox patent]) chip and an improbability generator (the teacup thing).


With regards
Kjetil

Luminescent
06-Feb-2003, 20:46
I believe it is more likely for the R400 to sport an array of vertex/pixel processors which may be subdivided and labled as "pipelines", than to sport the more conventional grouped units seen today. Something more along the likes of the NV30's vertex shader and the p10, processor pools/arrays would be more elegant than mere vliw pipelines. They could be scheduled and pipelined in any fashion. Maybe they would offer more precision by being combined (2 32-bit units for a 64-bit calculation).

martrox
06-Feb-2003, 21:21
Hmmmm......an improbability generator ...........a cup of tea...........Anna Nicole Smith......... AGGGH!!!!!!!!

<runs from room, arms waving>
screaming "DANGER WILL ROBINSON, DANGER" and procedes to pluck his own eyes out........

Jeez, just what the heck did I eat to cause that nightmare..........

T2k
08-Feb-2003, 20:06
Anna Nicole Smith


She's a low woman. Simply low level...

09-Feb-2003, 01:04
The jump which the R400 will make compared to the R200-R300 isn't that great......... the next huge leap will not arive till R500 so till then there isn''t really that much to make a fuzz about......

The biggest diff will be R400=0.13 micron instead of 0.15 micron

:roll:

T2k
10-Feb-2003, 02:56
The jump which the R400 will make compared to the R200-R300 isn't that great......... the next huge leap will not arive till R500 so till then there isn''t really that much to make a fuzz about......

The biggest diff will be R400=0.13 micron instead of 0.15 micron

:roll:

It's totally contradict Orton's statement about R400...

Hellbinder
10-Feb-2003, 03:00
The jump which the R400 will make compared to the R200-R300 isn't that great......... the next huge leap will not arive till R500 so till then there isn''t really that much to make a fuzz about......

The biggest diff will be R400=0.13 micron instead of 0.15 micron


hahahaha.. thats a good one..

Arun
10-Feb-2003, 19:12
]
The jump which the R400 will make compared to the R200-R300 isn't that great......... the next huge leap will not arive till R500 so till then there isn''t really that much to make a fuzz about......

The biggest diff will be R400=0.13 micron instead of 0.15 micron


hahahaha.. thats a good one..

Hehe, I gotta agree it sure is a good one :)

Something I've got to agree on, however, is that the R500 will be a bigger move than either R300 or R400.
That's because 0.09 is nearly 50% smaller than 0.13 - we haven't such such a reduction in years!
So, whatever companies do with 0.09, it'll be a big performance boost for sure even if there's no architectural change.


Uttar

Bigus Dickus
10-Feb-2003, 22:33
What makes you so positive that graphics IHV's won't go to a .11u process before they move to .09u?

After all, ATi and nVidia both stopped at .15u in their process refinements, while Intel and AMD both jumped straight from .18u to .13u. Just because AMD and Intel plan .09u to be next wouldn't seem to lock in the conclusion that ATi and nVidia will make the same jump.

;)

Deflection
10-Feb-2003, 22:57
After all, ATi and nVidia both stopped at .15u in their process refinements, while Intel and AMD both jumped straight from .18u to .13u. Just because AMD and Intel plan .09u to be next wouldn't seem to lock in the conclusion that ATi and nVidia will make the same jump.

I'm not sure that is such a good comparison. Intel and AMD have their own fabs and constantly refine their process smaller and smaller. it's not quite so discontinuous as it appears. Their drawn process size differs greatly from the actual over time. You'll occasionally hear speculation about what their current Leff's are. Probably, at this stage of maturity their .13 process's would be better described as .11 already. I doubt ATI and Nvidia get this advantage using TSMC.

Bigus Dickus
11-Feb-2003, 03:10
I'm not sure that is such a good comparison. Intel and AMD have their own fabs and constantly refine their process smaller and smaller. it's not quite so discontinuous as it appears. Their drawn process size differs greatly from the actual over time. You'll occasionally hear speculation about what their current Leff's are. Probably, at this stage of maturity their .13 process's would be better described as .11 already. I doubt ATI and Nvidia get this advantage using TSMC.

Point well taken, but that doesn't change my hypothesis. Rather, it strengthens it. If AMD and Intel are actually using process tweaks that give them something closer to an effective .11u (or .15u, as many claimed AMD's .18u copper process was essentially), then that supports my hypothesis that ATi and nVidia won't jump straight to .09u. Why would they, if no one else really does, and historically they have taken an "actual" in-between step wheras Intel and AMD have only taken the "effective" in-between steps?

Deflection
11-Feb-2003, 05:22
Point well taken, but that doesn't change my hypothesis. Rather, it strengthens it.

I agree. I can't remember the last time ATi or Nvidia didn't use that half generation step as long as it was ready.

Mulciber
11-Feb-2003, 10:51
I'm not sure that is such a good comparison. Intel and AMD have their own fabs and constantly refine their process smaller and smaller. it's not quite so discontinuous as it appears. Their drawn process size differs greatly from the actual over time. You'll occasionally hear speculation about what their current Leff's are. Probably, at this stage of maturity their .13 process's would be better described as .11 already. I doubt ATI and Nvidia get this advantage using TSMC.

Point well taken, but that doesn't change my hypothesis. Rather, it strengthens it. If AMD and Intel are actually using process tweaks that give them something closer to an effective .11u (or .15u, as many claimed AMD's .18u copper process was essentially), then that supports my hypothesis that ATi and nVidia won't jump straight to .09u. Why would they, if no one else really does, and historically they have taken an "actual" in-between step wheras Intel and AMD have only taken the "effective" in-between steps?

I thought I specificly read somewhere an interview with someone from nVidia, and when asked about the move to .09 micron he replied that it would be more like .11.

Those exact numbers.

laGadU
11-Feb-2003, 12:29
what about those news about the partnership with cadence for the .09 process to the R500 ?

MuFu
11-Feb-2003, 15:31
what about those news about the partnership with cadence for the .09 process to the R500 ?

ATi and nVidia have very different ways of going about PCB design. I am not really at liberty to talk about it, but basically nVidia have used Cadence software for quite a while (Specctra/SpecctraQuest etc), making impedence and delay (->skew) matching of the entire system alot easier, whereas ATi use only basic PCB tools and rely on pretty tight tolerances from their ASIC teams. Once the ASICs start hitting 600MHz+ (as will happen on 0.09u, possibly even on 0.13u) they really are going to need access to the full suite of tools to keep everything ticking over nicely.

MuFu.

demalion
11-Feb-2003, 18:09
Why are the ATI cards smaller, then? R300 with 256-bit bus compared to, say, a GF 4 Ti...where would these design tools manifest themselves?
I realize smaller doesn't necessarily mean better for a card design, but getting more out of less seems that way to me...or are ATI's PCBs just outrageously more expensive to manufacture?

MuFu
13-Feb-2003, 15:12
I think it's to to do with engineering headroom. nVidia like to cover their asses on a per-team basis quite a bit more than ATi and there seems to be more of a "better safe than sorry" ethos because of it. Discrete reg stages are therefore bigger, taking up board space and there are extra ground/signal layers to maintain SSTL integrity. The tools just allow them to build in this safety margin by more accurately matching impedences etc for operation at high clockspeeds. Jen-Hsun is a pretty scary guy so nobody really wants to answer to him when targets aren't met. :? :lol:

Regarding the Ti4600 board/9700 Pro boards - the former is an 8-layer PCB like the R300-942 and despite being larger, only has a 128-bit memory bus (as you say). It was designed quite a bit earlier though and even at speeds of 400/800MHz+ is not a limiting factor. It is quite obvious from the clockspeeds being reached by 9700NPs flashed with 9700 Pro BIOS's that the mem interface itself is holding things back at clockspeeds of more than 325MHz. The PCB design teams probably got an early projection for R300 at-speed (~300MHz) and then worked for a target of 325MHz or so, only verifying PCB delay and matching impedence. There was talk of ten-layer R300 boards - that was more of QC issue than anything else, with board manufacturers not being too confident that they could produce the 8 layer board reliably for a given cost.

At nVidia the PCB teams get delay/skew times from the ASIC guys and work to match the entire path from the get-go so the resulting boards have quite a bit of headroom. They had gone through three revisions of the NV30 PCB before ithe ASIC had even come back from the fab (!). I guess with the larger margins for error nVidia have more flexibility when it comes to AIB manufacturers.

MuFu.

MuFu
13-Feb-2003, 15:12
BTW, at ATi they have a policy of sticking to compact designs. The board marketing manager insists on the same form factor for all the PCBs in a range - makes designing the AIW cards a bit of a nightmare but pretty cool from an eng. elegance POV. I guess they are just less bothered about it at nVidia.

The 9700/9700 Pro board is only about $5, incidently - not that expensive now that it is being produced in volume. I am told that it can quite easily qual at 350MHz+ with some resistor pack tweaks, but of course will require more power for termination because of it.

demalion
13-Feb-2003, 16:48
Well, as long as ATI can execute, I think I have a general fondness for that board marketing manager! Oh, and for the board engineers that actually implement it, I guess, but I'm thinking ATI engineers in general don't need any more ego feeding right now. :P

Nagorak
13-Feb-2003, 23:05
It is quite obvious from the clockspeeds being reached by 9700NPs flashed with 9700 Pro BIOS's that the mem interface itself is holding things back at clockspeeds of more than 325MHz. The PCB design teams probably got an early projection for R300 at-speed (~300MHz) and then worked for a target of 325MHz or so, only verifying PCB delay and matching impedence.

What are you basing this off of? Don't you think it could have something to do with the fact the R300 was more or less designed for DDR1, which ends at around 400 MHz (now) and it was slower at release? Are you saying the R300 architecture is incapable of using faster ram efficiently? I really have a hard time imagining that this could be the case...

stevem
13-Feb-2003, 23:26
...It is quite obvious from the clockspeeds being reached by 9700NPs flashed with 9700 Pro BIOS's that the mem interface itself is holding things back at clockspeeds of more than 325MHz...

How is this possible for endusers to qualify given the disparity of RAM used on these boards?

OTOH my 9700 NP & 9500 pro boards are identical, which I know is peculiar for a 9500 pro...

MuFu
14-Feb-2003, 00:34
Are you saying the R300 architecture is incapable of using faster ram efficiently?

No, nothing to do with efficiency. Stability. There is probably some kid of internal limit/TL problem that causes the ~350MHz ceiling. You'll have to take my word that one of the big areas of optimisation/tweaking in the R300 to R350 transition involves the memory interface/controller. Like I said earlier, the PCB will probably be fine at 350MHz+ - it'll just use more power.

MuFu.

MuFu
14-Feb-2003, 00:36
...It is quite obvious from the clockspeeds being reached by 9700NPs flashed with 9700 Pro BIOS's that the mem interface itself is holding things back at clockspeeds of more than 325MHz...

How is this possible for endusers to qualify given the disparity of RAM used on these boards?

Well that's the key thing - similar overclocks *despite* the rating disparity.

OTOH my 9700 NP & 9500 pro boards are identical, which I know is peculiar for a 9500 pro...

Yeah - must be a very early one.

MuFu.

stevem
14-Feb-2003, 01:52
Well that's the key thing - similar overclocks *despite* the rating disparity.

I know, however, that presupposes that BIOS timings, etc, aren't to blame. You're not really testing the same GPU+PCB vs diff RAM. You're testing same GPU+PCB+BIOS vs diff RAM. I don't necessarily disagree with your conjecture, mind you.

T2k
17-Feb-2003, 07:18
What is the possible timeframe for tapeout? March-April?

Nebuchadnezzar
17-Feb-2003, 10:10
What is the possible timeframe for tapeout? March-April?

Optimistic : End of Feb.
Conservative: End of March.

T2k
19-Feb-2003, 07:02
What is the possible timeframe for tapeout? March-April?

Optimistic : End of Feb.
Conservative: End of March.

Sounds good. :)

So, we should prepare for the first 'leaked' rumours within' weeks? :D
I hope so...

T2k
22-Feb-2003, 05:16
Hmmm... everything is far too silent... :wink:

edwpang
23-Feb-2003, 16:34
]
ok, how bout these.....


R300 = Brute Force
R400 = Intelligent Design
...
R300 = Culmination of generations since R100
R400 = Instigation of new Generational Relationships for future products
...
R300 = More parts = more fun for U
R400 = Do we still neeed all these parts???

Interesting, I like these comments.

WaltC
23-Feb-2003, 19:45
I believe it is more likely for the R400 to sport an array of vertex/pixel processors which may be subdivided and labled as "pipelines", than to sport the more conventional grouped units seen today. Something more along the likes of the NV30's vertex shader and the p10, processor pools/arrays would be more elegant than mere vliw pipelines. They could be scheduled and pipelined in any fashion. Maybe they would offer more precision by being combined (2 32-bit units for a 64-bit calculation).

I don't really see anything wrong with "conventionally grouped" pipelines...;) I would also hope that they don't go hog wild with .13 and grow the circuitry disproportionately just because they think they can. I'd like to see them concentrate on maintaining a decent thermal profile while attaining a much higher clockspeed--which I think they ought to be able to do--provided they KISS. I would think they'd want to eschew elegance if it's going to mean a slew of complex logic conflagrations down the line. I'm also not sure if it's a good thing to look at "nv30's vertex shader" discretely, for that kind of thing. Yea, it's possible you could get what you want on the first go--also possible you'd get a great big mess, instead...;)

Nagorak
24-Feb-2003, 00:41
I don't really see anything wrong with "conventionally grouped" pipelines...;) I would also hope that they don't go hog wild with .13 and grow the circuitry disproprotionately just because they think they can. I'd like to see them concentrate on maintaining a decent thermal profile while attaining a much higher clockspeed--which I think they ought to be able to do--provided they KISS. I would think they'd want to eschew elegance if it's going to mean a slew of complex logic conflagrations down the line. I'm also not sure if it's a good thing to look at "nv30's vertex shader" discretely, for that kind of thing. Yea, it's possible you could get what you want on the first go--also possible you'd get a great big mess, instead...;)

The mobile market is a big part of ATi's focus, so I don't imagine they'll deviate much from their previous methodology. So, rest easy. ;)

T2k
04-Mar-2003, 03:51
bumpy-bumpy

:)

Hmmm... I still believe R400 will be out around August...

Pete
04-Mar-2003, 04:17
With the R350 sporting 30GB/s of bandwidth, ATi might just be wiser leaving it for late Nov/early Dec. 30GB/s!

T2k
14-Mar-2003, 14:40
So, now we know R350/RV350... is there anything new about R400?

Joe DeFuria
14-Mar-2003, 14:50
So, now we know R350/RV350... is there anything new about R400?

Well, we have not heard anything abot R400 taping out, which probably means it hasn't. We usually get tipped off around here when these things happen. ;)

Assuming that's the case, and given the fact that this should be another brand new architecture, I don't really see any way it can be in production this summer.

At best, I would guess we might see ATI do something similar to what nvidia did with the FX. Launch it late this year, and possibly get it out late 4Q. However, my guess would be R400 is pushed back to the Spring '04 cycle (which effectively is what happened to the NV30).

Assumimng that's the case (no real R400 this year), it will be interesting to see what ATI does for the fall cycle. If ATI is true to their convictions of keeping "the flag" cpatured at the high end, we may see another R3xx refresh this summer. I'm not expecting another chip, but perhaps that's when we may see something like a 450Mhz+ R350 paired with 450 Mhz DDR-II.

We know that ATI will launch some 256 MB R350 in April. If the clocks are no different than the 128 MB 9800 Pro (which I suspect they will be the same), then my guess is they are going to launch a 450/450 or similar R350 this summer....around the time that the NV35 starts to ship...

demonic
14-Mar-2003, 16:36
We know that ATI will launch some 256 MB R350 in April. If the clocks are no different than the 128 MB 9800 Pro (which I suspect they will be the same), then my guess is they are going to launch a 450/450 or similar R350 this summer....around the time that the NV35 starts to ship...

But didn't we see FIC do a press release with a 420? core and 460 ddr-ii ram?

From most of the reviews the core is clocking up very well... so wouldn't it be a case of FIC or another vendor speed binning and getting premium chips?

From ATi I would rather see something past the 500Mhz range for the fall product.. After all, we all know that ATi's architecture is much better than Nv's at the moment.. :D

Joe DeFuria
14-Mar-2003, 16:46
But didn't we see FIC do a press release with a 420? core and 460 ddr-ii ram?

Yes we did see it, although it was premature and AFAIK, it could have been based on old / incorrect information. Now that the R350 is official, is that FIC PR with those same specs available?

From most of the reviews the core is clocking up very well... so wouldn't it be a case of FIC or another vendor speed binning and getting premium chips?

It is certainly possible.

From ATi I would rather see something past the 500Mhz range for the fall product.. After all, we all know that ATi's architecture is much better than Nv's at the moment.. :D

Well, yeah, I'd rather see a 750 Mhz+ product. ;) However, it is in ATI's best interests for their high-end cards to be "just fast enough" to be considered "faster" than nVidia's latest offerings.

In other words, a 420/460 R350 would really be overkill (for ATI, not us!) considering the competition. That could hurt the margins on the 9800/9800 Pro depending on the pricing of the 9800 Pro/Ultra/SuperDuper FIC type card. ;)

IMO, ATI is just better off waiting for nVidia to show it's "NV35 card", and then respond to it with an "Ultra" R350, once they know more or less exactly what the NV35 is capable of.

Arun
14-Mar-2003, 16:53
Okay, so we had a lot of 3DFX/nVidia comparaisons with the VSA-100 / NV30 similarities.

But is it just me or does the R400 look an awful lot like Rampage? :D

The Rampage was a fundamentally different architecture, compared to its predecessors and everything else on the market.
The Rampage got delayed
3DFX made several intermediary architectures ( V3, V4 & V5 ) in order to remain competitive

The R400 is a fundamentally different architecture, compared to its predecessors and everything else on the market.
The R400 got delayed ( although it seems it won't be delayed anywhere near as much, and I'd be *very* surprised if it was delayed by more than 6 months )
ATI is probably going to release a 0.13/DDR2 R350 in order to remain competitive.


The only difference is that ATI still got a lot more financial resource than 3DFX back in the day.
If the R400 in 2004 could be what Rampage would have been in 2001... Wow...


Uttar

antlers
14-Mar-2003, 17:29
I always thought the R400 was more like Fear

Joe DeFuria
14-Mar-2003, 17:33
But is it just me or does the R400 look an awful lot like Rampage? :D

Nnnnnnnoooooooooo!!

:(

Please let Rampage rest in peace. I thought we had gotten over that once all the NV30 was launched and all the "NV30 will be like Rampage" stuff subsided...

TheMightyPuck
14-Mar-2003, 18:00
I thought r400 was going to have to wait for low-k dial proc feasability. Next year.

Arun
14-Mar-2003, 18:19
Nnnnnnnoooooooooo!!

:(

Please let Rampage rest in peace. I thought we had gotten over that once all the NV30 was launched and all the "NV30 will be like Rampage" stuff subsided...

Sorry :(
Please don't think I'm an inhumane madman which laughs on Rampage's fate. It's a really sad thing, and it could have been great if it had been released in early 2001.

What I mostly see in the R400 are Rampage's qualities. I probably should have not said it...


Uttar

T2k
15-Mar-2003, 00:37
But is it just me or does the R400 look an awful lot like Rampage? :D

Nnnnnnnoooooooooo!!

:(

Please let Rampage rest in peace. I thought we had gotten over that once all the NV30 was launched and all the "NV30 will be like Rampage" stuff subsided...

Yes, exactly.

It's gettin' borin'... 'if the 3dfx'... 'then the 3dfx'... 'NV30 has Rampage-like this and that' (has nothing from that)... stop it, pls.

OFF
C'mon: Rampage today couldn't be a player anywhere. Forget it. (And honestly: I don't give a flying frog about what Ramage could do... never came out, that's all. Carpe diem! ;))
Yeah, it's only a personal note...

Onde Pik
15-Mar-2003, 01:52
I always thought the R400 was more like Fear

Naa the obvious name is "Rage Fury Frenzy 10000 PRO"

Personally I always thought Rage Fury was the coolest ever Vcard name.

RussSchultz
15-Mar-2003, 03:59
Personally I always thought Rage Fury was the coolest ever Vcard name.

Department of the Redundancy Department.

Chris123234
16-Mar-2003, 01:30
Department of the Redundancy Department.

lmao

Onde Pik
16-Mar-2003, 14:25
Personally I always thought Rage Fury was the coolest ever Vcard name.

Department of the Redundancy Department.

Actually it isnt redundant atall.

Chris123234
16-Mar-2003, 16:40
rage n.-

1-A. Violent, explosive anger.
---B. A fit of anger.
2-Furious intensity, as of a storm or disease.

fury n.-

1-Violent anger; rage.
2-Violent, uncontrolled action; turbulence.

Actually its totally redundant

Back to the Department of Redundancy Department for Rage Fury.

noko
16-Mar-2003, 17:09
No, nothing to do with efficiency. Stability. There is probably some kid of internal limit/TL problem that causes the ~350MHz ceiling. You'll have to take my word that one of the big areas of optimisation/tweaking in the R300 to R350 transition involves the memory interface/controller. Like I said earlier, the PCB will probably be fine at 350MHz+ - it'll just use more power.

~350mhz ceiling I think is more based on Ram specs then actual memory controller design. I can go significantly above 350mhz but with tweaking and obviously out of spec voltages :).

If I understand right, all the pipelines are dedicated in shading one triangle at a time and if there is overfill as in less pixels left to be shaded then the number of pipelines, those pipelines results are dumped. Meaning to me wasted use of pipeline availability. How about having the ability to shade more then one triangle at a time and keep the pipelines as much as possible actually shading triangles?

MuFu
16-Mar-2003, 19:07
No, nothing to do with efficiency. Stability. There is probably some kid of internal limit/TL problem that causes the ~350MHz ceiling. You'll have to take my word that one of the big areas of optimisation/tweaking in the R300 to R350 transition involves the memory interface/controller. Like I said earlier, the PCB will probably be fine at 350MHz+ - it'll just use more power.

~350mhz ceiling I think is more based on Ram specs then actual memory controller design. I can go significantly above 350mhz but with tweaking and obviously out of spec voltages :)

Sure - bumping VDD/VDDQ will help. Actually the limit on the R300/942 board seems to be more like 340-345MHz. I disagree completely - it is most definitely independent of memory limitations.

Consider the fact that the 9800 Pro uses exactly the same memory as the 9700 Pro yet typically overclocks to 370-400m.

MuFu.

noko
17-Mar-2003, 00:03
Actually bumping VDDQ hinders performance in my case but VDD was a true blessing :D. The Tachyon uses a different voltage controller for VDDQ which doesn't require the back plate found with the typical Radeon 9700 Pro's. Basically the Tachyon doubled up and used the same type of voltage controller for both VDDQ and VDD. So you say it is the same type of memory? hummmm. Anyways 3.2 volts for VDD allows me to go to 370mhz without any problems on my Raddy. Anything over 3.16v for VDDQ causes overclockability to digress the other way with bad artifacts. So I think many folks upped the voltage of VDDQ and had terrible results not knowing that this value could be detrimental if raised too high.

I may still be able to get a higher stable memory overclock once I install a higher resistor into my circuit for VDDQ in combination with the potentiameter. Since 3.16v is the lowest it goes and 2.8v is the stock voltage. I've havn't played around with VDDQ voltages between 2.8 and 3.16v. So maybe I will be able to get above 370 mhz. I hope so.

noko
17-Mar-2003, 03:58
Mufu,

Just saw your thread over at Rage3d.com about bios and memory speeds, pretty interesting. To fill you in a little bit here is the circuit I built which controls Vcore, Vmem I/O and core or VDDQ and VDD.

http://bellsouthpwp.net/n/o/noko/Proposalimages0005.jpg


I can controll the voltages in a given range externally since I have the circuits mounted in a floppy drive bay.


http://bellsouthpwp.net/n/o/noko/Proposalimages0002.jpg


Here is an example of Vcore being controlled in real time using TGM (Tachyon Graphics Monitor) to measure the values. First at 1.5v is the stock value, the first jump is when I insert the circuit into operation by one of the dip switch switches. Then the increase in voltage is due to me adjusting the potentiameter up to a value of 1.9v (smoking :) ). Vmem I can do the same thing. The dip at around 1.9v isn't a voltage transient but when I acknowledged the TGM alarm.


http://bellsouthpwp.net/n/o/noko/tgmvoltage.gif


The dynamic way of adjusting voltages has allowed me to fine tune for overclocking which wouldn't be possible if I guess at a given value. Plus if VDDQ is set to high it will indeed degrade overclockability.

Sorry everyone for getting off topic here :oops:.

Althornin
17-Mar-2003, 04:35
thats sweet!
i want one :)

MuFu
17-Mar-2003, 15:09
Heh, yeah that is sweet. :D

So you say it is the same type of memory? hummmm.

No, I'm saying the mem interface limitation is largely independent of BGA type. Testament to this is the fact that users with Hynix 3.6ns RAM are reaching very similar O/C's to those with Samsung 2.86ns BGA after flashing the Pro BIOS (330-350Mhz) and the fact that the 9800 Pro uses the same memory as the 9700 Pro yet is capable of higher memory clockspeeds.

Anyway... on with the R400 speculation!!!

16 unified PS/VS pipes, 4 double-pumped TMUs (i.e. 16x0.25, hehe), 512-bit memory bus, dedicated programmable tesselation hardware, integrated Stargate, tea & coffee maker, free hooker etc etc...

MuFu.

Reverend
17-Mar-2003, 15:58
C'mon: Rampage today couldn't be a player anywhere...
Sorry to bring back "3dfx" into the thread but... What makes you say that? Back in those days (circa mid-2000), it was "DX9" (as "DX9" could be at the time)... what aspects of Rampage do you know for certain?

martrox
17-Mar-2003, 16:06
16 unified PS/VS pipes, 4 double-pumped TMUs (i.e. 16x0.25, hehe), 512-bit memory bus, dedicated programmable tesselation hardware, integrated Stargate, tea & coffee maker, free hooker etc etc...

MuFu.

Umm.... I'd have to say the hooker isn't "free', unless you mean she's "free" in the same context that nVidia was talking "free" FSAA with the NV3x's..... :wink:

MuFu
17-Mar-2003, 16:26
The difference being you can always pay extra for a hooker whereas nV's FSAA is a lost cause... :lol:

MuFu.

martrox
17-Mar-2003, 16:32
BA DUMP BUMP! :lol:

T2k
17-Mar-2003, 16:45
C'mon: Rampage today couldn't be a player anywhere...
Sorry to bring back "3dfx" into the thread but... What makes you say that? Back in those days (circa mid-2000), it was "DX9" (as "DX9" could be at the time)... what aspects of Rampage do you know for certain?

Let me ask you: w/ all my respect, Rev, do you seriously think Rampage could do considerably hit on the market (I mean, in terms of performance) in THESE DAYS against R9700/9800 or GFFX? :shock:

Rampage would have been big BANG in 2000, probably in 1st half of 2001 too - but never came out.
Now it's 2 years old story, Rev. Design changed, playground changed, everything changed.
RIP.

T2k
21-Mar-2003, 17:01
The difference being you can always pay extra for a hooker whereas nV's FSAA is a lost cause... :lol:

MuFu.

:D LOL

Ilfirin
21-Mar-2003, 17:04
integrated Stargate

:lol:

Arun
21-Mar-2003, 20:23
Let me ask you: w/ all my respect, Rev, do you seriously think Rampage could do considerably hit on the market (I mean, in terms of performance) in THESE DAYS against R9700/9800 or GFFX? :shock:

Rampage would have been big BANG in 2000, probably in 1st half of 2001 too - but never came out.
Now it's 2 years old story, Rev. Design changed, playground changed, everything changed.
RIP.

A 16-chip Rampage would beat the hell out of a GFFX :D
And anyway, the Rampage had better FSAA quality than the GFFX will ever have :P

More seriously however, the Rampage would have no chance against a GFFX, performance-wise.

I think it was leaked by someone that Rampage performance, with non final silicon & drivers, was only slightly slower than a Ti4600 in most cases!

So, I'd guess shipping chips/drivers would be on par with a Ti4600.
It would have been a winner even if it was released in H2 2001, not only in 2000 or in H1 2001, if that information is correction.


Uttar

Mulciber
21-Mar-2003, 22:29
I think the rampage gets more and more powerfull everytime someone mentions it

Pete
21-Mar-2003, 22:46
It becomes more powerful the more worshippers it has. :D

All hail Rampage!

demalion
22-Mar-2003, 05:22
Wavey taunted me and all I got was this lousy sig
...

I'm just glad I don't drink milk often...

Ailuros
23-Mar-2003, 12:40
Back in those days (circa mid-2000), it was "DX9" (as "DX9" could be at the time)... what aspects of Rampage do you know for certain?

I don't recall anything in the Spectre and/or Fear rumoured specs that indicated anything higher than PS/VS1.1, with probably some sort of pre-sampled displacement mapping (my own guestimate) in the latter.

Add to that Overbright Lighting, effectively 13bits/RGBA. Anything I'm missing? Mojo was of course a totally different story.

And anyway, the Rampage had better FSAA quality than the GFFX will ever have .

For 4x sample MSAA for sure, due to it's capability for 4xRGMS. I doubt it would have been "fillrate free" though at least on Spectre.

Anyway does anyone have a clue more or less how their adaptive anisotropic algorithm worked? If there are any relations to the FX "balanced" mode, I don't think I'm that much interested after all... :roll:

Mulciber
23-Mar-2003, 13:53
Wavey taunted me and all I got was this lousy sig
...

I'm just glad I don't drink milk often...

At least someone thought it was funny :lol:

Arun
25-Mar-2003, 15:50
It becomes more powerful the more worshippers it has. :D

All hail Rampage!

Hehe, I can already imagine an army crying "All hail Rampage!" in the streets of Baghdad... That would be funny! :)

The following is a joke:
Anyone wanna petition with me to nVidia, so that they'd compare the NV40 specs to Rampage instead of current competition?
Oh, sure, it would look humilate the Rampage, but the goal is to get the real specs...


Uttar

duncan36
25-Mar-2003, 20:04
Well i'm sure the Rampage was good on paper, on paper I'm sure Rampage and Fear totally destroyed Nvidia and would make 3dfx billions!
Intense self-deception is the beginning of the end of company's required to be innovative. I hope the GF FX debacle is enough to snap Nvidia out of their own brand of self-deception. Actually scratch that I hope they trip up one more time with the Nv35, I think they still deserve one more.

T2k
28-Apr-2003, 19:53
Did we get some new update on R400?

Neutrality
28-Apr-2003, 22:02
:?:

Where did the MuFu thread about R390, R420 and R500 go?

Said too much?


-Neutrality-

MuFu
28-Apr-2003, 22:29
Yes. :oops:

MuFu.

Natoma
28-Apr-2003, 22:38
Yes. :oops:

MuFu.

If you're not under NDA how could they force you to take it down?

demalion
28-Apr-2003, 22:47
Well, some people don't have to be "forced" to do things, and might do them out of courtesy, and I'd place MuFu and the moderators here in that camp. Whether MuFu or someone else removed the thread, I'd guess courtesy..."being asked" instead of "being forced". Hmm...maybe not even "being asked" would be required, though I think it likely the case this time.

Luminescent
29-Apr-2003, 03:31
Well, anyways, thankyou Mufu for trying to give us some inside info. 8)

Neutrality
29-Apr-2003, 03:56
Well, anyways, thankyou Mufu for trying to give us some inside info. 8)

Indeed, its always nice to get some inside info that no one else has. 8)

-Neutrality-

Slides
29-Apr-2003, 07:13
Inside info is good for the soul. :|

Deflection
29-Apr-2003, 07:42
Well, anyways, thankyou Mufu for trying to give us some inside info. 8)

Indeed, its always nice to get some inside info that no one else has. 8)

-Neutrality-

Hmm, I saw the post briefly and it didn't seem to say much more than is commonly speculated on. I guess one of the code names could be seen as suprising, but why bother removing the post if it's this well known around the industry already and the poster didn't break an NDA?

Neutrality
29-Apr-2003, 08:29
Well, anyways, thankyou Mufu for trying to give us some inside info. 8)

Indeed, its always nice to get some inside info that no one else has. 8)

-Neutrality-

Hmm, I saw the post briefly and it didn't seem to say much more than is commonly speculated on. I guess one of the code names could be seen as suprising, but why bother removing the post if it's this well known around the industry already and the poster didn't break an NDA?

Well, this was sort of a confirmation of some of the things people have been speculated about but also some new info.

-Neutrality-

Lezmaka
29-Apr-2003, 09:00
Well part of the post definately wasn't something most people knew, and there are quite a few people over at rage 3d who still believe the R400 is coming out this year. And even if it was well known, maybe it's because someone asked him to, or he decided to take it down because the person who gave him the info got mad, etc.

elroy
29-Apr-2003, 15:09
This wasn't the only site where the info was posted. I won't post a link, cos it might be removed if I did, but it isn't very hard to figure out where else it could be.......

indio
29-Apr-2003, 15:58
maybe it was taken down because it had inaccurate information...

T2k
29-Apr-2003, 17:10
maybe it was taken down because it had inaccurate information...

:shock:
Therefore half of the internet should've been shutted down...

MuFu
30-Apr-2003, 00:42
So anyway... this is now the R500 speculation thread. :lol:

Your thoughts? They must be shooting for full DX10-compliancy (or near as dammit) now, huh?

MuFu.

Testiculus Giganticus
30-Apr-2003, 01:09
Mmm, this is getting out of hand:)Imaginatory dellirium in progress, send thought police in order to bring speculation to a stop :lol: Guys, R500 already?I`m starting to feel like a mammoth among u people :shock:

Lezmaka
30-Apr-2003, 01:29
So anyway... this is now the R500 speculation thread. :lol:

Your thoughts? They must be shooting for full DX10-compliancy (or near as dammit) now, huh?

MuFu.

So if R390/420/blah won't have ps/vs 3.0, and if R500 is shooting for DX10 level stuff, would that mean they're gonna skip ps/vs 3.0? I realize R500 would have support for it, but i mean a card that is designed for it.

Ailuros
30-Apr-2003, 04:03
PS/VS3.0 is part of dx9.0.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9_c/directx/graphics/reference/Shaders/VertexShader3_0/VertexTextures.asp

Lezmaka
30-Apr-2003, 04:14
PS/VS3.0 is part of dx9.0.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9_c/directx/graphics/reference/Shaders/VertexShader3_0/VertexTextures.asp

I know that, but no card supports it. I was asking if ATI is gonna skip right over PS/VS 3.0 and go to DX10 level

Ailuros
30-Apr-2003, 04:18
Assuming there will be a R4xx part, why wouldn't it have PS/VS3.0?

Excuse the minor confusion here, but I obviously must have missed a few aspects of the rumour mill.

Lezmaka
30-Apr-2003, 04:37
In a thread that has been removed, someone stated that R390 is the same thing as R420 and that the feature set will be largely unchanged from R350.

elroy
30-Apr-2003, 07:22
And that they will go straight from R390 to R500. There's gotta be something in between though, imo. They must be aiming R500 for DX10 (unless they have changed that as well?), but DX10 isn't coming out until Longhorn, which is expected in 2005. So what are ATI going to have for at least a year to refresh their lineup?

DegustatoR
30-Apr-2003, 08:46
9 month product cycle, right?

R350 + 9 months = winter 2003-2004 - R390(420)?
R390 + 9 months = autumn 2004 - winter 2004-2005

Longhorn (and DX10) is expected in the end of 2004 or in early 2005. So, it makes sense...

Though there is some troubles. First i don't think that there is any DX10 specs that are close to final. How are you going to design chip for that?

Second there is NVIDIA... :) Which will release hi-end chips in a different manner:

NV35 in May 2003
NV40 (w/shaders 3.0) in winter 2003-2004 (so it will compete w/R390?..)
NV45 in summer 2004, i think (what's ATI's gonna do with it?..)
NV50 around spring 2005...

It looks to me like ATI wants to make R400(500) the development platform for DX10 (again) and release it to market before NV50 (again), like it did with R300 vs NV30...

Arun
30-Apr-2003, 12:55
It looks to me like ATI wants to make R400(500) the development platform for DX10 (again) and release it to market before NV50 (again), like it did with R300 vs NV30...

Maybe, but if nVidia gets lucky, they might have a good 2004 year if the NV4x is a good architecture.
Let's hope it is :)


Uttar

T2k
30-Apr-2003, 17:35
So anyway... this is now the R500 speculation thread. :lol:

Your thoughts? They must be shooting for full DX10-compliancy (or near as dammit) now, huh?

MuFu.

Hmm, I doubt it.
AFAIK DX10 doesn't exist, even on scratch-level but the Marlborough-team is busy on something... ;)

EDIT: MuFu, topic renamed. :D

MuFu
30-Apr-2003, 18:51
Hmm, I doubt it.
AFAIK DX10 doesn't exist, even on scratch-level...

Well I assume in a year and a half it might be somewhat more than that. :)

MuFu.

T2k
30-Apr-2003, 19:16
Hmm, I doubt it.
AFAIK DX10 doesn't exist, even on scratch-level...

Well I assume in a year and a half it might be somewhat more than that. :)

MuFu.

OK but wouldn't that be too late for R500? It's already in design, isn't it?

Arun
30-Apr-2003, 19:28
Hmm, I doubt it.
AFAIK DX10 doesn't exist, even on scratch-level...

Well I assume in a year and a half it might be somewhat more than that. :)

MuFu.

OK but wouldn't that be too late for R500? It's already in design, isn't it?

Didn't the EETimes article mention a 2005 launch?


Uttar

T2k
30-Apr-2003, 19:55
Hmm, I doubt it.
AFAIK DX10 doesn't exist, even on scratch-level...

Well I assume in a year and a half it might be somewhat more than that. :)

MuFu.

OK but wouldn't that be too late for R500? It's already in design, isn't it?

Didn't the EETimes article mention a 2005 launch?


Uttar

They did but IMHO it's a guessing:
Graphics companies have plenty to tweak. "We create chips that contain more logic than a Pentium but must sell for an order of magnitude less. I don't know if this will be sustainable going forward but we are charging ahead," said Bob Feldstein, vice president of engineering for ATI's Marlborough, Mass., team, which is designing its R500 core, probably aimed at release in about 2005.

Two things, based on this article:

they're designing R500 - it's a fact (?)
aimed for 2005 - it's an estimation

MuFu
30-Apr-2003, 23:30
OK but wouldn't that be too late for R500? It's already in design, isn't it?

Yes, effectively since late '01 now. Better be good. :D

MuFu.

elroy
01-May-2003, 02:26
Geez, since late '01? It usually takes about 2 years to go from initial planning to market, doesn't it? Therefore, it could quite easily be on the market early next year (bloody hell!!!!). I thought it was supposed to be 0.09 um though, so the question is, who's going to manufacture it? Intel is only going to 0.09 um at the end of this year with Prescott, and they seem so far ahead in terms of fabbing compared to everyone else. So I'm assuming that 0.09 um won't be ready until the middle of next year, at a guess. Unless Intel is going to fab for ATi....... Thoughts?

asicnewbie
01-May-2003, 06:40
> Intel is only going to 0.09 um at the end of this year with Prescott,
> and they seem so far ahead in terms of fabbing compared to
> everyone else.

Well, that's not entirely accurate. All the major merchant-foundries (IBM, UMC, TSMC, TI, LSI, etc.) have demonstrated 'working silicon' at the 90nm node. IBM and UMC are pretty close to ramping up production 90nm silicon of Xilinx's FPGAs.

...

http://www.xilinx.com/xlnx/xil_prodcat_product.jsp?title=cost_device

Xilinx's pages a series of article-links regarding their push to 90nm technology. Each article has a slightly different bit of useful info.


[11/2002] Today's news represents a milestone achievement for the manufacturing collaboration between IBM and Xilinx commenced in March of this year. The agreement marked the first time IBM would manufacture high-volume parts for a foundry customer using its most advanced processes, which are normally used in high-end microprocessors, custom chips and memory products. IBM is currently manufacturing Xilinx's flagship VirtexX-II Pro semiconductor products using a 130nm process on 200mm wafers at IBM facilities in Burlington, Vt. and on 300mm wafers at facilities in East Fishkill, N.Y.



[11/2002]130nm will have to be done on 300mm because that is what the most up-to-date equipment is designed for. The people who say they can stay with 8in [200mm] will find that out,” said Roelandts.

“If you want to do 130nm with good yields, you have to do it on 300mm. On the hybrid [150/180nm] technology, the yield is now better [than 200mm]. For pure 150nm, it is at parity and continues to improve. Over the long term, it will be better



[12/2002]Using FPGAs to prove a new process technology is a first for IBM. SRAMs were once favored because they are based on an array of redundant cell structures, which helps in identifying defects. PowerPC microprocessors have also been used to spearhead process development because they fetch a high price and usually need the latest process technology to stay competitive in performance.


UMC is preparing to manufacture the line of Xilinx field programmable gate arrays (FPGAs) at its 300mm fabrication facility and has produced an FPGA test chip. At 90nm, Xilinx engineers can pack more transistors, layers, interconnect and product features into a single chip, reducing die size by 50 to 80 percent, compared to any competing FPGA solution. UMC's L90 process integrates nine layers of high-speed copper interconnect, 1.2V high performance transistors and low-k dielectric material into a single manufacturing process.


According to Xilinx, the both IBM and UMC's (300mm) 90nm lines will produce some of the same products, implying Xilinx targeted two foundries for the at least one of its low-end product-lines. Xilinx's high-end FPGAs will remain with IBM.

But as NVidia learned, going from 'working silicon' to 'successful customer tape-out' could be a slam-dunk or a never-ending nightmare, depending on the customer's design. In terms of physical-design, FPGAs are so far removed from HDL-design (as in 'completely custom logic'), that these tech-demos are completely irrelevant to the casual foundry customer.

I doubt Intel will fab such a complex part for ATI. Recently, Intel opened a 'design services' division, which some industry analysts viewed as a potential baby-step into the merchant foundry arena. But 3-4 months Intel closed the group, and the next day IBM once again (loudly) trumpeted their own foundry capabilities and services. Intel's CPU-business keeps its fabs fully occupied, so I doubt they'd have any (worthwhile) excess to sell. They would make less money selling foundry capacity versus churning on Pentium4s and (gasp) Itaniums. IBM wanted the PowerPC to dominate the world... and *coincidentally*, they have excess (advanced) foundry capacity to sell. (Ok, that's kind of a trollish remark.)

http://www.eetimes.com/story/OEG20030428S0038 - Sony sets investment strategy for 65 nm, 300 mm. A paragraph hints that Sony's new fab could be used to build a lot of ASICs that it's now buying from outside suppliers.

rubank
01-May-2003, 10:41
What about NEC
http://www.necel.com/en/process/ux6.html
any special reason they´re not an option?

IIRC the Matrox G4xx line was fabbed at NEC.

(and what to think of their eDram tech?)

asicnewbie
02-May-2003, 06:14
> What about NEC
> http://www.necel.com/en/process/ux6.html
> any special reason they´re not an option?

Good question! I'm not really a foundry expert, I just regurgitate tidbits of info I hear from my coworkers. (Hopefully this practice will make me an ASICnewbie and not an asicNEWBIE.:D) Frankly, I'm in no better a position to judge one foundry's merit against another. I think you can rank the final wafer price based on the level of design-services available at the given foundry.

On the one-end of the design-services spectrum, you had TSMC/UMC/SMIC/Charter, with the overall lowest pricing per wafer.

At the other-end, there's IBM/LSI/TI/NEC/Toshiba (and others I'm sure), with extensive in-house design/finishing-services, and a correspondingly higher cost per die/area.

That's just a rough rough guideline. In truth, the industry is evolving, and it's not a dichotomy like I presented above. From what my coworkers say, the Japanese fabs (NEC, Toshiba, Fujitsu, etc.) as a group have tended to be more expensive than TSMC/UMC.

My newbie-analysis on the ATI/NEC pairing breaks down like this:
NEC's main advantages (more extensive in-house design-services, packaging/testing) are less helpful to 'elite'-fabless companies like ATI/NVidia. The offered design-services aren't needed, because ATI/NVidia *already* have large in-house back-end design teams to deal with the physical-aspect of ASIC-design (the gate-placement , interconnect routing, design-for-test methodology, packaging/substrate design.) If ATI wanted to trim engineering staff, then I guess the foundry design-services would let them dismiss their own back-end design team (no need for redundancy.) The other design-services (like preverified IP libraries) again aren't signficicant factors, because ATI/NVidia end up having to design all their own I/O cells and analog blocks (RAMDACs, TMDS, DDR I/O, AGP I/O.) Some of the foundry library I/O-cells (AGP, DDR)could potentially replace the equivalent I/Os on ATI's GPUs, but the stuff like RAMDACs, dot-clock generators are very application specific, and it's better for ATI to design that themselves than to rely on someone else.

BUT...NEC is fabbing the Gamecube graphics ASIC for Nintendo, right? The implication is that Artx (now part of ATI) has a good working relationship with NEC. And Artx engineering team has already been integrated very well into ATI's engineering teams. Thus, if ATI were to hypothetically team up with NEC, this pre-existing business-relationship could be a key reason. From a engineering perspective, switching foundries is normally a HUGE risk, and for ATI to pick NEC would mitigate much of this risk and really give ATI a kickstart (compared to NVidia+IBM) in terms of ATI taping-out a design with NEC. (Since ATI is already familiar with NEC's design-flow methodology.)

ATI and Intel just seems unlikely to me. For one thing, Intel is a direct competitor in the PC-graphics market (albeit in the low-end, integrated-graphics market.) Yet I recall some joint agreement between the two companies for (low-end) integrated PC-graphics. So on the one hand, I see a potential conflict of interest between the foundry (Intel) and customer (ATI.) On the other hand, maybe the agreement includes an understanding that Intel won't suddenly deicde to jump into the GPU-market (like when they acquired Read3D.)

This above point is VERY important. LSI used to be the preferred foundry for fibre-channel (and gigabit ethernet) companies, because they were first to market with a (CMOS) gigabit transceiver core. In the olden days, the transceiver was a separate IC, on the expensive GaAs (Gallium Arsenide) process. As CMOS reached submicron, the top designers could target CMOS, and LSI demonstrated a working/verified core for their standard-cell ASIC products. Emulex, Qlogic, and others flocked to LSI for this competitve edge. (Adaptec had an analog design team, and developed their own CMOS transceiver.) Then, after fabbing and testing the production ASICs for its customers, LSI turned around and released its own SCSI and Fibre-channel host adapters, entering into direct competition with its customers. Emulex has already decided to switch foundries, and rumor is that Intel will eventually fab all of Emulex's future ASICs, starting at the 90nm node. (There's a press release on Emulex's website, but it only confirms a joint venture for single SATA product-line based on an Intel-supplied RISC CPU.) Qlogic is actively looking for a different foundry.

When TSMC keeps touting their 'pure-play foundry' model, that's exactly the scenario they're promising to avoid. UMC has non-trivial investments and ownerships in subsidiary chip companies. (For example, Mediatek. Mediatek basically killed Oak Technologies in the CD-RW controller market. UMC also dabbled in x86 compatible CPUs, made semiconductor memories, and chipsets at one time.) I think IBM, NEC, Toshiba are all clasified as integrated device mfgs (IDMs), which opens the possibility (however remote) that any of them could enter a customer's target market.

Personally, I don't understand the raitionality behind this fear. Apparently, it's understood (and accepted) that a pure-play foundry will fab chips for two customers in competition with each other. Yet for the foundry itself to be a competitor is frowned upon. My coworker says 'perception of conflict of interest', that a foundry somehow has access to the customer's sensitive design-properties through its handling of the customer's ASICs. I think it's an irrational fear, because there hasn't been a recent documented case of design-theft at a foundry.

Sorry I've gone off on a tangent...just trying to show that the art of choosing a fab goes beyond mere price/performance. There are many other factors, both political and technical, logical and irrational, that enter the decision process.

Nebuchadnezzar
02-May-2003, 11:54
Doesn't IBM fab Flipper?

Gubbi
02-May-2003, 11:58
Nope, NEC does.

Cheers
Gubbi

rubank
02-May-2003, 14:23
asicnewbie,

you´ve made some good points. Thx for the enlightenment.

Mariner
02-May-2003, 15:33
I believe IBM fabs Gecko, the CPU of the GameCube?

indio
02-May-2003, 17:17
That's Gieco :wink:

Nebuchadnezzar
02-May-2003, 17:25
That's Gieco :wink:

No, it's Gecko.

jb
02-May-2003, 17:51
That's Gieco :wink:

LOL good one..love those ads.

Pete
03-May-2003, 03:05
Geico?

demalion
03-May-2003, 03:11
OK, it is a US commercial for an insurance company named "Geico". The ad's "running gag" is a computer generated gecko who gets mistakenly called by people trying to reach the insurance company (gecko and the way the company says "Geico" sound similar).

A pretty decent little joke here, though I doubt it is the first time it has occurred to someone in a GameCube discussion. If you'd seen the commercials, you'd probably have chuckled as well, or atleast smiled.

Pete
03-May-2003, 09:18
Only thing I was smiling at is how they fumbled the name. :D

T2k
13-May-2003, 06:21
>

Interesting. THanks for the info. :)

T2k
13-May-2003, 06:23
MuFu, is Loci the next? I mean, mentioned by Dave?

elroy
13-May-2003, 13:00
I don't think so T2k (although I'm not in the loop like these other guys). I think R360 (if we use the Inquirer name) is the chip that Dave is talking about, and should be out in around 2 months time. Loci is also known as R390 or R420. This should be out around Comdex time. It is supposed to go head to head with NV40, if my memory serves me correctly.

Typ55
14-May-2003, 02:35
I didn't read the whole thread but I can't understand why some guys think that this test was unfair just because of FX5900 hadn't to use arb2. But isn't arb2 everything that ati can? I heard that these are the features r350 has and later became arb2-standard. FX5900 has just an specific path as it has more features than atis card has. Nn specific ati path wouldn't increase radeon-peformace cause arb2 is an ati path

Is that right?

MuFu
14-May-2003, 03:14
It's only "unfair" as far as the 5800 Ultra goes - we can pretty much assume that's how it'll perform with the final build. It'll use the "NV30" path by default, which compromises precision but is probably the most balanced choice for the NV30 in terms of performance/IQ.

Although in these comparisons the NV35 uses the "NV30" path as well, its architectural advances compared with the NV30 allow it to run the ARB2 path at the same speed (or thereabouts). We can therefore assume that these results are a good indication of how the NV35 will perform with the final build using the ARB2 path.

ARB2 (not to be confused with OGL 2.0) is vendor-independent, but was particularly suited to the capabilities of the R300/R350. Now nVidia have "caught" up in this respect and I assume a proprietary path for either card would be of little benefit.

MuFu.

Doomtrooper
14-May-2003, 03:28
On a NV 35 ..yes

On all the other FX line-up..no.

And as you are aware the money is not on the high end.

Ailuros
14-May-2003, 04:36
ARB2 (not to be confused with OGL 2.0) is vendor-independent, but was particularly suited to the capabilities of the R300/R350. Now nVidia have "caught" up in this respect and I assume a proprietary path for either card would be of little benefit.

I´m confused now...I know the R3xx´s use FP24; so what kind of precision is NV35 using after all?

T2k
01-Jun-2003, 21:01
ARB2 (not to be confused with OGL 2.0) is vendor-independent, but was particularly suited to the capabilities of the R300/R350. Now nVidia have "caught" up in this respect and I assume a proprietary path for either card would be of little benefit.

I´m confused now...I know the R3xx´s use FP24; so what kind of precision is NV35 using after all?

As of today, we see the point... ;)

BTW, MuFu pointed out something interesting in DH's Terry M. interview:

Zardon: Any plans for revamped color controls? I.e., in game gamma controls explained and working, new and easier to understand standard gamma adjustments with mouse over or "What's This?" explanations on how they are best used, maybe some pre-set sliders or functions like "Boost saturation", "Smooth Bright Gradients", "Smooth Dark Gradients" that users could use more intuitively.

Terry: Sure, we have provided a new color control recently. We don’t have any further improvements to that page in our roadmap but of course if that’s what users want - then we will add something to the roadmap. There will be radical changes coming in roughly a year to all our control panel pages, so we may wait until then to revamp everything.


Hmm... if it signs R500 where is R400 in the timetimetable? End of this year?

AlphaWolf
01-Jun-2003, 21:31
ARB2 (not to be confused with OGL 2.0) is vendor-independent, but was particularly suited to the capabilities of the R300/R350. Now nVidia have "caught" up in this respect and I assume a proprietary path for either card would be of little benefit.

I´m confused now...I know the R3xx´s use FP24; so what kind of precision is NV35 using after all?

As of today, we see the point... ;)

BTW, MuFu pointed out something interesting in DH's Terry M. interview:

Zardon: Any plans for revamped color controls? I.e., in game gamma controls explained and working, new and easier to understand standard gamma adjustments with mouse over or "What's This?" explanations on how they are best used, maybe some pre-set sliders or functions like "Boost saturation", "Smooth Bright Gradients", "Smooth Dark Gradients" that users could use more intuitively.

Terry: Sure, we have provided a new color control recently. We don?t have any further improvements to that page in our roadmap but of course if that?s what users want - then we will add something to the roadmap. There will be radical changes coming in roughly a year to all our control panel pages, so we may wait until then to revamp everything.


Hmm... if it signs R500 where is R400 in the timetimetable? End of this year?

My money is on year end for R400, but I am not conviced that R500 is the reason for the revamp next summer.

Chalnoth
01-Jun-2003, 22:55
I´m confused now...I know the R3xx´s use FP24; so what kind of precision is NV35 using after all?
Apparently the NV35 has dramatically improved FP16 performance, up to near FX12 performance levels (according to Uttar's benches), so the NV35 should use a mix of FP16 and FP32.

Of course, the problem is that ARB_fragment_program still doesn't allow instruction-level control over precision, but instead has a global precision hint. It will be very hard for drivers to properly decide which precision to use...that is a job best left to the shader writer.

rwolf
03-Jun-2003, 05:23
Geez, since late '01? It usually takes about 2 years to go from initial planning to market, doesn't it? Therefore, it could quite easily be on the market early next year (bloody hell!!!!). I thought it was supposed to be 0.09 um though, so the question is, who's going to manufacture it? Intel is only going to 0.09 um at the end of this year with Prescott, and they seem so far ahead in terms of fabbing compared to everyone else. So I'm assuming that 0.09 um won't be ready until the middle of next year, at a guess. Unless Intel is going to fab for ATi....... Thoughts?

This should answer you question...

http://www.xbitlabs.com/news/other/display/20030602053212.html

T2k
10-Jun-2003, 21:55
http://www.maximumpc.com/reprints/reprint_2003-06-07.html

"What will nVidia and ATI release in their next refresh cycles?

We could tell you shocking details, but then we’d have to kill you. ATI’s next product release should be the R400, which will include a substantial number of new features and silicon on the chip."

Wow... what a prophecy... Dodona... ;)

Josiah
11-Jun-2003, 07:24
isn't the 360 next?

T2k
13-Aug-2003, 15:42
So... nothing new info regarding R400/420/whatever-it-is-by-the-end-of-this-year? :)

Xbit reports ATI has confirmed its on 130 nano:
http://www.xbitlabs.com/news/video/display/20030812141326.html