Why ATI ditched the R300 architechture ?

neliz said:
X2 was about the shadows, what about X3?
Space seems ripe for HDR. :cool:

Edit: Hubert, surely staring at the sun from Earth in the daytime is nothing compared to staring at the sun from the bleak blackness of space? :)
 
Last edited by a moderator:
You are not going to see a difference between SM2 and SM3 unless the developers add effects that take advantage of it.

It would be hard pressed to get something in SM3 that just wont run on SM2. More like the graphics will have some stuff added to it, like Foam and not just waves breaking on the beach etc. Not because SM2 hardware can't do it, but because its not likely to be powerful enough to run that many shaders at once. So developers simply have levels of what they allow to be shown based on the spec you support. It could be forced on SM2 hardware 99% of the time.

At least that’s my take.
 
neliz said:
X2 was about the shadows, what about X3? X2 was heavily optimized for nV40's architecture, the shadowing etc.
I did not notice any shadow casting in X3: Reunion. It is pretty much about using shiny, reflective metallic surfaces for everything, sectors have singular light sources (stars), weapon fire, engine and beacon light's does not affect surfaces. Apparently there is a lack of dynamic lighting, it pretty much represents the sharp contrast one light sceanarios one would experiment orbiting Earth. I was wondering myself, I did not play X2 but I've seen pictures and ships cast shadows on themselves.

There also exists a glow effect, which makes lights to have a halo. I am sure it is not HDR it's prolly a PS effect, and it can be turned on and off, regardless of the shader model used.
Don't get me wrong, the lack of shadows it is not something you'd notice, or miss.

The game looks very good, mostly because the artwork is just very good, models and textures being very detailed. Oh, yes of course, everything is bumpmapped (normal mapped ?)
 
Last edited by a moderator:
Pete said:
Space seems ripe for HDR. :cool:
Elaborate, please.

Edit: Oops it took time for me to realize your smilie was an actual link.

PS. I can see HDR being used in space, too. Why souldn'we have dynamic range in space ? Lights are lights, and lacking the atmosphere, maybe the bloom effects would be missing too. :)
 
Last edited by a moderator:
Hubert said:
PS. I can see HDR being used in space, too. Why souldn'we have dynamic range in space ? Lights are lights, and lacking the atmosphere, maybe the bloom effects would be missing too. :)
Well, bloom effects come about because the frame is still output on a low dynamic range device. So they're there in an attempt to approximate the effect of bright lights blinding you. Of course it won't look exactly like it does in real life, because it can't actually be that bright. But game developers are going to do the best they can until/unless we get some real HDR displays.
 
Hubert said:
PS. I wonder if SM 2.0b and SM 3.0 are not both included for the High setting. That would make SM 2.0a the Medium and SM 1.1 the Low one.
High is PS2.0 as far as I can tell (PS2.0b on my X800XT), with medium PS1.4 and low PS1.1.

I think PS1.1 looks better, overall, than PS1.4. PS1.4 suffers from "effects" that look half-arsed, in fact they look like they're broken effects.

Jawed
 
Mintmaster said:
R300 shone not only versus NV30, but was a huge improvement over the previous gen. It didn't even have twice the transistors of R200, but had like 3 times the practical shading power, in FP24 to boot, amazing AA, better AF, and the list goes on.
R300 was a huge improvement over the previous gen because R200 was rather mediocre. Now R520 is successing the R300 architecture which was very good in its time. Obviously, that makes an equally big step almost impossible.
And of course you can make improvements like going to MSAA with FB compression or PS2.0 only once. R300 doubled memory bandwidth over R200. It seems to me that this wasn't an option for R520, and that has nothing to do with R520's shader architecture. R520 gives you PS3.0 with good dynamic branching, FP32, better AF, AVIVO, and more. That's a big step as well.

R520 had double the transistors of R420 and was maybe 30% faster overall.
That's probably about the difference R300 was faster than NV25 when AA/AF was disabled. The real difference only came apparent when you used the main features of R300: PS2.0, AA, and AF.
If you do the same with R520, it can shine as well. And while when R300 came out DX9 wasn't even there, we already have some games taking advantage of PS3.0 now.

You may think a cheap PS3.0 wouldn't sell well, but remember that it would perform much faster. That sells a lot more than claiming you have the most advanced design. Look at NV15 vs R100, or NV25 vs. R200. Performance sells way more than technology for the mass market.
The thing is, I do not believe it would perform much faster. Since you said I shouldn't take your suggestions too literally, it seems to me the only thing you are proposing is to throw out the "good branching" and put in "cheap branching" and "more pipelines" instead. I don't believe this would be enough for even going from 16 to 24 (if you're suggesting just adding ALUs, then it would be better to take R580 for comparison). And it does nothing to bandwidth whatsoever, besides increasing demand.
 
Last edited by a moderator:
You guys are not paying attention to my points, so this'll be my last post on the topic.

SugarCoat said:
yes and almost 2/3 of it was used getting the chip up to date with SM3.0 and the memory controller which in itself was a long term investment more then anything.
Again, the memory controller is 8-10% of the die space. It looks like more, but measure it. Getting the chip to to date with PS3.0 didn't take much die space. PS3.0 with fast dynamic branching did, though. I gave proof straight from sireric, who knows more about R5xx than anyone on any forum.

Xmas said:
R300 was a huge improvement over the previous gen because R200 was rather mediocre.
Was the Geforce4 Ti4600 mediocre? The jump to R300 was still enormous if you use that as a reference, so your point is moot.
R300 doubled memory bandwidth over R200.
Look at the 9500PRO benchmarks. Pixel shading was almost as fast as the 9700, and gaming wasn't far behind either.
R520 gives you PS3.0 with good dynamic branching, FP32, better AF, AVIVO, and more. That's a big step as well.

That's probably about the difference R300 was faster than NV25 when AA/AF was disabled. The real difference only came apparent when you used the main features of R300: PS2.0, AA, and AF.
If you do the same with R520, it can shine as well. And while when R300 came out DX9 wasn't even there, we already have some games taking advantage of PS3.0 now.
You're arguments here are very weak. Of all the things you mentioned about R520, only dynamic branching is the big die space eater, and the topic of this thread is about the pixels shader size. And none of those points sell as well as performance. R300 made a huge leap in PS1.x performance as well, and AA was part of mainstream benchmarking for a while. The games using so-called SM3.0 right now are pretty much only using FP blending, so they're taking far less advantage R520's hardware than PS1.x games did of R300's.

I love R520 (and R580 even more), and both architecturally and technologically it is just as big a leap as R300 was. From the point of view of economics and competitiveness, though, it's not even close to the leap that R300 made.

The thing is, I do not believe it would perform much faster. Since you said I shouldn't take your suggestions too literally, it seems to me the only thing you are proposing is to throw out the "good branching" and put in "cheap branching" and "more pipelines" instead. I don't believe this would be enough for even going from 16 to 24 (if you're suggesting just adding ALUs, then it would be better to take R580 for comparison). And it does nothing to bandwidth whatsoever, besides increasing demand.
16 to 24 only? R300 shader pipes, including a ROP and texture unit, were ~5M transistors (see R300/RV410->R420). R520->R580 shows FP32 shader units (w/o a texture unit) are ~2M transistors. ATI could easily have produced a 32-shader, 32-texture unit, 16-ROP part with under 300 million transistors. Beef up the mini-ALU, or add another stage, and it would easily outperform R580 too when dynamic branching isn't involved. Again you mention bandwidth, but like I said, the memory controller is not huge at all.



My central point, supported by ample evidence, is this: ATI made a big die space commitment to make dynamic branching fast. That's the sole reason ATI is behind NVidia in performance per clock per transistor.
 
Just like NVIDIA paid for the mistakes of NV30 the knowledge allowed them to get a performing FP32 chip out later, namely the NV40. In the same way that ATi has hiccupped with R520 (it being delayed) it sets the tone for future ATi graphics cards.
When we go to Unified shaders and Vista ATi will have the foundation for good dynamic branching and NVIDIA will be playing catchup.

As to economics, Mintmaster, I think ATi are going to suffer not because the R580 is so huge in die area but because they are the underdogs now and NVIDIA is seen as the market leader (for the 6800-7800 series) ATi have to cut their margins compared to NVIDIA to compete.

If ATi can manage to execute well in the next cycle then they will be on the road to grabbing back marketshare and increasing their margins.
 
Mintmaster said:
Was the Geforce4 Ti4600 mediocre? The jump to R300 was still enormous if you use that as a reference, so your point is moot.
It became enormous when the R300's strengths were used. It wasn't enormous right from the start. But I repeat myself...

Look at the 9500PRO benchmarks. Pixel shading was almost as fast as the 9700, and gaming wasn't far behind either.
Actually the 9500Pro supports my point. It was the same chip, but significantly slower because it frequently hit the bandwidth limit.

You're arguments here are very weak. Of all the things you mentioned about R520, only dynamic branching is the big die space eater, and the topic of this thread is about the pixels shader size. And none of those points sell as well as performance. R300 made a huge leap in PS1.x performance as well, and AA was part of mainstream benchmarking for a while. The games using so-called SM3.0 right now are pretty much only using FP blending, so they're taking far less advantage R520's hardware than PS1.x games did of R300's.
And a big die space eater, relatively speaking, for R300 was FP24. Which was completely useless until games using PS2.0 came out.
But anyway, that's not the point. It's not about what R300 did back then, but what the architecture/philosophy/whatever would do now, compared to R520.

Which games that only use FP16 blending are you referring to?


16 to 24 only? R300 shader pipes, including a ROP and texture unit, were ~5M transistors (see R300/RV410->R420). R520->R580 shows FP32 shader units (w/o a texture unit) are ~2M transistors. ATI could easily have produced a 32-shader, 32-texture unit, 16-ROP part with under 300 million transistors. Beef up the mini-ALU, or add another stage, and it would easily outperform R580 too when dynamic branching isn't involved. Again you mention bandwidth, but like I said, the memory controller is not huge at all.
What does the memory controller have to do with it? What's the point of having 32 TMUs when you simply can't feed them? There is a reason why R580 doesn't have more TMUs either.
I certainly do not believe replacing the branching with a cheaper version would have saved enough space to basically double the PS architecture like you suggest and still use up 20M less transistors.

My central point, supported by ample evidence, is this: ATI made a big die space commitment to make dynamic branching fast. That's the sole reason ATI is behind NVidia in performance per clock per transistor.
It certainly is an important part, but most likely not the sole reason.
 
Xmas, AA was the GF4's strength prior to R300. R300's strengths were everything except multitexture rate (which wasn't a weakness anyway), including PS 1.1, which used to be GF4's territory.

The 9500Pro does not prove your point - look here, here, here, etc. It's ~10% behind the 256-bit 9700 on average, and miles ahead of previous gen. R300 gave immediate benefits to all games when given similar bandwidth and clock speed to NV25 & R200. Likewise with G70 vs. NV40. Not so with R520 vs. R420 (see X1800XL vs. X800XT).

I doubt FP24 was a big die space eater in R300. It had double of almost everything compared to R200 and then some, yet was less than double it's size.

For games I'm talking about nearly every HDR game. The reason developers didn't support HDR on R3xx/R4xx/NV3x is lack of FP blending, not lack of SM3.0. Almost nothing out there is really using R5xx's small batch size, and won't for a while, IMO.

Regarding texture bandwidth, it's not as big a factor as you, Jawed, and others think. I've seen actual data gathered from hardware, and Chalnoth is right in that there's plenty of room left for improvement in texturing speed. Why else would the 7800GT fare so well against the 6800U?

If you don't believe me about being able to fit 32 pipes (without ROPs, remember) in R520's size, then I can't do anything to convince you. It's not simply "replacing the branching with a cheaper version ". You can engineer the shader pipeline and dispatcher in an entirely different way without this design objective, as sireric said.
 
I hesitate to ask this question, lest I reveal my profound ignorance of shader programming... But anyways.

Regarding dynamic branching. Would it be feasible (or even desirable) for ATI's driver to compile shaders that use static branching into internal code that uses dynamic branching instead? Perhaps just using a heuristic such as the length of code within the branch to determine/guess whether it might benefit from using dynamic branching. Or am I completely misunderstanding how all this stuff works? If this were actually possible and do-able, then that might provide additional justification for the die-space used to implement fast dynamic branching.
 
Well, the only difference between static and dynamic branching is that whether the condition is stored in constant registers, or registers that can be changed during execution. How exactly the hardware implements static branching is entirely up to the shader compiler: is it cheaper for the hardware to upload a new shader when this condition is changed? Or is it cheaper to just do dynamic branching?

I'm sure this is a question that has been investigated quite extensively by both IHV's.
 
I believe I understand. So, it sounds like the shader compiler could conceivably compile a model 2.0 shader to utilize the model 3.0 hardware resources (ie. dynamic branching) if it determined that it would ultimately run faster that way. Or, to paraphrase, fast hardware dynamic branching could benefit shader code that does not explicitly use it.
 
Mintmaster, maybe you are right and I am just too optimistic about how fast developers will pick up PS3.0. It has been around for a long time now, and NVidia has created more than a tiny installed base, and often this is the decisive factor for when features get implemented. Also, I don't think it's that hard for them to go through all the (longer) shaders and find the most obvious opportunities to skip some code. I haven't written many shaders, but almost all of them that had more than four lines could have used dynamic branching.

btw, R200 had 8 TMUs and, AFAIK, 8 ALUs, but they were arranged in one quad-pipeline. The same is true for NV2x.

PurplePigeon said:
I believe I understand. So, it sounds like the shader compiler could conceivably compile a model 2.0 shader to utilize the model 3.0 hardware resources (ie. dynamic branching) if it determined that it would ultimately run faster that way. Or, to paraphrase, fast hardware dynamic branching could benefit shader code that does not explicitly use it.
There is no static branching in PS2.0, only in VS2.0.
When there are no branches in the shader code, it is almost impossible for the compiler to insert any in a meaningful way. Sure, it could check if one of the operands of a mul is zero and not calculate the other in that case. But it has no way of knowing how frequently that happens, if ever. So the compiler could end up inserting one if for every multiplication, and instead of speeding up it could slow down everything.

OTOH, if flow control is implemented in a way that allows e.g. one "free" if(x==0) per cycle, this could indeed be useful. It increases code size, however.

There is one thing, that could be turned into a dynamic branch easily: alpha test.
 
Last edited by a moderator:
Mintmaster said:
Xmas, AA was the GF4's strength prior to R300. R300's strengths were everything except multitexture rate (which wasn't a weakness anyway), including PS 1.1, which used to be GF4's territory.

The 9500Pro does not prove your point - look here, here, here, etc. It's ~10% behind the 256-bit 9700 on average, and miles ahead of previous gen. R300 gave immediate benefits to all games when given similar bandwidth and clock speed to NV25 & R200. Likewise with G70 vs. NV40. Not so with R520 vs. R420 (see X1800XL vs. X800XT).

Since this whole debate is somewhat based on could have or might have beens....assume for a second that R200 would have had X more efficiency, multisampling AA support and what not. The gap between R200 and R300 would had been automatically smaller.

If you focus on ATI's roadmaps and design decisions exclusively (and from the other side from a separate perspective what NVIDIA managed through the past years) it might be easier to keep things apart.

Was NV4x a quantum leap in performance over NV3x because of added units and/or SM3.0 support or because the latter didn't inherit the first's weaknesses? If I draw a parallel between R2xx and R3xx I don't think the answer is that much different in the end.

ATI had in my mind a very clear vision for DX9.0 including SM2.0 and SM3.0 from the very beginning. They did have to abort at some stage "R400" development for whatever reason, whereby a significant amount of it's elements probably have made it into today's console and PC designs. If I assume that that old R400 design was in fact a USC, then I have a R420 as some sort of "gap-filler" based on former R3xx basics, a SM3.0 USC for the console and a PC SM3.0 sollution with separate PS/VS units.

If all that didn't occur and ATI had a SM3.0 design to follow R300, I don't see a single reason why it would had ended up any slower than R420; the only other major difference would had been a diametrically larger chip complexity, and we'd also see there SM3.0 refresh of a SM3.0 refresh until D3D10 GPUs. Quantum leaps in performance in such a case between a 2004 and a 2005 GPU irrelvant of compliance? I don't think so.
 
erm .. I thought Dave said the R400 became the R500 .. and teh R500 Tech is now the Xenos .. and part of that tech is in the R580.

The R420 tech comes from the R300.

At least that's what I understood Dave as saying.

US
 
Unknown Soldier said:
erm .. I thought Dave said the R400 became the R500 .. and teh R500 Tech is now the Xenos .. and part of that tech is in the R580.

The R420 tech comes from the R300.

At least that's what I understood Dave as saying.

US

I don't think I implied or said anywhere that R400 design elements were lost, rather the contrary.
 
Xmas said:
OTOH, if flow control is implemented in a way that allows e.g. one "free" if(x==0) per cycle, this could indeed be useful. It increases code size, however.
Actually, the more I think about it the more sense it makes. If ATI's flow control unit can execute one FC op per clock and is capable of jz/jnz, that could potentially bring huge savings at no runtime cost, just more work for the compiler.

Since all sequences
a = simpleFunc(x);
b = incrediblyComplexFunc(y);
c = a * b;

can be rewritten as
a = simpleFunc(x)
if(a) b = incrediblyComplexFunc(y);
c = a * b;

There is one thing, that could be turned into a dynamic branch easily: alpha test.
And kill, of course.
 
Xmas said:
Since all sequences
a = simpleFunc(x);
b = incrediblyComplexFunc(y);
c = a * b;

can be rewritten as
a = simpleFunc(x)
if(a) b = incrediblyComplexFunc(y);
c = a * b;
You may want to initialize 'b', in case its previous value was inf or nan.

Also, have you tried sequences like this on NV4x/G7x? You might be surprised...
 
Back
Top