View Full Version : JC's latest plan file update
Geeforcer
11-Feb-2002, 22:44
http://www.shacknews.com/finger/?fid=johnc@idsoftware.com
Text:
February 11, 2002
-----------------
Last month I wrote the Radeon 8500 support for Doom. The bottom line is that
it will be a fine card for the game, but the details are sort of interesting.
I had a pre-production board before Siggraph last year, and we were discussing
the possibility of letting ATI show a Doom demo behind closed doors on it. We
were all very busy at the time, but I took a shot at bringing up support over
a weekend. I hadn't coded any of the support for the custom ATI extensions
yet, but I ran the game using only standard OpenGL calls (this is not a
supported path, because without bump mapping everything looks horrible) to see
how it would do. It didn't even draw the console correctly, because they had
driver bugs with texGen. I thought the odds were very long against having all
the new, untested extensions working properly, so I pushed off working on it
until they had revved the drivers a few more times.
My judgment was colored by the experience of bringing up Doom on the original
Radeon card a year earlier, which involved chasing a lot of driver bugs. Note
that ATI was very responsive, working closely with me on it, and we were able
to get everything resolved, but I still had no expectation that things would
work correctly the first time.
Nvidia's OpenGL drivers are my "gold standard", and it has been quite a while
since I have had to report a problem to them, and even their brand new
extensions work as documented the first time I try them. When I have a
problem on an Nvidia, I assume that it is my fault. With anyone else's
drivers, I assume it is their fault. This has turned out correct almost all
the time. I have heard more anecdotal reports of instability on some systems
with Nivida drivers recently, but I track stability separately from
correctness, because it can be influenced by so many outside factors.
ATI had been patiently pestering me about support for a few months, so last
month I finally took another stab at it. The standard OpenGL path worked
flawlessly, so I set about taking advantage of all the 8500 specific features.
As expected, I did run into more driver bugs, but ATI got me fixes rapidly,
and we soon had everything working properly. It is interesting to contrast
the Nvidia and ATI functionality:
The vertex program extensions provide almost the same functionality. The ATI
hardware is a little bit more capable, but not in any way that I care about.
The ATI extension interface is massively more painful to use than the text
parsing interface from nvidia. On the plus side, the ATI vertex programs are
invariant with the normal OpenGL vertex processing, which allowed me to reuse
a bunch of code. The Nvidia vertex programs can't be used in multipass
algorithms with standard OpenGL passes, because they generate tiny differences
in depth values, forcing you to implement EVERYTHING with vertex programs.
Nvidia is planning on making this optional in the future, at a slight speed
cost.
I have mixed feelings about the vertex object / vertex array range extensions.
ATI's extension seems more "right" in that it automatically handles
synchronization by default, and could be implemented as a wire protocol, but
there are advantages to the VAR extension being simply a hint. It is easy to
have a VAR program just fall back to normal virtual memory by not setting the
hint and using malloc, but ATI's extension requires different function calls
for using vertex objects and normal vertex arrays.
The fragment level processing is clearly way better on the 8500 than on the
Nvidia products, including the latest GF4. You have six individual textures,
but you can access the textures twice, giving up to eleven possible texture
accesses in a single pass, and the dependent texture operation is much more
sensible. This wound up being a perfect fit for Doom, because the standard
path could be implemented with six unique textures, but required one texture
(a normalization cube map) to be accessed twice. The vast majority of Doom
light / surface interaction rendering will be a single pass on the 8500, in
contrast to two or three passes, depending on the number of color components
in a light, for GF3/GF4 (*note GF4 bitching later on).
Initial performance testing was interesting. I set up three extreme cases to
exercise different characteristics:
A test of the non-textured stencil shadow speed showed a GF3 about 20% faster
than the 8500. I believe that Nvidia has a slightly higher performance memory
architecture.
A test of light interaction speed initially had the 8500 significantly slower
than the GF3, which was shocking due to the difference in pass count. ATI
identified some driver issues, and the speed came around so that the 8500 was
faster in all combinations of texture attributes, in some cases 30+% more.
This was about what I expected, given the large savings in memory traffic by
doing everything in a single pass.
A high polygon count scene that was more representative of real game graphics
under heavy load gave a surprising result. I was expecting ATI to clobber
Nvidia here due to the much lower triangle count and MUCH lower state change
functional overhead from the single pass interaction rendering, but they came
out slower. ATI has identified an issue that is likely causing the unexpected
performance, but it may not be something that can be worked around on current
hardware.
I can set up scenes and parameters where either card can win, but I think that
current Nvidia cards are still a somewhat safer bet for consistent performance
and quality.
On the topic of current Nvidia cards:
Do not buy a GeForce4-MX for Doom.
Nvidia has really made a mess of the naming conventions here. I always
thought it was bad enough that GF2 was just a speed bumped GF1, while GF3 had
significant architectural improvements over GF2. I expected GF4 to be the
speed bumped GF3, but calling the NV17 GF4-MX really sucks.
GF4-MX will still run Doom properly, but it will be using the NV10 codepath
with only two texture units and no vertex shaders. A GF3 or 8500 will be
much better performers. The GF4-MX may still be the card of choice for many
people depending on pricing, especially considering that many games won't use
four textures and vertex programs, but damn, I wish they had named it
something else.
As usual, there will be better cards available from both Nvidia and ATI by the
time we ship the game.
John Reynolds
11-Feb-2002, 22:47
That high poly bug I keep seeing 8500 users complain about online is scarily starting to sound like a hardware, and not driver, level problem.
bystander
11-Feb-2002, 22:51
That's a great endorsement for the GF4mx!
Geeforcer
11-Feb-2002, 22:52
Hardly unexpected.
Edit: His comments on GF4MX.
<font size=-1>[ This Message was edited by: Geeforcer on 2002-02-11 23:53 ]</font>
Dave Baumann
12-Feb-2002, 00:10
John - if its the same 'high poly bug' that's been talked about then it seems that Croteam have run into it as well; they have a workaround and ATi say they will be updating it in the next driver.
Johnny Rotten
12-Feb-2002, 00:23
I believe we're talking about 2 distinct things here. The SS:SE slowdown bug was a driver problem with regards to particular texture uploads. This is apparently a different issue than the VulpineGL problem and (apparently) Doom3 issue.
At this point though only ATI knows for sure.
Dave Baumann
12-Feb-2002, 00:30
Yeah - just re-read that one and it seems to be a texture thrashing issue rather than anything else.
Doomtrooper
12-Feb-2002, 01:46
This poly issue concerns me, although all my games play excellent ...I still can't fathom how the 8500 could lose to my old Radeon Vivo in test 6 in Glxcess.
I plan on getting a 128 8500 as soon as they're available, the first thing I'm gonna do is load up the leaked Firegl drivers that detects chip revision..if it shows a different chip revision (mine is currently A13) I will get suspicous.
The thing that concerns me the most is Croteam and the Designer of Glexcess have contacted ATI (Glexcess Coder over a month ago) about this issue and even with the many dev leaks the problem is still there. The Firegl drivers show better results than the 8500's, but not near what is should be putting out.
This also may be the reason why the 128 meg Radeon 8500 cards are showing a 1000 point 3Dmark increase and 20 fps more RTCW...
Doomtrooper, the numbers in the ATI 128MB presentation slides are just a little bit confusing. ATI was using the pre-release drivers for the 64MB 8500 and the latest betas for the 128MB 8500. I wasn't at the presentation but Mrbread was . I wonder if he can confirm this?
<font size=-1>[ This Message was edited by: ben6 on 2002-02-12 02:51 ]</font>
Sharkfood
12-Feb-2002, 01:52
The GLExcess bug is also similar in nature (but also hindered by the J.C. Doom3 bug to a much smaller degree) to the bug plagueing Tribes2 and SS:SE.
What J.C. is talking about isn't nearly as insidious as the bug user's are encountering on the 8500. The "low framerate" bug currently can cause normal 100 fps situations to sputter along at 20-30 fps with occasional drop outs to 1 fps in it's more extreme conditions.
There are different layers of issues with the OGL drivers at this point in time and they are additive with each other. Something with fat texturing, uses sphere maps and high polygon using OGL's T&L pipes can hit conditions that halt framerate constantly. Disable TCL and framerates double. Reenable TCL and reduce texture detail and framerates double. Reenable both and reduce some other setting (like Z-buffer depth or hyperz, or disable vol. fog) and framerates double. Remove all but a singular sphere map and framerates double, etc.etc. There are also quite the isolated issues with fastzmask clears and hw page flips stumbling flat and introducing wait states. Off and on they fix and rebreak much.
As I've said on other forums, ATI is just now arriving at what I'd consider "V1.0" of their drivers. I believe they will be able to absorb a good, healthy percentage of the troubling performance issues but with some left over when all is said and done. They really need to focus on regular, timely driver updates and stop applying specific "duct-tape" fixes for a singular source. This is what's causing lots of break/fix/rebreak when the horseblinders are on with specific issues.
Just my $0.02,
-Shark
This is a huge endorsement for the ATI RADEON 8500 from JC IMO. Just wish that 'niggly' little (maybe it is not so little after all -thats the way I read it) wasn't present.
LeStoffer
12-Feb-2002, 06:27
8:50 pm addendum: Mark Kilgard at Nvidia said that the current drivers already
support the vertex program option to be invarint with the fixed function path,
and that it turned out to be one instruction FASTER, not slower.
Anyway, it great to see that ATI is making progress and that their advanced hardware (PS 1.4) really gives a clear advantage in real life games. I wonder, however, if Carmack has used a GF4 and whether it give any improvements over GF3 (besides it's faster clock speed)?
Regards, LeStoffer
anyone know where to get the archive on his .plan file ? anyone keeping them ?
I wonder, however, if Carmack has used a GF4 and whether it give any improvements over GF3 (besides it's faster clock speed)?
Carmack indicates that he's used the GF4 a couple of times in the .plan file.
The fragment level processing is clearly way better on the 8500 than on the
Nvidia products, including the latest GF4.
...
The vast majority of Doom
light / surface interaction rendering will be a single pass on the 8500, in
contrast to two or three passes, depending on the number of color components
in a light, for GF3/GF4.
He seems to be indicating the GF4ti in these cases, as he lumps the geforce3 and 4 together in that last quotation and later talks specifically about the geforce4MX as a seperate entity. However, I don't think he's had a lot of time to play around with the additional functionality of the GF4. Perhaps he's had one kicking around, but he still compares the 8500 to the GF3 here:
A test of the non-textured stencil shadow speed showed a GF3 about 20% faster
than the 8500.
Had he been using the Geforce4 for a great deal of time and become accustomed to it, I believe he would have used that in the comparison. Also, he made no comments comparing the second vertex shaders of the 8500 and the Geforce4. Not conclusive proof, I know, but it was a comparison I expected him to make. Especially since he was complaining about nVidia's vertex shader implimentation. This suggests (to me, anyway), that he's a) using DX7 shader instructions so the fixed-function nature of the 8500's second shader isn't bothering him, b) only using the DX8 shader on the 8500, or c) hasn't had time to optimise for the second shader on the GF4 and thus can't compare the 8500's parallel shaders to the GF4's parallel shaders.
I may be reading too much into this, but some thoughts.
EDIT: Rather than respond to my own post...
Muted, the text of the .plan is in the first message of the thread. I believe Blue's News also has this.
<font size=-1>[ This Message was edited by: tkopp on 2002-02-12 08:28 ]</font>
nggalai
12-Feb-2002, 07:34
muted,
try www.gamefinger.com.
I found the .plan update to be rather interesting indeed. I was slightly surprised to read JC bash the GF4-MX naming scheme this openly, but nonetheless very satisfied about it.(*note GF4 bitching later on)LOL
ta,
.rb
________
Yamaha TTR250 (http://www.cyclechaos.com/wiki/Yamaha_TTR250)
Originally posted by tkopp:
Especially since he was complaining about nVidia's vertex shader implimentation. This suggests (to me, anyway), that he's a) using DX7 shader instructions so the fixed-function nature of the 8500's second shader isn't bothering him, b) only using the DX8 shader on the 8500, or c) hasn't had time to optimise for the second shader on the GF4 and thus can't compare the 8500's parallel shaders to the GF4's parallel shaders.
Doom 3, like all of John Carmack's games, is written in OpenGL. In this context, any and all regard for DirectX should be thrown out the window.
Since there isn't an official vertex program or fragment shader extension in OpenGL (as opposed to vs1.0/1.1 and ps1.0-1.4 in DX8.1), vendors expose their hardware in different ways.
ATI's vertex program extension requires developers to call entry points for every operation they wish to perform. Although I don't have the spec on hand (and I'm too lazy to look it up in the repository), if you wanted to perform a simple transformation shader, you would have to call a function like glVertexShaderBinaryOpEXT() with the various enumerants for the input and output registers (with masks) for every instruction in the shader. With NVIDIA's implementation, you just pass in a string that consists of the shader assembly (very similar to vs1.0/1.1; however, some instruction and register names have been subtly changed) and it will compile the vertex program for you.
Any reasonable programmer would just write a parser for vertex shaders that automagically does the right thing for generating shaders on ATI cards, but it's ugly and shouldn't be required. It also means that the drivers have more entry points, which means more places where something could go wrong.
However, on NVIDIA's implementation, if you were to render a sphere using the fixed function pipeline, and then render the *exact same* sphere in a second pass using a vertex shader (e.g., if you wanted to do bump mapping setup in the second pass), the depth values for each fragment are slightly different, so multipass algorithms break. In order to circumvent this, you need to emulate fixed-function processing in vertex shaders every pass. This hurts performance (since vertex shaders are slower than fixed function T&L), and shouldn't really happen. The Radeon 8500 doesn't have this problem.
There is no way for an application programmer to take control of the second vertex shader on either the Radeon 8500 or the GeForce 4Ti, nor should they have to. This happens automatically, every time you send vertices down the pipeline. There is no way for an application programmer to disable use of the R200's second vertex shader.
On a related note -- where was it confirmed that the R200 had two vertex shaders?
tkopp:
I think Carmack was comparing the GF3 and R8500 together in this .plan update for a reason, rather than GF4 vs. 8500. GF3 and 8500 are both nearly the same generation, whereas the GF4 is a whole cycle ahead, so it wouldn't be fair to compare something that was released much further ahead. If Carmack did compare the GF4 vs. the 8500 and showed that the R8500 performed worse in every case, then you'd find a lot of ATI card owners crying "doom", so to speak. A lot of developers avoid pissing off the manufacturers, but with JC, I imagine it's worse to piss off the ATI/NVIDIA zealots. :grin:
Curious question to all here, but I've heard word that ATI has done what NVIDIA is doing, and is implementing all the code for every card in a single driver update. Sharkfood mentions here that the drivers are like at "v1.0" state at this point. Would this mean that the R250/R300 drivers will be much better and refined as a result, since it'll use code based off a previous generation, or do they start from scratch as if the card is a wholly new generation (ala NVIDIA)?
Edit: Should've known that word would be censored. I guess zealot is okay? :wink:
<font size=-1>[ This Message was edited by: Matt on 2002-02-12 10:36 ]</font>
Dave Baumann
12-Feb-2002, 11:01
On a related note -- where was it confirmed that the R200 had two vertex shaders?
In reference to ATi 3Dmark2001 Vertex throughput an ATi employees posted at Rage3D that this was due to the Dual ‘Engines’.
Curious question to all here, but I've heard word that ATI has done what NVIDIA is doing, and is implementing all the code for every card in a single driver update.
NVIDIA claims patents on that. The question is whether ATi’s unified drivers operate off single code paths for all cards or just roll all the drivers into a single package. I’d hope the former.
Edit: Should've known that word would be censored. I guess zealot is okay?
Yes, that is edited, I’m still deciding on Zealot :wink:
On a related note -- where was it confirmed that the R200 had two vertex shaders?
I'm wondering the same thing myself now having carefully reread ATi's notes about the Charisma II engine. Specifically they say that there are two pipelines - one fixed function and one programmable. Well that's how MS in DirectX describe the differences between Dx7 and Dx8; the former uses a fixed function TCL pipelines whereas Dx8 uses a programmable one (vertex shaders). To my mind that's not dual or parallel vertex shaders.
On the subject of the GF4's VS units, they do operate in parallel, yes? How does that work in principle though? Surely only one vertex can be operated on per pass? Does it mean that while one vertex is being TCL'd, another can be immediately worked upon by the other ALU? Can the two units perform different instructions separately or just the same one? How does the PS unit cope then, if you've got two VS units firing vertices at it virtually at once?
On 2002-02-12 12:09, Neeyik wrote:
On the subject of the GF4's VS units, they do operate in parallel, yes?
AFAIK, yes, they do.
Does it mean that while one vertex is being TCL'd, another can be immediately worked upon by the other ALU?
Each VS pipe works on 3 vertexes at time, in this way gf4 should be cruching 6 vertexes on the fly at the same time.
Can the two units perform different instructions separately or just the same one?
The 2 units should be independent, so it's possible to perform different instructions at the same time, but, I believe, with the same vertex shader (if the rasterizer can work only on one primitive at time wouldn't make much sense having 2 vertex shaders concurrently working on different shaders)
How does the PS unit cope then, if you've got two VS units firing vertices at it virtually at once?
Probably there is some kind of fifo/buffer between VS and PS pipelines.
ciao,
Marco
[/quote]
"I wonder, however, if Carmack has used a GF4 and whether it give any improvements over GF3 (besides it's faster clock speed)?"
I'd think he is perfectly familiar with nv25, but it probably dont offer anything new that is usable for his engine.
Dave Baumann
12-Feb-2002, 14:49
CBrennan[ATI] @ Rage3D (http://www.rage3d.com/board/showthread.php?s=&threadid=33592808&highlight=two+ engines)
"The 8500 can complete 2x the vertex shader instructions/clock cycle that a GF3 can complete due to the extra vertex engine in the 8500. The XBox's vertex engine should therefore be on par with the 8500. This is why the 8500 stomps the GF3 in synthetic vertex tests."
GF3 uses a T&L program to emulate fixed function T&L; if silicon requirements for fixed function T&L and Vertex Shaders then I'd guess that it makes more sence to implement two Vertex Shader and have the emulation program run over one.
Thanks for the input chaps...
Dave - did you mean the Radeon 8500 when you said "GF3 uses a T&L program to emulate fixed function T&L; if silicon requirements for fixed function T&L and Vertex Shaders then I'd guess that it makes more sence to implement two Vertex Shader and have the emulation program run over one."
What still puzzles me now though is that if you've got dual VS units hammering through 6 vertices per pass then, unless the PS unit is mightily quick, the buffer between the separate shaders is going to be more of a bottleneck rather than an aid. Clearly the GF4 isn't slow, which must mean that it's pixel shader is very fast indeed, or am I totally missing the point here? NVIDIA have obviously ramped up the performance of their PS unit in the GF4; I wonder now by just how much?
LeStoffer
12-Feb-2002, 16:03
On 2002-02-12 16:12, Neeyik wrote:
What still puzzles me now though is that if you've got dual VS units hammering through 6 vertices per pass then, unless the PS unit is mightily quick, the buffer between the separate shaders is going to be more of a bottleneck rather than an aid. Clearly the GF4 isn't slow, which must mean that it's pixel shader is very fast indeed, or am I totally missing the point here? NVIDIA have obviously ramped up the performance of their PS unit in the GF4; I wonder now by just how much?
I have been wondering about the same thing. http://www.extremetech.com had an interview with nVidia's chief scientist, David Kirk, during their GF4 preview.
Some goodies:
We asked David Kirk to talk a bit about where they found GeForce3 was bottlenecked, and what design changes that inspired:
"Most of the bottlenecks have to do with FIFO sizes and stall-on-use for data paths. For example, in the pixel shading unit where in one cycle a texture is being looked up, in the next cycle it's being operated on, and in the next cycle it's being used for another lookup. If there is not the ability to write the register and use it at the same time, then you stall and have to wait a cycle. So we added extra data paths, and increased the number of ports on register files. In the memory controller we added more buffering so that things like blits, filtering operations and dual texture read and blend-together and write out again operations have enough buffering to schedule reads in time so that data is there when you need it so that the computational part of the pipeline isn't stalled."
We also asked Dave Kirk about some the different areas of design focus: "We put a lot of focus on dual texturing as a performance case, as well as full-speed trilinear, which required a widening of data paths and more buffers in order to deliver the data at full speed. It's kind of like a water flow diagram: if all the pipes in the path aren't the right diameter, you go as slow as the slowest case."
Regards, LeStoffer
<font size=-1>[ This Message was edited by: LeStoffer on 2002-02-12 17:06 ]</font>
Dave Baumann
12-Feb-2002, 16:56
Dave - did you mean the Radeon 8500 when you said
No. What I was saying is that David Kirk has explained the reason why GF3 can’t use fixed function T&L as well as the vertex shader is because they use a vertex program over the Vertex Shader to emulate fixed function, meaning that GF3 only has one vertex unit in total, not a Vertex Shader and fixed function T&L.
ATi’s official documentation states that they can run fixed function and Vertex Shaders at the same time, with the explanation that they have two engines (one fixed function and one vertex shader). The point I’m getting at is that if a Vertex Shader can run a T&L emulation program (as Kirk explained for GF3) and the silicon requirements are similar why bother implementing two different engines? There would be more mileage in implementing two vertex shaders, but running the emulation program over one of them when fixed function T&L is needed. When you add the statements from the ATI PR documentation and what CBrennan is saying that’s the conclusion I come to (assuming what CBrennan is saying is actually the truth).
The vertex shaders don't emulate the fixed function pipeline on NV20/NV25; however, they share the same FPU.
Emulating the fixed function path in a vertex shader is slower than including a legacy fixed function path; however, the FPU for both is exactly the same, which is why fixed function T&L and vertex shader T&L don't operate in parallel.
Clearly the GF4 isn't slow, which must mean that it's pixel shader is very fast indeed, or am I totally missing the point here? NVIDIA have obviously ramped up the performance of their PS unit in the GF4; I wonder now by just how much?
I think the thinking here is two-fold.
First, as soon as you've got a complicated vertex shader (multiple bones per vertex w/ bump-mapping setup, for example) you were probably going to be geometry limited on NV20. Since a lot of next-generation engines will use complicated (> 40 instruction) vertex programs, it makes sense to beef up the vertex engine, since it was the bottleneck before.
Second, for simpler vertex programs, the triangle count will be *so* high that it's safe to assume the average triangle size will be miniscule, or overdraw will be high enough that a large number of pixels will never get sent down the shading path due to Z occlusion. Shading will probably still be the bottleneck in this case, but this is more of a theoretical benchmark case, rather than a practical usage case.
Dave Baumann
12-Feb-2002, 18:23
The vertex shaders don't emulate the fixed function pipeline on NV20/NV25; however, they share the same FPU.
Mmmm, thats not the impression I got from David's answer (http://www.voodooextreme.com/3dpulpit/Articles/dkirk_oct2001.html):
"David : The DX7 hardwired pipeline is implemented on GF3 using a vertex program. That is why users cannot run both at the same time."
The use of the words Vertex program seemed to imply something else.
"
The vertex shaders don't emulate the fixed function pipeline on NV20/NV25; however, they share the same FPU.
Mmmm, thats not the impression I got from David's answer:
"
OK this is my understanding....
Your both sort of right.
The fixed function unit shares the Vertex shader ALU and it does some of the work on the vertex shader sequence, however the fixed function unit has access to dedicated lighting hardware, so it's no exactly simply "emulated".
"The fixed function unit shares the Vertex shader ALU and it does some of the work on the vertex shader sequence, however the fixed function unit has access to dedicated lighting hardware, so it's no exactly simply "emulated".
Yes, I second that. In heavy lighting scenarios, the Geforce3 performance in dx7 apps is just "too fast" to be "emulated" over the standard vertex shader instruction path.
We had the discussion about the R200 implementation over on the old forums, where arjan de lumens brought into that the R200 vertex shader might be superscalar and may only have some ALU parts doubled, e.g. the Dot3 unit: http://www.beyond3d.com/messageview.cfm?start=21&catid=3&threadid=1631
I think this makes sense.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.