Does ATi follow OPEN 3d API standards better than Nvidia?

To take this further consider dependant reads, you have to potentually absorb the latency through the texture cache, out to memory (inc arnitration time), back through texture cache and through the filtering HW. This adds up to 100's of cycles of latency which if not absorbed will cause a stall. So how big are you temp register files ? Very big.

This makes it reasonable to trade off number of temps vs latency buffering, of course why NV's chips suffer so badly when using such a small number of regs seems a bit strange as thsi would imply that they barley have enough buffering to hide their ALU latency, let alone latency of dependent reads (although there are other scheme that can be used to help with them).

On number of ports, If you're supporting par vec and scalor ops then per clock you need up to 6 read and 2 write ports, 8 ports is still costly.

John.
 
Dave H said:
I appreciate some of the points you're making, but I don't see why the register requirements need to be nearly what you're proposing. Why have the full random-access register file at every pipeline stage? A standard datapath would keep register access confined to one well-defined pipeline stage (or perhaps spread it across two or three on a deeply pipelined design). Intermediate values are of course propagated down the pipeline as needed, but in latches, not 10-ported registers; it's really a seperate issue that has nothing to do with the size of the visible register set. The way you describe it, it seems like each of your 10 pipeline stages is an
EX phase.
You misunderstand. With 1 pipeline and 10 pipeline stages, you work on a *different* pixel every clock cycle, cycling through 10 pixels until you have completed your pixel shader program for all of them. To do this, you need to keep somewhere the complete state information for all 10 pixels while processing them, and that implies one instance of the register file for each of the 10 pixels. (This is very different from a standard, single-threaded MPU, in that the MPU needs only hold state for 1 thread => 1 instance of each register. Think of the 10-step pipeline as running 1 'thread' per pixel and constantly swapping between 10 'threads') There need not be more than one pipeline step that does the actual register access; this one stage must then choose between the 10 register files, picking the one belongiong to the pixel in that piepline step at that time.
And 10 ports!! A modern MPU datapath is just as superscalar as a single pixel shader pipe, but typically use just a dual read-port and single write-port design, IIRC.
You're confusing register files and cache SRAMs. They are completely different beasts. SRAMs usually only have one or two ports, whereas MPU register files usually have a lot more. For example, the Athlon, which is a 3-way superscalar design, has an integer register file with 17 ports and a cache with 2 ports.
I suppose a pixel shader might tend to have denser operand locality, but 10 ports! Even if it's theoretically possible that you be asked to coissue three FMACs using the same register for all three source operands and also as one of the destinations...do you really think the hardware is designed to accomodate this ridiculous instruction bundle? No way. 3 or 4 ports should be more than enough to keep register structural hazards to a minimum.
If you have 3 FMAC-capable units, you need 12 ports: for each FMAC, you need 3 read ports and 1 write port. Multiply that by 3 and you get 12 ports.
 
Umm, why would you work on ten pixel shaders at once per pipeline? In the other areas of the pipeline, the pixels in flight aren't ready to be shaded yet. It is only near the end of the pipeline that they reach the pixel shader stage. It seems to me that the pixel shader stage should only be active on one pixel at a time, and if a shader takes 100 cycles to complete, the other 9 pixels are waiting stalled.

Through the pixel shader, there should only be one pixel in flight, the rest of them waiting until the shader completes. I fail to see how pipelining the shaders themselves will help since presumably, the shader ALUs will be 100% busy and should complete most of their operations in a single cycle. You can't work on VLIW shader instruction 1 for pixel 1 in cycle 1, and then VLIW shader instruction 1 for pixel 2 in cycle 2, since pixel 1 in cycle2 will be blocked from doing any shader operations. You haven't gained anything.

The only time it makes sense is maybe on the FDIV/POW/SINCOS/etc unit which may take more than 1 cycle.
 
DemoCoder said:
Umm, why would you work on ten pixel shaders at once per pipeline? In the other areas of the pipeline, the pixels in flight aren't ready to be shaded yet. It is only near the end of the pipeline that they reach the pixel shader stage. It seems to me that the pixel shader stage should only be active on one pixel at a time, and if a shader takes 100 cycles to complete, the other 9 pixels are waiting stalled.
Such an approach would ruin shader performance beyond belief. Consider for example a pixel shader instruction that does a texture lookup. The texture lookup operation consists of the following sub-operations:
  • A max-value and division stage (for cube-mapping, you must divide the 2 smallest texture coordinate by the largest one)
  • Mipmap level calculation (includes at least a logarithm)
  • Texture coordinate scaling and clamping
  • Texture cache lookup
  • Bi/trilinear interpolation
  • Fixed-point to floating-point conversion (if texture map is fixed-point and PS registers are floating-point)
Do you really believe that this can be done with a *latency* of 1 clock cycle at 500 MHz on current 0.13 micron processes?? To me, 10 clocks sounds like a more reasonable estimate (but still very low). If you cannot swap execution between multiple different pixels each clock cycle, that would mean that a pixel shader program that does just 1 texture lookup will stall the entire pipeline for 10 clock cycles, so effective single-texturing fillrate will be approximately (pipelines * clock speed / 10). This doesn't even take into account the ~20-cycle wait you would get from a texture cache miss.
Through the pixel shader, there should only be one pixel in flight, the rest of them waiting until the shader completes. I fail to see how pipelining the shaders themselves will help since presumably, the shader ALUs will be 100% busy and should complete most of their operations in a single cycle.
Which operations? At 0.13-micron, even doing an FP16 add in 1 cycle at 500 MHz is hard enough to require custom design (=very time-consuming, expensive, error-prone, risky), and you can just forget about FP multiplies, MADs, dot-products, texture lookups etc.
You can't work on VLIW shader instruction 1 for pixel 1 in cycle 1, and then VLIW shader instruction 1 for pixel 2 in cycle 2, since pixel 1 in cycle2 will be blocked from doing any shader operations.
I don't see the problem - if pixel1 is blocked until VLIW shader instruction 1 has been run for pixels 2,3,4,..10 in cycle 2 to 10 (so that instruction2 can be run for pixel1 only in cycle 11), you still get to maintain a throuhgput of 1 VLIW shader instruction per clock cycle, just as if you had processed 1 pixel at a time with a 1-cycle instruction latency for all instructions. And you would get the benefit of 10 pipeline stages for any operation that you wish to have a 1-cycle throuhgput for (which may be excessive for an FP add, but quite nice for dot-products, RCPs and texture lookups).
You haven't gained anything.
You have gained a little bit less than 10x the clock speed, and/or reduced the stalls due to instruction latencies by a factor of 10.
 
DemoCoder said:
Umm, why would you work on ten pixel shaders at once per pipeline? In the other areas of the pipeline, the pixels in flight aren't ready to be shaded yet. It is only near the end of the pipeline that they reach the pixel shader stage. It seems to me that the pixel shader stage should only be active on one pixel at a time, and if a shader takes 100 cycles to complete, the other 9 pixels are waiting stalled.

Through the pixel shader, there should only be one pixel in flight, the rest of them waiting until the shader completes. I fail to see how pipelining the shaders themselves will help since presumably, the shader ALUs will be 100% busy and should complete most of their operations in a single cycle. You can't work on VLIW shader instruction 1 for pixel 1 in cycle 1, and then VLIW shader instruction 1 for pixel 2 in cycle 2, since pixel 1 in cycle2 will be blocked from doing any shader operations. You haven't gained anything.

The only time it makes sense is maybe on the FDIV/POW/SINCOS/etc unit which may take more than 1 cycle.

You're wrong with it ;) There is not just one unit par pipeline. Each unit works on a different pixel. If it wasn't the case, why would you put more than one unit or more than one texturing unit per pipeline??? It would be useless.

Deeper, every unit is in reality a pipeline. A mul, a mad… aren't done in one cycle. So in the micro-pipeline of the unit, you must also put a pixel in every stage.

The benefit of a pipeline is its ability to work on different stages of the process and on a different element on each stage. This hides latency of calculations.

Before the pixel pipeline, there is a little buffer. The same work is done on many pixels. It's why pipelining is very efficient in GPU.

EDIT : sorry, I hadn't seen arjan de lumens's answer
 
Well, if the FP units can't achieve single cycle performance, then they only need to store the values temporarily, I still fail to see why they need a complete register file per inflight pixel. Since the order in which the pixel "threads" is handed is presumably deterministic (until we get branching), couldn't you arrange for the values of inactive pixel "threads" (e.g. those in mid-execution on an FP unit) to be propagated through temporaries or buffers and arrive "just in time" for the beginning of the next instruction execution.

Swapping out the complete register file just seems like the quick and dirty way of doing it. After all, you only need to store those varying values (texture coordinates) and any registers that were modified (copy-on-write semantics)

Imagine doing multitasking on the CPU, in which instead of swapping the registers to the stack everytime a task was about to be switched out, you swapped the registers into a cheaper on-chip buffer that was timed to empty precisely when the task is swapped back in. Yes, you still incur the swap, but the storage cost is much lower (onchip vs memory)
 
DemoCoder said:
Well, if the FP units can't achieve single cycle performance, then they only need to store the values temporarily, I still fail to see why they need a complete register file per inflight pixel. Since the order in which the pixel "threads" is handed is presumably deterministic (until we get branching), couldn't you arrange for the values of inactive pixel "threads" (e.g. those in mid-execution on an FP unit) to be propagated through temporaries or buffers and arrive "just in time" for the beginning of the next instruction execution.

Swapping out the complete register file just seems like the quick and dirty way of doing it. After all, you only need to store those varying values (texture coordinates) and any registers that were modified (copy-on-write semantics)
You need to store the complete set of registers to handle the (possibly uncommon) case where the shader program actually writes to all the registers. But you're right that passing along a copy of all the registers alongside each pixel as it travels down the pipeline could be quite a bit less expensive (approximately 50-70% of so, as far as I can see; more for larger register file port counts) than naively implementing a bank of 10+ independent register files at the ends of the pipeline.
Imagine doing multitasking on the CPU, in which instead of swapping the registers to the stack everytime a task was about to be switched out, you swapped the registers into a cheaper on-chip buffer that was timed to empty precisely when the task is swapped back in. Yes, you still incur the swap, but the storage cost is much lower (onchip vs memory)
I have heard about processors with such a feature, but can't recall offhand which processors that might have been... 68k used to have something like that for its Stack Pointer register.
 
This is really interesting seeing the analysis of why the fp32 performance hit occurs. But if the hit is there - and its large (say 50%) for a full rendered scene, the outcome is inevitable - limit your fp32 available per scene if you want speed.

If the bottleneck is linear then a scene that is 100% fp16 and renders 100 fps would achieve the following rough fps at higher fp32 content:


fp32
% - fps
----------
10 - 95
20 - 90
30 - 85
40 - 80
50 - 75
60 - 70
70 - 65
80 - 60
90 - 55
100 - 50

Does that logic seem reasonable and must game developers keep this sort of throughput at the top of their minds. Nothing is ever free, it is just another variable to deal with. I seem to remember NVidia saying only use fp32 for complex lighting, mirrored or refractive surfaces.

Hope the analysis continues and that the main thrust of the thread is eventually returned to and answered by someone here!!!
 
arjan de lumens said:
One possible reason for the register scarcity may be that pixel shader register files are very expensive in hardware: if you have 4 rendering pipelines, each with 10 pipeline steps for pixel shader operation, you need 40 physical instances of the entire pixel shader register file for correct operation.

Um, not in any architecture that I've ever seen, and I've seen quite a few. The number of pipelines shouldn't affect the number of registers you need at all unless you are designing an OoO machine. In an inorder machine you only need enough registers to cover the architectural register set.

My honest guess is that they have a screwed up register file orginization and instead of each register file being Y depth * (X elements * Z width) with A ports, they did something like Y*X depth * Z width with A*X ports. Which would get very expensive very fast, so you would want less fully ported files. Looks like optimization for the non-common case.

Also, with 3 arithmetic units per pipeline, you need perhaps 10 or so ports into the register file or else you will starve the arithmetic units. So you end up with 40 register files, each with 32 128 bits registers and 10 ports to each register. That's something like 25-30 million transistors in total, and a routing nightmare. On register files alone.

If the Nvidia architects actually did something so stupid, then they should be shot. Seriously. That is just bad design. Realistically you are looking at 1 RF per Pipe with X entries of 4x32 wide with something like 4 read ports ( 3 for an ALU op, 1 for a Tex Op) and 2 write ports ( 1 for ALU op and 1 for Tex Op). Lets be honest, a pixel shader pipe is just a fairly simple execution pipeline, not so different than something like an SSE2 pipe on a standard x86 processor with the addition of a writable control store. Not really that complex at all.

Aaron Spink
speaking for myself Inc
 
aaronspink said:
arjan de lumens said:
One possible reason for the register scarcity may be that pixel shader register files are very expensive in hardware: if you have 4 rendering pipelines, each with 10 pipeline steps for pixel shader operation, you need 40 physical instances of the entire pixel shader register file for correct operation.

Um, not in any architecture that I've ever seen, and I've seen quite a few. The number of pipelines shouldn't affect the number of registers you need at all unless you are designing an OoO machine. In an inorder machine you only need enough registers to cover the architectural register set.
Here is where GPUs become different from everything else. In GPU architecture discussion, typically a 'pipeline' refers to a unit that processes 1 pixel per clock, regardless of the actual number of parallel execution units it contains. (Each 'pipeline' roughly corresponds to one core in a CMP processor architecture; in NV30/35 and R300/350, it appears that each 'pipeline' as such is actually 3-way superscalar) As such, at any time, each 'pipeline' works on a separate pixel and as such needs its own pixel shader register file(s). Also, within each 'pipeline', you will want to swap execution between different pixels every clock cycle to avoid stalls (sort of like in the Cray MTA processor, which I suspect you have a passing knowledge of), which obviously requires 1 register file per pipeline per pixel in flight. So there you have the reasons to have 40 instances of the register file.
 
Can no one answer the question raised?

Plenty of interesting stuff on where the limits arise from - and I really do thank you for that.

But back to the question about the implications of these limitations for game developers:

1. Is it code OPEN standards for ATi and Proprietary for NVidia or face a 50% performance high if you use too much high colour precision and run into NV3x register limits at fp32?

2. How much of a scene is it reasonable to request fp32 colour precision for? Should we expect it to only be used very sparingly or is it fair to expect liberal use of high colour precision broadly throughout modern games.

/sits_back - waits to see the conversation goes this time!
 
The situation would be solved by better compilers and shading languages. It's not an either/or.

The current open standards are not expressive enough to capture the full semantics that programmers would like to write. (atleast ARB_fragment_program)

On the other hand, OGL2.0 shader language includes an INT datatype which could represent Nvidia's FX12. By default, an integer will probably be converted to a floating point register, but on NV3x it could fit into an FP16 or FX12 register.

Programmers could declare their intent by fetching textures into INTs and performing calculations on INTs as a hint that they want only 16-bit per component precision. The only downside is that it won't have floating point's range.

That's why the "half" float type exists in some HLSLs. Microsoft also added a "double" (FP64)
 
Hi g__day,

If you look at how DX9 standards are formed in the first place, then much of the answers should be apparent. Ideally, Microsoft should form a standard which is powerful, easy and sensible to program on. But since the hardware is outside of MS's control, the spec must be based on hardware from IHVs. It's well known that Microsoft likes to rotate around its favourable members out of the hardware makers, which has had some impact on the success and failure of various chip makers since the first iterations of DirectX. When DX8 needed to be nailed down, there were shader specs from 3dfx, NVIDIA and ATI. NVIDIA was chosen for that generation. DX9 came and ATI's R300 was chosen. (rotation habit)

So for NVIDIA, it isn't really them not following the spec as much as the spec was not based on their hardware. And there's no way you can just tell your hardware to work word for word to the spec without committing performance suicide. So they offer alternatives, which is perfectly logical.

-James
 
JF_Aidan_Pryde,

Take that a bit further when DX9 was being specified NVidia was a Goliath to ATi David - brave choice for MS.

It doesn't explain the huge OpenGL ARB2 vs NV30 code path performance delta in Doom 3 - whereas register constrained NV3x hardware seems increasingly more realistic.

Surely not everyone designed 3D API standards that were deliberately anti-NVidia. I could believe eveyone took a proposed NVidia spec as gospel and then designed the API to that level - then once the silicon showed a problem its too late to change and you have to look for alot of workarounds. Or did game developers hear full cinematic rendering - final fantasy quality graphics in real time - and say great whack it into all our games - and NVidia just went Oh Shit!!!!

Is NVidia really saying look S.O.T.A. today doesn't require more than 10% high colour precision in any screen cause the top hardware simply can't deliver this? Design your games for the mass market anyway so severely limit high colour precision usage in any frame.

Which is fine until ATi says - Well, actually we can run everything at high colour precision - so go for your lives - NVidia must have some problem /shrug.

Is NVidia really trying to say requesting high colour precision everywhere in a scene is poor planning because either 1) it is - and/or 2) NVidia takes a huge speed hit when you insist on it?
 
Sigh, :rolleyes:

Please do not bring The Inquirer into what up to now has been an intelligent debate. We may be light on the number of experienced people wishing to add there view - and I did ask Dave to comment - but I greatly appreciate all that people have expressed to date.

Suggesting we add analysis from the toilet bowl of technical analysis doesn't help my cause mate!

http://discuss.futuremark.com/forum/showflat.pl?Cat=&Board=miscgeneral&Number=2561569

SwedBear> Do you have any comment as to the claim at the Inquirer that PS 2.0 support is broken in Nvidia drivers? (http://www.theinquirer.net/?article=10602)

<Kardon> How often do you guys do these community Q&A sessions?

<NV_Derek> We do not comment on anything that gets posted on the Inquirer , but I will tell you that this is not true
 
I actually prefered it when the debate was still on CPU alike pipelining and/or parallelism, because it's highly interesting.

Anyway:

Aidan,

When DX8 needed to be nailed down, there were shader specs from 3dfx, NVIDIA and ATI. NVIDIA was chosen for that generation. DX9 came and ATI's R300 was chosen. (rotation habit).

Usually all IHV's try to contribute for a future DX runtime version. There Microsoft (the way I see it) tried/s to satisfy as many IHVs "requests" as possible. Here dx8/dx8.1 turned out into a complete mess, with the average layman still hardly being able to keep apart which shader versions belong to which VGA. Where do you exactly see the rotation habit is beyond me. Dx8.1 was added to allow NVIDIA to include PS1.3 and ATI PS1.4.

With dx9.0 IMHO Microsoft did not only want to give the API a longer lifespan to make it easier for developers, give the constant update requests from a IHVs a rest, but also target it for it's future LONGHORN OS.

PS/VS2.0 (=/<96 instructions)
PS/VS2.0 extended (96-512 instructions)
PS/VS3.0 (512-32k something instructions)

Where exactly do you see any favours? ATI just tried to design an architecture that can deliver the highest possible performance with the highest possible quality and apparently succeeded, whereby IMHO it seems to me that NVIDIA rather underestimated the importance of high accuracy buffers. In a nutshell how often have we heard answers to questions about functionality X, Y or Z etc from NV since the introduction of the NV3x line and usual consensus is "yes but currently not exposed in drivers". I guess for that is Microsoft responsible too? Unless s.o. is trying to tell me that M$ writes drivers for ATI too, for which said functionalities were exposed since day one.

Furthermore if you have the impression that only those two have contributed to the shaping of dx9.0 you're inherently wrong.

So for NVIDIA, it isn't really them not following the spec as much as the spec was not based on their hardware. And there's no way you can just tell your hardware to work word for word to the spec without committing performance suicide. So they offer alternatives, which is perfectly logical.

I still fail to see how dx9.0 was supposedly tailored on the R3xx, while actually putting NV3x into a serious disadvantage. The only other responsibility lies exclusively in NVIDIA's design decisions/architectural layout, where ATI managed to deliver a more efficient design with higher quality tradeoffs, yes they may be limited to FP24 but I'm not sure that for the early partially dx9.0 implementations higher accuracy is absolutely needed. Full dx9.0 games/applications are of course a completely different story.

***edit: if I didn't make it clear enough, all the above is just IMHO. Of course are they open to corrections :)
 
Hi Ail,
My ideas on rotations were mainly from Brain Hook (ex-id, ex-3dfx), from the old fools.com investment forums. He told long stories of how MS would try to level the playing field by changing their favourite IHVs every cycle. This is apparently the reason why low key players like Bitboys got them EMBM into the spec. But of course the spec is based on contributions from many IHVs, but the 'emphasis' isn't as clear cut. ;)
 
Back
Top