Questions on current 3d hardware

krychek

Newcomer
I'm an intermediate OpenGL hobby programmer, the current 3d technology has become too much for me to grasp easily. I keep hearing about lots of things but they are not clear to me. Here are some of the questions that are nagging me. I'll be very thankful if these questions are answered.

1) Architecture wise, how are the vertex and fragment (pixel) parts of the GPU different? Is it true that the vertex part is deeply pipelined whereas the fragment part is parallelized?

2) Where are the FIFO queues used in the gpu?

3) What are read-modify-write hazards? Why do they happen?

4) I'm very confused about the current generation of texture units. How are they arranged? 4 pipelines with 1 unit each, 4 pipelines with 2 units each.. how do these differ? How do they support more textures than there are texture units on a pipeline.

5) Vertex cache: How are triangle lists more cache friendly than triangle strips?

Thanks a lot.
 
krychek said:
I'm an intermediate OpenGL hobby programmer, the current 3d technology has become too much for me to grasp easily. I keep hearing about lots of things but they are not clear to me. Here are some of the questions that are nagging me. I'll be very thankful if these questions are answered.

1) Architecture wise, how are the vertex and fragment (pixel) parts of the GPU different? Is it true that the vertex part is deeply pipelined whereas the fragment part is parallelized?

2) Where are the FIFO queues used in the gpu?

3) What are read-modify-write hazards? Why do they happen?

4) I'm very confused about the current generation of texture units. How are they arranged? 4 pipelines with 1 unit each, 4 pipelines with 2 units each.. how do these differ? How do they support more textures than there are texture units on a pipeline.

5) Vertex cache: How are triangle lists more cache friendly than triangle strips?

Thanks a lot.

Some quick comments :

1) In terms of arithemetic capabilities they are not really all that different. Key difference though is that the amount of pixel processing required is much higher than the number of vertex processing, there are substantially more pixels in a scene than vertices. Another issue is that the pixel side has to cope with a lot more latency than the vertex shader side, the main reason for this is texture access (especially dependent reads and random/noise based arithemetic reads). If a texel fetch fails on the cache an external memory access is required and due to buffering and other memory operations this can lead to latencies of hundreds of clock cycles. Current vertex shaders do not suffer from such long latencies. Hence more parallellism is required in the pixel shader (more pixels processed at the same time) because more latency has to be absorbed.

2) There are a lot of places where a FIFO can sit... input buffers and most likely also between vertex and pixel shader unit (avoids that one immediately stalls the other)

3) Not sure what you mean... a key issue with shaders is that a target register is not yet written by instruction 1 when instruction 2 enters the processing which requires the result of instruction 1 as an input. Because the result is still within the pipeline you get a stall because the info for the next instruction is not available yet. If you don't detect this the wrong input value is used by instruction 2. This is something hardware or driver compiler needs to catch and handle, not the developer.

4) Forget about pipelines and texture units... all that matters now is how many instructions you can do per clock and of those instructions how many can be texture instructions and how many arithmetic instructions. The back-end might still be pipelined with some pixel pipe width limitations. Supporting more textures than texture instructions available is easy through loopback which is like automated multipass.

5) Indexed triangle lists can use the same vertex a lot of times, imagine a grid a cenetr vertex is connected to 6 other vertices meaning that using an index this vertex can be tranformed exactly once to build 6 triangles. With strips you would transform that vertex at least twice. In effect using an index list you can point at the same vertex as often as you want with strips you can not (and often there is the overhead of degenerate triangles).

K-
 
Thanks for your detailed replies. Although you asked me to forget about pipelines and texture units I'm curious how the hardware implements them.

Regarding the texture units.. If the hardware can do two texture untis per cycle and say I'm using four different textures, then doesn't the hardware have to switch the textures on the second (auto)pass? But texture switching is supposed to be costly then how can the texture(s) be switched on a per-fragment basis without getting a major performance hit? So there must be some other scheme for this where max textures that can be bound at any one time is more than the number of texture instructions that can be done in one cycle. What is this scheme?

I guess the f-buffer scheme of ATI is used to multipass when more textures are used than are supported by the hardware and that doesn't have to switch textures on a per-fragment basis.

Is the number of texture interpolators equal to the max tex instructions per cycle?

Why is texture switching considered costly? Becasue the texel cache has to be flushed? What exactly happens when a texture is bound (assuming the texture is already in the video memory)?

Thanks again.
 
The hardware can store a certain number of texture parameters, like address in local video memory, format, etc etc. WIth PS2.0 this number has gone to 16 different textures. So the hardware is aware of upto 16 different textures at the same time. If you change the binding of a texture to a "sampler" this info needs to be updated and this has to be done in sicuh a way that it does not conflict with operations still in progress. This might mean that the hardware needs to flush the pipeline. After all say sampler 0 is set to texture 1, you are processing some polygons using this setting and then you change sampler 0 to texture 2, you need to sync up since you can not just immediately change this info since there might still be pixels/polygons in progress that need sampler 0 to be set to texture 1 and not 2. Hence changing a texture binding might require a pixel pipe flush which takes time and might create stalls slowing things down.

Another thing to remember is higher numbers of textures bound to samplers is not soo much an issue, trick becomes handling al of this in the cache. If you are accessing 16 different textures to produce each pixel then obviously the cache is going to run in some kind of problem since it would need to store sections of 16 different textures... if it does not fit then textures might be flushed and refetched for every pixel spelling disaster for performance.

K-
 
Kristof said:
4) Forget about pipelines and texture units... all that matters now is how many instructions you can do per clock and of those instructions how many can be texture instructions and how many arithmetic instructions. The back-end might still be pipelined with some pixel pipe width limitations. Supporting more textures than texture instructions available is easy through loopback which is like automated multipass.

You wouldn't be hinting at Series 5 being non-traditional like the NV30, would you, Kristof? ;)


Uttar
 
PowerVR have supported "Loopback" for much longer than the IMR vendors as its very easy to do on a TBR arhitecture.
 
Uttar said:
You wouldn't be hinting at Series 5 being non-traditional like the NV30, would you, Kristof? ;)

I thought NV30 was very traditional :?:
And I don't hint... I tease...

K-
 
krychek said:
1) Architecture wise, how are the vertex and fragment (pixel) parts of the GPU different? Is it true that the vertex part is deeply pipelined whereas the fragment part is parallelized?
It doesn't really matter. The pixel part is more complex, mostly because it has to handle the huge latencies involved in texture lookups to AGP. As we move on and get texture fetches in the vertex unit, both of them will look a lot more alike.
2) Where are the FIFO queues used in the gpu?
Just about anywhere you need latency compensation - whenever you know you will have to wait for some result.
3) What are read-modify-write hazards? Why do they happen?
I'm not sure what you mean by this.
4) I'm very confused about the current generation of texture units. How are they arranged? 4 pipelines with 1 unit each, 4 pipelines with 2 units each.. how do these differ? How do they support more textures than there are texture units on a pipeline.
Ahhh, my favourite drum to bang. They shouldn't be viewed as 'pipelines' and 'texture units' so much any more. What is '4 pipelines' may be '1 pipeline pushing 4 pixels per clock'. It could even be '1 pipeline pushing 8 pixels every 2 clocks'. Or some esoteric architecture.

As regards texture lookups, there's no reason a texture unit can't be sent multiple requests by a single pixel. Things can travel through the same pipe stages many times.

As an analogy: a CPU matrix-multiply makes use of the multiplication unit many times, even though it only has one multiplication unit.
5) Vertex cache: How are triangle lists more cache friendly than triangle strips?
They can be. But they might not be. Cache-friendliness is defined by 'how much do things get reused' and 'how much memory do you read, but not use, because the lines don't align with the data'.
 
Is the number of texture interpolators equal to the max tex instructions per cycle?
Probably not, but it depends on the piece of hardware in question.
Why is texture switching considered costly? Becasue the texel cache has to be flushed? What exactly happens when a texture is bound (assuming the texture is already in the video memory)?
There's many possible reasons. Flushing is a major issue, but there's no reason you would need to flush unless the texture data has been changed (upload, subimage, etc) - standard cache eviction rules (LRU) would clean out the old data anyway.

Binding a new texture involves updating registers which are a long way down the pipeline. If for some reason that can't happen immediately, the pipeline will have to be allowed to drain until it can - and a pipeline full of bubbles is just plain wasted performance.
 
Kristof and Dio, thanks for your replies.

I have one last question. When you say 4 pixels per clock, does this mean something like 4 pixels per clock with 1 texture, 2 pixels per clock with 2 textures...? Doesn't a clock cycle always take the same time? Then, do we get same performance (speed) with nearest filtering or anisotropic filtering?

Sorry for being such a newbie about current cards. I have a very old card (forget about shaders this doesn't even have hw TnL :)).
 
It depends on the architecture. A reasonable assumption now is that bilinear filtering is no hit in performance. Trilinear and anisotropic may be, depending on the situation and the hardware.

Some cards might have free trilinear; some might have free anisotropic - at least under some circumstances.

The idea nowadays is to get as much effort out of each transistor as possible - so the more the cards evolve, the less 'hard wired' things will be.

In general what you say is true, that if it's X ppc with 1 texture, then it will be X/2 ppc with 2.
 
krychek said:
I guess the f-buffer scheme of ATI is used to multipass when more textures are used than are supported by the hardware and that doesn't have to switch textures on a per-fragment basis.

I could be wrong about this (and probably am) but I thought the F-buffer was employed to avoid having to loop shader instruction chains in a multipass mode. The hardware could always do multipass loopback on instructions to avoid a limit, but at an expense to performance. IMO, it's a marketing bullet, since nominally complex instruction chains of even 100 can tax the hardware.
 
WaltC said:
I could be wrong about this (and probably am) but I thought the F-buffer was employed to avoid having to loop shader instruction chains in a multipass mode. The hardware could always do multipass loopback on instructions to avoid a limit, but at an expense to performance.

Well... it's kinda interesting for speeding up offline rendering, where speed matters (but not as in real-time) and shaders can be very long indeed.

WaltC said:
IMO, it's a marketing bullet, since nominally complex instruction chains of even 100 can tax the hardware.

Not only *can* tax the hardware, they infact do (having written shaders with ~50-60 arithmetric instructions for the Shader-Competition)... ;)
 
WaltC said:
krychek said:
I guess the f-buffer scheme of ATI is used to multipass when more textures are used than are supported by the hardware and that doesn't have to switch textures on a per-fragment basis.

I could be wrong about this (and probably am) but I thought the F-buffer was employed to avoid having to loop shader instruction chains in a multipass mode. The hardware could always do multipass loopback on instructions to avoid a limit, but at an expense to performance. IMO, it's a marketing bullet, since nominally complex instruction chains of even 100 can tax the hardware.

No, you are right. I mixed it up with the F-buffer as introduced by the stanford paper. ATI's F-buffer seems to only handle long shader instructions. As mentioned in this thread, the F-buffer seems to be used only when the instruction count exceeds 160.

Damn. When will they implement the ability to use a large number of textures? But for the current hardware, F-buffer for exceeding max instructions probably is sensible as many textures are not used nor can they be stored and get swichted with acceptable performance.

Wow, this forum has many good threads. It would be nice if a list of good threads were maintained.
 
Kristof said:
Current vertex shaders do not suffer from such long latencies. Hence more parallellism is required in the pixel shader (more pixels processed at the same time) because more latency has to be absorbed.
That doesn't make much sense to me. Parallelism shouldn't have any effect on latency problems, unless the hardware is smart enough to dynamically-allocate resources when one pipeline is held up (which I doubt...my intuition tells me that it would be very complex to pull off properly).

Pipelining would be the most straightforward way to hide latency (more pixels being processed in-flight per pipeline).
 
Only the word 'parallelism' is somewhat wrong - instead it should be 'number of things in flight'. The rest of his statement is quite right.
 
Back
Top