How a modern GPU works ?

agent_x007

Newcomer
First of all : Hello everyone :)

I know it's a silly (and quite complex) question, but I want to see if I understand 3D rendering idea in context of GPU architecture, right :)
Simple way to put it - I want to join these :
pipeline.png


With those :
GT200fullblock.png

GK110Block.png


I do know (more-less) how Deffered Rendering works (and since it's quite common in games these days I want to base on it).

SO, here's how I think a modern GPUs work :

DX10 (using/handled in GT200 aka. GTX 2xx) :
We start at VS (Vertex Shader), that works on points in space (delivered by CPU through PCI-e).
VS will transform them (move, rotate, etc.) any way we want and combine them together to form vertices and primitives/objects.
Vertex Shader stage uses "Streaming Procesors" (SPs) in GT200 combined with cache and VRAM to move data around.
Next stage, Geometry Shader (GS), does similar thing to VS, but it does it on larger scale (entire vertices, objects/primitives) + it can create new vertices (VS can't make new geometry).
Like VS, GS is handled by SPs.
After we complete all transformations, we rasterise the image (ie. change 3D space of vertices, to 2D space of pixels - triangles in, pixels out) using Rasteriser in GPU.
Next, we have to "put wallpaper" on our new pixels, to figure out what colour they should have now (using GPU TMU's), and after that we pass them on to Pixel Shader (PS) that can do interesting stuff to them and take care of lighting entire scene (Deffered Rendering "thing"). PS is again done by SP unit's in GPU.
Important thing to note is that SP's operate in blocks and if they are in the same SM, they can't do two different things (like PS and VS) at the same time.
After all that, all what's left is to blend our pixels into something more useful than some numbers, so we input them to ROP unit(s) (which give us a image frame as ouput.
Once we have it, we send it through RAMDAC(s) (RAMDAC translates "a frame", into something monitor can understand), to monitor(s).
I didn't mentioned culling and cutting since it takes place in almost every stage (culling/cutting reduces workload for this stage, and those after it).

In DX11, before Geometry Shader and after Vertex Shader we have a Tessalation stage (that consists of Hull Shader, Tesselator unit and Domain Shader). It can create Enourmous amounts of new geometry REALLY FAST (it's based on fixed function setup, like T&L in days pased or fast Video Decoding today :) ).
Hull Shader takes care of controling other stages, Tesselator... tesselates, and Domain Shader combines data gathered from all previous stages (including Vertex Shader) and prepares it for Geometry Shader.
VS, HS, DS, GS and PS, are all handled by Cuda Core's (new marketing term, since SP's can't handle HS and DS stages ie. GT200 is not capable of DX11).
Cuda Core's (CC's) are present in Fermi, Kepler and Maxwell based cards.
CC's have the same limitations as SP's had (working in groups, the CC's of the same group can't hadle different tasks).

Other stages are pretty much the same as the ones in DX10 (altho they do have more capabilities than old versions).
From GPU perspective It's worth noting that all DX10 GPU's have only one Rasteriser ("thing" that changes 3D to 2D). DX11 based ones can have 4 or even 5 of these, working in parallel.

OK, that's it (I THINK I got it right).
I don't need ALU or Register detail level here - I just want to know if my thinking (and understanding of it all), is correct.

PS. Also one other thing :
I know Immediate/Direct Rendering (DR) differs from Deffered Rendering in Lighting stage - in DR it takes place early (ie. in Vertex Shader).
But are there any other this type of things or other stuff, that make GPU handled it differently ?

Thank's for all responds.
 
Here comes the noob question

If it creates 3 vertices thats a triangle, and thats geometry isnt it ?

The vertex shader has no concept of geometry to begin with. The input is a vertex and associated parameters and the output is a vertex and associated parameters, so it's 1:1. So the vertex shader doesn't really create vertexes either, it's more like it changes them.

The vertexes could be ended up used as part of one triangle, multiple triangles (vertex shared as a corner of adjacent triangles), no triangles (if it was clipped or culled) or higher level geometry that's converted into triangles later.
 
A common misconception is that vertex shader stage runs first and transforms all the vertices. This is not how it goes.

Assuming indexed triangle rendering (the most common one):
Primitive assembler reads 3 indices from the index buffer. GPU checks the "post transform vertex cache" (also known as parameter cache) for each index, and runs the vertex shader for those vertices that did not exist in the cache. When all the 3 vertices of a triangle are transformed, the transformed vertex positions go to a fixed function unit that first determines the rough coverage of the triangle (*) and then fine (2x2 quad) coverage of the triangle (*). This results in variable amount of pixel shader instances (grouped in 2x2 quads). GPU starts to execute these pixel shader instances immediately, while still continuing the primitive assembler work and vertex shader work (and generating new pixel shader instances based on triangle coverage) at the same time. In the end the pixel shader output is tested against triangle coverage (not all pixels in 2x2 tile are inside the triangle), against depth and stencil. Pixels that fail the test are rejected.

Both vertices and pixels are actually grouped to waves/warps before execution (32 elements on NVIDIA hardware, 64 elements on AMD hardware, 8/16/32 on Intel hardware). This adds a little bit latency to the pipeline, but simplifies the GPU execution a lot, since the later stages don't need to process/book-keep single vertices/pixels.

(*) In these stages the pixel shader instances might be culled by hierarchical depth buffering and by early depth/stencil test.
 
(*) In these stages the pixel shader instances might be culled by hierarchical depth buffering and by early depth/stencil test.
Right! I was thinking why depth/stencil checks came after pixel shading work had already been kicked off! There's early-Z and such to cut down on the unnecessary workload.

Anyway, why is it that depth check (not counting early Z stuff) is located at pretty much the very end of the pipe? Is it just easier to build the GPU that way, or is it maybe just some historical reason, IE, a carry-over from the classical Silicon Graphics rendering pipe or somesuch ("it's always been that way")?

Cheers! :D
 
Anyway, why is it that depth check (not counting early Z stuff) is located at pretty much the very end of the pipe? Is it just easier to build the GPU that way, or is it maybe just some historical reason, IE, a carry-over from the classical Silicon Graphics rendering pipe or somesuch

That is the OpenGL (well probably even the preceding IrisGL**) model of the SGI machines***. It permits using the texture to control visibility of the polygon with a fixed pipeline processing order.

Of course, it is a pain from a performance perspective since most of the scene doesn't require alpha testing.


**but I don't have time to go an check through the ancient documentation.
***possibly other early work station systems, e.g, Apollo,Sun, HP, as well but someone else would need to check.
 
Last edited:
A common misconception is that vertex shader stage runs first and transforms all the vertices. This is not how it goes.
It might, though, and it is roughly what happens on tile based renderers (where geometry work from one render pass overlaps raster work from the previous render pass).
 
I found additional slides that show best what I meant :
0vTvRS6.png

kcvuSor.png

IA = Input Assembly,
SO = Stream Out,
"GEOM" is basicly a prototype of Polymorph Engine from later NV GPUs (which combines all fixed function units from above slide, except for ROPs and Texture blocks).
FB is divided on slide, however it's not divided in real GPU (split was done for "neatness" I think).

Source of the slides : LINK (.pdf, slide 52 and 53)
 
Back
Top