Various question on GPUs.

liolio

Aquoiboniste
Legend
Various questions about GPUs.

Hi every body,

I've various questions about GPU, I'm not to ask how things work in details but I would like to clear some dark spots.

First, it must be a pretty simple one but still obscur to me. I think it's related to the command processor and with the thread dispatcher. I've read and read again last AlexV's reviews but it's sadly still unclear to me. So I'm going to try to explain what I don't get. It's about code and datas and where they are originate/handled.
The CPU/Host send commands and data to the GPU. The CP handles those commands and generate some works/task for the thread dispatcher... to dispatch, schedule.
First It's unclear to me what the CPU really send to the GPU and how/where. Does the CPU send data and orders/commands to GPU or only orders and then the GPU reads data from RAM.
For example the CPU send commands (what to do for me) to the gpu for example manipulate vertex and then the GPU read the vertex datas output by CPU in Ram or the CPU send all together directly to the GPU?
Then it's getting even messier for me once we pass vertex processing. Say vertex processing is done for some batches, the CP will be informed via the thread dispatcher right? Then it has "a result" but how does it knows what to do with it? I mean by watching to the pipeline it's obvious what it will do but I can't figure out how.
Going further the GPU figured out what to do and move from vertex to primitive, generates tasks of x primitives for the "ALUs" to handles, the thread dispatcher do his work, ok so far so good. "Same player shoot again" how does the "ALUs" know what to do with those primitives. I've an idea but it's vague (or worse...) I read multiple time speaking of GPU "code is data", does that mean that the GPU will go read "commands" in RAM? At this stage the CPU has no longer anything to do with it, right? So this code/shaders has been set here by developers. Is that the basic idea? In that case how the GPU know which data/command (as they are memory object) load to execute? It's due to some values carried initially to a vertex or a batch of vertex?

Actually my question would be the same at each stages of the pipeline, how the GPU knows that it has to move to next part of the pipeline and where he recovers the code/commands (even if for it it's just somehow other datas).

I would really appreciate if those things were clearer to me as I feel like I have growing misconceptions about what is really going n in GPU.

I've also something other that bothered me. There are a lot of talk about how many cores are in a given GPU, how make them more "clever" etc. I tried to understand what one would call a core and why and stuck with my super limited knowledge I ended thinking that in the end from my POV GPU is still one core as the Command processor is still the only "clever" (in a CPU way) part in a GPU.
By following my logic I think that before moving to something like Larrabee GPU have first to become "multiple cor"e or be able to work in a multi-core fashion. Say right now two GPUs work in a way less efficient fashion than say two mono core CPU back in time. Larrabee is larrabee (bunch of cpu augmented by SIMD) but I feel like actual GPU has still quiet some road to do to get there (if getting there is really that much of an important goal at short term). Intel made this choice but that doesn't mean that there is not an other middle ground between actual CPU and GPU, no?
Still following my logic, I wondered if GPU viewed as a single core has passed the optimal size and may have become really "multi-core" say the Host/CPU sees multiple GPU. From my POV/understanding Command Processor and Thread Dispatcher are critical part to make the GPU more flexible, while in volume they handle more than the CPU attached to the Larrabee SIMD (or it's the other way around :clown: ) I have the "feel" (it's not based on fact, I mean I don't know) that the logic in a larrabee core is more potent in that it can make more things.
So when I say "GPU has passed the optimal size" I mean that you could have more "potent" (not in volume, keep track of that many thread etc.) command processor and thread dispatcher handling a reduce amount of SIMD. For example it looks like for ATI the building block for the GPU is a SIMD array and matching texture units, for Nvidia it seems to be (to me) the texture processor (to which are attached a given number of SIMD array, 2?). My idea (I don't state this in a pompous manner like I know better, it's more I want to understand where my reasoning is messed up) is "would it make sense to build a tinier but fully formed GPU as a building block?
For example for the same size as a Cypress you end up with a "five cores", it's still a lot less logic overhead than in a larrabee for example and as those "cores" could end sharing datas it would still be easier to adress "communication problems" than in "a sea of cores design". Especially if GPU manufacturer move to fully coherent cache supporting read and write operation, no?

So basically that are my questions somehow "code and data path in GPU, the critical role of the command processor as I feel it is the one key to my first questions" and "why GPUs don't go really to multi core first when every body expect them to become "a sea of cores" in near future.

Thank in advance for you answers :)
 
Last edited by a moderator:
Thanks a lot for those links they were indeed useful :)
Is this your own blog?

So OK sounds like I was off on quiet some stuffs. The CPU is in completely in charge of giving the GPU "orders/commands". So now the picture is way clearer GPU have its ways of keep track of what going on but every operations are "CPU" driven.
I find the part about "state changes" really interesting as the author presents it in a really comprehensive way.
I still have question, I will use the ATI RV740 as an example. What is the maximum number of "states" the GPU can run simultaneously?
It's 8? ie one per SIMD or 16? a SIMD can keep two "threads" active // Basically what I ask is if the two threads active have run the same instructions.
But that's just a part of my concerns, whether it's 8 or 16 it's still few in regard to the number of threads the GPU keep in fly so the number of "state change" as to be huge. The author says that "state change" are expansive and advice to keep them minimal. But from my understanding they have to happen quiet often. I think I still have a misunderstanding about what the author calls a "state change" and how they relate to "threads". I've an idea, basically you have to store instructions and constants somewhere ie in buffers so in a limited memory space so while the hard can change threads quiet often and have evolve to make the process less costly you still limited by storage.
Say the CP and thread dispatcher may not care if 3 "tasks" result in X threads or if 10 "tasks" result in the same X thread but buffers on the other side...

Extra insight would be welcome :)

EDIT
I don't know how you (members) manage to handle the terminology... damned that tough.
I may replace Task by shader and thread by task and I'm not sure it would be more accurate...
Like that many shader resulting that many tasks working on batches of that given size, etc. :(
 
Last edited by a moderator:
Section 2.3 of the 6xx/7xx 3D programming guide might help a bit :

http://www.x.org/docs/AMD/R6xx_R7xx_3D.pdf

The programming sequence (batch sizes etc for 4770) is basically :

- set up state including shader programs
- pass one or more buffers of vertices into the GPU
- GPU groups the vertices into batches of 64 and assigns batches to each idle SIMD
- all of the vertices on a SIMD run the same shader instructions
- as each batch of vertices comes out of the vertex shader the vertices get assembled into primitives
- primitives get exploded out into quads, quads get grouped into batches of 64 pixels (16 quads, I guess)
- batches get assigned to each free SIMD, competing for SIMD time with vertex batches
- all of the pixels on a SIMD run the same shader instructions
- as batches of pixels come out they go through DB/CB and out to video RAM
- NOW you change state (maybe a new shader program or something)
- repeat, drawing more stuff

State changes are only required when you want to change the way the GPU processes incoming vertex information. You could easily process thousands of vertices and millions of pixels between state changes. Juggling threads and batches is done by the hardware and does not require state changes.

I'm not sure about pairs of batches - that's all invisible to the driver programmer anyways.
 
Last edited by a moderator:
Back
Top