Can you give me some insight into this? I'm at a loss... I thought all the responsibility was shifted but since the application knows things about what its doing it can handle said responsibilities more efficiently depending on implementation.
that's a back and forth in history.
in GL1.0 the process was quite simple
1. you set your drawing settings (e.g. flat shading, smooth,...)
2. you set your texture
3. you draw by telling what type (e.g. GL_QUAD ) and then push vertex by vertex to the rasterizing device.
the settings and data you set is usually not the format the GPU wants (e.g rasterizers used fixpoint vetices, not float, texel could be R8G8B8A8 or A8R8G8B8 or R4G4B4A4 or...) and every time you do that, the driver had to convert it. that's why in the old days everyone was sorting drawcalls by texture switches (even nowadays some have that mindset without knowing the historical reason)
GL1.1 solution : Displaylists
in display lists you can
1. start display list recording
2.1. you set your drawing settings (e.g. flat shading, smooth,...)
2.2. you set your texture
2.3. you draw by telling what type (e.g. GL_QUAD ) and then push vertex by vertex to the rasterizing device.
3. stop recording
4. replay that recording as many times as you want, the driver will not do any conversion
1. how do you change something? you need to record the display list again.
2. hardware appeared that support an awesome new feature: multitexturing, but opengl was overwriting the previous "thing" you've set, thus you always had just one texture.
GL1.2 solution texture objects, vertex arrays (I think those were actually earlier there, but slower)
1. create texture objects
1. you set your drawing settings (e.g. flat shading, smooth,...)
2. you set your texture obejects (barely driver work)
3. you draw by pointing at a vertex array in memory and telling GL how many primitives to draw
1. we are at Riva128 and Voodoo graphics times now, those were actually way faster than CPUs by using very smart pipelining and dedicated memory. dedicated memory is fast, but moving data to it was super slow (I think that was still ISA or EISA? time. thus you really become limited by the vertices you can copy to the rasterizer chips. With GeForce256 TnL hit consumers and the situation was even more unbalanced.
from now on mostly extensions took over
quick solution: VAS (vertex array storage) don't kill me if I'm calling it wrong, that's like ~1999 I think
you can specify to GL that the array you point at will not be alternate until you tell so, that way the rasterizer can keep it in memory and just redraw. TnL was taking care of transforms, thus the CPU was not involved at all.
problem: but you still had to copy.
now we got all the memory handle, for Vertex (VBO), Rendertarget (RBO), uniform/constant (UBO, I think that was in GL 3.2)
at this time nobody maintained Displaylists, because they become overly complicated to track by the driver. Displaylist allowed some data to be static (e.g. textures) but some data to be dynamic (whatever you set outside that was not recorded, thus overwritten inside the displaylilst). with VBO,RBO.. it went beyond the specs. I think the last attempt was by nvidia that supported for a short time PBuffers (kind of predecessor of frame buffer objects). but the driver guys said this became insanity.
from now on the API was pushing all commands good old opengl 1.0 way, the driver recorded the commands into some buffer and pushed it to the driver thread.
but why a driver thread if all data is on GPU and we just push commands? well, the GPU guys figured that everything you add to a GPU and which isn't used all the time is a waste. hence lets remove everything static and make (aka emulate it in shaders) it dynamic.
I think PowerVR is the pioneer of this (I don't know exactly to what extend they've gone, but if you look at their GL extensions, you'll get quite some hints). As an example: transparency. That's not needed for most objects and if you need it, the shader could do it just as good, right? ok, but OpenGL has a dozens of settings, how do we know which combination to create? we cannot compile all 100 different permutations.... well, let's do that in a driver when it's needed.
problem: there are tons of settings that can change every drawcalls, texture formats, framebuffer formats, blend settings, vertex layouts, shader, sampler...... and all of those trigger a new permutation of those super flexible units.
well, the set of permutations you really need in a game is small, because there are 100 trees that render the exact same way, but every game has a different way to render its trees, thus the driver needs to evaluate all settings on the first drawcall and the consecutive drawcalls need to at least check all settings for a possible change... insane work nowadays... that's why it takes a lot of CPU time.
and it's not just the average cost, but the unpredictable cost that makes this solution bad. if one frame some more object/drawcalls appear, the CPU will spent way more time preparing those draw calls than the GPU needs to executed'em.
what was the GL 1.1 solution for the "the driver does it every drawcalls, but the data doesn't change between frames"? ah, yes: Displaylists... or lest call those command list or command buffer now
as you can see in the presentation , the world is divided into those display lists like back then, once you see a new one or an existing one needs to be modified, a new display list is recorded. for all the other frames the CPU just tells the API to replay the list...
problems with Vulkan and DX12 will be obviously the same as back then in GL1.0
"1. how do you change something? you need to record the display list again"
my prediction for DX13 and emm... (Mantle...Vulkan...) Magma is a programmable command processor (which is just like moving the GPU back to the CPU and do it GL1.0 style).
The PCP will allow you to evaluate a scene on the GPU and push data in a flexible way to the GPU backend....
I hope everybody is sleeping well by now