Welcome to the forums Christophe, thanks for posting!
Regarding your numbers, I'm curious about a few methodology things:
1) Which state changes are you including here, or just raw draw calls similar to the asteroids example in the OpenGL superbible?
2) How many vertex attributes are you pulling?
3) I'm assuming this is using non-indexed primitives? Have you tried comparing with NVIDIA's bindless IB/VB multi-draw indirect extension? It's not really okay long term to stop using indexed primitives
4) Have you tried with bindless textures in the mix as well to try and get a more representative idea with "real" shaders?
5) For the multi-draw-indirect cases, have you tried generating the relevant draw buffers on the GPU and submitting them right after vs. generating them on the CPU? The latter opens the door for the driver to play games and if you really want to compare the GPU discard side of things, you should try and avoid that.
The reason for the last one in particular is that there is an O cost with how many resources (well, "allocations") you reference from a command buffer, bindless or not. It may or may not be feasible to have hundreds of thousands of "resident" bindless textures currently for that reason. I wouldn't necessarily assume that CPU performance would be unaffected in this case.
Mantle applications likely work around this by just grouping things into larger allocations, but that is not currently possible in DX/GL on Windows other beyond simple cases like buffers or texture arrays. Long term the OS itself needs to improve here so it's not so much an API problem, but if you want to do an apples-to-apples comparison it's something to consider.
The other question I had is whether you did any profiling of CPU usage in the various cases, particularly the "tight loop" ones. NVIDIA tends to launch several driver threads for the purposes of offloading work from what they assume is usually a loaded down main game thread. From the point of view of discrete cards, CPU time is "free" and the more you can use it to avoid bottlenecking the GPU, the better you look in comparative benchmarks.
Intel on the other hand has made a very deliberate decision not to do this offloading for several reasons. First and foremost, on power constrained SoCs (i.e. most these days), additional driver complexity lowers the performance of the entire platform, GPU included. Thus optimizations that can be accomplished in the application should really be done there, but only as appropriate. If having an entire HW thread dedicated to just submitting API commands makes sense in a given game, that can easily be done by the application. For this reason, game developers have requested that drivers stop spawning these additional heavy threads as they tend to oversubscribe machines, particularly on the low end where the performance is needed the most.
AMD will likely run into a similar situation and opt for a "thin" driver when their SoCs become more significantly power-constrained (or on the ones that are). Mantle is chasing similar goals and thus may see good improvements on power-constrained parts.
Thus I claim it's worth understanding the overall *performance efficiency* - both CPU and GPU - of these cases along with their raw performance, as ultimately that will determine the actual realized performance on future SoCs.
Dynamic indexing/fetching is always more expensive than what you can do if you statically know things ahead of time. Take the case of the input assembler... fixed function hardware can take advantage of special caches and data-paths to do the various AoS->SoA conversions that normally take place. More critically, the fetching of data can be properly pipelined such that when a vertex shader is launched, the vertex data is already available with no stalls. If you pull the data from the shader instead, you have to stall and hide that latency. Now of course GPUs are already pretty good at hiding latency, but it costs registers/on-chip storage and other hardware thread resources. This is an inescapable trade-off that comes up in lots of similar situations (CPU prefetching, etc).
Now in the case of the IA, GCN has chosen to simplify the design by just using vertex pulling internally and relying on the latency hiding that is already in place for stuff like texture fetches. This is a reasonable trade-off, but it's not "free", and indeed the more power-constrained stuff gets, the more fixed-function hardware tends to win. It will be interesting to see how competitive GCN is in Kaveri at the lower end of the power spectrum.
Don't get me wrong, for the case of IA, I have no firm opinion on whether we need fixed-function hardware in the long run. In a lot of cases it's worth just paying a small area/power cost to simplify hardware and software design, but it all depends on what that cost ends up being relative to common workloads. It's just not entirely fair to say that there's "no compromises" made - at least from a hardware point of view - in making these things more dynamic.
Good discussion, keep it up guys!
So when do we give up on agreement and start asm.glsl? Honestly while it's a practical concern, that's sort of a tangential discussion in the context of this thread.Shader cross compilation by defining a standard shader IL valide for HLSL and GLSL. We need it to be able to fully take advantage of all the OpenGL and Direct3D APIs.
Regarding your numbers, I'm curious about a few methodology things:
1) Which state changes are you including here, or just raw draw calls similar to the asteroids example in the OpenGL superbible?
2) How many vertex attributes are you pulling?
3) I'm assuming this is using non-indexed primitives? Have you tried comparing with NVIDIA's bindless IB/VB multi-draw indirect extension? It's not really okay long term to stop using indexed primitives
4) Have you tried with bindless textures in the mix as well to try and get a more representative idea with "real" shaders?
5) For the multi-draw-indirect cases, have you tried generating the relevant draw buffers on the GPU and submitting them right after vs. generating them on the CPU? The latter opens the door for the driver to play games and if you really want to compare the GPU discard side of things, you should try and avoid that.
The reason for the last one in particular is that there is an O cost with how many resources (well, "allocations") you reference from a command buffer, bindless or not. It may or may not be feasible to have hundreds of thousands of "resident" bindless textures currently for that reason. I wouldn't necessarily assume that CPU performance would be unaffected in this case.
Mantle applications likely work around this by just grouping things into larger allocations, but that is not currently possible in DX/GL on Windows other beyond simple cases like buffers or texture arrays. Long term the OS itself needs to improve here so it's not so much an API problem, but if you want to do an apples-to-apples comparison it's something to consider.
The other question I had is whether you did any profiling of CPU usage in the various cases, particularly the "tight loop" ones. NVIDIA tends to launch several driver threads for the purposes of offloading work from what they assume is usually a loaded down main game thread. From the point of view of discrete cards, CPU time is "free" and the more you can use it to avoid bottlenecking the GPU, the better you look in comparative benchmarks.
Intel on the other hand has made a very deliberate decision not to do this offloading for several reasons. First and foremost, on power constrained SoCs (i.e. most these days), additional driver complexity lowers the performance of the entire platform, GPU included. Thus optimizations that can be accomplished in the application should really be done there, but only as appropriate. If having an entire HW thread dedicated to just submitting API commands makes sense in a given game, that can easily be done by the application. For this reason, game developers have requested that drivers stop spawning these additional heavy threads as they tend to oversubscribe machines, particularly on the low end where the performance is needed the most.
AMD will likely run into a similar situation and opt for a "thin" driver when their SoCs become more significantly power-constrained (or on the ones that are). Mantle is chasing similar goals and thus may see good improvements on power-constrained parts.
Thus I claim it's worth understanding the overall *performance efficiency* - both CPU and GPU - of these cases along with their raw performance, as ultimately that will determine the actual realized performance on future SoCs.
What's wrong with uint buffers or "byte address buffers" (assuming there's an equivalent in GL)? Obviously byte gathers are going to pay a hit on GPU hardware (some may not even support it natively and have to insert a pile of bit logic into the shader), but you can already do whatever you want with uint buffers. "asfloat" and the like are basically reinterpret casts in registers.We should be able to address the memory and reinterpret_cast the data into whatever vertex format that is associated with a specific draw.
Slow down a bit on that... saying "no compromises" isn't entirely true if you're talking about the whole platform.All in all, that stuff that I call programmable vertex pulling is putting ourselves in a position with no compromise and only wins.
Dynamic indexing/fetching is always more expensive than what you can do if you statically know things ahead of time. Take the case of the input assembler... fixed function hardware can take advantage of special caches and data-paths to do the various AoS->SoA conversions that normally take place. More critically, the fetching of data can be properly pipelined such that when a vertex shader is launched, the vertex data is already available with no stalls. If you pull the data from the shader instead, you have to stall and hide that latency. Now of course GPUs are already pretty good at hiding latency, but it costs registers/on-chip storage and other hardware thread resources. This is an inescapable trade-off that comes up in lots of similar situations (CPU prefetching, etc).
Now in the case of the IA, GCN has chosen to simplify the design by just using vertex pulling internally and relying on the latency hiding that is already in place for stuff like texture fetches. This is a reasonable trade-off, but it's not "free", and indeed the more power-constrained stuff gets, the more fixed-function hardware tends to win. It will be interesting to see how competitive GCN is in Kaveri at the lower end of the power spectrum.
Don't get me wrong, for the case of IA, I have no firm opinion on whether we need fixed-function hardware in the long run. In a lot of cases it's worth just paying a small area/power cost to simplify hardware and software design, but it all depends on what that cost ends up being relative to common workloads. It's just not entirely fair to say that there's "no compromises" made - at least from a hardware point of view - in making these things more dynamic.
Are you not at AMD anymore? From Aras's tweet a while back, are you at Unity now?All this is nice for IVHs to understand how to design future GPUs.
Good discussion, keep it up guys!