Many draw calls with pulling, bindless, multidrawindirect, etc.

Welcome to the forums Christophe, thanks for posting!

Shader cross compilation by defining a standard shader IL valide for HLSL and GLSL. We need it to be able to fully take advantage of all the OpenGL and Direct3D APIs.
So when do we give up on agreement and start asm.glsl? :p Honestly while it's a practical concern, that's sort of a tangential discussion in the context of this thread.

Regarding your numbers, I'm curious about a few methodology things:

1) Which state changes are you including here, or just raw draw calls similar to the asteroids example in the OpenGL superbible?
2) How many vertex attributes are you pulling?
3) I'm assuming this is using non-indexed primitives? Have you tried comparing with NVIDIA's bindless IB/VB multi-draw indirect extension? It's not really okay long term to stop using indexed primitives :)
4) Have you tried with bindless textures in the mix as well to try and get a more representative idea with "real" shaders?
5) For the multi-draw-indirect cases, have you tried generating the relevant draw buffers on the GPU and submitting them right after vs. generating them on the CPU? The latter opens the door for the driver to play games and if you really want to compare the GPU discard side of things, you should try and avoid that.

The reason for the last one in particular is that there is an O(n) cost with how many resources (well, "allocations") you reference from a command buffer, bindless or not. It may or may not be feasible to have hundreds of thousands of "resident" bindless textures currently for that reason. I wouldn't necessarily assume that CPU performance would be unaffected in this case.

Mantle applications likely work around this by just grouping things into larger allocations, but that is not currently possible in DX/GL on Windows other beyond simple cases like buffers or texture arrays. Long term the OS itself needs to improve here so it's not so much an API problem, but if you want to do an apples-to-apples comparison it's something to consider.

The other question I had is whether you did any profiling of CPU usage in the various cases, particularly the "tight loop" ones. NVIDIA tends to launch several driver threads for the purposes of offloading work from what they assume is usually a loaded down main game thread. From the point of view of discrete cards, CPU time is "free" and the more you can use it to avoid bottlenecking the GPU, the better you look in comparative benchmarks.

Intel on the other hand has made a very deliberate decision not to do this offloading for several reasons. First and foremost, on power constrained SoCs (i.e. most these days), additional driver complexity lowers the performance of the entire platform, GPU included. Thus optimizations that can be accomplished in the application should really be done there, but only as appropriate. If having an entire HW thread dedicated to just submitting API commands makes sense in a given game, that can easily be done by the application. For this reason, game developers have requested that drivers stop spawning these additional heavy threads as they tend to oversubscribe machines, particularly on the low end where the performance is needed the most.

AMD will likely run into a similar situation and opt for a "thin" driver when their SoCs become more significantly power-constrained (or on the ones that are). Mantle is chasing similar goals and thus may see good improvements on power-constrained parts.

Thus I claim it's worth understanding the overall *performance efficiency* - both CPU and GPU - of these cases along with their raw performance, as ultimately that will determine the actual realized performance on future SoCs.

We should be able to address the memory and reinterpret_cast the data into whatever vertex format that is associated with a specific draw.
What's wrong with uint buffers or "byte address buffers" (assuming there's an equivalent in GL)? Obviously byte gathers are going to pay a hit on GPU hardware (some may not even support it natively and have to insert a pile of bit logic into the shader), but you can already do whatever you want with uint buffers. "asfloat" and the like are basically reinterpret casts in registers.

All in all, that stuff that I call programmable vertex pulling is putting ourselves in a position with no compromise and only wins.
Slow down a bit on that... saying "no compromises" isn't entirely true if you're talking about the whole platform.

Dynamic indexing/fetching is always more expensive than what you can do if you statically know things ahead of time. Take the case of the input assembler... fixed function hardware can take advantage of special caches and data-paths to do the various AoS->SoA conversions that normally take place. More critically, the fetching of data can be properly pipelined such that when a vertex shader is launched, the vertex data is already available with no stalls. If you pull the data from the shader instead, you have to stall and hide that latency. Now of course GPUs are already pretty good at hiding latency, but it costs registers/on-chip storage and other hardware thread resources. This is an inescapable trade-off that comes up in lots of similar situations (CPU prefetching, etc).

Now in the case of the IA, GCN has chosen to simplify the design by just using vertex pulling internally and relying on the latency hiding that is already in place for stuff like texture fetches. This is a reasonable trade-off, but it's not "free", and indeed the more power-constrained stuff gets, the more fixed-function hardware tends to win. It will be interesting to see how competitive GCN is in Kaveri at the lower end of the power spectrum.

Don't get me wrong, for the case of IA, I have no firm opinion on whether we need fixed-function hardware in the long run. In a lot of cases it's worth just paying a small area/power cost to simplify hardware and software design, but it all depends on what that cost ends up being relative to common workloads. It's just not entirely fair to say that there's "no compromises" made - at least from a hardware point of view - in making these things more dynamic.

All this is nice for IVHs to understand how to design future GPUs.
Are you not at AMD anymore? From Aras's tweet a while back, are you at Unity now?

Good discussion, keep it up guys!
 
So really the future is to just have "state buffers" to go along with our vertex and index buffers? Interesting, but shader changes are still a big deal.

I don't feel like massive numbers of draw calls are *that* interesting, especially in synthetic tests. At the end of the day on a well-designed console performing a draw call is about as expensive as writing 40-50 bytes to memory somewhere. Avoiding state changes is significantly more interesting... with a few tweaks we'll perhaps be able to do it on every relevant platform soon.
 
I don't know about you guys, but the thought of traversing complex data structures using HLSL or GLSL doesn't exactly fill my heart with joy.
I don't have much problems in traversing (reading) complex structures using GPU. For logarithmic searches (trees, etc) you have to do it in a branchless way, but that's what you have to do on CPUs these days as well (CPU branch predictors are very bad at random 50/50 split branches). For hash lookups, I use cuckoo hashing (it's constant time and doesn't need a single branch). HLSL supports functions, and once you have good library for your data structures, using them becomes easy. Unfortunately HLSL doesn't support function/class templates, so you need to do some reinterpret casts (asfloat/asint) and define data containers by value size (not by value type). CUDA and C++AMP support templates as well, making it easier to write generic code.

Generating complex data structures on a GPU however can be very difficult, because you need to do them in a parallel way. Fine grained synchronization kills your performance. But I often get surprised how easy some complex structures are for GPUs to generate. Bloom filter (http://en.wikipedia.org/wiki/Bloom_filter) for example can be generated in parallel by atomic OR operations.
I am expecting magnitude 1 or 2 of higher scene complexity/increase efficiency with moving toward the MultiDrawIndirect development mind frame on current GPU architecture from the XBox360/PS3 development mind frame.
Object counts (and draw distance) will certainly increase when GPU pulling starts to be used more widely. We will also see more shadow casting lights.
Mantle is only trying to optimize the XBox360/PS3 development mind frame. I am happy for whoever want to live in the past, but I have a future to build. :p
I am quite sure Mantle supports everything in OpenGL 4.4 and beyond. If it allows you a direct access to the ring/command buffers you could likely add your own loops and branches there (allowing you to do many nice things without any CPU intervention).

Of course many developers will use Mantle to reduce draw call cost of their CPU based rendering architectures. But that's a good thing. I prefer to have as many options as possible. GPUs are not yet flexible enough to do everything themselves. For our purposes fully GPU-driven rendering is enough (the compromises are worth it for us and we have good workarounds), but that's not the case for everyone. Some rendering approaches just need more state changes and more front end flexibility.
Ultimately, we should be able render a frame with less than 10 draw calls by API calls are much lower frequency updates and think in term of bandwidth instead, GPU based resource indexing instead of CPU based resource switching. The indirect draw buffer because a memory packed representation of the series of draws that is constantly updated to match the evolution of the scene frame after frame. A cached version of that buffer can even be used in a separate thread so that really the rendering thread as no much left to do anymore.
I like keeping the whole scene data on the GPU side. Usually less than 1% of the objects have changed per frame, so the update buffer is quite small, thus the traffic from CPU -> GPU is reduced dramatically compared to "good" old times (= CPU updating the matrix of every object every frame and filling constant buffers per object and pushing huge amount of draw calls that are basically the same every frame).
VAO can get ready to be dropped in that mental frame, replaced by programmable vertex fetching. Shader storage buffers are not flexible enough for my taste here. We should be able to address the memory and reinterpret_cast the data into whatever vertex format that is associated with a specific draw.
On DirectX you have byte address buffers, and asfloat/asint/etc (reinterpret cast operations of HLSL). In addition to these we have created our own formats for things like 8 bit floats, as bandwidth is the most usual bottleneck in compute shaders. Modern GPUs are almost never ALU bound, so you can spend some ALU cycles to optimize your data layout (bit packing magic + more exotic encoding/decoding).
All in all, that stuff that I call programmable vertex pulling is putting ourselves in a position with no compromise and only wins. There is so many opportunities to take advantage of that mind frame.
"No compromises" is not true yet. OpenGL 4.4 (and NV_bindless_multi_draw_indirect) gets us closer, but neither is widely available (cannot be yet used in a shippable product you want to sell to masses). And neither solves the issue of shader permutations (select shader by GPU).
Thus I claim it's worth understanding the overall *performance efficiency* - both CPU and GPU - of these cases along with their raw performance, as ultimately that will determine the actual realized performance on future SoCs.
Console development has had the same mind set for years. If your rendering takes lots of CPU time, you have less CPU time for your physics, AI and game logic. Every cycle counts (as console CPUs tend to be quite slow compared to modern PC CPUs). More efficient graphics API is a big win, but moving some rendering setup steps to GPU is also a big win. If a "draw call" basically becomes a single insert operation to an append buffer, that shouldn't eat much power on laptop/tablet SoCs either :)
Now in the case of the IA, GCN has chosen to simplify the design by just using vertex pulling internally and relying on the latency hiding that is already in place for stuff like texture fetches. This is a reasonable trade-off, but it's not "free", and indeed the more power-constrained stuff gets, the more fixed-function hardware tends to win. It will be interesting to see how competitive GCN is in Kaveri at the lower end of the power spectrum.
IA is a special case also in GCN (the index buffers are special, check the SI instruction set PDF, it's publicly available).

I dislike the fixed-function units in general. It was a very good move to add a programmable (general purpose) scalar unit to GCN. It doesn't use much die space (one unit shared between four SIMDs), but allows using similar optimization techniques for dynamic buffers as you could use for constant buffers (do address calculations and data fetching once for wave, not for each thread). And of course the same scalar unit is used for all math related to branching (since branches are wave wide). I hated the "constant waterfalling" problem that plagued last generation GPUs. It was one of the biggest problems that limited more generic dynamic accessing to arrays etc. Now dynamic accesses (dynamic address calculation + fetch) is using exactly the same mechanisms (and caches) as constants. This makes many things possible that weren't possible before (because of performance issues). The HLSL programming model however doesn't yet support everything that the GPU supports.

If Mantle supports writing shaders in GCN microcode ("pretty please" to Mantle guys reading this if not :)), it would be the best API for GPU-driver rendering (because scalar unit abuse allows more efficient data pulling / address calculating when you know what you are doing).
At the end of the day on a well-designed console performing a draw call is about as expensive as writing 40-50 bytes to memory somewhere.
But writing just 4 bytes (object id) to an append buffer by GPU is even more efficient. Especially on PC, where you need to move the draw call data generated by CPU across the PCIE bus (and translate it to the GPU understandable format from a cross platform format).
 
What about streaming geo? Seems like a multidraw indirect requires one giant index buffer that you index into per-object with the start index location. But maintaining such an index buffer in a world where geo can stream in or out at any given time sounds like a big pain.... and constructing a new giant index buffer every frame (with compute, I suppose) sounds like a waste of resources.
 
Avoiding state changes is significantly more interesting... with a few tweaks we'll perhaps be able to do it on every relevant platform soon.
I'm just not sure how much "state" I really care about changing at a high frequency to be honest. If/when binding is out the picture, do you really need to switch shaders, blend/raster/depth state or anything else millions of times per frame? I think not, and that would suck in other places in the GPU pipeline anyways.

Given how reasonably decent GPUs are at uniform/static control flow these days, the only really over-riding constraint that forces shader changes is register pressure (I'm ignoring the API annoyance of uber-shaders, etc. right now as that's "just software"). There are really only a few buckets you need to capture those differences too. Arguably GPU hardware could move to be more dynamic on that front as well by relying more on close L1$'s and stacks than on gigantic, statically-allocated register files, but that comes with some non-trivial hardware cost that isn't necessarily justified considering we already have CPUs that do that design pretty well.

I like keeping the whole scene data on the GPU side.
This distinction starts to fall away a bit in the future on a lot of platforms though, which will be an overall boon to efficiency. It's much better to be able to decide where to run code based on power efficiency than based on separate memory spaces. Data locality will of course always be a relevant concern, but for (potentially multi-buffered) frame update stuff, it's not usually going to still be sitting in GPU caches from the previous frame.

More efficient graphics API is a big win, but moving some rendering setup steps to GPU is also a big win. If a "draw call" basically becomes a single insert operation to an append buffer, that shouldn't eat much power on laptop/tablet SoCs either :)
Sure, but my point is that it goes the other way too. Making the CPU really cheap by making the GPU do something power-efficient is not going to be a long-term win :) Thus we need to shift focus from "moving things on and off processor X" to doing various operations on the most appropriate processor.

IA is a special case also in GCN (the index buffers are special, check the SI instruction set PDF, it's publicly available).
Sure, I should have been more clear on my example being about vertex buffers in particular. Index buffers you still need some sort of pointer in the command stream as they happen before you hit any of the programmable stages. The NV bindless solution seems pretty good conceptually for this.

I dislike the fixed-function units in general.
So do all of us software developers, but it's always a trade-off :)

If Mantle supports writing shaders in GCN microcode ("pretty please" to Mantle guys reading this if not :)), it would be the best API for GPU-driver rendering (because scalar unit abuse allows more efficient data pulling / address calculating when you know what you are doing).
While cool, I think that's extremely unlikely given what they have said. Personally I had hoped for it being a more specific "GCN API" for this reason, but they have explicitly said they want to keep it portable. Baking in stuff like SIMD size, scalar unit, etc. is not something they are likely to do. Time to revive CTM? :p

What about streaming geo? Seems like a multidraw indirect requires one giant index buffer that you index into per-object with the start index location. But maintaining such an index buffer in a world where geo can stream in or out at any given time sounds like a big pain.... and constructing a new giant index buffer every frame (with compute, I suppose) sounds like a waste of resources.
See the referenced NVIDIA extension to this end - you give it index and vertex buffer bindings in your indirect buffer along with the draw commands. Pretty much the ideal solution, although I'd argue that handing streaming by sub-allocating out of a bigger CPU buffer and copying stuff in and out is usually not going to be that much overhead either. At this stage it's all virtual memory anyways, so having a big continuous chunk of it isn't a huge deal assuming 64-bit GPU addressing.
 
See the referenced NVIDIA extension to this end - you give it index and vertex buffer bindings in your indirect buffer along with the draw commands. Pretty much the ideal solution, although I'd argue that handing streaming by sub-allocating out of a bigger CPU buffer and copying stuff in and out is usually not going to be that much overhead either. At this stage it's all virtual memory anyways, so having a big continuous chunk of it isn't a huge deal assuming 64-bit GPU addressing.

At least in terms of AAA games, I feel that GCN will drift towards being sort of both a min and a max spec for the next few years... so the nvidia extension might not end up being very helpful unless nvidia steam machines become a huge thing. But as you say the overhead of buffer management may not actually be significant.

There are a lot of concerns, though. Synthetic tests are cool but they seem so far removed from an actual game situation and actual game concerns that it's unclear sometimes how to apply the things they teach us.
 
Dynamic indexing/fetching is always more expensive than what you can do if you statically know things ahead of time. Take the case of the input assembler... fixed function hardware can take advantage of special caches and data-paths to do the various AoS->SoA conversions that normally take place. More critically, the fetching of data can be properly pipelined such that when a vertex shader is launched, the vertex data is already available with no stalls. If you pull the data from the shader instead, you have to stall and hide that latency. Now of course GPUs are already pretty good at hiding latency, but it costs registers/on-chip storage and other hardware thread resources. This is an inescapable trade-off that comes up in lots of similar situations (CPU prefetching, etc).

I would think doing input assembly in shader would hide this latency better since you can overlap that latency with work, at least partially.

Fetching it all in IA means you stall until everything is available and there is no work to overlap it.

EDIT: What am I missing?
 
Last edited by a moderator:
Sure, but my point is that it goes the other way too. Making the CPU really cheap by making the GPU do something power-efficient is not going to be a long-term win :) Thus we need to shift focus from "moving things on and off processor X" to doing various operations on the most appropriate processor.

If we are talking about systems with discrete GPUs, then it seems to me that the tradeoff would almost always favor, sooner or later, going to the GPU. They have a lot more parallelism, can now (or in the near future will) run multiple different kernels in parallel to hide latency even with low parallelism algorithms and have massive memory bandwidth.

If anything is amenable to parallelization and vectorization, it'll end up migrating to the GPU. If it is fundamentally serial like scripting, then there is of course no hope. That leaves a class of algorithms that parallelize well but vectorize poorly which might stay on CPU, but it's hard to see what they might be.
 
What about streaming geo? Seems like a multidraw indirect requires one giant index buffer that you index into per-object with the start index location. But maintaining such an index buffer in a world where geo can stream in or out at any given time sounds like a big pain.... and constructing a new giant index buffer every frame (with compute, I suppose) sounds like a waste of resources.
Modern GPUs have virtual memory. You just modify the page mappings (of your geometry structures) and stream things page by page, just like virtual texturing does. Allocating/freeing lots of (variable size) GPU resources (DirectX textures/objects for example) has never been the recommended way to stream things. That fragments your memory, and requires GPU to swap things in/out.
This distinction starts to fall away a bit in the future on a lot of platforms though, which will be an overall boon to efficiency. It's much better to be able to decide where to run code based on power efficiency than based on separate memory spaces. Data locality will of course always be a relevant concern, but for (potentially multi-buffered) frame update stuff, it's not usually going to still be sitting in GPU caches from the previous frame.
Unified memory is great, however Steam Survey still shows that over 90% of gaming PCs have discrete GPUs. If you want to sell your game on PC, you better design your graphics engine in a way that works properly on these GPUs. For academic research & hobby projects (and for first party console development) you can of course forget about discrete GPUs and enjoy more freedom.
Thus we need to shift focus from "moving things on and off processor X" to doing various operations on the most appropriate processor.
Fortunately scene management, animation, culling, object/hierarchy matrix setup and most other heavy duty graphics engine tasks are more power efficient for GPU. These steps are easy to run in parallel and/or contain lots of floating point math. Many developers have started to process these things in "GPU-like" manner on CPU as well because it is faster (Dice's vectorized flat culling pipeline for example beats old more complex octree-based algorithms by 3x according to their paper). GPU obviously runs these linear floating point heavy algorithms even faster and with better power efficiency, since it doesn't have heavy OoO/renaming machinery (that doesn't help at all in algorithms like these, but still consumes quite a lot of power).
Personally I had hoped for it being a more specific "GCN API" for this reason, but they have explicitly said they want to keep it portable. Baking in stuff like SIMD size, scalar unit, etc. is not something they are likely to do. Time to revive CTM? :p
I have been a console programmer for too long time... for me it feels WRONG if I know the hardware could do better but the API doesn't allow me to do it :)
At least in terms of AAA games, I feel that GCN will drift towards being sort of both a min and a max spec for the next few years... so the nvidia extension might not end up being very helpful unless nvidia steam machines become a huge thing.
That's true. Most PC games are console ports. Also we don't yet know which AMD and Intel chips are going to support OpenGL 4.4. I would guess that only 7000-series and 2xx-series Radeons support it (there's still a lot of 5000 and 6000 series cards in use, since these support full DX11 feature set). Intel doesn't yet even support 4.3 on any iGPUs (and has not announced any plans beyond 4.2). So it might not be commercially viable to build your engine on top of these features (unless you have fallback plans that don't kill your performance).
I would think doing input assembly in shader would hide this latency better since you can overlap that latency with work, at least partially.

Fetching it all in IA means you stall until everything is available and there is no work to overlap it.

EDIT: What am I missing?
I have been thinking about it too. Usually hiding memory latency by ALU instructions has been a good idea. Maybe the post-transform cache would be hard to implement using memory buffers available to shaders (LDS/GDS), or the cache logic would be too hard to distribute around the chip.
 
At least in terms of AAA games, I feel that GCN will drift towards being sort of both a min and a max spec for the next few years... so the nvidia extension might not end up being very helpful unless nvidia steam machines become a huge thing.
Sure, but the point is about future hardware and software directions, not what game developers can do today. I'm just referencing examples of various design points and in some cases - such as bindless IA stuff - we even already have proof that it works well.

Unified memory is great, however Steam Survey still shows that over 90% of gaming PCs have discrete GPUs. If you want to sell your game on PC, you better design your graphics engine in a way that works properly on these GPUs.
That's not totally clear with a full 36% not shown. Still, I agree that the majority of Steam users are on discrete for now, although Unity is heavily skewed in the other direction.

That said, I'm not primarily interested in the situation today so much as 3+ years down the road. While I'm a high end discrete GPU guy through and through, there's little doubt it my mind that the majority of systems with be SoCs soon (or already are depending on who's counting). The economics that make that the best option in mobile, laptops and consoles apply to PCs too, despite the small percentage of folks who will continue to buy halo stuff.

Fortunately scene management, animation, culling, object/hierarchy matrix setup and most other heavy duty graphics engine tasks are more power efficient for GPU.
I disagree with this, at least at the level of generality that you have specified it. I've written a fair amount of this stuff both for CPUs and GPUs and even if we had much more sophisticated GPU scheduling available, they are still ill-suited to do stuff with even moderate branch divergence, pointer chasing, trees/recursion, etc. They *can* do it now, but they are not as power-efficient as equivalent CPU implementations.

The problem is that GPUs don't really have a large lead over CPUs even in straightforward FLOPS efficiency, particularly if you compare lower-clocked parts. Running at 3-4Ghz is obviously not going to be as power efficient, but as all CPUs are increasingly power-constrained they will have much wider dynamic clock and power ranges and good use of parallelism will cause them to run at more efficient (i.e. lower) frequencies.

While the OOE logic does use some power, I'm not convinced it is as much as people think. In fact, I'm starting to get similar vibes to when people were convinced that x86 could never be competitive in power-efficiency... and we all know how that turned out :) It's also worth noting that naive arguments about SIMD widths don't hold up. Indeed in Haswell the physical SIMD size in the CPU is actually wider than the GPU :D And both are quite power-efficient when compared to competing CPUs and GPUs, even normalizing for process.

I'm a GPU guy so I tend to be the one helping people move stuff to the GPU, optimize, etc. there, but having also written my share of ISPC code too I can tell you that the only place where GPUs have a large efficiency advantage is when you can make use of the fixed-function hardware in some way... i.e. usually texture samplers or similar.

Anyways I don't want to get too far astray here, but suffice it to say I don't buy the "most parallel stuff is more power efficient on the GPU" argument. Indeed I still think a key point in this discussion is whether the power-overhead of pull-mode stuff on the GPU is significant compared to a more push-oriented model. Obviously right now it's way better since the APIs have far too much overhead, but it might be an interesting comparison with Mantle or similar.

That's true. Most PC games are console ports.
I would have agreed with you a couple years ago but not today. Using the Steam stats again, how many of the games that people play could be called console ports? Even if you call Skyrim a console port (which is questionable...), you're still talking maybe 1-2 of the top 10 games and an even smaller percentage of the top 20. If you weight by players, it's even smaller. And that doesn't include other massively popular PC games like LoL, WoT, everything from Blizzard, etc.

These days the PC industry is its own thing. Even the "big" console games are minor players compared to a wealth of hugely-popular PC exclusives. Thus it would be stretching it to say that console considerations are particularly relevant to the games that PC gamers are playing these days.

As I mentioned I'm less interested in the immediate term than a few years down the road anyways. If GPU models evolve significantly in that time I'm not convinced that the "majority" of PC games are going to be hamstrung by console technology considerations judging by the current trends.

Fetching it all in IA means you stall until everything is available and there is no work to overlap it.
I'm not sure I get your point here... you always "stall" until data is available if that is the bottleneck, it's just a question of whether you're burning hardware resources (registers, HW threads) during that time or whether those resources are available to do other things. And even if the IA is a significant bottleneck for some reason (and it shouldn't be since the pipeline is usually balanced to achieve a certain throughput through all of geometry stages), it's still better for power if you have completely idle execution units vs. stalled ones.
 
While the OOE logic does use some power, I'm not convinced it is as much as people think. In fact, I'm starting to get similar vibes to when people were convinced that x86 could never be competitive in power-efficiency... and we all know how that turned out :) It's also worth noting that naive arguments about SIMD widths don't hold up. Indeed in Haswell the physical SIMD size in the CPU is actually wider than the GPU :D And both are quite power-efficient when compared to competing CPUs and GPUs, even normalizing for process.

There was an NVIDIA presentation that broke down microarchitecture power usage by component, which had OoOE+renaming+branch prediction pegged at a very high level, as in causing power usage to go up by an entire order of magnitude(!) http://techtalks.tv/talks/54110/#

The underlying problem is that data movement is very expensive in terms of power, and OoOE requires a lot of it (internal to a single core, but still significant).

OoOE also has a problem in that it usually can only cover small stalls (cached memory transactions, branchs). Beyond that, it's unlikely you'll be able to find non-dependent instructions, no matter how wide your OoOE window is. Thing is, these can be covered fairly well in compiler time instruction scheduling (this is the real benefit of JiTing from LLVM or similar (HLSL uses it's own virtual ISA) - instruction scheduling requires a lot knowledge about the specific microarchitecture), so you're really not going to see a tremendous increase in performance.

Also keep in mind that you have SMT on GPUs (and also recent CPUs) which can in fact cover very long stalls, but only works if you have extra threads. This shrinks the window for OoOE benifits.

The question then is if a larger number of in order processors can, with more frequent stalls, have more total performance than a lower number of cores with fewer stalls. No doubt this depends on the problem! (Or perhaps on the algorithm...)

I'm a GPU guy so I tend to be the one helping people move stuff to the GPU, optimize, etc. there, but having also written my share of ISPC code too I can tell you that the only place where GPUs have a large efficiency advantage is when you can make use of the fixed-function hardware in some way... i.e. usually texture samplers or similar.

I'll admit that I'm biased a bit towards GPUs, most likely because the problems I've worked on (fractals, mostly) have in fact had massive performance and power usage benefits on GPUs, no doubt because they're embarrassingly parallel AND have high a compute/memory ratio AND have low divergence (though in the case of flame fractals, you have to rejigger the algorithm to get this - the naive one actually has very high divergence). I suppose they do benefit from fixed-function hardware in that they make quite a bit of use of the transcendental (sin, exp, log...) hardware.

Anyways I don't want to get too far astray here, but suffice it to say I don't buy the "most parallel stuff is more power efficient on the GPU" argument. Indeed I still think a key point in this discussion is whether the power-overhead of pull-mode stuff on the GPU is significant compared to a more push-oriented model. Obviously right now it's way better since the APIs have far too much overhead, but it might be an interesting comparison with Mantle or similar.

"Most parallel stuff is more power efficient on the GPU" is a pretty good rule of thumb, as GPUs are increasing becoming "CPUs with microarchitecture and ISAs optimized for parallel processing" :p

I don't really see where a pull-mode for context changes is that expensive (in terms of power). You're only transferring a few bytes. The problem is more about latency (if it stalls between every context change, you have a problem) than anything.
 
That said, I'm not primarily interested in the situation today so much as 3+ years down the road. While I'm a high end discrete GPU guy through and through, there's little doubt it my mind that the majority of systems with be SoCs soon (or already are depending on who's counting). The economics that make that the best option in mobile, laptops and consoles apply to PCs too, despite the small percentage of folks who will continue to buy halo stuff.
As I game developer I must unfortunately focus on the technologies that are available right now (and in next 2 to 3 years). I would be happy to forget about discrete cards (as that would let us do some things better on unified memory consoles), but only if that would not hurt our PC customers. You have to remember that people who buy the "halo" stuff are also the most vocal ones :)
I disagree with this, at least at the level of generality that you have specified it. I've written a fair amount of this stuff both for CPUs and GPUs and even if we had much more sophisticated GPU scheduling available, they are still ill-suited to do stuff with even moderate branch divergence, pointer chasing, trees/recursion, etc. They *can* do it now, but they are not as power-efficient as equivalent CPU implementations.
Trees are easy, if the tree is balanced (logarithmic maximum guarantee). Balanced trees are easy to traverse without any branches (and thus no thread divergence). Pointer chasing is ok, as long as the threads near each other follow similar paths (you get a good L1/L2 hit rate). On a CPU you don't want to pointer chase either (unless your data is mostly in the caches). Branch divergence kills performance on GPU, so you need to use some "sort by branch"-method to get rid of divergent branches. Sometimes this is straightforward (and efficient), sometimes not.

I agree with you that many algorithm ports to GPU are not reaching the same power efficiency as the CPU implementations. GPU ports usually increase memory traffic (by multipassing), because that's an easy way to get more performance (on a device with huge BW). Sorting for example is less efficient on GPU, because GPU sorting algorithms tend to move more data around. For example fastest GPU radix sorters do 4 bits per pass, while fastest CPU radix sorters do 8, thus GPU sorter needs twice as many passes and thus twice as many memory moves. Of course the GPU still beats CPU in sorting performance, but that's because it has much more bandwidth to burn (over the required 2x to reach parity).
While the OOE logic does use some power, I'm not convinced it is as much as people think. In fact, I'm starting to get similar vibes to when people were convinced that x86 could never be competitive in power-efficiency... and we all know how that turned out :) It's also worth noting that naive arguments about SIMD widths don't hold up. Indeed in Haswell the physical SIMD size in the CPU is actually wider than the GPU :D And both are quite power-efficient when compared to competing CPUs and GPUs, even normalizing for process.
Double wide SIMD is more power efficient in an OoO architecture, as wider SIMD requires same amount of OoO machinery work as a narrower one, but doubles the arithmetic throughput. AVX 512 will again "halve" the OoO machinery "cost" per FLOP. Sandy "halved" it with 8 wide AVX, and Haswell "halved" it with FMA(float) & AVX2(int).

Narrower SIMDs on GPUs aren't that bad decision (compared to heavy OoO machines). GCN doesn't have that wide SIMDs either (16 wide hardware). Different architectures have different scaling.
I'm a GPU guy so I tend to be the one helping people move stuff to the GPU, optimize, etc. there, but having also written my share of ISPC code too I can tell you that the only place where GPUs have a large efficiency advantage is when you can make use of the fixed-function hardware in some way... i.e. usually texture samplers or similar.
I have been porting a lot of graphics engine stuff to GPU as well, and for our graphics engine processing steps, the GPU seems to do a good job. Sometimes you need to spend a bit extra bandwidth (when sorting for example), but that's not always the case. Porting most of our game logic (or our flash based UI code) to GPU would of course be very time consuming, and would likely result in very bad an inefficient code. But that's why I believe moving graphics setup / culling / animation code to GPU is even more important. It frees more CPU cores to the game logic. And game logic really needs CPU to perform well. On consoles the processing resources are always limited (especially on CPU side). Good balance between CPU and GPU is the key to success.

I also want to point out that your CPU of choice (in these comparisons) is Haswell (8 wide AVX2 + FMA + gather). Not all CPUs are that efficient in processing heavy floating point graphics setup code ("refresh 10000 matrices by slerping animation key frame quaternions and positions"). Quad core (8 thread) Haswell is likely able to process all the required (per frame) tasks like this fast enough, but some (very important) CPUs still have 4 wide SIMDs and no FMA (and much lower frequency), and would seriously choke on setting up huge dynamic game scenes (200k+ visible objects + 60 fps).
 
Last edited by a moderator:
There was an NVIDIA presentation that broke down microarchitecture power usage by component, which had OoOE+renaming+branch prediction pegged at a very high level, as in causing power usage to go up by an entire order of magnitude(!) http://techtalks.tv/talks/54110/#
Interesting, thanks for the link! I'll try and find some time to watch that.

OoOE also has a problem in that it usually can only cover small stalls (cached memory transactions, branchs).
Sure, but ultimately uncached stuff is going to dominate the power budget regardless of the microarchitecture. Currently this is actually part of the reason why CPUs tend to do pretty well: large caches. Of course GPUs are increasingly adding caches and so on, but they spend a large portion of those transistors on register files already for the purposes of latency hiding. You can't have your cake and eat it too in this situation... you always pay the cost of unpredictable memory access somewhere. Prefetching (HW or SW) and GPU-style "stall on use" are really the same solution, even though they look very different to the application developer.

I'll admit that I'm biased a bit towards GPUs
And to be clear, so am I no doubt due to my job, but I think it's important to try and be a bit balanced. To that end I have been surprised recently on how well CPUs actually do on problems that one would think are particularly GPU-suited. There's still no replacing texture samplers or the way that GPUs are able to pipeline data on-chip for graphics, but on the pure compute front, it's closer than I would have thought.

It's also sort of stunning how much waste occurs on GPUs in terms of computation, often due to the quest for more parallelism. The tiled deferred stuff I did a few years ago was a bit of a wake up call on how incredibly brute force the GPU solution so far has been. Obviously there are language issues there, but there's also no escaping the fact that the typical wisdom that you might as well do a brute force GPU solution vs. a more efficient CPU one is increasingly not true due to power.

The problem is more about latency (if it stalls between every context change, you have a problem) than anything.
Right it's not the memory traffic that is the concern, it's the additional latency hiding and use of execution units for tasks vs. fixed function (the latter of which is generally more power efficient). "Dynamic" (pull) is always more expensive than static, it's just a question of by how much. It's probably pretty small, but it's not that easy to measure at the moment without having either a significantly power-constrained GPU (even more than something like the 290X) or much better hardware counters for power.

As I game developer I must unfortunately focus on the technologies that are available right now (and in next 2 to 3 years).
No argument, I'm just saying that my angle in this thread is not so much to change game developer short term behavior and more about understanding medium/long-term API degrees of freedom.

Of course the GPU still beats CPU in sorting performance, but that's because it has much more bandwidth to burn (over the required 2x to reach parity).
Right and that's the rub... while a paper may happily conclude that they moved X algorithm to a GTX 680 and got a 2x performance improvement, the reality is that you just de-optimised it significantly in terms of power... With increasingly power-efficient CPUs (sub-45W mobile stuff) the comparison gets even more brutal.

The rubber really meets the road on stuff like the 4350U, which I've been playing with a bit recently. Either the CPU or the GPU is easily able to completely consume the entire 15W TDP (which includes chipset stuff and FIVR power too). The turbo ranges are huge (~2x on CPU, ~5x on GPU) so you really run into the power efficiency situation head-on. For that reason, it's a pretty interesting processor to play around with in terms of testing power efficiency :) High end mobile chips are obviously another good place since phones often get chips these days that they have no hope of ever running at full speed for any length of time. Ultimately these heavily-power-constrained chips will creep up the TDP scale and in a few years it's going to be almost everything that operates that way. Thus it's of critical importance to understand the power efficiency situation when looking a few years out at stuff like this.

But that's why I believe moving graphics setup / culling / animation code to GPU is even more important. It frees more CPU cores to the game logic. And game logic really needs CPU to perform well. On consoles the processing resources are always limited (especially on CPU side). Good balance between CPU and GPU is the key to success.
Yeah and I'm not saying that a chunk of that stuff doesn't indeed belong on the GPU. I'm just saying that in the future - probably by the next console generation - you're going to be in a situation similar to the 4350U, where running a workload in the most power-efficient place will always be the best, since you can never spin up the entire chip at once anyways. The concept of "leaving CPU available for X" won't really be relevant on those chips, since the CPU portion could easily consume the entire TDP if desired.

I also want to point out that your CPU of choice (in these comparisons) is Haswell (8 wide AVX2 + FMA + gather). Not all CPUs are that efficient in processing heavy floating point graphics setup code ("refresh 10000 matrices by slerping animation key frame quaternions and positions").
No argument. Again, I'm using it as a proxy for future chips, not making arguments about what you should do today :)

I don't want to get too far astray in this CPU/GPU discussion, but at the same time it is not really that off-topic per se. It is a fairly fundamental question related to these programming model decisions: if we assume there will be some amount of queuing work and other nasty stuff at the front of the conceptual graphics pipeline, is that more appropriately suited to the GPU execution units or to a CPU? Obviously such a CPU could be integrated into the GPU frontend itself but that distinction is not particularly relevant for the architectural discussion (it's roughly the same solution as Mantle w/ simple, dense instruction streams).
 
Right and that's the rub... while a paper may happily conclude that they moved X algorithm to a GTX 680 and got a 2x performance improvement, the reality is that you just de-optimised it significantly in terms of power...
I definitely wouldn't move anything from CPU to GPU if I only got a 2x gain (by using the whole GPU). GPU is obviously required for graphics rendering as well, so it's a precious resource for games (not something that should be wasted by moving badly suited algorithms to it). So far I have only moved things to GPU that actually speed up the total frame rendering time (for example improved culling = less triangles to process = faster rendering, also GPU setup/animation = less GPU stalls because each draw call does more and requires less data modifications). I never would use more than maybe 1-2 milliseconds (max) per frame to setup/cull things. These steps used to take two whole CPU cores in the past (one of those cores being mostly rendering API overhead), so it is definitely more power efficient this way.
 
Last edited by a moderator:
Sure, but ultimately uncached stuff is going to dominate the power budget regardless of the microarchitecture. Currently this is actually part of the reason why CPUs tend to do pretty well: large caches. Of course GPUs are increasingly adding caches and so on, but they spend a large portion of those transistors on register files already for the purposes of latency hiding. You can't have your cake and eat it too in this situation... you always pay the cost of unpredictable memory access somewhere. Prefetching (HW or SW) and GPU-style "stall on use" are really the same solution, even though they look very different to the application developer.

I believe that on a high end CPU or GPU, the RAM consumes something on the order of 20W, so far from dominating power usage. Mobile might be different.

The big difference between on-chip and off-chip bandwidth costs is that while off-chip costs more power per byte, on-chip has far higher bandwidth and usage. If you count up the data transferred around the inside of a single OoOE core, you can get some awfully high numbers - something on the order of 100s of bytes per clock cycle when you add up all the various buffer lookups for register renaming, RoB, branch prediction, issue buffer, PRF to ARF, and a few more buffers that I've no doubt forgotten, all multipled by the superscalar width and possibly SIMD width, depending on the buffer in question... Just simple bypassing requires register width*bypass locations in the pipeline*superscalar width bytes.

This means that the OoOE hardware is pushing around TB/s of in core bandwidth, regardless of which instructions are actually being executed, around 2 orders of magnitude greater than the maximum bandwidth to global memory.

There's also a significant difference between prefetching and stall on miss - with prefetching, you have to guess which bits of memory are going to be used shortly, and if you guess wrong, you just wasted all that power and bandwidth for nothing. With stall on miss, you always know the memory will be used.
 
Does it make sense to count internal bandwidth? There are billions of "wires" inside a multi-billion transistor chip, why do you count only those wires that transfer signals a human would interpret as "hundred of bytes" of architectural significance? If I were in marketing I would count the intra-cycle wires connecting transistors of f.ex. an adder chain: :runaway: EXABYTES per second :runaway:
But seriously, where is power dissipated inside a chip? I just know enough to be wrong with big words: there's static leakage current/transistor, dynamic power for a switching transistor, what else?
 
I believe that on a high end CPU or GPU, the RAM consumes something on the order of 20W, so far from dominating power usage.
20W is a significant portion of the TDP of most SoCs :)

Does it make sense to count internal bandwidth? There are billions of "wires" inside a multi-billion transistor chip, why do you count only those wires that transfer signals a human would interpret as "hundred of bytes" of architectural significance?
Right I don't think there's anything to be gained from thinking about "bandwidth" internally... I think the only relevant note there is that stuff gets more expensive power-wise the further it gets away. So for instance, hitting DRAM is going to cost a lot more than L3$, which in turn is going to cost more power than L1$.

There's also a significant difference between prefetching and stall on miss - with prefetching, you have to guess which bits of memory are going to be used shortly, and if you guess wrong, you just wasted all that power and bandwidth for nothing. With stall on miss, you always know the memory will be used.
It's just a compiler transform really - you can get "stall on use" semantics by replacing loads with prefetch + load, lifting the prefetch as high as possible and optionally putting fiber switches (i.e. software "threads") between. There's no escaping the trade-off between how early you can prefetch/load before you need the data. If you can apply algorithmic knowledge to that it's always better than letting the hardware/software try and hide the latency "late".
 
But seriously, where is power dissipated inside a chip? I just know enough to be wrong with big words: there's static leakage current/transistor, dynamic power for a switching transistor, what else?

Dynamic power use comes from current moving across the transisters as they switch from off to on (and vice versa), which isn't instantanious. With something like CMOS, the transisters are arranged so that the only time current moves across the chip (ignoring leakage) is during this transistion (the rest of the the time, you can think of the state of the chip as a bunch of tiny capaciters between the transisters, which corresponds to "on" and "off").

Static current/leakage is just what it sounds like. The gaps between the wires (using this term broadly to mean everything connecting the transisters) aren't a perfect insulator, so a bit of current can leak across, sorta like a short circuit. As your wires/transisters get smaller, so do the spaces between them, which lets all the more current leak across.

I'm going to go out on a limb here (I don't really know all that much about this) and propose that the high power consumption of communications comes from two sources: First, wires resist current more and more as they get thinner, so in order to pass a signal through a wire to your destination somewhere else on the chip, you have to amplify the signal, which costs extra power. Second, the very presence of these wires increases leakage, since it offers a shortcut for leaking current as it moves across the chip from Vin to Vout. Think of these communication lines as a lightning rod, which offers a path of lower resistance for the current.

Does it make sense to count internal bandwidth? There are billions of "wires" inside a multi-billion transistor chip, why do you count only those wires that transfer signals a human would interpret as "hundred of bytes" of architectural significance? If I were in marketing I would count the intra-cycle wires connecting transistors of f.ex. an adder chain: :runaway: EXABYTES per second :runaway:

Right I don't think there's anything to be gained from thinking about "bandwidth" internally... I think the only relevant note there is that stuff gets more expensive power-wise the further it gets away. So for instance, hitting DRAM is going to cost a lot more than L3$, which in turn is going to cost more power than L1$.

Internal bandwidth absolutely counts when you're talking in terms of power consumption. It costs power even if it doesn't directly contribute to computation. Note that I'm not talking about bandwidth between two transisters in an adder, but between between different functional until and buffers inside the core, which have much the characteristics of, say, the L1$ as far as distance of data communication.

The numbers I've heard put transfering a word of data 1mm across a die at the same cost as a double precision multiplication, so even accessing the L1$ isn't cheap (a modern CPU core is probably 5mm across). Throw in half a dozen buffer accesses needed for OoOE (each of which cost similar to a L1$ hit) and you can really have a runaway in overhead power use.

20W is a significant portion of the TDP of most SoCs

I'm talking in the context of 200W discrete GPUs and 75W high end CPUs here:smile: For mobile (laptops on down) you can cut this down quite a bit by clocking down memory, using LPDDR, etc.

It's just a compiler transform really - you can get "stall on use" semantics by replacing loads with prefetch + load, lifting the prefetch as high as possible and optionally putting fiber switches (i.e. software "threads") between. There's no escaping the trade-off between how early you can prefetch/load before you need the data. If you can apply algorithmic knowledge to that it's always better than letting the hardware/software try and hide the latency "late".

Obviously it makes sense to move memory accesses as far forward in the code as you can, so that you cover as much of any given stall as possible. I was talking about speculative prefetching, which is in fact used in many CPUs.
 
Last edited by a moderator:
Back
Top