Not sure if you saw, but I mentioned this in my post as well. The issue is this the only example people have given of cases where you're completely not utilizing the ALU array... rendering depth-only. So there's a nice one-time boost we can get during shadow map rendering, but it's not a long-term performance amplifier per se. Also the more power-constrained a GPU is, the less this will help.
That was just an easy example, because shadow map is triangle and ROP bound, while the compute step is likely bandwidth and ALU bound. But this is not the only case where it helps to run two kernels (or kernel + rendering) simultaneously. For example if you have two kernels, one is ALU bound and one is BW bound, manually allocating the GPU resources between these two could bring a performance boost, since the bandwidth (and L2) will be shared between the whole GPU, and thus the BW hungry part can use more than half (lets say 65% of the BW), while the ALU hungry only uses (35%). Now if you allocate the CU (ALU) resources 50/50 between these two tasks, the BW heavy task will finish in 1.3x time, and the ALU heavy task will finish in 1.3x+0.35x = 1.65x time. Total time savings of 0.35x (17.5%). Of course 50/50 split isn't even the best fit here, since you would want to allocate more CUs to the ALU heavy task. But this is of course pure speculation, since we don't know yet how fine grained access to GPU resource allocation between concurrent tasks you have on Mantle.
I really wasn't too happy with the answers given in the Q&A. I'm going to write it off to people not having thought a lot about it and being somewhat conditioned in their engine design thinking by the way that APIs have always been to this point, but they really missed the point of why people are bringing up bindless in this context.
Bindless textures basically remove the last piece of state that changes at high frequencies in engines and thus "breaks batches". sebbbi has mentioned this before, but if you want to you can completely render at least an entire pass to a set of render targets with one draw call using bindless. Thus the overhead of draw calls is largely irrelevant... even DX today has quite acceptable overhead if you're only talking about tens-of-draw-calls per frame.
(...)
I agree with you. Bindless is an enabler of GPU-driven rendering, not a minor performance boosting feature. Yes, you can use it in a "wrong" way and get a minor 20% boost to your traditional CPU-driven pipeline. And that's fine if you want to submit draw calls using CPU. But submitting (and rebuilding) 100k draw calls per frame using CPU is just a huge waste of CPU resources. You need to use multiple CPU threads just to push draw calls. Same stuff every frame, again and again (only changing the data set by a few percents every frame).
A GPU-driven pipeline on the other hand doesn't need more than a single indirect draw call (or a fixed amount of them) to render a whole scene (sidestepping the whole draw call cost issue entirely). Tiled resources ("virtual textures") and bindless resources are absolutely necessary here, because the CPU doesn't know anything about the rendered scene (GPU does the whole scene management). Thus CPU cannot change any resources, vertex/index buffers, constant buffers or textures. GPU needs to pull this data on it's own. Without bindless resources and virtual paging ("virtual texturing" is an outdated term, since you can use similar technique for meshes and constant data as well) GPU-driven rendering is pretty much impossible. Software based virtual texturing (and virtual paging in general) is of course is usable on all platforms, and that's a good solution for many cases (only textures require filtering between pages and require shader trickery).
Bindless has some advantages over hardware PRT as well. The biggest advantage is the flexible data format. Every resource descriptor can point to data with different format (different BC formats, floats, integers, packed 11-10-11, etc). PRT on the other hand has a fixed format. If some of your texture pages would need better quality (uncompressed or different BC mode) that is not possible. Also bindless resource descriptors can point to different sized resources (with different mip counts, etc) and there's proper border wrapping/clamping for the filtering as well. These are quite handy features.
I think the question panel gave some incorrect information about the bindless texture GPU cost. The bindless textures cost basically no GPU performance at all. It's not a trade off, when you use it correctly. This is because modern hardware has scalar units in addition to SIMD units, and scalar units can be used to fetch wave/warp coherent data, such as constants. The scalar units also can do any kind of wave/warp coherent calculation, such as multiplying to constants together, performing calculations that lead to branching decisions (branching has wave/warp granularity), or calculating addresses to scalar fetches... For the scalar unit it doesn't matter if a resource descriptor is fetched from a hard coded address (CPU side binding) or a calculated address (bindless). It's a single scalar ALU operation + a single scalar fetch more per wave/warp (= 64 threads on AMD / 32 threads on Nvidia). I have never seen a scalar unit bound shader in my life... so all this is very likely masked out completely = free.
Bindless would cost performance in a case where you have wave/warp coherency problems. A single GPU data fetch instruction fetches data to all (64 or 32) threads in the wave/warp at once from a single resource. If you would want each thread to fetch data from a different resource, you would need to serialize the execution (similar penalty to branch divergence I suppose). I don't know if this is entirely true for most recent Nvidia hardware, since the OpenGL bindless prototype performance results I investigated several years ago were running on Fermi. That prototype was showing huge performance drop in a case of incoherent bindless resource access patterns. However if you can guarantee that each thread in the same wave/warp fetches data from the same resource (for example resource index comes from a constant buffer, or you have manual control over it in a compute shader), you should have (very near to) zero GPU performance cost. If warp/wave granularity is not guaranteed, the shader compiler seems to do some black magic to cover you up (but at a big cost). I have to say that I don't have a clue how this works on Intel hardware (modern AMD and Nvidia hardware seem quite similar in this regard).
100k draw calls by Mantle seem like a huge number. Interesting question is, how many CPU cores you need to dedicate fully to rendering in order to submit this many draw calls? And in the long run, how the performance compares to a fully GPU-driven pipeline that completely sidesteps the draw call overhead issue (bog standard PC DirectX 11 API is enough for HUGE object counts).