That madman has finally revealed his motives for his blog, and he is now suckering his followers into donating money into his paypal account.
Lets get back on topic here
I'm hoping perhaps someone senior can assist with this.
The GCP is responsible for a number of contexts to run, when the context hits a snag and can no longer process instructions instead of stalling, it will switch over to the next context. It does this in a round robin much like Hyper-Threading does, but at a hardware level. I'm reading about 1-2 cycles for latency on nvidia hardware, I can imagine it being similar to AMD. I think ideally the aim is to have your contexts loaded up with work, and you just blaze through it. You have ACEs queues as well if all your graphics contexts are stalled, so now you can begin to fit in some compute work as well.
So why the 2nd GCP for xbox one? I think this is actually a symptom than a boost now.
Xbox pulls memory from two pools, esram and DDR3. One has amazing latency, the other is ok, one has superior bandwidth the other doesn't. You're likely going to run in a lot of issues where you are going to have to wait on DDR3. But since you've not a lot of bandwidth, even though the information is coming back at a readily rate, ultimately you're not getting enough work done; so you're likely to keep switching contexts. If you're switching contexts and you blaze through 7 of them waiting on DDR3, and you've got nothing in queue, you've effectively stalled. So I think to remedy the DDR3 situation, they've added additional contexts. Having multi-threaded rendering helps with keeping all the queues full, providing additional time for the GCP to properly schedule what work needs to be done.
I think this is pretty high level that needs to be verified, but I think at this point this is where the rabbit hole goes deeper: the reason why you may actually prefer smaller draw calls, and more of them. It comes down to ALU utilization/occupancy. Looking at the specifics of it, you can't just assign whatever job you want. I was reading Timothy Lottes blog that writes about GCN occupancy.
http://timothylottes.blogspot.ca/2014/03/gcn-and-wavefront-occupancy.html
I won't quote too much, truthfully its a short post:
Shaders can easily get issue limited when the number of wavefronts becomes small. Without at least 3 wavefronts per SIMD unit, the device cannot tri-issue a {vector, scalar, and memory} group of operations. This will be further limited by the number of wavefronts which are blocked during execution (say because they are waiting on memory). Abusing instruction level parallelism and increasing register pressure at the expense of occupancy can result in low ALU utilization.
He reviews that
- Each SIMD unit has the capacity of 1 to 10 wavefronts.
- Once launched, wavefronts do not migrate across SIMD units.
- CU can decode and issue 5 instructions/clk for one SIMD unit.
- It takes 4 clocks to issue across all four SIMD units.
- The 5 instructions need to come from different wavefronts.
- The 5 instructions need to be of different types.
My understanding here isn't fully clear, since i'm not sure what happens at the SIMD level if I render 100K boxes with the same shader. Does it do the quoted above? I'm not sure how many wavefronts are submitted.
Would having different draw calls diversify the types of instructions that need to be completed, therefore possibly increase the ALU utilization?