DirectX 12: The future of it within the console gaming space (specifically the XB1)

No but that is about exactly the theoretical power.



I agree. Probably didn't really have much time to work on it.
GNM still is a different API than directX as we see the case with developer error with AF on PS4; there is a steep learning there that couldn't have been realized that early. They've always had BF3 and BF4 on DirectX, even though it wasn't low level, it's a testament to how long developers have been learning the ins and outs of directX and make it perform.

I'm hoping to see the next iteration of the engine drop off cross gen, and go for the gold. Wait for Star Wars Battlefront in 17 days!
 
Last edited:
Yeah.
Anand's looks like this:

260-series should be 1 gfx + 7 or 8 compute (depending if the gfx actually eats 1 slot or not)
290-series & 285 should be 1 gfx + 63 or 64 compute (depending if the gfx actually eats 1 slot or not)
And I'm fairly certain that GCN 1.0's ACE's could each do 2 queues, not 1, so it should be 1 gfx + 3 or 4 compute (depending if the gfx actually eats 1 slot or not)

edit:
AnandTech's article has been fixed :)
Because I like getting @mosen fired up.
From @Ryan Smith -at- Anandtech
From a feature perspective it’s important to note that the ACEs and graphics command processors are different from each other in a small but important way. Only the graphics command processors have access to the full GPU – not just the shaders, but the fixed function units like the geometry units and ROPs – while the ACEs only get shader access. Ostensibly the ACEs are for compute tasks and the command processor is for graphics tasks, however with compute shaders blurring the line between graphics and compute, the ACEs can be used to execute compute shaders as well now that software exists to make use of it.

GPU Queue Engine Support
Graphics/Mixed Mode Pure Compute Mode
AMD GCN 1.2 (285) 1 Graphics + 8 Compute 8 Compute
AMD GCN 1.1 (290 Series) 1 Graphics + 8 Compute 8 Compute
AMD GCN 1.1 (260 Series) 1 Graphics + 2 Compute 2 Compute
AMD GCN 1.0 (7000/200 Series) 1 Graphics + 2 Compute 2 Compute
NVIDIA Maxwell 2 (900 Series) 1 Graphics + 31 Compute 32 Compute
NVIDIA Maxwell 1 (750 Series) 1 Graphics 32 Compute
NVIDIA Kepler GK110 (780/Titan) 1 Graphics 32 Compute
NVIDIA Kepler GK10x (600/700 Series) 1 Graphics 1 Compute

So guys what happens now that there are 2 Graphics on Xbox One and 2 Compute lol.
(lets not let PS4 be left out here, it's also got a system reserved GCP in one of their diagrams)

Any thoughts on what it will do/does yet with the new knowledge? Any educated guesses?
 
The most obvious answer is yes. And that obvious answer should not be ignored. As many times the obvious answer is likely the correct one.

It's unfortunately a little complicated however. We've read the leaked SDK.

There are 8 hardware graphics contexts for Xbox One, one context is already system reserved.
 
On console?
Yes for Xbox One though I was mostly thinking of PC and forgot this was a console thread.

Well i'm a guy that likes to fail fast, looks like PS4 has really had both a major API and hardware advantage. I'm surprised Drive Club and Order1886 didn't leverage this, they may have though, it's just not listed on the slide.

Async_Games_575px.png
As you theorized that's not an exhaustive list.
 
really? I didn't think they would need to reserve an ACE for system. Do you have a source I can look at?

Sorry it is not reserved the system can use ACE. Long time I have not read the article. And my memory is rusty on the subject.

http://www.vgleaks.com/orbis-gpu-compute-queues-and-pipelines

PS4 GPU has a total of 2 rings and 64 queues on 10 pipelines

– Graphics (GFX) ring and pipeline

  • Same as R10xx
  • Graphics and compute
  • For game


– High Priority Graphics (HP3D) ring and pipeline

  • New for Liverpool
  • Same as GFX pipeline except no compute capabilities
  • For exclusive use by VShell


– 8 Compute-only pipelines

  • Each pipeline has 8 queues of a total of 64
  • Replaces the 2 compute-only queues and pipelines on R10XX
  • Can be used by both game and VShell (likely assign on a pipeline basis, 1 for VShell, and 7 for game)
  • Queues can be allocated by game system or by middleware type
  • Allows rendering and compute loads to be processed in parallel
  • Liverpool compute-only pipelines do not have Constant Update Engines (presented in R10XX cards)
 
The most obvious answer is yes. And that obvious answer should not be ignored. As many times the obvious answer is likely the correct one.

It's unfortunately a little complicated however. We've read the leaked SDK.

There are 8 hardware graphics contexts for Xbox One, one context is already system reserved.

As a possible counterpoint, GPUs with one GCP support 8 hardware contexts as well. It may be a name collision, but it may also mean the contexts mentioned in the leak may not imply something out of the ordinary.

case CHIP_BONAIRE:
...
rdev->config.cik.max_hw_contexts = 8;
...
case CHIP_HAWAII:
...
rdev->config.cik.max_hw_contexts = 8;
...
case CHIP_KAVERI:
...
rdev->config.cik.max_hw_contexts = 8;
...
case CHIP_KABINI:
case CHIP_MULLINS:
...
rdev->config.cik.max_hw_contexts = 8;

http://cgit.freedesktop.org/~airlied/linux/tree/drivers/gpu/drm/radeon/cik.c?h=drm-next
 
As a possible counterpoint, GPUs with one GCP support 8 hardware contexts as well. It may be a name collision, but it may also mean the contexts mentioned in the leak may not imply something out of the ordinary.



http://cgit.freedesktop.org/~airlied/linux/tree/drivers/gpu/drm/radeon/cik.c?h=drm-next
Agreed and expected. That brings us to a weird place to debating if this is a naming convention issue and figuring out what the contexts do. I'm assuming context is a queue.
If one is reserved for System usage, 8 contexts per GCP, I don't see a point to the second GCP. If the purpose of a GCP is to control the full GPU.
Then a second can only be
a)redundancy
b)used in the operation of a GPU.

A long time ago I theorized that the 2nd GCP would be used for improved context switching (earlier in this thread, much earlier). I think this statement from Ryan might apply here:

Latency hiding in turn can become easier with multiple work queues. The additional queues give you additional pools of threads to pick from, and if the GPU is presented with a situation where it absolutely can’t hide latency from the graphics queue and must stall, the compute queues could be used to fill that execution bubble. Similarly, if there flat-out aren’t enough threads from the graphics queue to fill out the GPU, then this presents another opportunistic scenario to execute threads from a compute task to keep the GPU better occupied. Compared to a purely serial system this also helps to mitigate some of the overhead that comes from context switching.
So I think the second GCP is like just adding more threads for the GPU to do work, to make sure that the graphics queue should you decide to send a lot of smaller graphics items, you have a lot more queues to pull from. I think this is just going to boil down to - better graphics pipeline scheduling. As in, you fill 16 contexts up, and has an overarching view on how to schedule all of it efficiently. Maybe they put this is to combat the multi-threaded renderer: perhaps when a lot of different jobs come in simultaneously, without enough queues to hold them all, scheduling efficiency drops. I still hold onto the idea that they could be leveraging a dual clutch type of idea, when you switch contexts you pull from the other GCP, and you keep flipping back and forth hiding the latency of context switching. (which at fastest I think must be 2 cycles, one to copy the registers out, and 1 to copy in the new set of registers)

I don't think this is much different from PS4 providing 8 ACE, which is overkill, but the idea might be the same. Just providing additional flexibility if for some reason you have a lot of fine grained jobs (For X1), you have 15 graphics queues to work from and 16 compute queues)

I thought PS4 ran with a 14/4 model, so does ISS run all 18 CU's
Yes. It's not a 14/4 model. You can't divide CUs like that.
 
Last edited:
XB1 = 12 CUs at 853 MHz. PS4 = 18 CUs at 800 MHz. Differential is PS4 is 40.6% faster. 56% more pixels is 38% faster than the theoretical hardware difference.

Yes but when you factor in the difference in memory bandwidth I think you would get that 56% difference in pixels. That's just hardware.
 
That madman has finally revealed his motives for his blog, and he is now suckering his followers into donating money into his paypal account.
Lets get back on topic here ;)

I'm hoping perhaps someone senior can assist with this.
The GCP is responsible for a number of contexts to run, when the context hits a snag and can no longer process instructions instead of stalling, it will switch over to the next context. It does this in a round robin much like Hyper-Threading does, but at a hardware level. I'm reading about 1-2 cycles for latency on nvidia hardware, I can imagine it being similar to AMD. I think ideally the aim is to have your contexts loaded up with work, and you just blaze through it. You have ACEs queues as well if all your graphics contexts are stalled, so now you can begin to fit in some compute work as well.

So why the 2nd GCP for xbox one? I think this is actually a symptom than a boost now.
Xbox pulls memory from two pools, esram and DDR3. One has amazing latency, the other is ok, one has superior bandwidth the other doesn't. You're likely going to run in a lot of issues where you are going to have to wait on DDR3. But since you've not a lot of bandwidth, even though the information is coming back at a readily rate, ultimately you're not getting enough work done; so you're likely to keep switching contexts. If you're switching contexts and you blaze through 7 of them waiting on DDR3, and you've got nothing in queue, you've effectively stalled. So I think to remedy the DDR3 situation, they've added additional contexts. Having multi-threaded rendering helps with keeping all the queues full, providing additional time for the GCP to properly schedule what work needs to be done.

I think this is pretty high level that needs to be verified, but I think at this point this is where the rabbit hole goes deeper: the reason why you may actually prefer smaller draw calls, and more of them. It comes down to ALU utilization/occupancy. Looking at the specifics of it, you can't just assign whatever job you want. I was reading Timothy Lottes blog that writes about GCN occupancy. http://timothylottes.blogspot.ca/2014/03/gcn-and-wavefront-occupancy.html
I won't quote too much, truthfully its a short post:

Shaders can easily get issue limited when the number of wavefronts becomes small. Without at least 3 wavefronts per SIMD unit, the device cannot tri-issue a {vector, scalar, and memory} group of operations. This will be further limited by the number of wavefronts which are blocked during execution (say because they are waiting on memory). Abusing instruction level parallelism and increasing register pressure at the expense of occupancy can result in low ALU utilization.

He reviews that
  • Each SIMD unit has the capacity of 1 to 10 wavefronts.
  • Once launched, wavefronts do not migrate across SIMD units.
  • CU can decode and issue 5 instructions/clk for one SIMD unit.
  • It takes 4 clocks to issue across all four SIMD units.
  • The 5 instructions need to come from different wavefronts.
  • The 5 instructions need to be of different types.
My understanding here isn't fully clear, since i'm not sure what happens at the SIMD level if I render 100K boxes with the same shader. Does it do the quoted above? I'm not sure how many wavefronts are submitted.
Would having different draw calls diversify the types of instructions that need to be completed, therefore possibly increase the ALU utilization?
 
Back
Top