DirectX 12: The future of it within the console gaming space (specifically the XB1)

Doing some math will indicate that @ 853 Mhz at 12 CUs, the required bandwidth to remove the bandwidth bottleneck for CUs will be approximately ~240 GB/s. This happens to be approximately the total complete bandwidth of Xbox One (192 GB/s + 67GB/s). When looking at PS4 we have a max theoretical of 176GB/s, and as such, bandwidth will become a bottleneck for all 18CUs.
.

- eSRAM max theoretical speed is not 192 GB/s!

ESRAM is capable of simultaneous reading and writing. But at every 8 cycles it can only do one of them (either reading or writing)
This allows for 15 cycles read+write+1 cycle either reading or writing.

On a 1024 bits wide bus we get: 1024*856~=109 GB/s. This is a guaranteed theoretical bandwidth on the eSRAM.

But theoretically we can double this on every 15 out of 16 cycles.

109*2*15/16~=204 GB/s. This is the max theoretical speed of eSRAM. Not 192 GB/s.

Real life scenarios achieved so far a maximum of 141 GB/s. And this on very specific calculations. (alpha transparency blending operations (FP16 x4))

192 GB/s, with real life scenarios speeds of 133 GB/s were values before the overclock.


- 700 GB/s seems a theoretical scenario that covers all cases. I should assume this would include heavy texture streaming too. How do you plan on get 204+68 GB/s on the Xbox One with only 32 MBytes of fast RAM? Main memory is only 68 GB/s so no data contained on it can be read at faster speeds! And it´s 8 GB of it!


- I just had a very, very, very fast look at the document you linked. And in page 4, as you state, they do refer 700 GB/s is the speed they would require for no bottlenecks.

Since I had a quick look, I might have saw it wrong, but that point seems to refer to a baseline heterogeneous system as shown on the document Figure 1.

And Figure 1 show a GPU accessing the RAM on a non coherent bus. Not the PS4 case!

Then all the document seems to go on to speak about the advantages of heterogeneous system coherence (like in the PS4 case).

The conclusion states that this kind of system not only gets performance enhancements, but also a bandwidth reduction of 95 to 99% to the directory.

PS4 has full heterogeneous system coherence. On the other hand, as far as I know, the Xbox One CPU cannot access the ESRAM.

So, it seems to me, that the document you refer points that the PS4 is better suited than Xbox One. And not the other way around!

Please feel free to correct me. I just had an extremely fast look at the document.
 
I explained myself wrong... ESRAM can do 7 cycles read+write and 1 cicle either reading or writing.

So we have 7*2+1*1 = 15 operations (not cycles) on 16 possible.
 
IIRC ESRAM it's 7/8 cycles can read/write simultaneously. On the 8th cycle it must do nothing. It can do 8 reads in a row or 8 writes but doing both it must rest on the 8th cycle. That is how 192 is obtained.

As for the article yes it is about full heterogenous systems, the article is about shifting bottlenecks which in this case the requirement of bandwidth required to no longer be the bottleneck (for 32 CUs). 700GB/s was the amount that they found.

As for your assertion that Xbox one cannot achieve 242 GB/s it can simultaneously modifying existing data on the eSRAM while reading and writing to data in DDR3. They are separate pools. The idea to keep the workload on esram. Data is moved in and out of ESRAM as required.
 
IIRC ESRAM it's 7/8 cycles can read/write simultaneously. On the 8th cycle it must do nothing. It can do 8 reads in a row or 8 writes but doing both it must rest on the 8th cycle. That is how 192 is obtained.

As for the article yes it is about full heterogenous systems, the article is about shifting bottlenecks which in this case the requirement of bandwidth required to no longer be the bottleneck (for 32 CUs). 700GB/s was the amount that they found.

As for your assertion that Xbox one cannot achieve 242 GB/s it can simultaneously modifying existing data on the eSRAM while reading and writing to data in DDR3. They are separate pools. The idea to keep the workload on esram. Data is moved in and out of ESRAM as required.


Microsoft claims 204 GB/s.

Also, check the esram astrophisics *spin off* thread, and this post.

XBox_One_SoC_diagram.jpg


the chart only referer coherence on the CPU caches. PS4 also has it on the GPU caches.

And yes....the GPU can access both memories at Once... But it can read/write from ESRAM at 204 GB/s, and DDR 3 at 68 GB/s. There is nothing read or written at 272 GB/s.

If you have two xerox machine that can copy 1 page per second, you can have two pages every second. Yet if you have only one capable of two pages every second, you also get two pages per second. But you can have one page in half a second. On the other configuration you cannot!

So adding both buses do not equal one single bus at 272 GB/s since the speed data is delivered is not the same! Only the amount
 
It doesn't need to equal a single 272 GB/s to potentially offer performance advantages over a single, lower BW bus. Plus, you may avoid some costly contention issues by using two pools.

Incidentally, I wonder if the BW hit (absolute rather than proportional) caused by accessing main memory with the CPU would be lower on the Bone than the PS4 due to its DDR3 having lower latency access than the PS4's GDDR5. The Bone CPU jumping ahead of the GPU to access memory could see less disruption as the operation may be completed faster.

Making better use of the BW of the esram may be important in realising the gains made possible from DX12. The updated version of PIX, with streamlined esram profiling, would seem to be well timed in this regard.
 
My question to the "removal of any bottlenecks" would be... are we talking of the removal of extreme edge cases? I mean... where do the diminishing results set in? It's not like that doubling BW will double CU performance. What's the effective read/write performance at the supposed 700GB/s?

It's like saying, only a car with 700bhp has "no bottlenecks" for the tyres. In each and every application you can accelerate at full speed. Yet it doesn't go round the track faster than most other cars, as the tyres cannot "sustain" the load of the engine (here, engine is BW and tyres are GPU.. strange as it's the other way around, but there's no real measure for tyres, I guess) each and every time it gets fired up. It just leads to minute gains for massive cost.
 
ESRAM is a strange beast.

Lets compare it with the PS4 GDDR5

Full read only

ESRAM Bandwidth - 109 GB/s
GDDR5 bandwidth - 176 GB/s

Full Write only

ESRAM Bandwidth - 109 GB/s
GDDR5 bandwidth - 176 GB/s

Read and Write Cicle

ESRAM Bandwidth - 204 GB/s
GDDR5 bandwidth - 176 GB/s

So, it´s obvious that while 176 GB/s is a sustained speed, ESRAM has variations and only 109 GB/s are garanteed.

Besides people cannot just look at the ESRAM as a speed booster to the DDR3. They also must look at the DDR3 as a speed decreaser to the ESRAM.

Take the case of the GTX 970 in comparison.

GTX 970 - 3.5 GB faster memory (192 GB/s) + 512 MB slower memory (28 GB/s) = 224 GB/s
XBox One - 8 GB slower memory (68 GB/s) + 32 MB faster memory (204 GB/s) = 272 GB/s
PS4 - 8 GB faster memory (176 GB/s)

Fast memory use only

GTX 970 - max performance
Xbox One - not available!
PS4 - Max performance (one pool only)

Slow memory use only

GTX 970 - Slow performance
Xbox One - Slow performance
PS4 - Max performance (one pool only)

Mixed use

GTX 970 - Decrease in performance
Xbox One - Increase in performance
PS4 - Max performance (one pool only)

Now, join 2+2 (Xbox and 970) and we can see that the performance increase on Xbox One is never maxed to the level of one single memory pool at the ESRAM speed. DDR3 reads and writes, just like in the 970, should slow performance down. Result is better than the use of DDR3 only, but worst than the use of one large pool of fast memory. That is why Microsoft solution is not a standard!

In Xbox defense the ESRAM has an API, so the ESRAM usage can be optimized and results should be better!

Also future usage of Tiled textures, available since DX 11.2, should help decrease memory and bandwidth usage!
 
off-chip vs on-chip

And cpu causing contention in the memory pool which from the Microsoft leak is a serious issue, a Sony Dev slide seemed to indicate the same is true on PS4.

It would be very interesting to see how thoes numbers changed if we let the cpu loose on the memory at the same time.

It's just a shame about these pesky nda's which mean we will probably never know exactly how these things stack up exactly.
 
My question to the "removal of any bottlenecks" would be... are we talking of the removal of extreme edge cases? I mean... where do the diminishing results set in? It's not like that doubling BW will double CU performance. What's the effective read/write performance at the supposed 700GB/s?

It's like saying, only a car with 700bhp has "no bottlenecks" for the tyres. In each and every application you can accelerate at full speed. Yet it doesn't go round the track faster than most other cars, as the tyres cannot "sustain" the load of the engine (here, engine is BW and tyres are GPU.. strange as it's the other way around, but there's no real measure for tyres, I guess) each and every time it gets fired up. It just leads to minute gains for massive cost.
The article was topically about GPGPU on heterogenous systems. As they begin to move towards HBM they determined that the GPU side would get bigger so eventually the bandwidth would need to increase as well. In this case 700 GB/s was found to be enough for all workloads for 32CU. Or in the easiest case scenario to 350GB/s while it writes out 350GB/s. I'm unsure if asymmetrical workloads are considered ie: reading 600Gb/s and say performing a reduce and only writing out 100Gb/s.

Bandwidth is only a measure, the article does make a good point of how idle ALU units are. Which is why in the past running some GPU tests could burn your chip as those algorithms never gave your chip a chance to rest and stock cooler was not adequate enough for sustained 100% usage.

Linking back to my OP this was the concern. We are moving to an area where CPU load is reduced and it can now focus on other things while the GPU load is greatly increased. Which means ultimately both have now room to increase. This will lead to greater demands from the PSU and greater demands from cooling.
 
ESRAM is a strange beast.

Now, join 2+2 (Xbox and 970) and we can see that the performance increase on Xbox One is never maxed to the level of one single memory pool at the ESRAM speed. DDR3 reads and writes, just like in the 970, should slow performance down. Result is better than the use of DDR3 only, but worst than the use of one large pool of fast memory. That is why Microsoft solution is not a standard!
Also future usage of Tiled textures, available since DX 11.2, should help decrease memory and bandwidth usage!
I'm hesitant to respond to this because it's more complex than you write it. Some of the senior members will be able to provide better insight into how specific memory pattern usage would ultimately increase bandwidth.

Esram is designed to only hold 2/3 of the back buffer and textures are streamed into esram when it needs it, it does not need to stream the textures back. In fact it's almost useful to look at ESRAM as nearly a black hole with the exception that the render target will eventually come back.

The remaining space in ESRAM is used for scratch pad work which is a majority of where the bandwidth should be consumed.

Bandwidth is a unit of transportation. I've never considered it a measure of speed. Speed is a measure of velocity and distance, bandwidth is an aggregation of data moved over a set interval of time. As such you would agree that electricity nearly runs at the same speed as light, so why is 56K on a phone line yet we have so much more on OC48? It's because OC has multimode frequencies all sending different data in that all at the same time. The data isn't necessarily flowing faster than light, there is just more light than can be processed. In telecom we aggregated smaller data streams into larger one and those larger ones go cross country over large optical pipes or through satellite or whatever may be the case, but we are still bound by the same speed which is the speed of energy.

edit: that example might lead to OT. So lets make it simple. I have 8GB GDDR on PS4. I do the following:
int hello = 8;
int world = hello;

If I have 4 GB GDDR and do the same thing, will 8GB do that same statement in 1/2 the time?

No.

So bandwidth is not speed, bandwidth is about how much data can be fit. When you do the following statement:
hello = world + hi + var2+ var3+ object12 * happyworld;
Then we're pulling lots of data from different locations all at the same time that having a wide bus can pull from. If I don't have a wide enough bus I will need to spend additional clock cycles gathering that data before I can begin processing.

The GPU is an aggregator of bandwidth, whatever lines are free for it to pull and push is what it can work with. When you made the example of 2 fax machines capable of doing 1 fax per min, vs a fax machine that can do 2 fax per minute, that is a comparison of processing speed.

You haven't sold me on why it all needs to come from a single pool or why it needs to be 8GB yet.

Final edit: this article here shows the difference and importance of bandwidth vs memory size:
http://www.tomshardware.com/reviews/graphics-ram-4870,2428-2.html

As you can see 512MB performs very well as long as the game fits within the size requirements of the VRAM.
 
Last edited:
Earlier in this thread a lot of rumours came up with multi-gpu setups, and Direct X 12 moving to SFR instead of AFR.
And then earlier we had a discussion as to what the second GCP would do on the Xbox One.

Here's a theory. SFR uses both GCPs on Xbox One. Once GCP manages the render target in esram, the other manages the render target in DRAM. This way the different memory pools which operate at vastly different speeds can each have their own scheduler, making the process overall easier/straight forward for the API?
IjpglaI.jpg
 
Last edited:
Bandwidth is a unit of transportation.

Not so shure about that... You can't just ignore that transportation means speed!

On a real world example, a 30 tons truck can carry in one trip what one of 15 tons would need 2 trips.

Same is true here! Just as a simple example

On a 100 GB/s bandwidth reading a 10 GB texture will take 1/10 of a second.

On a 200 GB/s bandwidth reading the same texture will take 1/5 of a second.

Assuming no other hardware limitations, on a 100 GB/s bandwidth if nothing else was required you could read it 10 times per second, on the 200 GB/s you could do it 20 times.

Performance is associated!

BTW the article you state about memory and bandwidth is from 2009. In 2009 no more that 512 MB was needed, so extra memory had hardly any gains. Same article today would be something like 3 to 4 or 4 to 8 GB.

As you said, if memory size is enough to hold all assets, more memory would not give any gains. Just bandwidth. But increase the assets you use, and as such the stuff you place in RAM and you require more of it! By todays standards, 512 MB is very litle and will impact performance since data would have to be streamed from main memory to the VRAM. Top Games today require 3 GB VRAM for max details.

So, a memory Pool of 8 GB fast RAM is not really needed. But certainly is better than a 32 MB of it!

And in this regard, DX 12 will not improve a thing! It wont magicaly place data on the ESRAM or increase it's size! It can, at best optimize it's usage! Regardless 32 MB is a very small and limited pool. Even Intel used 128 MB of it in it's Iris Pro cards.
 
Not so shure about that... You can't just ignore that transportation means speed!

On a real world example, a 30 tons truck can carry in one trip what one of 15 tons would need 2 trips.

Same is true here! Just as a simple example

On a 100 GB/s bandwidth reading a 10 GB texture will take 1/10 of a second.

On a 200 GB/s bandwidth reading the same texture will take 1/5 of a second.
Without any idea of latency or speed this is untrue. You are just making an assumption of it. If I had 100GB wide bus but it only moved data once per second it would still be considered 100GB/s. And it would take 1 full second for it to arrive.

Clock speed makes things arrive faster, bus width makes more things arrive at the same time.

And I never said that DX12 would increase bandwidth, the bandwidth is there for Xbox One. DX12 means more saturation for the GPU so the CUs are well prepared for more work. While i'm unsure of the access pattern that will be required to maximize the bandwidth on Xbox One, it has the capability of moving that much work. My argument is that if PS4 was ready/designed in mind for DX12/Mantle style, wouldn't they designed it for more bandwidth so that the PS4 can perform even more work? That 176GB is going to be chewed up faster and faster as the CUs continue to do more work without breaks. They'll need to be fed even more, not less.
 
IIRC ESRAM it's 7/8 cycles can read/write simultaneously. On the 8th cycle it must do nothing. It can do 8 reads in a row or 8 writes but doing both it must rest on the 8th cycle. That is how 192 is obtained.
The ESRAM can dual-issue reads and writes for 7 out of 8 cycles. The 8th cycle cannot issue a write, but a read is possible.
The 192 would be the number offered before the upclock to 853 MHz, 204 is the current peak.
This is subject to patterns matching some unknown set of banking limitations. The real-world utilization figures show this is not readily acheived.

ESRAM is a strange beast.

Read and Write Cicle

ESRAM Bandwidth - 204 GB/s
GDDR5 bandwidth - 176 GB/s

So, it´s obvious that while 176 GB/s is a sustained speed, ESRAM has variations and only 109 GB/s are garanteed.

The two scenarios in this figure are cannot be equivalent.
GDDR5 cannot simultaneously read and write, and any mix that isn't very long stretches of pure reads followed by pure writes is begging to have that 176 GB/s figure chopped by in bad cases 10x or worse.
DRAM is also subject to a number of banking restrictions, and the penalties are likely far worse for it that the ESRAM.
The sustained bandwidth figures are somewhere around 140, per some of Sony's own slides, and the figure drops further under CPU contention.

Comparing the two systems is a very complex proposition, particularly since both are not what would be considered mature.
 
Without any idea of latency or speed this is untrue. You are just making an assumption of it. If I had 100GB wide bus but it only moved data once per second it would still be considered 100GB/s. And it would take 1 full second for it to arrive.

Clock speed makes things arrive faster, bus width makes more things arrive at the same time.

And I never said that DX12 would increase bandwidth, the bandwidth is there for Xbox One. DX12 means more saturation for the GPU so the CUs are well prepared for more work. While i'm unsure of the access pattern that will be required to maximize the bandwidth on Xbox One, it has the capability of moving that much work. My argument is that if PS4 was ready/designed in mind for DX12/Mantle style, wouldn't they designed it for more bandwidth so that the PS4 can perform even more work? That 176GB is going to be chewed up faster and faster as the CUs continue to do more work without breaks. They'll need to be fed even more, not less.

PS4 was originally going to be 192 GB/s. Perhaps that was when they were using 2 Mbit chips .. or perhaps the controller in the AMD APUs wasn't quite up to the job at the desired yeilds?
 
That's an interesting theory. A counterpoint is that Sony were/are expecting GPU utilisation to shoot up thanks to compute. If AMD provided accurate data on high-utilisation, it should have been factored into the design. Of course, mistakes happen and there's the possibility that Sony didn't really design with suitable future-proofing for changes in software and hardware utilisation.
Actually @iroboto might have a point there, because Phil Spencer said they were fully aware of DirectX 12 when they built the Xbox One, making it the first effective DirectX 12 device in the market:

http://n4g.com/news/1659528/phil-spencer-we-knew-about-dx12-when-we-built-xbox-one
 
PS4 was originally going to be 192 GB/s. Perhaps that was when they were using 2 Mbit chips .. or perhaps the controller in the AMD APUs wasn't quite up to the job at the desired yeilds?

The APU seems like it would have been up to the task, AMD was able to handle speeds in that range elsewhere. The speed grades for the highest density GDDR5 chips in clamshell mode may have been where it came down to capacity vs bus speed.
 
PS4 was originally going to be 192 GB/s. Perhaps that was when they were using 2 Mbit chips .. or perhaps the controller in the AMD APUs wasn't quite up to the job at the desired yeilds?
Or power/heat. Or killer bees. Or ninjas. It could have been anything, really ;)
 
Perhaps - but then there is this slide that likely MJP won't talk about ;)
yq4pkk2.jpg


fixed, moved the GDC slide to imgur. Redtechgaming didn't link the direct link I made.

edit: There appears to be some form of improved multi-threaded command buffer generation - the question is how much improved.
Maybe Naughty Dog knows about that, that's what they said some months ago when comparing GNM to DirectX:

http://wccftech.com/naughty-dog-devs-talk-directx-12-xbox-ps4-upsurge-30-xbox-performance-gain/

Christian Gyrling @cgyrling Follow

Low-level access to the GPU really makes you understand why using DirectX is slow and why it is really just in your way. #ps4 #performance
5:53 AM - 16 May 2014

Christian Gyrling @cgyrling Follow
Being able to unmap/remap memory pages from its virtual address space while maintaining its contents is absolutely amazingly useful. #ps4

7:40 AM - 1 May 2014
 
Back
Top