Xbox One (Durango) Technical hardware investigation

pjbliverpool · Sep 25, 2013

Gubbi said:
The read to write ratio is something like 2-4:1 for normal graphics workloads. If we assume we use all of the ESRAM read bandwidth (ie >95%) and 60 % of the DDR3 bandwidth for GPU reads, we get around 45-50GB write traffic to ESRAM with a 3:1 ratio.

That's 150GB/s (100 read, 50 write) of ESRAM bandwidth with aggregate system bandwidth being well over 200GB/s. Lower average latency too.

Cheers

Gubbi said:
The first one. There'd be no point in having a 1024 bit write bus if it is utilized significantly less than 50% most of the time. They'd be better off with 1280 bit read and 512/768 bit write buses.

Cheers

Thanks, this is a clear and logical explanation as to why the esram should be able to hit 150GB/s on average. Assuming of course there are no other caveats of which we are not aware then quite frankly this makes the Microsoft statement kinda moot since all they are saying is that they've measured a throughput (peak, average, whatever) which you can already show must logically be achievable on average from such a design.

Now why didn't you tell us this a month ago? j/k

Gubbi · Sep 25, 2013

SlimJim said:
This happens more often than you think: people are working on different specs and they communicate through email. Imagine if a MS engineer sent a message: but it went like this:

It's possible that the MS Hotmail servers picked the bolded part up, and marked it as spam. So nobody ever got the email, so that is why they didn't know.

That's not too farfetched. MS is a really big company, and MS Hotmail is not so good at recognising spam.

Laa-Yosh said:
This thread is making me laugh and cry at the same time.

Seconded.

Cheers

DrJay24 · Sep 25, 2013

taisui said:
I'll save the rest since I simply can't respond to conspiracy theories that essentially look away from the evidence being presented.

I guess you are considering MS public PR as evidence? I would say right now we don't have any real evidence of the effective bandwidth and how it is used in real games.

I guess we have some info about frame rates and frame buffer resolutions, but that would hard to use as data to figure out memory bandwidth.

pjbliverpool · Sep 25, 2013

taisui said:
I'll save the rest since I simply can't respond to conspiracy theories that essentially look away from the evidence being presented.

Fair play. So if another vendor came out tomorrow with a statement "we've measured 170GB/s in a real game" then you would also accept that as being the average utilization for said system. Glad we understand each other.

blakjedi · Sep 25, 2013

DrJay24 said:
I guess you are considering MS public PR as evidence? I would say right now we don't have any real evidence of the effective bandwidth and how it is used in real games.

I guess we have some info about frame rates and frame buffer resolutions, but that would hard to use as data to figure out memory bandwidth.

I think Gubbi answered your questions in the last two pages.

DrJay24 · Sep 25, 2013

blakjedi said:
I think Gubbi answered your questions in the last two pages.

We clearly have different definitions of evidence.

Bigus Dickus · Sep 25, 2013

dobwal said:
Unless gDDR5 can't deliver 256 bits per access during burst mode, then 176 GB/s is a real world number. Like I said you don't need to travel 60 miles over an hour to reach a speed of 60 mph. 176 GBs is a rate.

Then by your definition, the xb1 has real world bandwidth to the GPU if some 270 ~ 290 GB/s (depending on whether the combined esram rate is 204 or 218, I still don't think that's clear), so long as some code can be written to achieve that burst over a short time, which there is no indication thus far that isn't possible.

Now, I think your definition is silly. There is a reason we use terms like peak or theoretical... being that we expect average, or even maximum shirt term rates in real game code to be somewhat less.

We don't know how much less for either system. People can pontificate and speculate, and make up numbers to suit their agenda, but we simply don't know. For either system. And probably won't know for quite some time. MS has given some extra information suggesting perhaps 200 GB/s is more representative, but the information is ambiguous because we don't know representative of WHAT and under which conditions.

So we have peak numbers to compare, which doesn't help much except to say that both consoles seem to have pretty high bandwidth and games should be interesting.

As for the 99% of slow RAM talk, its just as much handwaiving as thinking the XB1 will achieve sustained 290 GB/s. So what if the faster pool is only 1% the total memory size, if the data structures being written to and from it fit within its size, and the GPU winds up using this pool for more than half the total GPU memory accesses? See, I can make up numbers too.

We will have to wait for quite a while before making more informed comparisons. Release of tech details probably won't even be enough. I think measured performance in realistic code will ultimately shed some light on his these two approaches compare.

dobwal · Sep 25, 2013

Doesn't 133 GBs work out to quite a few 1080p frames being alpha blended per second? Like somewhere in the neighborhood of ridiculous?

dobwal · Sep 25, 2013

Bigus Dickus said:
Then by your definition, the xb1 has real world bandwidth to the GPU if some 270 ~ 290 GB/s (depending on whether the combined esram rate is 204 or 218, I still don't think that's clear), so long as some code can be written to achieve that burst over a short time, which there is no indication thus far that isn't possible.

Now, I think your definition is silly. There is a reason we use terms like peak or theoretical... being that we expect average, or even maximum shirt term rates in real game code to be somewhat less.

We don't know how much less for either system. People can pontificate and speculate, and make up numbers to suit their agenda, but we simply don't know. For either system. And probably won't know for quite some time. MS has given some extra information suggesting perhaps 200 GB/s is more representative, but the information is ambiguous because we don't know representative of WHAT and under which conditions.

So we have peak numbers to compare, which doesn't help much except to say that both consoles seem to have pretty high bandwidth and games should be interesting.

As for the 99% of slow RAM talk, its just as much handwaiving as thinking the XB1 will achieve sustained 290 GB/s. So what if the faster pool is only 1% the total memory size, if the data structures being written to and from it fit within its size, and the GPU winds up using this pool for more than half the total GPU memory accesses? See, I can make up numbers too.

We will have to wait for quite a while before making more informed comparisons. Release of tech details probably won't even be enough. I think measured performance in realistic code will ultimately shed some light on his these two approaches compare.

Whats the max mph on your car? Does the fact that there exists a ton a factors that keeps you from regularly approaching that speed means that the max mph of your car is somewhat other than those times you can actually mash the pedal to the medal and hit that max speed if only for a few seconds.

Its a max rate thats all it is, it doesn't mean anything in and of itself especially under the circumstance you are comparing bandwidth of two different systems.

You might think I hold those peak numbers out there to mean one system is better than the other. But I am not. I understand that max or peak bandwidth can happen but in and of itself does not provide a clear look at how robustly the memory system of either console performs especially in comparison to each other.

I don't readily look at the peak numbers provided by the memory of these console like I would when comparing discrete gpus where you don't need to understand the average or sustainable rate because you know that the sustainable or average rate is usually proportional to the peak bandwidth especially when comparing cards with the same type of DRAM.

Like when MS throws out 133 GBs for alpha blending. I find it more natural to look at that as a PR number calculated from blending a 720p or 1080p frame and dividing by the time it takes to complete said frame and then extrapolating out to produce a 133 GBs number. Otherwise what is MS doing alpha blending a 8K frame at 30 fps?

MrFox · Sep 25, 2013

My brain likes the simplest explanations, and so far what I can understand about the memory system is what the engineers have said. Does anybody disagree with the following?

A. 102GB read to the 32MB pool.
B. 102GB write to the 32MB pool.
C. 68GB read/write to the 8GB pool.

- All three paths are operating in parallel.
- They have no contention between each other except ESRAM bank conflict during read/write
- The latencies of these paths are unknown.

I think this is what we know. If one of the paths saturates, your code can't go any faster. Trying to add numbers together doesn't simplify things, it removes important data, and as a side effect it makes the internets explode.

Rangers · Sep 25, 2013

DrJay24 said:
All bandwidth numbers are peak, so are AMD's they publish with the card specs, so that point is moot. Thinking that MS has 200GB/s available for 12CUs is kind of silly, they may see 200GB/s aggregate at times but in general their available bandwidth is much lower, remember the eSRAM is only 0.04% of the RAM.

.6%...You keep throwing around the incorrect .04% and others also accept it.

1% of 5000MB=50MB, so 32MB=.6% (for an graspable walkthrough not the math precise math 32/5000)

Bit of a moot point but yeah.

Rangers · Sep 25, 2013

zupallinere said:
So from your perspective the NOT XB1 is only has 110 GB/s which will basically be nearly the 102 GB/s of the original ESRAM bandwidth spec. Nicely done. The tables have turned and the XB1 is the bandwidth MONSTER !!!

We also have X1 Balance (tm) at 200 GB/s and 12 CU meaning NOT XB1s will limp along starving for bandwidth and falling by the wayside.

More seriously I would step back and think about how many developers over how many years have been able to access the bandwidth of GDDR5 memory. Is every game on every GDDR5 based GPU throwing away 25% of the bandwidth for all these years ??

Trying to imply DDR3+ESRAM utilization is somehow some large fraction less capable of reaching it's peak, than other memory setups, which is basically what goes on all the time, is annoying, and typical posturing.

And yes it does appear X1 is a bandwidth monster.

X1 has it weaknesses but we should also give due credit to it's apparent strengths.

warb · Sep 25, 2013

MrFox said:
My brain likes the simplest explanations, and so far what I can understand about the memory system is what the engineers have said. Does anybody disagree with the following?

A. 102GB read to the 32MB pool.
B. 102GB write to the 32MB pool.
C. 68GB read/write to the 8GB pool.

- All three paths are operating in parallel.
- They have no contention between each other except ESRAM bank conflict during read/write
- The latencies of these paths are unknown.

I think this is what we know. If one of the paths saturates, your code can't go any faster. Trying to add numbers together doesn't simplify things, it removes important data, and as a side effect it makes the internets explode.

Unless most games do end up with ~200GB/s system bandwidth in this setup. They likely put some some thought into it.

Shifty Geezer · Sep 25, 2013

warb said:
Unless most games do end up with ~200GB/s system bandwidth in this setup. They likely put some some thought into it.

Aaaaaaa!

It is not possible to understand the flow of data in a system by a single metric (unless that system has a single memory pool). Your aggregate number is true and yet pointless, and there's zero sense in trying to condense understanding of the BW into this single value.

Sometimes the code will run at the fastest aggregate speed of the total RAM. Sometimes it could be bottlenecked by the slowest singular pipe. Mostly it'll be hitting shifting limits as data moves around the different pools. All games will have access to ~200 GB/s (actually 272 GB/s as total peak available BW) but the amount of data flowing through the system could be very different. The most important thing is that devs will try to maximise dataflow within budgets and development targets, which is why they want to know bus speeds. Bus speeds aren't really for informing the masses about the potential of the consoles!

3dilettante · Sep 25, 2013

Rangers said:
Trying to imply DDR3+ESRAM utilization is somehow some large fraction less capable of reaching it's peak, than other memory setups, which is basically what goes on all the time, is annoying, and typical posturing.

From the Orbis thread, benchmarks using ROP blending operations on discrete GPUs have a 91% utilization rate versus blending numbers leaked for the eSRAM, 133-140 out of 204.
It may come down to different controller priorities, or the static split might not fit the necessary mix for the dynamic demands of those benches.

XpiderMX · Sep 25, 2013

3dilettante said:
From the Orbis thread, benchmarks using ROP blending operations on discrete GPUs have a 91% utilization rate versus blending numbers leaked for the eSRAM, 133-140 out of 204.
It may come down to different controller priorities, or the static split might not fit the necessary mix for the dynamic demands of those benches.

But numbers are 140-150 GB/s no?

Silent_Buddha · Sep 25, 2013

MrFox said:
My brain likes the simplest explanations, and so far what I can understand about the memory system is what the engineers have said. Does anybody disagree with the following?

A. 102GB read to the 32MB pool.
B. 102GB write to the 32MB pool.
C. 68GB read/write to the 8GB pool.

- All three paths are operating in parallel.
- They have no contention between each other except ESRAM bank conflict during read/write
- The latencies of these paths are unknown.

For the most part yes. So, every frame you'd have the followng...

1. Read from main memory.
2. Read-modify-write to eSRAM, quite likely multiple times for each read and write to main memory.
3. Write to main memory.

Yes, that is greatly simplified. For a traditional GPU memory pool, all of the read-modify-write would go back to man memory unless it manages to fit withing the onchip gpu caches thus eating into your main memory bandwidth and triggering read-write-read performance penalties.

As well, that does not take into account that some data is likely to persist n eSRAM between frames.

And to certain people that are likely reading this... Note, this isn't to say one is better or faster than the other. Only pointing out that the situation is far more complex than X has more bandwidth than Y.

Regards,
SB

3dilettante · Sep 25, 2013

XpiderMX said:
But numbers are 140-150 GB/s no?

I don't recall them saying the new numbers were for the same tests when they gave the 150 bound.

DrJay24 · Sep 25, 2013

Rangers said:
Trying to imply DDR3+ESRAM utilization is somehow some large fraction less capable of reaching it's peak, than other memory setups, which is basically what goes on all the time, is annoying, and typical posturing.

And yes it does appear X1 is a bandwidth monster.

X1 has it weaknesses but we should also give due credit to it's apparent strengths.

You are trying to have it both ways. An extra 2CUs don't yeild the ~16% increase because the XB1 is "balanced", yet it has a abundance of memory bandwidth. Where is the bottleneck then?

The numbers CUs scales with memory bandwidth on dedicated cards for a reason, yet the XB1 can't seem to make it scale, why?

oldschoolnerd · Sep 25, 2013

DrJay24 said:
You are trying to have it both ways. An extra 2CUs don't yeild the ~16% increase because the XB1 is "balanced", yet it has a abundance of memory bandwidth. Where is the bottleneck then?

The numbers CUs scales with memory bandwidth on dedicated cards for a reason, yet the XB1 can't seem to make it scale, why?

How about this. The on die esram is so low latency because of physical proximity to the CUs, that they are able to feed those cores up to really high utilisation. So you have managed to use all the bandwidth with 12CUs. Adding extra CUs isn't going to help much. Speeding the clock speed even by 6% gives you a linear performance increase across the board for free. Maybe.

Xbox One (Durango) Technical hardware investigation

pjbliverpool

B3D Scallywag

Gubbi

DrJay24

pjbliverpool

B3D Scallywag

blakjedi

DrJay24

Bigus Dickus

dobwal

dobwal

MrFox

Deludedly Fantastic

Rangers

Rangers

warb

Shifty Geezer

uber-Troll!

3dilettante

XpiderMX

Silent_Buddha

3dilettante

DrJay24

oldschoolnerd

Similar threads