Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
You are trying to have it both ways. An extra 2CUs don't yeild the ~16% increase because the XB1 is "balanced", yet it has a abundance of memory bandwidth. Where is the bottleneck then?

The numbers CUs scales with memory bandwidth on dedicated cards for a reason, yet the XB1 can't seem to make it scale, why?

CPU?

And also possibly dozens of other possible parameters per MS tech talk.

Also, I mean I was just looking up 7970 Fire Strike benches in relation to todays AMD event. What do you see there? Not perfect scaling. The 7970 GE has 3.58X the flops/CU's of the 7770 but only 2.54X the Fire Strike score. An easy example of poor CU scaling. And it's not like I was looking...
 
Trying to imply DDR3+ESRAM utilization is somehow some large fraction less capable of reaching it's peak, than other memory setups, which is basically what goes on all the time, is annoying, and typical posturing.

And yes it does appear X1 is a bandwidth monster.

X1 has it weaknesses but we should also give due credit to it's apparent strengths.

I am implying nothing. I am having a bit of fun with some suggesting that the NOT XB1 is bandwidth starved at 14/12 CU. If 12 CU @ 200 GB/s is Balanced then what else should I take from that ?

I have no issue with XB1 doing what it does, it's great. What is being implied in posts is that 200+ GB/s is being hit on a regular basis on par with whatever fraction of peak BW ( 130 GB/s or somesuch ) a GDDR5 system gets. MS engineers aren't saying that but it is certainly being alluded to in this thread.
 
Aaaaaaa! :runaway: It is not possible to understand the flow of data in a system by a single metric (unless that system has a single memory pool). Your aggregate number is true and yet pointless, and there's zero sense in trying to condense understanding of the BW into this single value.

Sometimes the code will run at the fastest aggregate speed of the total RAM. Sometimes it could be bottlenecked by the slowest singular pipe. Mostly it'll be hitting shifting limits as data moves around the different pools. All games will have access to ~200 GB/s (actually 272 GB/s as total peak available BW) but the amount of data flowing through the system could be very different. The most important thing is that devs will try to maximise dataflow within budgets and development targets, which is why they want to know bus speeds. Bus speeds aren't really for informing the masses about the potential of the consoles!
:LOL: Sorry I made you scream there Shifty.

MS say they've profiled a lot of games. If few multiplatform titles are bandwidth limited more so on X1 and this memory configuration isn't a nightmare, then um, they've done well.
 
How about this. The on die esram is so low latency because of physical proximity to the CUs, that they are able to feed those cores up to really high utilisation. So you have managed to use all the bandwidth with 12CUs. Adding extra CUs isn't going to help much. Speeding the clock speed even by 6% gives you a linear performance increase across the board for free. Maybe.

If the code is not shader bound (which they probably aren't) then adding CUs would not have helped much compare to just upclock.
 
If the code is not shader bound (which they probably aren't) then adding CUs would not have helped much compare to just upclock.

And even then, you have 6% faster in everythng GPU related or 16.7% faster (assumng perfect scaling which doesn't exist) in one aspect (out of many) of the GPU.

I'd say that 9 tmes out of 10, being 6% faster in the entirety of the GPU would end up in faster graphics rendering than 16.7% faster in one important but limited aspect of the GPU. As then you'll speed up rendering when the GPU is bottlenecked by something other than the CUs rather than only speeding up when you are bottlenecked by the CUs.

People seem to get overly fixated on X thing being so important that nothing else matters, when that isn't even remotely the case when rendering graphics. Yes, shaders and compute are important. No, they aren't the only important thing when rendering out each frame of a game's graphics.

Regards,
SB
 
Whats the max mph on your car? Does the fact that there exists a ton a factors that keeps you from regularly approaching that speed means that the max mph of your car is somewhat other than those times you can actually mash the pedal to the medal and hit that max speed if only for a few seconds.

It's probably my poor reading comprehension, buy I'm getting mixed messages from your posts. Peak bw on one console is real world because, well, bus width x clock speed math says it is. Buy peak on the other is a theoretical elusive phantom that will rarely if ever be approached because of a host of limiting factors? Again, I think I've misunderstood your posts.
You might think I hold those peak numbers out there to mean one system is better than the other. But I am not.

OK, we are definitely in agreement then. I think all logic says the ps4 should have a higher utilization of available peak bandwidth. Since MS designed a memory system with 100GB/s higher theoretical peak, it seems they well expected this too. Whether their design is on the whole better or not is anyone's guess, and different answers are probably all valid depending on whether emphasis is put on performance, die size, heat/power envelope, lifetime cost reductions...
 
And even then, you have 6% faster in everythng GPU related or 16.7% faster (assumng perfect scaling which doesn't exist) in one aspect (out of many) of the GPU.

I'd say that 9 tmes out of 10, being 6% faster in the entirety of the GPU would end up in faster graphics rendering than 16.7% faster in one important but limited aspect of the GPU. As then you'll speed up rendering when the GPU is bottlenecked by something other than the CUs rather than only speeding up when you are bottlenecked by the CUs.

People seem to get overly fixated on X thing being so important that nothing else matters, when that isn't even remotely the case when rendering graphics. Yes, shaders and compute are important. No, they aren't the only important thing when rendering out each frame of a game's graphics.

Regards,
SB

Can you disable CU's on a AMD card in software?

Or I guess you could just get different cards with 10-12 CU's. But you'd have to watch for the dual setup engine difference between Bonaire/Cape Verde, hmm...

Basically test out a upclock vs +2 CU's on PC. Would be an interesting test. I'd love to see DF do it but it's probably too arcane even for them.
 
For the most part yes.
Which part do you disagree with?
Is there any known facts that are missing here?
A hypothetical about how a coder would exploit this available memory resources is not an additional fact. Read-modify-write is covered by line #4, and it's performance cannot be known because of line #6.
A. 102GB read to the 32MB pool.
B. 102GB write to the 32MB pool.
C. 68GB read/write to the 8GB pool.

- All three paths are operating in parallel.
- They have no contention between each other except ESRAM bank conflict during read/write
- The latencies of these paths are unknown.
 
Which part do you disagree with?
Is there any known facts that are missing here?
A hypothetical about how a coder would exploit this available memory resources is not an additional fact. Read-modify-write is covered by line #4, and it's performance cannot be known because of line #6.

The part I left out about the internet exploding. :) Oh and that it removes data. It can't remove data if that data was never known in the first place.

And I think it can't be stressed enough that there's likely to be stuff written to/from eSRAM constantly. I know that wasn't addressed in your post to any great extent, but I threw that in there as some seem to be assuming that the eSRAM won't be getting much traffic, and hence its bandwidth will not be utilized much and hence, it can't be added to main memory bandwidth when comparing to a more standard implementation.

As well, if I understand things correctly, read-modify-writes to eSRAM won't incure the same performance penalties as read-modify-writes to man memory. Hence, comparisons to a standard memory architechture break down even further when trying to compare Xbox One's graphics memory system to a standard memory system.

Regards,
SB
 
It's probably my poor reading comprehension, buy I'm getting mixed messages from your posts. Peak bw on one console is real world because, well, bus width x clock speed math says it is. Buy peak on the other is a theoretical elusive phantom that will rarely if ever be approached because of a host of limiting factors? Again, I think I've misunderstood your posts.


OK, we are definitely in agreement then. I think all logic says the ps4 should have a higher utilization of available peak bandwidth. Since MS designed a memory system with 100GB/s higher theoretical peak, it seems they well expected this too. Whether their design is on the whole better or not is anyone's guess, and different answers are probably all valid depending on whether emphasis is put on performance, die size, heat/power envelope, lifetime cost reductions...

Oh no. I am not stating one peak is real while the other is imaginary. I am stating both are real but are only possible under specific circumstances and are not sustainable especially over a full second.

And that the numbers in and of themselves tell you nothing about how much bandwidth will readily be exploitable over 30 or 60 frames.
 
As well, if I understand things correctly, read-modify-writes to eSRAM won't incure the same performance penalties as read-modify-writes to man memory. Hence, comparisons to a standard memory architechture break down even further when trying to compare Xbox One's graphics memory system to a standard memory system.
Yes, I agree it's most probably faster but since we don't know the latency, it's still conjecture. We are speculating. From what I understand there's something finicky about the algorithm chosen to interleaving reads and writes which impacts either latency or bandwidth. I don't know a lot about GCN but it's supposed to be quite impervious to latency and it maximizes bandwidth. If that behaviour is part of the memory controller, MS needs to have placed the ESRAM closer or differently. Nobody talked about it yet. The latency figures will rule the comparison purposes, and we don't have them.
 
Yes, I agree it's most probably faster but since we don't know the latency, it's still conjecture. We are speculating. From what I understand there's something finicky about the algorithm chosen to interleaving reads and writes which impacts either latency or bandwidth. I don't know a lot about GCN but it's supposed to be quite impervious to latency and it maximizes bandwidth. If that behaviour is part of the memory controller, MS needs to have placed the ESRAM closer or differently. Nobody talked about it yet. The latency figures will rule the comparison purposes, and we don't have them.
There are still many unknowns, some absence of information, although Digital Foundry is working into it. The people who worked on the console have been very open to questions, -we don't know what DF staff asked anyways, so we can speculate to infinity and beyond- except for one thing.
14bufzd.jpg


(thanks to Javisoft for the picture)
 
Yes, I agree it's most probably faster but since we don't know the latency, it's still conjecture. We are speculating. From what I understand there's something finicky about the algorithm chosen to interleaving reads and writes which impacts either latency or bandwidth. I don't know a lot about GCN but it's supposed to be quite impervious to latency and it maximizes bandwidth. If that behaviour is part of the memory controller, MS needs to have placed the ESRAM closer or differently. Nobody talked about it yet. The latency figures will rule the comparison purposes, and we don't have them.

What? Are you suggesting then, that by using a 32M eDRAM will be faster and cheaper?

The issue with the simultaneous read/write is that write doesn't sustain full speed in this mode, so the peak is 204GBps instead of 218. There is a latency incurred when simultaneous read/write with the arbitration, but this is probably counted in the range of cycles. This is basic SRAM design.
 
What? Are you suggesting then, that by using a 32M eDRAM will be faster and cheaper?

The issue with the simultaneous read/write is that write doesn't sustain full speed in this mode, so the peak is 204GBps instead of 218. There is a latency incurred when simultaneous read/write with the arbitration, but this is probably counted in the range of cycles. This is basic SRAM design.
link?
 
link for what?

That SRAM has significantly lower latency than DRAM?
yes, the latency of the esram in this context. I can't find it, It seems to be very dependant on the memory controller. It was said earlier that GCN has a high latency to memory but it has nothing to do with GDDR latency, it's by-design because the architecture does that to maximize bandwidth usage.

So if someone told us the esram latency (or simply hinted about it), or if they explained where or how they connected the esram compared to a normal memory controller of a GCN architecture... I'm looking for a link to that info.
 
I don't recall them saying the new numbers were for the same tests when they gave the 150 bound.

133GB/s was at 800MHz. Should be ~142GB/s now, which is right in line with what Charlie had heard for real game usage and what Baker says in the DF article, also for real game usage.
 
133GB/s was at 800MHz. Should be ~142GB/s now, which is right in line with what Charlie had heard for real game usage and what Baker says in the DF article, also for real game usage.

For a specific subset of a frame in real game usage sure, this isn't what your going to get on average over the entire frame.
 
This from the digital foundry vs microsoft article,

"Sony was actually agreeing with us. They said that their system was balanced for 14 CUs. They used that term: balance. Balance is so important in terms of your actual efficient design. Their additional four CUs are very beneficial for their additional GPGPU work. We've actually taken a very different tack on that. The experiments we did showed that we had headroom on CUs as well. In terms of balance, we did index more in terms of CUs than needed so we have CU overhead. There is room for our titles to grow over time in terms of CU utilisation."

What do you guys think they mean by having headroom on CUs as well?

Are they implying that they arent using all the ALU on each CU for graphics thus having extra for compute?
Or are they saying they have set aside a certain number of CUs from the 12 just for compute?
 
Status
Not open for further replies.
Back
Top