Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
You benchmark them...?

Right, but that won't work in every situation now will it? Further, the benchmarks being used arguably could also fall into the same category of "not truly conveying useful performance metrics". Go look at any number of benchmarks posted here or elsewhere about particular CPU's, hell Jaguars, and you will find people picking apart the methodology and software used.

For instance, like Digital Foundry's "Next Gen" GPU comparison using off the shelf AMD parts that had their clock rates played with.

What I'm saying is, if we lament that 'using clock speeds and flops aren't legitimate indicators of real performance' then simply saying "well benchmarks" could fall under the same level of scrutiny.

I was inquiring if there were some kind of metric outside of flops or clock speed that could prove useful. Naturally we must accept the 'human' element in this situation and the intricacies of these architectures..
 
Benchmarks. There are a number of benchmarks used by comparison sites that provide a much better insight into the real performance of a processor.
And notably, specific benchmarks for different tasks and workloads, rather than a single metric used by forum warriors to denote 'better'. ;) For charts and infographics though, a single number, especially one that can be represented colourfully, is much more palatable for the masses.
 
Back in your day of ASM, I'm sure the cool kids were comparing CPUs by clock speeds.

Ha! I think my clock speed was about 7mhz in those days...and the shit that could be done with that...these devs are spoilt! But no, I think the cool kids were getting laid ....

Back to the subject in hand ... that GPGPU stuff seems quite challenging, you have a limited set of instructions aimed at gfx tasks, you have what, 700 or 800 "cores" to do the work? you have to parallelise the work to be done into even chunks, that have no inter-dependencies, you have to get all the data needed lined up ready to go before you start then run your code, gather all the results and repeat for the next lot. Maybe today's devs aren't so spoilt after all!

Seriously though, if someone said "Can you code up all this single threaded audio manipulation code into GPGPU...we need the cpu for something else" well rather you than me...you're not going to knock that out in a week are you?!*
 
And notably, specific benchmarks for different tasks and workloads, rather than a single metric used by forum warriors to denote 'better'. ;) For charts and infographics though, a single number, especially one that can be represented colourfully, is much more palatable for the masses.

That's basically what I'm getting at. It's obvious as to why certain numbers have been found desirable over others.

But, unfortunately, people rarely have anything beyond short attention spans. But I'd argue that this extends to 'benchmarks' too. Not all benchmarks are created equal and there are plenty of bad ones out there, performed by people who don't quite understand what's going on.

But it's in writing and has pretty charts so it must be true.

I realize that benchmarks are a far more adequate gauge of real world performance than flops or clock, but even for me the technical jargon I see criticizing even respected groups can cause confusion for me (let alone others). Again, like DF. Lots of people saying they screwed up that entire comparison article.

For laymen and near-laymen that causes some consternation.
 
If MS is correct and each 2CU over 12 is worth less than an 7% upclock, couldn't we say an 18 CU XBO at 800 mhz would be less than 14% more GPU over 12 CU at 853?

Because they already claim to be >14CU after the upclock, and each additional 2CU should be worth <7% by their reckoning.
 
If MS is correct and each 2CU over 12 is worth less than an 7% upclock, couldn't we say an 18 CU XBO at 800 mhz would be less than 14% more GPU over 12 CU at 853?

Because they already claim to be >14CU after the upclock, and each additional 2CU should be worth <7% by their reckoning.

I would like to see some evidence or at least some of the more educated in gpu tech on this forum come up a analysis on why that might be.
What is it that makes using more than 12 cu lose efficiency like that?
 
They're saying that each 2 CU's are not as good as a 7% up clock for their overall design and not in general.
 
I would like to see some evidence or at least some of the more educated in gpu tech on this forum come up a analysis on why that might be.
What is it that makes using more than 12 cu lose efficiency like that?

I think you need to take into account the full context ... Meaning adding additional CUs with DDR3/sram bandwidth.

You have an entirely different context with adding CUs with GDDR5 bandwidth.
 
The simulations would have been for whatever target software they consider representative, and then what the scaling behavior would be for the specific hardware implementation of the SoC.

This includes other possible bottlenecks, like ROPs, memory, and resources like the GDS and graphics front end.

Let's say there is a workload that is ALU or TEX limited 20% of the time, but is limited by other parts of the GPU like the ROPs or eSRAM for the other 80%.

Upping CU throughput improves overall performance by 17% for only 20% of the time.
That means the overall improvement is less than 4%.
A global improvement of 6% would mean more.

A different hardware balance and test set could change the outcome.
 
That's basically what I'm getting at. It's obvious as to why certain numbers have been found desirable over others.

But, unfortunately, people rarely have anything beyond short attention spans. But I'd argue that this extends to 'benchmarks' too. Not all benchmarks are created equal and there are plenty of bad ones out there, performed by people who don't quite understand what's going on.

But it's in writing and has pretty charts so it must be true.

I realize that benchmarks are a far more adequate gauge of real world performance than flops or clock, but even for me the technical jargon I see criticizing even respected groups can cause confusion for me (let alone others). Again, like DF. Lots of people saying they screwed up that entire comparison article.

For laymen and near-laymen that causes some consternation.
So here's a comparison: The 360 CPU and the Jaguar have the same theoretical FLOPS. In real world code, algorithms complete, on average, 6-8x faster on the Jaguar. Why? Out of order execution. multiple issue micro-ops, better branch prediction, a shorter pipeline, fewer stalls. There is no single number that would be able to tell you how well the two processors would do against each other. If you're running highly optimised vector loops, they might be about the same, if you're running branch-heavy AI or Physics, the Jaguar might be as much as 10 or 15 times as fast.

Now if you're comparing identical architectures, you might have more luck using just clock speed, but change one architecture even a little, like Ivy Bridge to Haswell, and for exactly the same frequency and FLOPS rating, the Haswell will be about 6% faster.
 
The simulations would have been for whatever target software they consider representative, and then what the scaling behavior would be for the specific hardware implementation of the SoC.

This includes other possible bottlenecks, like ROPs, memory, and resources like the GDS and graphics front end.

Let's say there is a workload that is ALU or TEX limited 20% of the time, but is limited by other parts of the GPU like the ROPs or eSRAM for the other 80%.

Upping CU throughput improves overall performance by 17% for only 20% of the time.
That means the overall improvement is less than 4%.
A global improvement of 6% would mean more.

A different hardware balance and test set could change the outcome.


Yeah, that's my one caveat. Test on shader limited code and the results may (should?) be different.

Just like I already posted, seems kind of a no brainer that on code made for 12 CU's, a upclock is more effective than 2 more CU's.

I wonder if they profiled any multiplats in this (I assume no)
 
Yeah, that's my one caveat. Test on shader limited code and the results may (should?) be different.

Just like I already posted, seems kind of a no brainer that on code made for 12 CU's, a upclock is more effective than 2 more CU's.

I wonder if they profiled any multiplats in this (I assume no)

Doesn't the upclock help with triangle setup rate and fillrate, while the addtion CUs wouldn't have helped in this regard?
 
I would like to see some evidence or at least some of the more educated in gpu tech on this forum come up a analysis on why that might be.
What is it that makes using more than 12 cu lose efficiency like that?

Not 100% on this, but I think that apart from bandwidth issues, there can be dependencies between the parallel threads. This would prevent linear scaling. The more you parallelise the work the more chance there is of a dependency cropping up and stalling the core. Possibly?

It would explain why ms got more by the 6% upclock than enabling the extra 2 CUs.
 
Not 100% on this, but I think that apart from bandwidth issues, there can be dependencies between the parallel threads. This would prevent linear scaling. The more you parallelise the work the more chance there is of a dependency cropping up and stalling the core. Possibly?

It would explain why ms got more by the 6% upclock than enabling the extra 2 CUs.
Adding cores never scales linearly. For their statement to be true on a GFLOPS basis (which it probably isn't), a scaling efficiency of 95.6% per core added would provide 12CUs at 853MHz with the same horsepower as 14CUs at 800. At that (bogus) scaling factor, 18 CUs at 800MHz would only be 7.4% faster than 12CUs at 853MHz.

The real scaling factor is guaranteed to be higher than that, but it will still be <1, and the scaling factor probably goes down with increasing cores, since the bottlenecks become even more important. You're going to have a lot more contention for a 256 bit memory bus at 12 cores than at 2. What this means is that in real terms, PS4 does not have, and never had, a 50% GPU advantage. What the actual advantage is, we may never know.
 
Yeah, that's my one caveat. Test on shader limited code and the results may (should?) be different.

Just like I already posted, seems kind of a no brainer that on code made for 12 CU's, a upclock is more effective than 2 more CU's.

I wonder if they profiled any multiplats in this (I assume no)

i dont know why doing both wouldnt make EVEN MORE SENSE. Both in this case, means additional CUs and higher clock speed.
 
i dont know why doing both wouldnt make EVEN MORE SENSE. Both in this case, means additional CUs and higher clock speed.

I asked earlier if activating CUs draw more power than inactive CUs. If they draw more power when active, than the combination of both may have put them over the edge for heat/power.
 
This seems strange to me. Do devs write code for a specific number of CUs?

They do now if they're working on PS4 or XBO titles. Since the platforms are closed and environment is more controlled it makes targeting the 12 CU's the ideal situation.
 
http://semiaccurate.com/2013/09/20/amd-livestream-gpu14/

"Things look vastly different on the gaming console side. AMD gained a clean sweep in Wii U with R700 shaders, PlayStation 4 with Sea Islands shaders, and Xbox One with Southern Islands shaders, along with Jaguar x86-64 CPU cores for the PlayStation 4 APU and Xbox One SoC, that means most of the AAA cross-platform titles will have the same starting line and a unified feature set on graphics capabilities. [Author's note: except for features enabled the eSRAM on Xbox One which some devs states it's "a pain to use the eSRAM".]"


So Microsoft was lying when they said both ps4 and xbox one use Sea Islands gpus?
 
Status
Not open for further replies.
Back
Top