Xbox One (Durango) Technical hardware investigation

taisui · Aug 30, 2013

ProjectNatalFan said:
SemiAccurate would be very surprised if it was 128b wide, wires are cheap, power saving areas not. Why is this important? Unless Microsoft’s XBox One architects are masochists that enjoy doing needless and annoying work they would not have reinvented the wheel and put an arbitrarily clockable asynchronous interface between the NB and the CPU cores/L2s. Added complexity, lowered performance, and die penalty for absolutely no useful upside is not a good architectural decision. That means the XBox One’s 8 Jaguar cores are clocked at ~1.9GHz, something that wasn’t announced at Hot Chips.

thanks if anyone can decode this mastery madness

He's saying that because the coherent bus BW is 30GBps, working backwards doing the 256b/32B from the DDR3, the bus would be 938Mhz. Take that and times the multiplier 2X, you get ~1.9Ghz on the CPU.

However I don't see why you can't do 1.5X on the frequency multiplier or just have it be async.

ProjectNatalFan · Aug 30, 2013

taisui said:
He's saying that because the coherent bus BW is 30GBps, working backwards doing the 256b/32B from the DDR3, the bus would be 938Mhz. Take that and times the multiplier 2X, you get ~1.9Ghz on the CPU.

However I don't see why you can't do 1.5X on the frequency multiplier or just have it be async.

did they ever find out if it is sync or async? and when you say working backwards, how does the math apply there? and do you know what the B stands for?

ProjectNatalFan · Aug 30, 2013

eastmen said:
I dunno if anyone is bias and I don't think here is the right place to talk about it . I just would like to see the theory of why its wrong and not just hear its wrong and that's it. I'm not a huge tech head like the guys on here and I do like to read what everyone comes up with and why even if I understand little of it. Its one of the only ways for me to learn more.

the theory of why its wrong is from the vgleak docs is that both ps4 and XBO have the same coherency number of 30gb's and the clock on the original doc is 1.6. Now you are saying you have an insider that claims the XBO has a more powerful CPU correct??

taisui · Aug 30, 2013

ProjectNatalFan said:
did they ever find out if it is sync or async? and when you say working backwards, how does the math apply there? and do you know what the B stands for?

B=Byte, b=bit, 1Byte=8bits, they are unit for data.
So 30GBps / 32B = 938M

He's guessing it's synchronous because it's a simpler design.

ProjectNatalFan · Aug 30, 2013

taisui said:
B=Byte, b=bit, 1Byte=8bits, they are unit for data.
So 30GBps / 32B = 938M

He's guessing it's synchronous because it's a simpler design.

awesome but if it was 128b/16b design then it would clock to the 1.6 correct???

eastmen · Aug 30, 2013

ProjectNatalFan said:
the theory of why its wrong is from the vgleak docs is that both ps4 and XBO have the same coherency number of 30gb's and the clock on the original doc is 1.6. Now you are saying you have an insider that claims the XBO has a more powerful CPU correct??

same person that told me about a possible small gpu uplock happening of about 75mhz. He told me that cpu in the xbox one is faster but never told me how much. I assume it was 1.8ghz and he never corrected me when I asked him the speed. So I dunno .

taisui · Aug 30, 2013

ProjectNatalFan said:
awesome but if it was 128b/16b design then it would clock to the 1.6 correct???

Then you'd have a bug running at nearly 1.9Ghz, seems unlikely, and the CPU if synchronous, will still be running at 1.9Ghz.

McHuj · Aug 30, 2013

ProjectNatalFan said:
well you seem knowledgeable.. Do you think its true? or do you think the vg leaks are correct?

VGleaks also said the GPU was 800 MHz. And that was true until it wasn't anymore. If ms was able to up clock the GPU the possibility exists that it could happen on the CPU side as well.

NotTarts · Aug 30, 2013

Arwin said:
The only problem with that theory is that there was some article on how the optimum heat/power ratio for that chip was at 1.6, where upping to 2.0 would mean a huge increase in inefficiency on that front. At least if I remember correctly. And 1.9GHz is very close to 2.0 ... ?

I'm guessing this is based upon Anand's remarks about the consoles idea clock speed? The heat concerns are unsubstantiated for a couple of reasons:

1. TDP != power consumption, especially when you're isolating the CPU from an SoC. Anand showed this himself in his Kabini review when he recorded the entire laptop drawing 11.5W under CPU load (the A4-5000 is rated at 15W).
2. The 15W -> 25W jump includes a GPU clock boost of 20%.
3. The Opteron X1150 (CPU-only Jaguar) shows otherwise. A 100% increase in CPU speed (1 GHz -> 2 GHz) results in only an 88% increase in TDP (9W -> 17W).
4. Jaguar consumes so little power that even if a 25% bump to 2 GHz in the XB1 resulted in a 50% increase in power, you're looking at an extra 10W at most.

Scott_Arm · Aug 30, 2013

They're using the 30 GB/s coherent link to calculate the clock rate of the CPU (30 GB/s / 32 byte bus width = 938 MHz), but if the clock rate of that link was tied to the CPU clock, then it would be higher than 30 GB/s, which is the same bandwidth they had listed on VGleaks at 800 MHz. Unless you believe the VGleaks stats were wrong about clock for the CPU from the start, I don't see how you can calculate the CPU clock in that manner.

taisui · Aug 30, 2013

ProjectNatalFan said:
well you seem knowledgeable.. Do you think its true? or do you think the vg leaks are correct?

It's an interesting guess, but the bus speed seem weird, so I don't know. I'm far more interested in knowing more about the eSRAM and the 2x2 gfx/compute cmd processors, which I am guessing would help with the CU utilization but I can't seen to find anything about it.

3dilettante · Aug 30, 2013

I have doubts the numbers given are precise enough to give a full picture.
One thing I notice from the Vgleaks Durango diagram is that the Nortbridge to CPU links have 20.8 GB/s.
There are a number of variables you can play with: bus width, whether the two links are separate, clock speed, etc.
What feasible combination of bit widths and clocks lead to 20.8 GB/s? 1.9 GHz or half of it doesn't give that number for any integer byte width.

The first seemingly reasonable numbers I could arrive at is 16B and 1.3 GHz, or possibly less reasonably 13B and 1.6 GHz.

Trying to get a good match for that bus with reasonable bus widths for coherent traffic, or for that matter the IO block's bus, makes me suspect that the various blocks have interfaces that run at different ratios. That doesn't say much either way about the CPU clocks, though.

3dilettante · Aug 30, 2013

I think the Microsoft presentation and the leaked documents on Durango have gaps in what they discuss that leaves room for interpretation.

There is a link or links between the CPU clusters and the Northbridge. Vgleaks shows 20.8 in both directions from a Jaguar cluster, with both clusters getting their own pair of links.
Adding those up already exceeds memory bandwidth and the bandwidth of the Northbridge's coherent traffic.
That could mean that there is an ambiguity in the diagram as to whether those directions should be listed separately, although the Semiaccurate article also brings up the idea that the interface between the two L2s has been buffed up, so their higher numbers may assume inter-module sharing.

If the request queue for coherent traffic is similar to existing AMD chips, there is a central crossbar and queue all ordered memory clients plug into, which is the likely source of the 30GB/s limit.
Coherent traffic from the CPUs needs to go through that juncture.
Write-combining traffic that doesn't go through the caches might supply write bandwidth that goes beyond the initial 30 GB/s.

SlimJim · Aug 30, 2013

3dilettante said:
I think the Microsoft presentation and the leaked documents on Durango have gaps in what they discuss that leaves room for interpretation.

There is a link or links between the CPU clusters and the Northbridge. Vgleaks shows 20.8 in both directions from a Jaguar cluster, with both clusters getting their own pair of links.
Adding those up already exceeds memory bandwidth and the bandwidth of the Northbridge's coherent traffic.
That could mean that there is an ambiguity in the diagram as to whether those directions should be listed separately, although the Semiaccurate article also brings up the idea that the interface between the two L2s has been buffed up, so their higher numbers may assume inter-module sharing.

If the request queue for coherent traffic is similar to existing AMD chips, there is a central crossbar and queue all ordered memory clients plug into, which is the likely source of the 30GB/s limit.
Coherent traffic from the CPUs needs to go through that juncture.
Write-combining traffic that doesn't go through the caches might supply write bandwidth that goes beyond the initial 30 GB/s.

Thanks for the insight!
I really hope that somebody can get his hands on real documentation soon.
I have the PS2 Linux kit, which has all the developer documentation supplied, it has specs for every single part, bus, and even real-life usage scenarios so it's not limited to theoretical specs.

Dominik D · Aug 30, 2013

Scott_Arm said:
30 GB/s / 32 byte bus width = 938 MHz

I think it'd actually be more: 20 * 1024^3 / 32 = 1006. But regardless of math - memory interface frequency has to play nicely with memory frequency but doesn't have to build any "sane" relationship with CPU speed. Last two pages of this thread are tea leaves reading and its finest.

3dilettante · Aug 30, 2013

Another point against assuming a simple relationship between peak CPU clock and the northbridge is that both the GPU and CPUs are capable of ramping their clocks up and down, we've seen them tweak the GPU without propagating that change through the chip, and northbridge clocks have ways of being kept at least partly decoupled.

At the high level of the leaks so far, I think we can see several likely coarse clock domains--not counting the local domains within them. Disclosures about the complexity of other modern cores and power management make raising the specter of world-ending complexity because the APU doesn't use a 1:2 clock ratio sound a little excessive.

Lalaland · Aug 30, 2013

Cyan said:
Another great reply, thanks. Do you mean that you can use it independently? Let's say you program a game from scratch -not from scratchpad , sorry for the bad joke- and don't want to use the DDR3 memory at all. If your game fits on the eSRAM, could you use it single-handedly to run it without DDR3?

Thanks glad I could help but your question has run into the limits of my knowledge!

Whether a game could be run entirely from the ESRAM depends on how access to this pool is handled and whether there is coherency between the GPU and CPU caches (L1 and L2). To the first point I've seen some back and forth about whether the GPU has exclusive access to the ESRAM or if the CPU can access it but it forces a GPU cache flush (the second point and not good as it essentially stalls the GPU). The bigger obstacle is that it's only 32MB and most of the techniques that play to ESRAMs strengths have the framebuffer stored there leaving little room for the rest of your code.

I look forward to the more knowledgeable folk on this board correcting me soon!

Jwm · Aug 30, 2013

Here is part 2 from SA - http://semiaccurate.com/2013/08/30/a-deep-dive-in-to-microsofts-xbox-one-gpu-and-on-die-memory/

Deleted member 7537 · Aug 30, 2013

https://twitter.com/digitalfoundry/status/337517435810881536

Digital Foundry
‏@digitalfoundry
Interesting AMD Jaguar stat from @anandshimpi: 2GHz clock speed requires 66% TDP increase compared to 1.6GHz, ruling out next-gen console.

NotTarts · Aug 30, 2013

jayco said:
https://twitter.com/digitalfoundry/status/337517435810881536

NotTarts said:
I'm guessing this is based upon Anand's remarks about the consoles idea clock speed? The heat concerns are unsubstantiated for a couple of reasons:

1. TDP != power consumption, especially when you're isolating the CPU from an SoC. Anand showed this himself in his Kabini review when he recorded the entire laptop drawing 11.5W under CPU load (the A4-5000 is rated at 15W).
2. The 15W -> 25W jump includes a GPU clock boost of 20%.
3. The Opteron X1150 (CPU-only Jaguar) shows otherwise. A 100% increase in CPU speed (1 GHz -> 2 GHz) results in only an 88% increase in TDP (9W -> 17W).
4. Jaguar consumes so little power that even if a 25% bump to 2 GHz in the XB1 resulted in a 50% increase in power, you're looking at an extra 10W at most.

Anand also wrote that before he even tested Jaguar. I'd guess that the decision to limit the CPU to 1.6 GHz is for yield reasons, not due to concerns about Jaguar's power efficiency at 2 GHz.

Xbox One (Durango) Technical hardware investigation

taisui

ProjectNatalFan

ProjectNatalFan

taisui

ProjectNatalFan

eastmen

taisui

McHuj

NotTarts

Scott_Arm

taisui

3dilettante

3dilettante

SlimJim

Dominik D

3dilettante

Lalaland

Jwm

Deleted member 7537

Guest

NotTarts

Similar threads