Xbox One (Durango) Technical hardware investigation

dobwal · Dec 31, 2014

Was this ever discussed here?

http://www.technologyreview.com/news/527826/microsofts-3-d-audio-gives-virtual-objects-a-voice/

In a demonstration of the technology at Microsoft’s Silicon Valley lab, I put on a pair of wireless headphones that made nearby objects suddenly burst into life. A voice appeared to emanate from a cardboard model of a portable radio. Higher quality music seemed to come from a fake hi-fi speaker. And a stuffed bird high off the ground produced realistic chirps. As I walked around, the sounds changed so that the illusion never slipped as their position relative to my ears changed.

3dilettante · Jan 1, 2015

I made note of this in the Xbox SDK thread, as there was a bit of documentation for vector memory operations that put a few latency numbers (seemingly in GPU cycles).

A vector L1 miss can take 50+ clock cycles to serve, an L2 to DRAM miss can take 200+ cycles to serve, while a miss to ESRAM can take 75+ cycles.
I am interpreting the cycle counts as being per individual miss event, meaning the L1 miss latency would be additive with whatever secondary miss is encountered in the next level of the hierarchy. This seems more consistent than expecting the ESRAM to be twice as fast that the L1-L2 request latency, meaning that there is a bypassing of a significant portion of the GPU memory pipeline when going to ESRAM.
This seems to put 250+ for a miss to DRAM and 125+ for a miss to ESRAM, or roughly half the latency. In a different portion, there is an expectation of a texture access realistically taking at more than 100 cycles and possibly around 400 if there is a miss, so the "+" on those numbers may be very significant.

For a somewhat weak comparison, this other post contains latency values for the CPU memory subsystem: https://forum.beyond3d.com/threads/...-news-and-rumours.53602/page-208#post-1700821

In short:
Orbis L1 to L2 to DRAM: 3 to 26(local L2) /190(remote) to 220+ cycles.
Durango is 3 to 17(local)/120(remote)* to 140-160 cycles.

*Reviewing the SDK seems to clarify some numbers that may have made my earlier interpretation pessimistic. If a remote L2 hit is followed by a remote L1 miss, latency is 100 cycles. If data is in a remote L1, there is an additional hop.
This leaves more cycles between L2 servicing and DRAM servicing, meaning 40-60 cycles are taken up by DRAM access. (A similar shift for Orbis would give a similar DRAM cycle contribution.)
Since these are referenced in CPU cycles, these are twice as fast as the cycles used for the GPU latencies.

Keeping with Durango, a CPU miss to DRAM would take ~80 GPU cycles, or roughly the same latency seen for a GPU L2 miss to the ESRAM. Since the ESRAM is not in the CPU's domain, whatever the CPU sees of data from there would take longer to get than going to DRAM.
For the GPU, the ESRAM is markedly faster. The more uniform access of the ESRAM and on-die connections likely mean hundreds of cycles bound up in queuing and bus traversal can be trimmed down, although this pace would still be glacial relative to what the CPUs would prefer (125 GPU cycles or 250 CPU cycles).

It is not clear what numbers would be seen for the ROP path, which may be more receptive to the 125 or so cycles saved.

Shifty Geezer · Jan 5, 2015

Talpatron said:
th xb1 cpu converter 2 macro ops/64 bit), ( 16= 2 *8 core) in 1.7 micro ops(32 bit) and in the max of efficence in 2=2
ps4 2 macro ops into 1 micro ops, because the cpu not have this feature, and its the reason of the gpmmu or gpgpu i don't rember the name xD, when 4 of the 18 cores, with the 8 ace help the cpu, and its is the reason because ps4 have 2 cores for the sistem,

That news is nearly two years old.
http://www.vgleaks.com/durango-cpu-overview

And it don't mean squat. AMD introduced dual x64 instruction decoders in Bobcat in 2011.

Shifty Geezer · Jan 5, 2015

Why do you think PS4 doesn't have dual instruction decoders?

orangpelupa · Jan 5, 2015

maybe because PS4 is single?

sorry i cant hold it

liquidboy · Jan 5, 2015

Bobcat Jaguar is "Dual x86 Instruction Decoders" vs XB1 is "dual x64 instruction decoders"

Betanumerical · Jan 5, 2015

liquidboy said:
Bobcat Jaguar is "Dual x86 Instruction Decoders" vs XB1 is "dual x64 instruction decoders"

This seems very unlikely why include a decoder for only x86??. x86_64 is old now even.

Shifty Geezer · Jan 5, 2015

I think you've misread it.

On average, an x86 instruction is converted to 1.7 micro-operations, and many common x64 instructions(pc, ps4) are converted to 1 micro-operation. In the right conditions

There's no reference to PC or PS4. It's not saying XB1 gets 1.7 uops and PC/PS4 get 1 uop. It's talking about decoding to 32 and 64 bit microcode.

On average, an x86 (32 bit) instruction is converted to 1.7 micro-operations, and many common x64 (64 bit) instructions are converted to 1 micro-operation.

So in XBox One (and other Jaguars AFAIK), you get more uops from x86 code than from x64 code, but the processor runs both.

Shifty Geezer · Jan 5, 2015

liquidboy said:
Bobcat Jaguar is "Dual x86 Instruction Decoders" vs XB1 is "dual x64 instruction decoders"

That pic looks inaccurate. There's no x64 decoder shown! I haven't found an explicit description of Jaguars decode hardware; I can only guess it's of no interest! Block diagrams just put it all together into the decode block.

liquidboy · Jan 5, 2015

Shifty Geezer said:
That pic looks inaccurate. There's no x64 decoder shown! I haven't found an explicit description of Jaguars decode hardware; I can only guess it's of no interest! Block diagrams just put it all together into the decode block.

Even Wikipedia lists the Bobcat architecture being a "Dual x86 instruction decoder" (http://en.wikipedia.org/wiki/Bobcat_(microarchitecture)) ...

I'll leave it at that, we can agree to disagree regarding the x86 vs x64 debate

Betanumerical · Jan 5, 2015

Talpatron said:
if the cpu was stocked and the only differce was the number of the core, this section shouldn't be reported

in addiction to, the ps4 and jaguar can decode 2 macro(64 bit) ops, but the difference is in the convertion in micro ops(32bit)
1 ops for istruction decode

Thats not what macro and uops mean afaik.

Betanumerical · Jan 5, 2015

Talpatron said:
I'm sorry, but i don't understand what you wrote, my english is fair but not excellent.

and the difference is in the convertion, if you or others can show me some evidences that jaguar can do something like xb1's cpu( 2 macro ops in 1.7) then i'll change my mind

A Macro Op is not a 64bit op and a Micro Op is not a 32bit op.

http://simple.wikipedia.org/wiki/Micro-operation

A macro op being a complex instruction in this case.

Shifty Geezer · Jan 5, 2015

Talpatron said:
I'm sorry, but i don't understand what you wrote, my english is fair but not excellent.

and the difference is in the convertion, if you or others can show me some evidences that jaguar can do something like xb1's cpu( 2 macro ops in 1.7) then i'll change my mind

From your quote:

On average, an x86 (32 bit) instruction (macro op) is converted to 1.7 micro-operations, and many common x64 (64 bit) instructions (macro ops) are converted to 1 micro-operation.

Macro ops are the instructions the processor receives. These can be 32 or 64 bit, x86 or x64. Micro ops are the internal operations these are decoded into. X1, and Jaguar by and large (until proven otherwise), will take 32 bit macro-ops and turn them into 1.7 micro ops each on average, and will take 64 bit macro-ops and turn them into 1 micro ops each on average.

Betanumerical · Jan 5, 2015

Talpatron said:
ok thanks xD,
but the problem stay, in fact to the cpu can do 2 macro into 1.7 or 2 against 2 into 1
the question is the normal jaguar can do this ?

it doesn't do 2 macro into 1.7 micro.

it does on average 1 x86 macro Into 1.7 micro.

Betanumerical · Jan 5, 2015

Talpatron said:
We saw the first example of a clear tradeoff when AMD stuck with a 2-issue front end for Jaguar. Not including a decoded micro-op cache and opting for a simplier loop buffer instead is an example of another. AMD likely noticed a lot of power being wasted during loops, and the addition of a loop buffer was probably the best balance of complexity, power savings and cost.

http://www.anandtech.com/show/6976/...powering-xbox-one-playstation-4-kabini-temash

this article confirms how the cpu jaguar doesn't decode the micro ops as xboxone's cpu,
and if ps4 has a jaguar it doesn't have the same characteristics

Do you mind quoting the part that proves it because I'm not seeing it.

Betanumerical · Jan 5, 2015

Talpatron said:
Not including a decoded micro-op cache and opting for a simplier loop buffer instead is an example of another
here, the difference is that the stocked jaguar doesn't have it

Can you show me where it says the XB1 does?.

Shifty Geezer · Jan 5, 2015

Talpatron said:
http://www.anandtech.com/show/6976/...powering-xbox-one-playstation-4-kabini-temash

this article confirms how the cpu jaguar doesn't decode the micro ops as xboxone's cpu,
and if ps4 has a jaguar it doesn't have the same characteristics

No it doesn't, and you're starting to get into mindless troll territory. You clearly don't understand CPU technology one jot, hence your confusion with macro and micro ops, and you clearly are reading gobbledegook into articles.

Every current x86 processor has a microcode decoder AFAIK (since 486 or somesuch). It's necessary to turn the x86 instructions from the compiled executable and turn them into operations to be executed in the processor which doesn't have a 1:1 mapping of instructions any more.

At this point, if you're not just a fanboy looking for a hardware advantage, I recommend you hold your arms up and acknowledge you got the wrong end of the stick and thank folk for educating you.

sebbbi · Jan 5, 2015

Talpatron said:
this article confirms how the cpu jaguar doesn't decode the micro ops

Every modern x86 CPU decodes instructions to micro ops. AMD K7, K8, Bobcat, Bulldozer, Piledriver, Jaguar, etc all do it and so do all modern Intel CPUs. This is because x86 is a variable length instruction set (among other nasty things), making it hard to execute directly.

I recommend Real World Tech's Jaguar analysis (they are very trustworthy when it comes to CPU tech). Page 3 contains the information you need:
http://www.realworldtech.com/jaguar/3/

Agner Fog's optimization guides are a good source of instruction level CPU information (all CPU vendors). Starting from page 88 you'll see a list of the x86 instructions, and respective decoded micro op counts on Jaguar:
http://www.agner.org/optimize/instruction_tables.pdf

Betanumerical · Jan 5, 2015

Talpatron said:
i asked if the cpu jaguar can do the same functions, if it does can you show me...
'
if you can, i change my mind, but if you don't demonstrate the problem remains, because if it was normal, why writing in the sdk?
and yes i haven't great knowledge of cpu i use logic, but if i am wrong i change idea, a fanboy doesn't do that

You write that kind of suff In the SDK because its relevant even if it is the same. You don't only write whats different because then devs would spend half the time looking up stock jaguar features on the internet.

Talpatron said:
if you see this pictures in xbox one's cpus there are two chips that there aren't in ps4, and if you zoom in the core of gpu there are different alu

i seen this differences,

The layout is different for the GPU because the ALUs are rotated, it has no 'extra' ALU.

I really wish people would stop trying to argue stuff that was debunked months / years ago.

iroboto · Jan 5, 2015

Talpatron said:
if you see this pictures in xbox one's cpus there are two chips that there aren't in ps4, and if you zoom in the core of gpu there are different alu

i seen this differences,

This type of evidence is weak Talpatron. It's also trusting another source who could be fabricating all they are saying.

Right now you are working through what someone else says is true (namely I feel like you are citing misterxmedia) and I can assure you he is fabricating a great deal.

He throws many darts against a board and eventually he will get one right, but he is mostly wrong.

This board has been very fair to both Consoles and have made fairly proper predictions on the performance of both with the knowledge that is available and I'd say the forum had been fairly accurate. The SDK leak has confirmed many suspicions about the tool set and the launch state of the console. Xbox is improving at a rapid rate because they've never had the tools to do so before not because the hardware is holding secrets to future technologies.

If there are future technologies we will need to wait to find out (if there are any dx12 specific ones), but we don't expect anything major, however a lot of small tweaks may help with the low performance profile of the console.

Xbox One (Durango) Technical hardware investigation

dobwal

3dilettante

Shifty Geezer

uber-Troll!

Shifty Geezer

uber-Troll!

orangpelupa

Elite Bug Hunter

liquidboy

Betanumerical

Shifty Geezer

uber-Troll!

Shifty Geezer

uber-Troll!

liquidboy

Betanumerical

Betanumerical

Shifty Geezer

uber-Troll!

Betanumerical

Betanumerical

Betanumerical

Shifty Geezer

uber-Troll!

sebbbi

Betanumerical

iroboto

Daft Funk

Similar threads