Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
other than amd or nvidia - which gpu vendor can supply the durango gpu ? ( powervr ?)

Well I don't know, I got the impression from the article that didn't want to state anything that wasn't on the leaked document, thus protecting their source. Hence the lack of vendor with regards to the GPU.
 
CPU leak is boring/depressing in lack of any special customizations, but it does give a couple interesting numbers. I'm glad to see Jaguar doesn't worsen L2 latency over Bobcat despite having a 4x larger cache shared between 4 cores. But I guess this part was already known. Full L1 miss + L2 hit latency was 20 cycles (3 + 17) on Bobcat so I assume that's the real cost here too (although it doesn't say 3 cycles for L1 miss, just hit)

I guess the most useful piece of data is the L2 miss cost. Not sure if that's in addition to L2 miss cycles or not.. so the latency could be anywhere from around 90ns to 112ns. Still mediocre but much better than XBox 360's ~160ns. Very curious what Orbis's will look like.

This claim really surprised me: "On average, an x86 instruction is converted to 1.7 micro-operations" This is very far off from any data I've seen before, which was more like < 1.2 uops/instruction and for a uarch that had worse uops than Jaguar. I'm sure this number came from somewhere but I wonder if it was really for normal code..

But probably the most interesting thing is DeadmeatGA in the comments still insisting that Durango can't be using Jaguar, with the most cockamamie rationalizations imaginable.. lolz..
 
CPU leak is boring/depressing in lack of any special customizations, but it does give a couple interesting numbers.
Indeed I wish we would have seen those rumored dual FMA units :(

Anyway I think that sadly AMD had no time for this, it seems they finalized the design just in time for the CPU to be embark into that next generation of consoles.
 
To the question of why would MS hire SoC engineers and not modify the CPU: even if they have input on Durango, the CPU is just one component of an SoC, so having multiple engineers with SoC experience doesn't necessitate customizing a CPU. Integrating the various components into a functional and efficient whole is itself a complex task, as is the physical design and implementation, which we don't see on block diagrams. There are vast numbers of SoCs that save the headache of customizing the CPU as well.

I suppose this wouldn't rule out custom microcode, if even that is considered worthwhile.
I am leery of talk about customizing Jaguar, because of the potential time to market impact.
The current gen went in order because of the time to market impact of designing and validating (especially validating) an OoO core.
While Jaguar being an iteration of an OoO lineage saves the time hit of implementing OoO execution for the first time, customizations that involve changing blocks or replumbing the pipeline will re-introduce a design and validation delay unless you don't feel like bug-testing a highly pipelined self-scheduling superscalar processor.
It's sort of like asking why a designer might shy away from taking a 777 and "just" adding an extra engine.
 
^ Well, I don't know why people would jump to CPU modifications anyway. They can't trust AMD to do those? I think it will be pretty clear the SoC designers would be pretty busy with the move engines alone.
 
This claim really surprised me: "On average, an x86 instruction is converted to 1.7 micro-operations" This is very far off from any data I've seen before, which was more like < 1.2 uops/instruction and for a uarch that had worse uops than Jaguar. I'm sure this number came from somewhere but I wonder if it was really for normal code.
You are right. Who knows for what instruction mix this number comes from. If we take Bobcat as a base, we got this:

BobcatHotChips_August24_8pmET_NDA-9_575px.jpg


That means the number of µOps per x86 instruction can be calculated to:
0.89*1 + 0.10*2 + 0.01*x = 1.09 + 0.01*x

There is no way x can be large enough to arrive at 1.7 µops. So your 1.2 µops appear to be close to that.

One thing to consider are maybe the 256bit AVX vector instructions which are always decoded into 2 µops. So for code heavily using this, it may come close to 1.7 µops. But I don't know if that is the scenario they used to come up with this number.
Edit: Maybe they just took a list of all supported instructions and calculated the average, irrespective of how often they are used in usual code. :LOL:
 
Last edited by a moderator:
I hear that the name of the document is, "Graphics on Durango"

Yeah, I have seen the doc and to be honest, vgleaks info goes into alot more details than what is in the doc. The only new thing in the doc is that there is a page that compares the raw specs of the 360 and the durango. Funny thing is, according to the spec comparison, although the Xenos has 22.4gb/s to the RAM it seems that typically the gpu only has 16gb/s of bandwidth to itself. That's really about the most interesting new info there, alongside other raw spec comparison.
 
Edit: Maybe they just took a list of all supported instructions and calculated the average, irrespective of how often they are used in usual code. :LOL:

On further reflection I'm actually leaning towards this. The followup of "and many common x64 instructions are converted to 1 micro-operation" seems to fit with that. Such a number would be pretty arbitrary since it depends on how they classify different instructions vs same instruction with different operand, there's no real standard way to do that. Not that this metric is useful in the first place, outside of an architectural curiosity.
 
I think they named it originally macro-ops. Bobcat/Jaguar use them, too. The "fused" macro-ops for instructions with memory operands for instance are split up later in the pipeline. I guess they just named the macro-ops now µops in that diagram because that is what the decoders actually spit out.

Edit: got ninja'd
 
I think they named it originally macro-ops. Bobcat/Jaguar use them, too. The "fused" macro-ops for instructions with memory operands for instance are split up later in the pipeline. I guess they just named the macro-ops now µops in that diagram because that is what the decoders actually spit out.

Edit: got ninja'd

Yeah I didn't have a lot of confidence in that second thing in the post and removed it :p I think the Bobcat diagram is referring to uops that the output COPs are composed of, and if that's the case dynamic proportion is the only thing that makes really sense. They are calling them COPs in the papers, but good luck getting marketing slides to match engineering terminology.. If they're like the classic macro-ops they should be able to handle three uops for full RMW support (IIRC)

I have to go read the Bobcat and Jaguar papers again, my memory is fuzzy on this.. can't remember if a Bobcat decoder can output one or two COPs. On K7 a decoder could put out only one macro-op, on K8 they made them capable of putting out two (fast path doubles). But I think that it required multiple decoders to do this. At the very least it's now this way on Bulldozer. So I'm wondering if Jaguar needs to use both its decoders, or a single decoder over two cycles, to output their equivalent of fast path doubles, and if this is how AVX instructions are implemented. If they need 2x the decoder resources than 128-bit instructions then there's little reason to use them. I imagine they will because this is SSE2 was implemented on K8 and this is how AVX is implemented on Bulldozer.
 
FEB 2012.

Thanks. So with the VGleaks article we could assume their info is at least slightly more recent.

As we haven't heard of any more Durango developer meetings (where I think this document originates) we have to assume for now little/nothing has changed spec wise.
 
Thanks. So with the VGleaks article we could assume their info is at least slightly more recent.

As we haven't heard of any more Durango developer meetings (where I think this document originates) we have to assume for now little/nothing has changed spec wise.

I don't know how recent theirs is, but the doc in question is more like a key point stuff, with vgleaks' going into more detail. Theirs and most others are from SuperDaE, I don't know how recent his info is. Personally I don't know if anything has changed, if it has it will probably be minor stuffs, either way we will know better when the console is announced.
 
Yeah, I have seen the doc and to be honest, vgleaks info goes into alot more details than what is in the doc. The only new thing in the doc is that there is a page that compares the raw specs of the 360 and the durango. Funny thing is, according to the spec comparison, although the Xenos has 22.4gb/s to the RAM it seems that typically the gpu only has 16gb/s of bandwidth to itself. That's really about the most interesting new info there, alongside other raw spec comparison.

Is that the same one that was being talked about where it's said the Durango gpu is 100% efficient?
 
Is difficult to understand these specs and 7970 in the alpha kits.

Actually, the doc made no reference to the exactly card in the alpha kit only that the gpu will be "Near-final GPU architecture, emulates final Durango performance" while the beta kit will have "Durango GPU architecture". Also the alpha kit has split memory architecture, no ESRAM, no Move Engines, and only one display plane. The beta kit will have unified memory, ESRAM, Move engines, and 2 hardware display planes.
 
Status
Not open for further replies.
Back
Top