Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
Extremetech is wrong though. Nothing changed there. Each GCN based GPU has a 64 way associative L1. It consists of just 4 sets (that's what they probably mixed up). The older VLIW architectures had even a fully associative L1 (which meant 128way associative and just a single set comprising the full 8kB). Latency is high anyway, maximizing the hitrate is important here.

Well, there goes that theory, so outside of the SRAM and associated DMA's, it looks like a vanilla GCN GPU then?
 
I would imagine that includes; the video compress/decompress unit, audio dsp, zlib and the likes.

Yup, exactly what Orbis has. Also some hardware dedicated to whatever these display planes are.

(please note, I mention Orbis not to draw a comparison, but to show that similar architectures would likely result in similar fixed function blocks)
 
Well, there goes that theory, so outside of the SRAM and associated DMA's, it looks like a vanilla GCN GPU then?
The L2 associativity is supposed to be 8 way in Durango and 16way for a usual GCN GPU. But I actually don't know if that may be already part of the adaption to the eSRAM and the mem controllers in the SoC. Or maybe it even differs for the GCN models (i.e. Tahiti has 16way, CapeVerde just 8way), i don't know. In any case, it's nothing major as it is basically already outside of the core GCN architecture.
 
Every off-the-shelf GPU has fixed funtion elements, for example the ROPs.

so they are calling "accelerators" something that already are in all the gpu's? doesn't make any sense to me
of course they will be audio acc., video acc. those are the blocks from the leaked diagram, but aegies and others says that this is accurate but incomplete, I wonder when we will have a complete idea of the whole structure, anyway durango seems a complex, and long thinked, system to me.
 
so they are calling "accelerators" something that already are in all the gpu's? doesn't make any sense to me
of course they will be audio acc., video acc. those are the blocks from the leaked diagram, but aegies and others says that this is accurate but incomplete, I wonder when we will have a complete idea of the whole structure, anyway durango seems a complex, and long thinked, system to me.

Vgleaks hasn't finished dumping their supposed knowledge about Durango.

According to this poll, they have more to reveal on the audio blocks, display planes, kinect, memory architecture, video compression and more.
 
Yup, exactly what Orbis has. Also some hardware dedicated to whatever these display planes are.

AFAICT display planes are different 'output buckets' that can be used for anything from multiple outputs to overlays...

Given what we suspect about Durango "overlay" is a very good bet. For example:
- 1 display pane for apps.
- 1 display pane for the game.
- 1 display pane for video (the HDMI input and/or any other source).

Then some magic pixie dust to overlay them in the correct order, and resize them before sending it to the HDMI out. (e.g. it might allow your TV program to run with a twitter feed down the left, or maybe a game installation window in the top right etc, and when gaming it might show you the current program along with a video chat window).
 
The DMEs sound cool I guess.

The DMEs may be able to speed up data transfer between the embedded RAM pool and the DDR3 RAM pool but then again I can't see where this is an advantage when the competitor uses a single RAM pool with almost three times the bandwidth. Sounds like damage control to me.
 
The DMEs may be able to speed up data transfer between the embedded RAM pool and the DDR3 RAM pool but then again I can't see where this is an advantage when the competitor uses a single RAM pool with almost three times the bandwidth. Sounds like damage control to me.

Damage control is a quick response to an accident. A planned custom functional block is not a quick response. It's cost reduction, not damage control.
 
The DMEs may be able to speed up data transfer between the embedded RAM pool and the DDR3 RAM pool but then again I can't see where this is an advantage when the competitor uses a single RAM pool with almost three times the bandwidth. Sounds like damage control to me.

seems to me that you misunderstand what the DME's are here for

The advantage of the move engines lies in the fact that they can operate in parallel with computation. During times when the GPU is compute bound, move engine operations are effectively free. Even while the GPU is bandwidth bound, move engine operations may still be free if they use different pathways. For example, a move engine copy from RAM to RAM would not be impacted by a shader that only accesses ESRAM.

this is what the 4 orbis CU can be used for, if they have another path to ram, of course as orbis have no a different ram block (eSRAM or whatever)
 
seems to me that you misunderstand what the DME's are here for



this is what the 4 orbis CU can be used for, if they have another path to ram, of course as orbis have no a different ram block (eSRAM or whatever)

You really sound like Mistercteam now. DMEs have been explained and any forumer has said what you are saying. DMEs are there to compensate the lack of bandwidth of the DDR3.

http://www.neogaf.com/forum/showpost.php?p=47375749&postcount=127
 
You really sound like Mistercteam now. DMEs have been explained and any forumer has said what you are saying. DMEs are there to compensate the lack of bandwidth of the DDR3.

http://www.neogaf.com/forum/showpost.php?p=47375749&postcount=127

sky*sony* before you take an offence to another member, think twice.

and NO, DME's are here to compute functions that in other gpu's are done by shaders, plus those units can operate when gpu is stalled computing-side or BW-side
be less rude and learn to read
 
The DMEs may be able to speed up data transfer between the embedded RAM pool and the DDR3 RAM pool but then again I can't see where this is an advantage when the competitor uses a single RAM pool with almost three times the bandwidth. Sounds like damage control to me.

The DMEs are what ensure that the GPU doesn't become bandwidth limited (by facilitating tiling and otherwise making sure data is placed where the GPU can access it quickly enough to prevent the compute units from being starved for data) and also to allow useful work to be performed with any leftover bandwidth available when the GPU and CPU aren't using all of the bandwidth of both pools.

This fits with my suspicion that the custom hardware in Durango is intended to improve utilization of the available system resources, not provide more.
 
The advantage of the move engines lies in the fact that they can operate in parallel with computation. During times when the GPU is compute bound, move engine operations are effectively free. Even while the GPU is bandwidth bound, move engine operations may still be free if they use different pathways. For example, a move engine copy from RAM to RAM would not be impacted by a shader that only accesses ESRAM.

this is what the 4 orbis CU can be used for, if they have another path to ram, of course as orbis have no a different ram block (eSRAM or whatever)

Are the rumored 4CUs different, and more flexible/efficient than the 14CUs ? If so, they can be used independently to prep the data for the other 14 to churn. It may help to lower bandwidth and compute load depending on how the devs use them. It's similar in idea to the SPUs. I guess they can also throw all 18 to work on the same thing under "ideal" condition.

The DMEs are something else altogether. But I am curious how they compare to the 2 DMA units in equivalent AMD GPU (besides JPEG compression and number of units)

I also think ideally, MS would want their GPU to be closer to GCN2 to be more efficient but I haven't read up enough on it yet. ^_^
 
VGLeaks says
When used for their designed purpose, however, they can offload work from the rest of the system and obtain useful results at minimal cost.

as the tasks computed by DME's are often loaded into the shaders' work on classic GPU's
 
Last edited by a moderator:
The DMEs may be able to speed up data transfer between the embedded RAM pool and the DDR3 RAM pool but then again I can't see where this is an advantage when the competitor uses a single RAM pool with almost three times the bandwidth. Sounds like damage control to me.

The primary use I can see is for efficient virtual texturing. The ability to get 4 Mpixels worth of JPEG turned into properly swizzled and prepared textures per frame without having to use any of the primary processing for it is nothing to scoff at for modern virtual texturing engines.
 
All this make me wonder about how the GPU sees the two pools of ram (main RAM and the scratchpad). I was assuming (and some others too) that the move engines role were to have the GPU to see those 2 pools of RAM (/fake it).

The data provided by VGleaks seems to imply that the GPU by self is able to deal with those 2 pools of RAM without having to resort to the DME.

The data are a bit confusing to me as they use "shader" for the GPU, it is unclear how much bandwidth the "shader cores" /SIMDs or CUs or the ROPs have to the scratchpad memory.


I've to say I'm a bit lost, can the CPU access the scratchpad, would that even help in some way?
Instead of pre-fetching data using the CPU cores, you set the DMEs to gather (or scatter) data from the main ram. I would think that the latency to read or write from the CPU to that pool of memory would be (too) high but if you can have the CPU to pre-fetch from the scratchpad could it "work".

I think of those big data structures used for example by epic in UE4, could it be possible to have them compressed in RAM, to load the relevant parts in the scratchpad memory (it would remained compressed) then on request from the CPU to stream and uncompress the data (on the fly) to the CPU cache?
(I wonder the same about virtual texturing, or data structures used by Kinect).

If it is used by the CPU (too) the 25.6GB of bandwidth is less of an issue because no matter how the system is put together (1 or 2 chips) I don't expect the CPU to be able to suck that much more bandwidth.

Another idea is could the CPU set command for the GPU in the scratchpad, the data would be compressed on the fly, uncompressed on the fly by the GPU when it reads it?

Another thing is tessellation, I remember reading something about previous AMD GPU, the GPU had to dump data to the RAM (depending on the level of geometry amplification). Could it be a win if the GPU could dump data (compressed on the fly) in the scratchpad.

It would interesting if the scratchpad is not use "overall" by the whole system(and mostly the GPU) as a monolitic place used mostly to "render" (/deal with bandwidth intensive operations like blending) but is used in many different ways as a buffer by the units within the system (I mean a general purpose scratchpad memory).

What compression ratio can we expect? Or actually "how big" the 32MB memory could be made?

EDIT

Another question that raises into my head is the amount of RAM and CPU cores supposedly reserved to run the "OS", it sounds like quiet a lot to me.
I wonder if part of that reservation could be made to run the "API" (/ system level driver) Edge spoke about, could a core or more (as well as some memory) could be used to deal with DMEs and which "system" in your program (/game) uses and in which amount the scratchpad.
I've no idea here just wondering (though my wording is unclear) if out of this resources ( a lot going by the rumors) a lot it used to make the Scratchpad (with more than often only compressed content) to act pretty transparently as what could be a big L3 for the whole system (CPU and GPU). It could stream in advance stuff in the reserved RAM (be it virtual texture, Data structure holding voxel, kinects, and /what not) and then try to put the stuff in time in the scratchpad for either the GPU or the CPU to use?
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top