Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
AFAICS the Data Move engines aren't going to make up any computational difference, they might let software better exploit the 32MB scratch pad.
If that Scratchpad is low latency then it could make a large difference to the efficiency of the system.
IMO and from what I've been told, most shaders spend more time waiting on memory than they do computing values, if that pool of memory is similar to L2 Cache performance you'd be looking at a cache miss dropping from 300+ GPU cycles to 10-20. Hiding 10-20 cycles is a lot easier than hiding 300.

IF the ESRAM pool is low latency then I think the Durango architecture is interesting.
 
It would depend on how the input is configured.
it "could" certainly be limited to 1prim/clk, or it could be more these things aren't off the shelf parts jammed onto a piece of silicon. If MS thought they need more I can't imagine that level of customization would be out of order.
May I ask you something? It is related to those "rumored" move engine.

I spoke about it but it got lost into the noise.

In the 360, the GPU (I mean it as the GPU part within Xenos) can't "see" what is in the edram, right.
Still the I guess that the GPU can initiate the copy of data either from the ram to the edram or the other way around.

Wrt to those move engines, I though that could it be that AMD could not really make "it works together". By that I mean they could not really have the GPU to "perceive" (sorry for the lack of technical wording) equally the main RAM (as I think the GPU acts as as the north-bridge ala 360) and the scratchpad.
A result of that could be that (on the contrary of my hypothesis) the Shader cores can't really "read or write" within that second pool of ram as they would do from the main RAM.

If the system act ala 360, it means that the system will have to dumb thing into RAM is the GPU (as shader cores) are do something with what have been rendered, right.
It worked in the 360, I don't think they could do better at that time, still overall it looks a bit wasteful and not that flexible especially if the scratchpad is on the GPU.
As MSFT uses SRAM it has to be on the GPU die, if it was an external chip mostly likely it would plain RAM, wide IO the whole thing on a interposer. It doesn't make sense to go with only 32MB if that is not part of the GPU imo. If it is off chip EDRAM makes more sense, going of chip kills any latency advantage the SRAM has.

So as I stated earlier I wonder if those DMA engine are just a "technicality" to palliate the GPU lacking to "see" the scratch pad on an equal footing as the main RAM.

Could it be something like that:
The scratchpad memory has a memory space in the main ram (whereas it is obviously not) so when the GPU want to touch the scratchpad it uses that spaces. Obviously there is nothing in the main ram at that space (it would be just a trick). The DMA engine intercept somehow all the requests that touch that memory space and proceed to move the data as requested:
it could move the whole data all together like what happen in the 360 but it could also stream data to the shader cores.

As I see it, it could be a trick/technicality to make the whole thing "work" sort of transparently to both the GPU and software.

What do you think? Sounds sane?
 
Last edited by a moderator:
As I said above I think the architecture is only interesting IF the ESRAM pool can be used as input to the GPU and if it's low latency.
It's my understanding that at least the first part of that is possible.
I've been told what the data move engines do, and they're useful, but in themselves they are not game changing IMO.
 
As I said above I think the architecture is only interesting IF the ESRAM pool can be used as input to the GPU and if it's low latency.
It's my understanding that at least the first part of that is possible.
I've been told what the data move engines do, and they're useful, but in themselves they are not game changing IMO.

Why wouldn't it be low latency?
 
Is 8GB (or 5GB) memory and 50GB max game size considered a reasonable ratio?
I'm not seeing how that amount of memory can help large open world games if these games get limited by storage space.

They're not really limited by storage space it's more the rate at which they can stream data.
Very few titles this gen got close to filling 50GB and for most of those it was multiple copies of audio and video for localizations.
 
10-20 cycles is more on the lines of an L2 hit for a CPU. If that's the barrier for making a scratchpad interesting for a GPU, then the scratchpad most likely cannot physically ever be interesting.

That, and we'd need some benchmarking of cache latencies for the CPU.
At least for the prior VLIW GPUs, an L1 cache hit was measured to be over a hundred cycles, which makes the latency of an on-die hit noise.
 
As I said above I think the architecture is only interesting IF the ESRAM pool can be used as input to the GPU and if it's low latency.
It's my understanding that at least the first part of that is possible.
I've been told what the data move engines do, and they're useful, but in themselves they are not game changing IMO.

If there are four move engines, each has its function or they all make the same thing in paralel?.
If ESRAM is low latency and it saves so many cycles i can see this machine behaving like a 680 in fact.The subject then will be why something like that isnt in the future included in a 680 to conver it into a 980 ;).

Another thing i suspect is that the gpu could have 768 shaders but divided in more than 12 cus...
 
If it's EDRAM with an SRAM name like 1T-SRAM.

It would not be low latency if it is for example off chip as the smart edram in Xenos.

Ok, I assumed it would be on-chip due to its size. 64MB to 96MB was thrown around for a Xenos like implementation with all the AA and other effects thrown into that buffer. Seems like if they went down to 32MB it would be so it would be more capable. Perhaps not.
 
As I said above I think the architecture is only interesting IF the ESRAM pool can be used as input to the GPU and if it's low latency.
It's my understanding that at least the first part of that is possible.
I've been told what the data move engines do, and they're useful, but in themselves they are not game changing IMO.
OK so I can't really comment on my "brain farts" :LOL:

Though going by your wording, it sounds to me like they are a "technicality" to make thing happens more than anything else.
I don't mean irrelevant, as if they enable the GPU to be able to render both to the main RAM or the scratchpad, or to read/write from either those memory pool, it is a huge win. Still it is just about "making it happen" if the wording make sense.
 
10-20 cycles is more on the lines of an L2 hit for a CPU. If that's the barrier for making a scratchpad interesting for a GPU, then the scratchpad most likely cannot physically ever be interesting.

That, and we'd need some benchmarking of cache latencies for the CPU.
At least for the prior VLIW GPUs, an L1 cache hit was measured to be over a hundred cycles, which makes the latency of an on-die hit noise.

Why is latency so high on a GPU? Over a hundred cycles for a L1 hit seems huge! My A64 @ 800 mHz had main memory latency of about 120 cycles so those GPU figures seem surprising.
 
I doubt it would be called a "Data Move Engine" if its as fancy as some kind of HW scheduler.

Imho its just a DMA engine, and becuase of second hand rumours of people crying "secret sauce", together with people like Aegis on NeoGaf who even admits he's not a HW engineer and has no clue about what he's seeing in the docs in his possession, people are trying to see too much into what are essentially quick-fix bandwidth saving features to mitigate the slow main memory bandwidth.

In short MS wanted 8GB main RAM. DDR3 was their only option (perhaps they evaluation HMC and stacking with TSVs/Interposers but realised it wouldn't be ready in time). They went with DDR3, but needed a high-bandwidth scratchpad and memexport engine in order to ensure their main components weren't bandwidth starved. Joe public and Joe gaming journalist catches wind after months of being drip-fed false rumours of MS's nextbox being "beast" by overexcited devs and internet trolls, and now start trying to read too much into and rationalise some "secret sauce" and magic voodoo out of what is effectively a relatively low-cost/low-perf console design. Acert93 and Interference are both right.

If the fast memory pool isn't treated as a cache, I would lean towards it being a software (at least partly) controlled scratchpad. Filling and committing data to and from the primary address space would be handled by the CPU/GPU threads using it.

The scratchpad has some unknown granularity and banking. If not kept coherent, it could be composed of a number of large banks, possibly at the granularity of a memory page.
DMA engine(s) could offload most of the grunt work of arbitrating access and moving data back and forth to this pool of memory.
In this scenario, the SRAM/eDRAM pool is on a parallel portion of the non-coherent bus used by the GPU.

The data engines would have ancestry in the DMA engines discrete GPUs have had for quite a while, or the DMA engines used in a number of server and HPC designs. It would save time and power compared to having a CPU or GPU sending commands and reading in scratchpad memory back into coherent memory space before writing it back out, then reading in new data and then exporting it back to the scratchpad.
A DMA engine for the GPU, a DMA engine for the CPUs, and maybe a DMA engine for everything else.

It's not known how many independent accesses the memory pool can support in that sceneario. Even if it's not three, accellerating and offloading all the little moves and access negotiations implied in managing such a large memory space might make the scratchpad more easily utilized.

These are the type of answers I wanted to arrive at, but lacked the knowledge to get there.
 
Why is latency so high on a GPU? Over a hundred cycles for a L1 hit seems huge! My A64 @ 800 mHz had main memory latency of about 120 cycles so those GPU figures seem surprising.

The whole GPU execution and memory pipeline is structured around the near-worst case scenario of having to go out to graphics memory. It sustains a very large number of in-flight accesses and has very weak consistency if any consistency is to be had at all.
It's geared for high latency tolerance and very high throughput, which penalizes minimum and average latency.

This may have changed somewhat with GCN's introduction of a read/write memory pipeline, but I haven't seen this benchmarked. SGEMM and DGEMM code testing done a while back on this forum showed very high cache hit times (edit: for prior generations).
 
Actually, no, it's the other way :)

The completely unique texture coverage has allowed them to not use lightmaps and all and bake all the lighting into the color texture. So it has the same fidelity, which is extremely high compared to any other game.
Games using light maps have a second set of UVs with a much lower texel density, so the quality of the lightmaps is pretty low. Most levels in games are using something like a single 1024*1024 texture for all the surfaces (but try to optimize coverage as much as they can).

They did however also bake the specular highlights in some places, to conserve disk space, which wasn't so good.

Also, Carmack decided that 60fps is more important than adding any kind of dynamic lighting on top - this is more of a question of taste though. A lot of people did like the responsive controls and gunplay that it lead to, and which is so rare nowadays.

Interesting, thanks for that. So dynamic lighting is possible but they just didn't implement it. I'm kind of inclined to agree with his reasoning behind not doing it for this gen.

The rumored RAGE HI-REZ texture pack was around 160 gigabytes.

Yeah I remembered that when it was first rumored, just couldn't find it.
 
Last edited by a moderator:
Actually, no, it's the other way :)

The completely unique texture coverage has allowed them to not use lightmaps and all and bake all the lighting into the color texture. So it has the same fidelity, which is extremely high compared to any other game.
Games using light maps have a second set of UVs with a much lower texel density, so the quality of the lightmaps is pretty low. Most levels in games are using something like a single 1024*1024 texture for all the surfaces (but try to optimize coverage as much as they can).

They did however also bake the specular highlights in some places, to conserve disk space, which wasn't so good.

Also, Carmack decided that 60fps is more important than adding any kind of dynamic lighting on top - this is more of a question of taste though. A lot of people did like the responsive controls and gunplay that it lead to, and which is so rare nowadays.

I thought it looked fantastic on my PC. I hope the PRTs in GCN cause more developers to employ similar techniques next gen.
 
Rage on PC was 25 GB. Crysis 2 was 9 GB on PC. Battlefield 3 was 15 or 10 GB on PC(not sure why disc version is more). Sounds like 50 GB is lots.

Max Payne 3 I believe is also a bit north of 26 GB. Some of the MMO's on there can also hit some even larger sizes.

Regards,
SB
 
Status
Not open for further replies.
Back
Top