Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
so basically durango memory bandwidth is fine, and won't be bottle necked at all, i thought the Gpu would have a benefit going GDDR5/192gbs. i guess it's fine cause the gpu is 1.2 gflops, if it was 1.6 then it would be kind of starved?

We really don't know enough yet to draw any clear conclusions. With regards to Ram types we have to assume that MS wanted volume (8GB) so that ruled out GDDR5. And the bandwidth for orbis is 176GB/s
 
Well, we could trust that rumor that says that depth and color blocks are more than standard back ends and offer free 4xMSAA and FP16 HDR.

Out of the blue everything would change and the available BW would even seem too much.

ERP, stop the fabs a little more!.

Single cycle 4x MSAA and FP16 are what standard AMD back ends have provided since R770. It's always bandwidth permitting, though, not "free". That's just another example of the tipsters not really understanding the specs.

I think it's very likely all the "special sauce" people were bragging about comes from the parts of the documentation for features Durango has above what the 360 could do. For example, color and z compression. They didn't bother with the 360 since the ROPs had so much internal bandwidth to the embedded memory in that design, but they make a point of explaining its presence and benefits in Durango.

Likewise the embedded memory is now more flexible so they explain the beefed up DMAs they've included to help manage the shifting of data between pools. Since MS renamed every other functional unit in the diagram, we don't have any reason to believe they are much more sophisticated than the DMAs GCN has always relied on, though now there are twice as many. Maybe the small pool of ESRAM means you have to move data around more often, or sharing with the CPU requires more attention there.

Everything else seems pretty bog standard GCN. I guess the audio chip could still be pretty cool.
 
I know it were 8 cycles with the VLIW architectures, but is this documented or measured somewhere for GCN?

Honestly, now that I think of it, I actually do not know. I mostly work on CUDA at the moment, which means I have only written a tiny amounts of code for GCN. I though that that was how it worked, but I honestly can't remember where I learned that, or if I'm just mixing up GCN and older stuff in my head. Once I get home from work I'm going to look it up, or even do a little test.
 
Single cycle 4x MSAA and FP16 are what standard AMD back ends have provided since R770. It's always bandwidth permitting, though, not "free". That's just another example of the tipsters not really understanding the specs.

I think it's very likely all the "special sauce" people were bragging about comes from the parts of the documentation for features Durango has above what the 360 could do. For example, color and z compression. They didn't bother with the 360 since the ROPs had so much internal bandwidth to the embedded memory in that design, but they make a point of explaining its presence and benefits in Durango.

Likewise the embedded memory is now more flexible so they explain the beefed up DMAs they've included to help manage the shifting of data between pools. Since MS renamed every other functional unit in the diagram, we don't have any reason to believe they are much more sophisticated than the DMAs GCN has always relied on, though now there are twice as many. Maybe the small pool of ESRAM means you have to move data around more often, or sharing with the CPU requires more attention there.

Everything else seems pretty bog standard GCN. I guess the audio chip could still be pretty cool.

Indeed. The way I see it, and from what Al was saying in Neogaf, the hardware to provide free 4xMSAA is there but the bandwidth had always been the problem, so by including the eSRAM, it will provide the bandwidth and framebuffer to take advantage of the 4xMSAA hardware on ground. The choice of eSRAM is perculiar though as for the same budget, they could have gotten like 4 times the size in eDRAM, so the advantages of SRAM is probably very key here ie low latency, data refresh rate, etc.
 
Indeed. The way I see it, and from what Al was saying in Neogaf, the hardware to provide free 4xMSAA is there but the bandwidth had always been the problem, so by including the eSRAM, it will provide the bandwidth and framebuffer to take advantage of the 4xMSAA hardware on ground. The choice of eSRAM is perculiar though as for the same budget, they could have gotten like 4 times the size in eDRAM, so the advantages of SRAM is probably very key here ie low latency, data refresh rate, etc.

I think we've been through this too many times on this forum. You should check posts by Hornet, myself and others in the "Predict a Next Gen..." thread (now locked).

Durango's ESRAM will be 1T-SRAM, which is more closely related to eDRAM than actual SRAM (6T or 6 transistors per cell). 32MB of (6T)SRAM would be rediculous and infeasible, thus the only reasonable option is 1T-SRAM, aka the same type used on the gamecube.

It'll be denser than eDRAM, availble for manufacturing on a 28nm process and possible to be all on the same die as the other components.
 
I think we've been through this too many times on this forum. You should check posts by Hornet, myself and others in the "Predict a Next Gen..." thread (now locked).

Durango's ESRAM will be 1T-SRAM, which is more closely related to eDRAM than actual SRAM (6T or 6 transistors per cell). 32MB of (6T)SRAM would be rediculous and infeasible, thus the only reasonable option is 1T-SRAM, aka the same type used on the gamecube.

It'll be denser than eDRAM, availble for manufacturing on a 28nm process and possible to be all on the same die as the other components.

Yeah, this is a rumor thread and until official specs comes out you can't say for sure. What you guys said is just speculation, just as much as what I said. And from what you are saying, it is still denser than eDRAM. Infact what I am saying is more probable that Orbis shipping with 8gb GDDR5 RAM, which I think is unrealistic but as far as they are still rumors the speculation is acceptable.
 
Think of it like a PC with 68 GB/s main memory and 102 GB/s VRAM.

Well, except that very few GPU's have only 32MB of VRAM ;). In that respect, it is almost more a large and fast GPU cache memory/scratchpad?

Also, I wonder if the GPU is the 'boss' of the DDR3 as well again just like it was in the 360, if I remember correctly?

At any rate, certianly the 32MB of VRAM, however small, still adds a lot of bandwidth potential to the system.
 
More wood ( there are three pages of info ):

http://www.vgleaks.com/durango-gpu/

Clock rate
800 MHz
Compute
Shader cores
12
Instruction issue rate
12 SCs * 4 SIMDs * 16 threads/clock = 768 ops/clock
FLOPs
768 ops/clock * (1 mul + 1 add) * 800 MHz = 1.2 TFLOPS
Interpolation
( 768 ops/clock / 2 ops ) * 800 MHz = 307.2 Gfloat/sec
Geometry
Triangle rate
2 tri/clock * 800 MHz = 1.6 Gtri/sec
Vertex rate
2 vert/clock * 800 MHz = 1.6 Gvert/sec
Vertex/buffer fetch rate (4 bytes)
4 elements/clock * 12 SCs * 800 MHz = 38.4 Gelement/sec
Vertex/Buffer data rate from cache
38.4 Gelements/sec * 4 bytes = 153.6 GB/sec
Memory
Peak throughput from main RAM
68 GB/sec
Peak throughput from ESRAM
128 bytes/clock * 800 MHz = 102.4 GB/sec
ESRAM size
32 MB
GSM size
64 KB
LSM size
12 SCs * 64 KB = 768 KB
L2 cache size
4 x 128 KB = 512 KB (shared)
Texture
Bilinear fetch rate (4 bytes)
4 fetches/clock * 12 SCs * 800 MHz = 38.4 Gtexels/sec
Bilinear data rate from cache
38.4 Gtexels/sec * 4 bytes = 153.6 GB/sec
L1 cache size
16 KB/SC * 12 SCs = 192 KB (nonshared)
Output
Color/depth blocks
4
Pixel clear rate
1 8×8 tile/clock * 4 DBs * 800 MHz = 204.8 Gpixel/sec
Pixel hierarchical Z cull rate
1 8×8 tile/clock * 4 DBs * 800 MHz = 204.8 Gpixel/sec
Sample Z cull rate
16 /clock * 4 DBs * 800 MHz = 51.2 Gsample/sec
Pixel emit rate
4 /clock * 4 DBs * 800 MHz = 12.8 Gpixel/sec
Pixel resolve rate
4 /clock * 4 DBs * 800 MHz = 12.8 Gpixel/sec
 
Last edited by a moderator:
Interesting tidbits about memory:

Virtual Addressing

All GPU memory accesses on Durango use virtual addresses, and therefore pass through a translation table before being resolved to physical addresses. This layer of indirection solves the problem of resource memory fragmentation in hardware—a single resource can now occupy several noncontiguous pages of physical memory without penalty.

Virtual addresses can target pages in main RAM or ESRAM, or can be unmapped. Shader reads and writes to unmapped pages return well-defined results, including optional error codes, rather than crashing the GPU. This facility is important for support of tiled resources, which are only partially resident in physical memory

ESRAM

Durango has no video memory (VRAM) in the traditional sense, but the GPU does contain 32 MB of fast embedded SRAM (ESRAM). ESRAM on Durango is free from many of the restrictions that affect EDRAM on Xbox 360. Durango supports the following scenarios:

Texturing from ESRAM
Rendering to surfaces in main RAM
Read back from render targets without performing a resolve (in certain cases)


The difference in throughput between ESRAM and main RAM is moderate: 102.4 GB/sec versus 68 GB/sec. The advantages of ESRAM are lower latency and lack of contention from other memory clients—for instance the CPU, I/O, and display output. Low latency is particularly important for sustaining peak performance of the color blocks (CBs) and depth blocks (DBs).
 
I wonder if these are standard features already available in Cape Verde/Pitcairn/Tahiti.

Unlike some earlier GPUs (including the Xbox 360 GPU), Durango leaves texture and buffer data in native compressed form in the L2 and L1 caches. [...]
To see how this policy affects cache efficiency, consider an sRGB BC1 texture—perhaps the most commonly encountered texture type in games. BC1 is a 4-bit per texel format; on Durango, this texture occupies 4 bits per texel in the L1 cache. On Xbox 360, the same texture is decompressed and gamma corrected before it reaches the cache, and therefore occupies 8 bytes per texel, or 16 times the Durango footprint. For this reason, the Durango L1 cache behaves like a much larger cache when compared against previous architectures.

The Durango GPU supports 2x, 4x, and 8x MSAA levels. It also implements a modified type of MSAA known as compressed AA. [...]
Traditionally, coverage samples and surface samples match up one to one. [...]
Under compressed AA, there can be more coverage samples than surface samples.[...]
Compressed AA combines most of the quality benefits of high MSAA levels with the relaxed space requirements of lower MSAA levels.
 
In my eyes it would have been a much better solution for everyone if they had just listed the differences between the Durango GPU and a common GCN GPU instead of this text overkill! :yep2:
 
All of those are standard features of GCN. Coverage sampling was actually featured way back in the HD6900 series.

I see. Other than ESRAM, it looks like a pretty standard GCN GPU. The only good thing I can think about this design is that it is likely going to be cheap, unless they bundle some expensive peripheral in every box. This SoC should be less than 250mm^2, with a TDP of less than 120 W. Considering the motherboard layout will also be simpler than a launch Xbox 360, they could launch at 300$ without breaking the bank.
 
So what exactly that is differentiating a Shader Core to a common GCN Compute Unit?

Looking at specs, it doesn't seem like there is a difference. It's just listed with a different name on the block diagrams, whereever those came from. Seems like it's a GCN GPU, just a lower-end one than is in Orbis.
 
Status
Not open for further replies.
Back
Top