Xbox One (Durango) Technical hardware investigation

Gipsel · Feb 4, 2013

ultragpu said:
There has to be some merit to it..just how much can we trust VGleaks?

From some circumstantial evidence, I'm inclined to believe they cite official material.

Pugger · Feb 4, 2013

Least we know the GPU won't be bandwidth starved. Also it seems very much straightforward for devs to work with.

Nisaaru · Feb 4, 2013

Love_In_Rio said:
Well, according to ERP developers are encouraged to use DDR3 as the framebuffer...I am more of the theory that ESRAM is a big L3 cache for GPGPU ops.

Do GPGPU ops really depend on latency so much that the normal GPU caches don't handle that effectively? For pure bandwidth reasons the difference between 68GB to 102GB doesn't really sound relevant to me.

anexanhume · Feb 4, 2013

Do we even know if the 12 CU's could see an improvement in something greater than 16 ROPs? It only has two more CUs than the 7770 which also has 16 ROPs but with a ~20% slower core clock and I assume that was a pretty balanced design.

Hecatoncheires · Feb 4, 2013

Nisaaru said:
Do GPGPU ops really depend on latency so much that the normal GPU caches don't handle that effectively? For pure bandwidth reasons the difference between 68GB to 102GB doesn't really sound relevant to me.

Depends on what you want to do. If you want to have some fancy visual physics effects, maybe a flag fluttering in the wind, then you can do it with an off-the-shelf graphics hardware, Nvidia PhysX on a GeForce card for example. On the other hand, non-visual GPGPU algorithms, like for example A.I. or pathfinding or driving physics, are ultra latency sensitive since GPU and CPU have to work together on it. AMD claims that such algorithms are possible with the HSA.

anexanhume · Feb 4, 2013

Hecatoncheires said:
Depends on what you want to do. If you want to have some fancy visual physics effects, maybe a flag fluttering in the wind, then you can do it with an off-the-shelf graphics hardware, Nvidia PhysX on a GeForce card for example. On the other hand, non-visual GPGPU algorithms, like for example A.I. or pathfinding or driving physics, are ultra latency sensitive since GPU and CPU have to work together on it. AMD claims that such algorithms are possible with the HSA.

Well, since the GPU addresses virtual addresses that would seem to indicate they could be coherent with the CPU.

I know it's been asked, but can someone run through the disclosure and tell us what's actually different from GCN, besides the ESRAM?

Bagel seed · Feb 4, 2013

Comment and update from aegies and lherre

aegies said:
I said the leaks are accurate as of last February, because that's the date on all the documentation I've seen. Smart posters should be able to figure out why.

lherre said:

Because the specs didn't change. Simply as that.

Alpha kits -> Beta kits -> Final kits

Beta kits are more or less final kits (final silicon).

Click to expand...

http://www.neogaf.com/forum/showpost.php?p=47291466&postcount=678

anexanhume · Feb 4, 2013

Bagel seed said:
Comment and update from aegies and lherre

http://www.neogaf.com/forum/showpost.php?p=47291466&postcount=678

Yeah, he said something similar on twitter:

they're accurate as far as i know. just somewhat incomplete. i'm just saying my info is dated feb 2012.

https://twitter.com/aegies/status/298472641935851522

edit: at this point I'm ready to believe what we've seen 100% and just wait for a disclosure on the other blocks in the system (display planes, DMEs, etc.)

Nisaaru · Feb 4, 2013

How should that hardware run into production problems which were rumored?

anexanhume · Feb 4, 2013

Nisaaru said:
How should that hardware run into production problems which were rumored?

We don't know how much silicon the other parts eat up. It could still be a big die. For instance, 6T SRAM would be huge.

AstoundingHolmes · Feb 4, 2013

Nisaaru said:
How should that hardware run into production problems which were rumored?

Maybe the rumors got confused and it's Orbis that's having production problems?

Nisaaru · Feb 4, 2013

anexanhume said:
We don't know how much silicon the other parts eat up. It could still be a big die. For instance, 6T SRAM would be huge.

Yes, but doesn't vgleak claim ESRAM as 1T? And are GPUOPS really depending so much on latency that the normal GPU caches are so inefficient that this ESRAM would make such a relevant difference? Something here isn't really that obvious to me.

Hecatoncheires · Feb 4, 2013

It has nothing to do with cache. The copy overhead kills the latency: For non-visual GPGPU algorithms you have to copy data from CPU to GPU and back again all the time. The copying takes longer than the computation itself, making it useless for developers. Visual GPGPU algorithms work on homogeneous processors because the data doesn't have to be sent back to the CPU (CPU -> GPU -> screen). The HSA allows CPU and GPU to work on tasks without copying the data back and forth.

tunafish · Feb 4, 2013

Gipsel said:
I know it were 8 cycles with the VLIW architectures, but is this documented or measured somewhere for GCN? I

tunafish said:
Once I get home from work I'm going to look it up, or even do a little test.

Unless I mucked something up pretty bad, I think I can confirm that a SIMD FMA has 4 cycle latency on GCN.

Thanks for educating me.

bagofsuck · Feb 4, 2013

from superDaE twitter

scently · Feb 4, 2013

I asked this in the other thread but didn't get a reply;

What is the function of ROPs in relation to rendering and framerate? Also how many texturing units does durango have?

Scott_Arm · Feb 4, 2013

scently said:
I asked this in the other thread but didn't get a reply;

What is the function of ROPs in relation to rendering and framerate? Also how many texturing units does durango have?

ROP http://en.wikipedia.org/wiki/Render_Output_unit

Probably 40 TMUs, if it follows Cape Verde which has 10 CUs and 16 ROPs.

loekf · Feb 4, 2013

Prophecy2k said:
I think we've been through this too many times on this forum. You should check posts by Hornet, myself and others in the "Predict a Next Gen..." thread (now locked).

Durango's ESRAM will be 1T-SRAM, which is more closely related to eDRAM than actual SRAM (6T or 6 transistors per cell). 32MB of (6T)SRAM would be rediculous and infeasible, thus the only reasonable option is 1T-SRAM, aka the same type used on the gamecube.

It'll be denser than eDRAM, availble for manufacturing on a 28nm process and possible to be all on the same die as the other components.

AFAIK, for smaller nodes (<= 40 nm) 1T-SRAM = eDRAM. Just ask MoSys or TSMC ;-)

eDRAM is nice, but you have to deal with refreshes and (more important) leakage issues, and not to forget to extra process masks (4-6 ??) --> more costs per die.

scently · Feb 4, 2013

Scott_Arm said:
ROP http://en.wikipedia.org/wiki/Render_Output_unit

Doesn't fully answer my question. Please can somebody give me an answer to my question; what is the function of ROPs in relation to rendering and framerate? Also how many texturing units does durango have?

dobwal · Feb 4, 2013

Could the eSRAM make the Durango perform very efficiently at tessellation. Nvida once labelled tessellation as the future of gaming. It suppose to allow for better animation, better lighting and up polygon counts tremendously, while reducing the overall memory bandwidth and footprint.

http://www.hardwaresecrets.com/datasheets/03_TessellationDeepDive.pdf

http://forum.beyond3d.com/images/styles/B3DArena/editor/italic.gif

Compression: Using tessellation allows us to reduce our memory footprint and bandwidth consumption. This is true both for on-disk storage and for system and video memory usage, thus reducing the overall game distribution size and improving loading time. The memory savings are especially relevant to console developers with scarce memory resources.

Bandwidth is improved because, instead of transferring all of the vertex data for a high-polygon mesh over the PCI-E bus, we only supply the coarse mesh to the GPU. At render time, the GPU will need to fetch only this mesh data, yielding higher utilization of vertex cache and fetch performance. The tessellator directly generates new data which is immediately consumed by the GPU, without additional storage in memory.

Has anyone thought about TBDR. Its supported in Direct 3D 11.1 now. Doesn't it use alot of local memory to reduce to read/modify/writes to main memory saving bandwidth in the process? Could GCN cores be programmed to be TBDR like?

Xbox One (Durango) Technical hardware investigation

Similar threads