Xbox One (Durango) Technical hardware investigation

AlphaWolf · Feb 15, 2013

Averagejoe said:
Now that you are adding Jaguar cores to the test why not ESRAM as well for the 680GTX.?

Maybe you should change the argument to the 7770 can match the 680GTX,if i pair the 680GTX with a pentiun 4 3.2GHZ,and 1 GB of system ram,if your point is trying to prove that one much weaker GPU can be more efficient than another maybe that is a better way.

Oh and i chose a 680GTX with 2GB of ram imagine one with even more ram.

I'm not adding anything to the test, I'm trying to show you that your test is useless and offers little information anyone here did not already know. The part in durango is not a 7770 (with one GB of ram) and it doesn't have an Intel i7 cpu. It will at least probably run some things at 1920x1080 so I eliminated one variable for you. And I doubt a 680 with more ram would fare much better at 1080, remember it's a pc, it has another pool of ram already the 2GB on the 680 are exclusively for graphics. Bumping the 7770 to 2GB might show some improvement, but still tells us little about durango.

DopeyFish · Feb 15, 2013

inefficient said:
We are talking about companies who routinely build and sell boards near $1000 just to have bragging rights for thier flagship designs.

I am pretty sure cost is not the reason why they do not use edram on thier high end.

For the low end to midrange , you might be able to make that argument about cost effectiveness.

Again

If the improvements lead to a gain of 30-40% in shader performance and a gain of 300% to compute performance compared to vanilla parts... Is it still worth it? You're still talking about well over 2 billion transistors dedicated to increasing efficiency... When they could get better performance dedicating those transistors to other things- brute force real world performance on higher end chips outclass this arrangement - and this is before we start talking about the fillrate- higher resolutions and multi monitor support

These decisions are for several reasons

One is long term cost
Another is power and thermal. Extracting similar performance of a higher end part without the heat and power drain would be HUGE.

It's something to be looked in to, for sure. When processes shrink a couple times more, maybe it could be worth it to have those parts on board.

Problem is... A lot of this stuff as far as we are aware needs developer hands on to get the benefit which they'd need to change.

This is about smarter utilization of budget if it all works as expected

scently · Feb 15, 2013

DopeyFish said:
Again

If the improvements lead to a gain of 30-40% in shader performance and a gain of 300% to compute performance compared to vanilla parts... Is it still worth it? You're still talking about well over 2 billion transistors dedicated to increasing efficiency... When they could get better performance dedicating those transistors to other things- brute force real world performance on higher end chips outclass this arrangement - and this is before we start talking about the fillrate- higher resolutions and multi monitor support

These decisions are for several reasons

One is long term cost
Another is power and thermal. Extracting similar performance of a higher end part without the heat and power drain would be HUGE.

It's something to be looked in to, for sure. When processes shrink a couple times more, maybe it could be worth it to have those parts on board.

Problem is... A lot of this stuff as far as we are aware needs developer hands on to get the benefit which they'd need to change.

This is about smarter utilization of budget if it all works as expected

I think power and thermal issue is really key here. The transistor budget allocated to the 360 edram setup is really indicative that power and thermal are really important in designing a console. For the same transistor budget, they could have added atleast one third more processing power to the 360, but instead they choose to dedicate it to implementing the edram. Sure, the esram is there to alleviate the bandwidth setup but I think the choice of eSRAM and not edram or 1t-sram tells us that it serves other benefit ie. to increase efficiency and maintain throughput through lower latency. And as bkilian just pointed out, the reason nvidia's gpu flops are more capable than amd's is that their caches have a lot lower latency. ERP already insinuated that from those he has spoken to, the esram setup on the durango is not just for framebuffer, but can be used as a cache for the gpu.

DopeyFish · Feb 15, 2013

scently said:
I think power and thermal issue is really key here. The transistor budget allocated to the 360 edram setup is really indicative that power and thermal are really important in designing a console. For the same transistor budget, they could have added atleast one third more processing power to the 360, but instead they choose to dedicate it to implementing the edram. Sure, the esram is there to alleviate the bandwidth setup but I think the choice of eSRAM and not edram or 1t-sram tells us that it serves other benefit ie. to increase efficiency and maintain throughput through lower latency. And as bkilian just pointed out, the reason nvidia's gpu flops are more capable than amd's is that their caches have a lot lower latency. ERP already insinuated that from those he has spoken to, the esram setup on the durango is not just for framebuffer, but can be used as a cache for the gpu.

Well it's now starting to make sense why they'd be telling devs to push the framebuffer to ram... Because the performance increase would be greater that way as opposed to having graphical operations in ram and pushing framebuffer to ESrAM (thanks to latency alone)

3dcgi · Feb 15, 2013

bkilian said:
Nvidia _does_ add low latency SRAM to it's products, that article a couple of pages back showed that the only card with sub 20 cycle memory times in it's caches was the NVidia. AMD had a 300 cycle minimum, even in it's cache.

Does anyone have the link handy? These numbers aren't correct. 300 cycles for a cache hit is crazy. Cache misses are often serviced in that amount of time.

Pete · Feb 15, 2013

http://www.sisoftware.net/?d=qa&f=gpu_mem_latency

(From this post.)

AstoundingHolmes · Feb 15, 2013

vgleaks said:
The difference in throughput between ESRAM and main RAM is moderate: 102.4 GB/sec versus 68 GB/sec. The advantages of ESRAM are lower latency and lack of contention from other memory clients—for instance the CPU, I/O, and display output. Low latency is particularly important for sustaining peak performance of the color blocks (CBs) and depth blocks (DBs).

The vgleaks article specifically points to efficiency gains in color and depth, but doesn't mention anything specific about shader efficiency gains. I would think if the gains were substantial, it would point them out as well. Perhaps we're all focusing on the wrong thing? Could the color and depth blocks be the bottleneck in the system, and improving their efficiency result in the rest of the GPU running more efficiently?

3dilettante · Feb 15, 2013

Pete said:
http://www.sisoftware.net/?d=qa&f=gpu_mem_latency

(From this post.)

The article stopped short of the latest generation GPUs, missing at the very least just how much GCN's memory pipeline resembles the prior generations.
The 6850 and Llano measurements are strange. On this forum, RV770 was profiled to have L1 latencies a little more than half what Sandra is reporting. AMD's memory pipeline could have gotten a lot longer latency since then, but nothing has come up concerning an increase like that.

Gipsel · Feb 15, 2013

3dilettante said:
The article stopped short of the latest generation GPUs, missing at the very least just how much GCN's memory pipeline resembles the prior generations.
The 6850 and Llano measurements are strange. On this forum, RV770 was profiled to have L1 latencies a little more than half what Sandra is reporting. AMD's memory pipeline could have gotten a lot longer latency since then, but nothing has come up concerning an increase like that.

I would say they messed up some of the measurements (3 cache levels for Fermi?). I would liike to see the code they used. And I guess I don't have to mention that latencies up to a few hundred cycles are often irrelevant during usual graphics work. That lower latency caches would increase the utilization of the GPUs by a factor of two or three is simply ridiculous.

/Me chiming in as the SiSoft link appeared again in the thread.

3dilettante · Feb 15, 2013

Gipsel said:
I would say they messed up some of the measurements (3 cache levels for Fermi?). I would liike to see the code they used. And I guess I don't have to mention that latencies up to a few hundred cycles are often irrelevant during usual graphics work. That lower latency caches would increase the utilization of the GPUs by a factor of two or three is simply ridiculous.

/Me chiming in as the SiSoft link appeared again in the thread.

The Fermi measurements may be stumbling into some kind of banking conflict with the 16 L/S units and the 64 KB pool that is used as L1 and SM.

SGEMM and DGEMM kernels have been tested on AMD GPUs since RV770, but I haven't run across commentary about a latency increase that big. It seems large enough to be noticed by someone running a compute kernel, if the cycle numbers are accurate.

Gipsel · Feb 15, 2013

3dilettante said:
The Fermi measurements may be stumbling into some kind of banking conflict with the 16 L/S units and the 64 KB pool that is used as L1 and SM.

SGEMM and DGEMM kernels have been tested on AMD GPUs since RV770, but I haven't run across commentary about a latency increase that big. It seems large enough to be noticed by someone running a compute kernel, if the cycle numbers are accurate.

This is going offtopic, but I really think they didn't know enough about the different architectures to draft directed low level tests measuring what they claim to measure. And the use of OpenCL isn't going to help that without carefully checking the output of the compiler.

Edit:
Just to prove that point, B3D ran an article about the Fermi architecture and did some measurements on a Cypress (should perform the same way as Barts) for comparison. For instance they measured the maximum aggregate throughput achievable on the shared memory. They got ~1064 GB/s on a HD5870, which equates about 1250 Bytes or 312 dwords per clock. That is awfully close to 320 dwords/clock which makes a nice 16 dwords per clock and CU (theoretical maximum is actually twice that). They used uint2 accesses (so that comes down to 8 accesses per clock, the goal was to maximize bandwidth use, not measuring latency) and the shared memory and register usage obviously allowed up to 16 Wavefronts per CU. Even assuming the CU had to cycle through all 16 available wavefronts before completing the access (which isn't true, afaik Evergreen/Northern Islands doesn't switch out wavefronts [besides the usual interleaving of 2 wavefronts] as long as an LDS operation is pending), this gives us an upper bound of 128 cycles of the LDS latency (it is very probably quite a bit lower). How the Sisoft guys managed to come up with 164 cycles with a kernel supposedly measuring latency (and not throughgput), is a bit beyond me. In fact, the Evergreen architecture supports LDS direct reads, where an LDS read can be directly used as an input operand (for this the programmer has to ensure, that there are no bank conflicts). It means the LDS access is synchronized to the ALU pipeline and the actual latency of the LDS is probably just 8 cycles or so (should be manageable at sub 1 GHz frequencies). The usual indexed operation adds a bit to it as one has to use additional ALU instructions to handle the input/output queues so one may end up at 16 cycles latency or something in that range (that actually explains the results in that B3D Fermi article

). But nowhere near 160 cycles. Actually, AMD stated that the GDS has just a few tens of cycles latency.
Edit2: Just found a presentation where AMD explicitly states that indexed LDS ops have a latency of a single VLIW instruction on Cypress.

Lucid_Dreamer · Feb 15, 2013

AstoundingHolmes said:
The vgleaks article specifically points to efficiency gains in color and depth, but doesn't mention anything specific about shader efficiency gains. I would think if the gains were substantial, it would point them out as well. Perhaps we're all focusing on the wrong thing? Could the color and depth blocks be the bottleneck in the system, and improving their efficiency result in the rest of the GPU running more efficiently?

That quote from vgleaks says, " The advantages of ESRAM are lower latency and lack of contention from other memory clients—for instance the CPU...". But, doesn't the CPU access ESRAM/eDRAM, too?

anexanhume · Feb 15, 2013

Lucid_Dreamer said:
That quote from vgleaks says, " The advantages of ESRAM are lower latency and lack of contention from other memory clients—for instance the CPU...". But, doesn't the CPU access ESRAM/eDRAM, too?

No, GPU is the sole reader/writer of the ESRAM.

french toast · Feb 15, 2013

anexanhume said:
No, GPU is the sole reader/writer of the ESRAM.

Do you feel that 4mb of L2 cache is enough for 8 jaguar~like cores? Considering there is no L4 cache scenario?

anexanhume · Feb 15, 2013

french toast said:
Do you feel that 4mb of L2 cache is enough for 8 jaguar~like cores? Considering there is no L4 cache scenario?

I'm not qualified to answer that. My understanding is that games aren't that cache sensitive, but the game I remember reading that about was quake, so my info is a little dated

dobwal · Feb 16, 2013

Maybe this patent is relevant.

http://www.google.com/patents/US20120272011

A method for refining multithread software executed on a processor chip of a computer system. The envisaged processor chip has at least one processor core and a memory cache coupled to the processor core and configured to cache at least some data read from memory. The method includes, in logic distinct from the processor core and coupled to the memory cache, observing a sequence of operations of the memory cache and encoding a sequenced data stream that traces the sequence of operations observed.

The first figure is a gaming console along with a dev kit.

astrograd · Feb 16, 2013

dobwal said:
Maybe this patent is relevant.

http://www.google.com/patents/US20120272011

A method for refining multithread software executed on a processor chip of a computer system. The envisaged processor chip has at least one processor core and a memory cache coupled to the processor core and configured to cache at least some data read from memory. The method includes, in logic distinct from the processor core and coupled to the memory cache, observing a sequence of operations of the memory cache and encoding a sequenced data stream that traces the sequence of operations observed.

The first figure is a gaming console along with a dev kit.

Yeah mining their patents might be a good idea. I found all sorts of neat info on their display planes and Kinect 2.0 and other stuff doing as much. I found these...might be good reads. Not sure.

1) Reduction of memory latencies using fine grained parallelism and FIFO data structures: http://patft.uspto.gov/netacgi/nph-...&f=G&l=50&d=PALL&RefSrch=yes&Query=PN/8239866

2) PROCESSOR CACHE TRACING: http://appft1.uspto.gov/netacgi/nph...srchnum.html&r=1&f=G&l=50&s1=20120272011.PGNR
3) LAYERED TEXTURE COMPRESSION ARCHITECTURE: http://www.faqs.org/patents/app/20090315905

3dcgi · Feb 16, 2013

Pete said:
http://www.sisoftware.net/?d=qa&f=gpu_mem_latency

(From this post.)

Thanks. As others said those tests are showing weird results. For some reason they're getting uncached reads.

liquidboy · Feb 17, 2013

astrograd said:
Yeah mining their patents might be a good idea. I found all sorts of neat info on their display planes and Kinect 2.0 and other stuff doing as much. I found these...might be good reads. Not sure.

1) Reduction of memory latencies using fine grained parallelism and FIFO data structures: http://patft.uspto.gov/netacgi/nph-...&f=G&l=50&d=PALL&RefSrch=yes&Query=PN/8239866

2) PROCESSOR CACHE TRACING: http://appft1.uspto.gov/netacgi/nph...srchnum.html&r=1&f=G&l=50&s1=20120272011.PGNR
3) LAYERED TEXTURE COMPRESSION ARCHITECTURE: http://www.faqs.org/patents/app/20090315905

I've been a big follower of MS and its people for a long time, especially the patent side of things ...

I've got a nice collection going over here : https://skydrive.live.com/redir?resid=1E3F9E1E2F8BC994!179

I'm a dev that uses there stack every day and code for .NET/Win32/DirectX/Silverlight/WPF/WinRT/WinJS as well as HTML/CSS/JS ...

I've written apps/games for WP8/WIN8/XBOX as well as for there older OS's

I follow their stuff closely from a developers perspective and I know that MS have so much cool stuff going on where they are creating a closer synergy between HW/OS & SW than ever before ... A lot of people just don't get the transformation going on in MS..

Xbox is just the first real place this is going to show off in with new designs/chips etc.. because they truly own everything from end to end.. BUT its the first step of many where they are going to bring there bottom HW designs to the entire ecosystem!!

All the haters are going to hate BUT MS are really really innovating at the moment..

p.s. I really love the "WEDGE LIGHT" stuff they're doing , hint hint illumiroom

scently · Feb 17, 2013

liquidboy said:
I've been a big follower of MS and its people for a long time, especially the patent side of things ...

I've got a nice collection going over here : https://skydrive.live.com/redir?resid=1E3F9E1E2F8BC994!179

I'm a dev that uses there stack every day and code for .NET/Win32/DirectX/Silverlight/WPF/WinRT/WinJS as well as HTML/CSS/JS ...

I've written apps/games for WP8/WIN8/XBOX as well as for there older OS's

I follow their stuff closely from a developers perspective and I know that MS have so much cool stuff going on where they are creating a closer synergy between HW/OS & SW than ever before ... A lot of people just don't get the transformation going on in MS..

Xbox is just the first real place this is going to show off in with new designs/chips etc.. because they truly own everything from end to end.. BUT its the first step of many where they are going to bring there bottom HW designs to the entire ecosystem!!

All the haters are going to hate BUT MS are really really innovating at the moment..

p.s. I really love the "WEDGE LIGHT" stuff they're doing , hint hint illumiroom

You said the durango cpu flops is actually more than double, can you state the exact amount?

Xbox One (Durango) Technical hardware investigation

AlphaWolf

Specious Misanthrope

DopeyFish

scently

DopeyFish

3dcgi

Pete

Moderate Nuisance

AstoundingHolmes

3dilettante

Gipsel

3dilettante

Gipsel

Lucid_Dreamer

anexanhume

french toast

anexanhume

dobwal

astrograd

3dcgi

liquidboy

scently

Similar threads