Xbox One (Durango) Technical hardware investigation

Betanumerical · Jul 17, 2013

BeyondTed said:
Do you know if this 1024 bit wide (4x256 bit wide) L2 interface fits with the GCN assumption?

I ask since someone suggested to me that GCN & PS4 has 64 wide L2 interfaces, not 256. (And further suggested that this is a hint of a more advanced version.) Is this incorrect?

http://www.vgleaks.com/durango-gpu/cache/

The 1024bit L2 Interface to a memory controller for something such as the eSRAM fits perfectly with the GCN assumption yes. What it doesn't show is the DDR3 MC interface to the cache.

Where as the PS4 has 64bit wide MC's to the cache because thats the width of the GDDR5 memory controller channels (I think).

Scott_Arm · Jul 17, 2013

BeyondTed said:
Do you know if this 1024 bit wide (4x256 bit wide) L2 interface fits with the GCN assumption?

I ask since someone suggested to me that GCN & PS4 has 64 wide L2 interfaces, not 256. (And further suggested that this is a hint of a more advanced version.) Is this incorrect?

http://www.vgleaks.com/durango-gpu/cache/

The 8 steamroller cores seems reasonably off the table but how can you be so sure about GCN2?

GCN showed up January 2011 and it is nearly 2 full years from January 2012 to November 2013.

Can one really make the assertion that it is GCN1.0 and not 1.5 or 2.0 or heavily customized? They had two more years to work on it.

I wouldn't be shocked either way. GCN 1.0 is not unreasonable but I do not think the other options can be so easily dismissed without sources as opposed to assumptions.

The VGleaks specs look like GCN 1. The details provided for a CU are pretty much the same. I would expect GCN 2 to have some more significant differences. They might have some customization, but I doubt it's the next generation of the architecture.

BeyondTed · Jul 17, 2013

Please help me to interpret this whitepaper (I think I have it right but):

http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf

Is it saying 64 bit L2 interfaces in GCN?

AMD GCN Whitepaper said:
Equally important, the cache hierarchy was designed to integrate with x86 microprocessors. The GCN virtual memory system can support 4KB pages, which
is the natural mapping granularity for the x86 address space - paving the way for a shared address space in the future. In fact, the IOMMU used for DMA
transfers can already translate requests into the x86 address space to help move data to the GPU and this functionality will grow over time. Additionally, the
caches in GCN use 64B lines, which is the same size as x86 processors. This sets the stage for heterogeneous systems to transparently share data between
the GPU and CPU through the traditional caching system, without explicit programmer control.

The memory controllers tie the GPU together and provide data to nearly every part of the system. The command processor, ACEs, L2 cache, RBEs, DMA
Engines, PCI Express™, video accelerators, Crossfire interconnects and display controllers all have access to local graphics memory. Each memory controller
is 64-bits wide and composed of two independent 32-bit GDDR5 memory channels. For lower cost products, GDDR5 can be replaced with DDR3 memory as well. For large memory footprints, the GDDR5 controller can operate in clamshell mode to use two DRAMs per channel, doubling the capacity.

So 4x the width on Xbox One? Is this a big impact? Seems like a big difference.

Why have 4x the width on 2/3 the ALU???

Betanumerical said:
The 1024bit L2 Interface to a memory controller for something such as the eSRAM fits perfectly with the GCN assumption yes.

Where as the PS4 has 64bit wide MC's to the cache because thats the width of the GDDR5 memory controller channels (I think).

But isn't 1024 bits way wider than any existing GCN? Isn't 7970 only 384 bits?

I am trying to understand why have 1024/384 = 2.7x the width for 768/2048 = 1/2.7x the ALU?

That is a ratio of 2.7^2 = 7. That just seems very odd to me. 7x the width per ALU. And to much lower latency on chip memory too.

Maybe it is nothing but 4x the interface width and 7x the width per ALU seems like heavy overkill for a supposedly weak GPU. What am I missing? Is it just because of DDR3 or it is possibly more than that?

Betanumerical · Jul 17, 2013

BeyondTed said:
But isn't 1024 bits way wider than any existing GCN? Isn't 7970 only 384 bits?

I am trying to understand why have 1024/384 = 2.7x the width for 768/2048 = 1/2.7x the ALU?

That is a ratio of 2.7^2 = 7. That just seems very odd to me. 7x the width per ALU. And to much lower latency on chip memory too.

Maybe it is nothing but 4x the interface width and 7x the width per ALU seems like heavy overkill for a supposedly weak GPU. What am I missing? Is it just because of DDR3 or it is possibly more than that?

No other existing GCN card uses a large amount of eSRAM like the XBONE does so don't except to find a example of it in the GCN white paper. They replaced the DDR3/GDDR5 MC's with one for the eSRAM and also one for the DDR3 this is where the 4*(256bit) interface comes from.

The rest is bog standard GCN as per the vgleaks article.

BeyondTed · Jul 17, 2013

Betanumerical said:
No other existing GCN card uses a large amount of eSRAM like the XBONE does so don't except to find a example of it in the GCN white paper. They replaced the DDR3/GDDR5 MC's with one for the eSRAM and also one for the DDR3 this is where the 4*(256bit) interface comes from.

The rest is bog standard GCN as per the vgleaks article.

Does anyone know if future AMD cards are being designed with eSRAM?

I just wonder since Xenos did seem to take a bit of a lead (UMA, unified shaders) that later showed up in R600.

Or does it look like the eSRAM is just a one off customization like the eDRAM w/Xenos?

3dilettante · Jul 17, 2013

BeyondTed said:
Is it saying 64 bit L2 interfaces in GCN?

"Like the L1 data cache, the L2 is virtually addressed, so no TLBs are required at all. The L2 cache is 16 way associative, with 64B cache lines and LRU
replacement. It is a write-back and write allocate design, so it absorbs all the write misses from the L1 data cache. Each L2 slice is 64-128KB and can send a 64B cache line to the L1 caches."

The memory controllers are 64-bit, each having two 32 bit channels.

McHuj · Jul 17, 2013

BeyondTed said:
Does anyone know if future AMD cards are being designed with eSRAM?

I just wonder since Xenos did seem to take a bit of a lead (UMA, unified shaders) that later showed up in R600.

Or does it look like the eSRAM is just a one off customization like the eDRAM w/Xenos?

I doubt it since it takes a significant software effort to utilize eSRAM. I don't think it's feasible in the PC space, unless it ends up being implemented as a L3 with complete transparency to the software. I think long term, use of embedded-on die RAMs in GPU's will be replaced with stacked memories/memories on interposer. You'll get the bandwidth, power savings, but much much more memory.

BeyondTed · Jul 17, 2013

McHuj said:
I doubt it since it takes a significant software effort to utilize eSRAM.

Is it something that Direct X 11.? could take care of?

Does the developer need to know?

Or is it something that needs more information to use and control than what Direct X API could handle "automatically"?

Just wondering if AMD and MS can handle it in an API or whether it requires case by case hand holding.

Betanumerical · Jul 17, 2013

BeyondTed said:
Is it something that Direct X 11.? could take care of?

Does the developer need to know?

Or is it something that needs more information to use and control than what Direct X API could handle "automatically"?

Just wondering if AMD and MS can handle it in an API or whether it requires case by case hand holding.

To get a decent amount of performance it probably requires hand holding its not a cache and as such there is nothing automatic about it some software needs to handle it and without knowing a head what would benefit from staying in there and what would benefit from being evicted you cannot really decently use it.

BeyondTed · Jul 17, 2013

Betanumerical said:
To get a decent amount of performance it probably requires hand holding its not a cache and as such there is nothing automatic about it some software needs to handle it and without knowing a head what would benefit from staying in there and what would benefit from being evicted you cannot really decently use it.

Ok.

I was wondering if the Direct X API with a frame worth of information could make decisions on what to keep in memory and what to evict for that frame (based upon some sort of examination or analysis of what it is being asked to render for that frame).

I really need to rely on others here as my understanding of the Direct X API is extremely low.

But I was wondering if passing a frame worth of instructions allows the API to go [software based] "looking forward in time" [throughout the frame instructions from start to finish before starting to render] that a GPU (hardware) or cache can not. Can Direct X "scan", analyze or process the frame worth of data and make much more intelligent choices than a GPU could clock by clock with only a relatively tiny bit of logic and knowledge based upon the past?

Can this be done at the input assembler? [I know nothing about the Direct X API so I am sorry if it turns out that this does not make sense.]

http://www.pcgameshardware.com/&menu=browser&mode=article&image_id=978082&article_id=682669&page=1

http://www.pcgameshardware.com/&menu=browser&mode=article&image_id=978084&article_id=682669&page=1

Kaotik · Jul 17, 2013

McHuj said:
I doubt it since it takes a significant software effort to utilize eSRAM. I don't think it's feasible in the PC space, unless it ends up being implemented as a L3 with complete transparency to the software. I think long term, use of embedded-on die RAMs in GPU's will be replaced with stacked memories/memories on interposer. You'll get the bandwidth, power savings, but much much more memory.

Is there any actual proof of it requiring significant effort when not used as L3, or is it just assumption?
As L3 it doesn't need effort and works fine as proven by Intel, though.

BeyondTed · Jul 17, 2013

Scott_Arm said:
The VGleaks specs look like GCN 1. The details provided for a CU are pretty much the same. I would expect GCN 2 to have some more significant differences. They might have some customization, but I doubt it's the next generation of the architecture.

Is Volcanic Islands/Hawaii GCN (1.0) or is it a significantly evolved/new GCN such as GCN 2.0 (nearing 2 full years later)?

http://www.xbitlabs.com/news/graphi...D_Hawaii_Graphics_Processor_in_September.html

I ask since the possibilities are mainly:

1. GCN (1.0) <<< This seems to be the most popular and considered most likely
2. GCN (1.0) + Customization <<< 2nd most popular/likely
3. GCN (1.5) (And maybe Customization)
4. GCN (2.0) (And maybe Customization) <<< Not likely, especially if VI/Hawaii is not GCN 2.0

So if VI/Hawaii is not GCN 2.0 then that would be a solid data point that could take it off the table.

But if it is GCN 2.0 then the Xenos & R600 timing could be a reasonable argument supporting VI/Hawaii type architecture in Xbox One.

[However if the VG Leaks are accurate and is GCN 1.0 then that is a reasonable argument against. And they seem to be pretty accurate, at least meshing well with PS4 and the less detailed/less complete information on Xbox One.]

Betanumerical · Jul 17, 2013

BeyondTed said:
Is Volcanic Islands/Hawaii GCN (1.0) or is it a significantly evolved/new GCN such as GCN 2.0 (nearing 2 full years later)?

http://www.xbitlabs.com/news/graphi...D_Hawaii_Graphics_Processor_in_September.html

I ask since the possibilities are mainly:

1. GCN (1.0) <<< This seems to be the most popular and considered most likely
2. GCN (1.0) + Customization <<< 2nd most popular/likely
3. GCN (1.5) (And maybe Customization)
4. GCN (2.0) (And maybe Customization) <<< Not likely, especially if VI/Hawaii is not GCN 2.0

So if VI/Hawaii is not GCN 2.0 then that would be a solid data point that could take it off the table.

But if it is GCN 2.0 then the Xenos & R600 timing could be a reasonable argument supporting VI/Hawaii type architecture in Xbox One.

[However if the VG Leaks are accurate and is GCN 1.0 then that is a reasonable argument against. And they seem to be pretty accurate, at least meshing well with PS4 and the less detailed/less complete information on Xbox One.]

The fact that it lacks some of what we already know will be in GCN1.1 really leads credence to it being GCN1.0 or GCN1.0 + customisation and thats it.

BeyondTed · Jul 17, 2013

Betanumerical said:
The fact that it lacks some of what we already know will be in GCN1.1 really leads credence to it being GCN1.0 or GCN1.0 + customisation and thats it.

What is new in GCN "1.1"? And of those new developments are not in Xbox One?

Betanumerical · Jul 17, 2013

BeyondTed said:
What is new in GCN "1.1"? And of those new developments are not in Xbox One?

Difference ACE setup, the XBONE has the default for GCN1.0

liolio · Jul 17, 2013

McHuj said:
I doubt it since it takes a significant software effort to utilize eSRAM. I don't think it's feasible in the PC space, unless it ends up being implemented as a L3 with complete transparency to the software. I think long term, use of embedded-on die RAMs in GPU's will be replaced with stacked memories/memories on interposer. You'll get the bandwidth, power savings, but much much more memory.

I wonder if the two are exclusives. The interesting things to me is that we get really close to be able to fit enough cache on a chip so it could relieve one constrain that still dictates how GPU are designed, my understanding of it is "texturing related latencies".

With tiles resources you can all the textures you need for a frame within 16GB in really good quality, it means that it could fit within a on chip cache pretty soon. That means that part of "texturing" induced latencies could be significantly reduced. I wonder if it could have an impact on how GPU are designed /less reliance on multi threading to keep the ALUs busy.

BeyondTed · Jul 17, 2013

Betanumerical said:
More command queues, the XBONE has the default setup of GCN1.0 in that regard.

So more like:

Cerny said:
An overhauled GPU

According to Cerny, the GPU powering the PS4 is an ATI Radeon with “a large number of modifications.” From the GPU’s perspective, the large RAM pool doesn’t count as innovative. The PS4 has a unified pool of 8GB of RAM, but AMD’s Graphics Core Next GPU architecture (hereafter abbreviated GCN) already ships with 6GB of GDDR5 aboard workstation cards. The biggest change to the graphics processor is Sony’s modification to the command processor, described as follows:

The original AMD GCN architecture allowed for one source of graphics commands, and two sources of compute commands. For PS4, we’ve worked with AMD to increase the limit to 64 sources of compute commands — the idea is if you have some asynchronous compute you want to perform, you put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that’s in the system.

http://www.extremetech.com/gaming/1...avily-modified-radeon-supercharged-apu-design

So the Xbox One has the older and unmodified (two) command queues?

Do you have a link to an article about that I could read?

I see the two "Graphics Commands" blocks in vgleaks. Is that it?

http://www.vgleaks.com/durango-gpu-2/

Betanumerical · Jul 17, 2013

BeyondTed said:
So more like:

http://www.extremetech.com/gaming/1...avily-modified-radeon-supercharged-apu-design

So the Xbox One has the older and unmodified (two) command queues?

Do you have a link to an article about that I could read?

I see the two "Compute Command" blocks in vgleaks. Is that it?

http://www.vgleaks.com/durango-gpu-2/

Yes the two compute command blocks in vgleaks are the two ACE's. If you want information on the ACE's look at the whitepaper, i cannot give you article which specifically outlines that it only has two because the only source atm is vgleaks.

McHuj · Jul 17, 2013

Kaotik said:
Is there any actual proof of it requiring significant effort when not used as L3, or is it just assumption?
As L3 it doesn't need effort and works fine as proven by Intel, though.

It is an assumption on my part. If it's not transparent to the programmer like a cache, then it has to be used accessed by an API or intrinsic of some kind (maybe a smart compiler could use it, but I doubt it). I think if MS decided to release DX12 and embedded RAM's and associated API's were a main feature of the DX12, then I think we would see embedded RAM being used as all DX12 cards would have to have embedded memory.

But unless it's part of a standard, then I don't think any developer would spend time on a feature that has minimal market penetration (in the PC space)

liolio · Jul 17, 2013

Based on that:

Important differences between xx and yy GPUs

Multi queue compute
Lets multiple user-level queues of compute workloads be bound to the device and processed simultaneous. Hardware supports up to eight compute pipelines with up to eight queues bound to each pipeline.

System unified addressing
Allows GPU access to process coherent address space.

From hardware.fr review of bonaire and what the leaks said about Durango I would think that Durango is based on GCN 1.1 (which is not AMD nomenclature).

Betanumerical said:
Yes the two compute command blocks in vgleaks are the two ACE's. If you want information on the ACE's look at the whitepaper, i cannot give you article which specifically outlines that it only has two because the only source atm is vgleaks.

Sorry I may have missed it but I see only one "compute command block" on that diagram.

Xbox One (Durango) Technical hardware investigation

Betanumerical

Scott_Arm

BeyondTed

Betanumerical

BeyondTed

3dilettante

McHuj

BeyondTed

Betanumerical

BeyondTed

Kaotik

Drunk Member

BeyondTed

Betanumerical

BeyondTed

Betanumerical

liolio

Aquoiboniste

BeyondTed

Betanumerical

McHuj

liolio

Aquoiboniste

Similar threads