What are the true structures of the next-gen APUs? *spawn

BeyondTed

Newcomer
From repeatedly looking at the vgleaks documents (both Gen 8 consoles) I can't help but think they are too "high level" and derived from documentation fairly far removed from the sort of documentation that an AMD or Intel would use to describe the design. And I do not just mean from the Visio or Power Point like (redrawn) "quality".

For example, I would not expect a diagram of two CPU modules as neither console is architected that way (as far as I know):

http://www.vgleaks.com/durango-cpu-overview/



Instead I would expect to see a diagram of either:

1. Two APU blocks and a Distinct GPU block integrated the die/SoC (and with the other added blocks)

or

2. One APU and a Distinct GPU block integrated on the die (and with the other added blocks)

or

3. Two APU (or one expanded APU) blocks with expanded number of CU in the one or two APU blocks, no Distinct GPU block (and with the other added blocks)



[Note that I am talking about ONE piece of silicon. One SoC with various BLOCKS or MODULES integrated inside that one silicon die/SoC.]



For example, if based upon Kabini (Not GCN 1.0) then scenario 1 (above) would look like:

1. Two Kabini Blocks (2x[4 Jaguar + 2CU]) + Distinct GPU Block

That would be further described as a design with 8 Jaguar cores and 4 CU (2 * 2 CU in the two APU) and 12CU in a distinct GPU block (Xbox One) or 14CU in a distinct GPU block (PS4).



For example, if based upon Kabini (Not GCN 1.0) then scenario 2 (above) would look like:

2. One "expanded" Kabini Block (1x[8 Jaguar + ?CU]) + Distinct GPU Block

That would be further described as a design with 8 Jaguar cores and ? CU (? CU in the one APU block) and 12CU in a distinct GPU block (Xbox One) or 14CU in a distinct GPU block (PS4).



For example, if based upon Kabini (Not GCN 1.0) then scenario 3a (above) would look like:

3. Two Kabini Blocks (2x[4 Jaguar + Expanded/Greater Than 2CU Each])




What is my point? There are a couple:

1. Why does the vgleaks show CPU and GPU separately as opposed to a Trinity/Llano/Richland AMD style APU block diagram? Do not both contain two small jaguar *APU* blocks as two of the various modules integrated into the SoC? True APU blocks? With generation ? level advantages that come from an APU? Not an AMD APU block stripped of the CU? (I think that can be easily rejected.) With one of the two being clearly described as based upon off the shelf components I expect to find two modules inside that one: Each with 4 jaguar plus attached CU (not sure if it is Kabini generation or not).

2. Perhaps this is why there was mention of 4CU+14CU. Is the same thing a valid question for Xbox One also? Is it one big APU? Is it one or two APU blocks? Is it one or two APU blocks plus a discrete GPU block? [I think two instances of 4 jaguar and 2 CU blocks is more likely than one 8 jaguar plus expanded ? CU block.]

3. Perhaps this is why questions were asked about what could be run on 4CU. (That was the wrong question IMHO, the right question is: Can the 14CU do everything the 4CU is capable of? I suspect the answer is no. I think the 4CU inside the APU blocks would be more capable than the 14CU outside, not the other way around. The 4CU has some HSA/HUMA advantages (varying depending on the version: Kabini or ?) over the CUs inside the distinct GPU (non APU block).

4. If 2 & 3 are at all correct and IF they also describe the Xbox One structure (which is why this is on topic and is NOT actually a comparison question or PS4 question) then is the Xbox One a 2x Kabini (or 2x predecessor or successor) PLUS a distinct GPU block? If so, is the distinct GPU block 12CU? That would seem to fit with the vgleaks description of a GPU, as opposed to describing an APU.

http://www.vgleaks.com/durango-gpu-2/



In a way the purpose of this post is to ask the more knowledgeable sources the following question:

When the full block diagram structure is fleshed out accurately at the level of accuracy you would expect from AMD (if that ever happens) what do we expect to see? Two *APU* blocks plus a distinct GPU block? Or not?



I think it matters for a variety of reasons. One would be that some CU in the design have APU related capabilities/advantages, for example. Another would be that some CU are attached to one set of ACE (+ Command Processor + Geometry Engine(s) + Rasterizer + Global Share + Cache + Etc + Etc) and the others are attached to another set:

http://images.anandtech.com/doci/6837/Bonaire.jpg



What do people think? I think it is fair to expect that the documentation (even if 100% accurate) at vgleaks may never have been intended to describe the level of detail being discussed. Given that it is an unknown source we don't know the level of expertise of the ?technical writer? or ?dev? or who ever else did the leak.
 
For example, I would not expect a diagram of two CPU modules as neither console is architected that way (as far as I know):
How can you know? there's only one document that describes the chip topologies, and it says 2x 4 core CPUs connected to CUs. ;)

What is my point? There are a couple:

1. Why does the vgleaks show CPU and GPU separately as opposed to a Trinity/Llano/Richland AMD style APU block diagram?
Because they have discrete CPUs? These aren't mobile parts with a handful of CUs, but new SOCs for consoles. Seems very sensible to me to have the 'CPU' in one area and the 'GPU' in another for simplicity, rather than CPU+CU+more CU.

When the full block diagram structure is fleshed out accurately at the level of accuracy you would expect from AMD (if that ever happens) what do we expect to see? Two *APU* blocks plus a distinct GPU block? Or not?
I'd expect to see two groups of four Jaguar cores, and a block of CUs.

What do people think? I think it is fair to expect that the documentation (even if 100% accurate) at vgleaks may never have been intended to describe the level of detail being discussed.
That is true, as the system architecture documentation is designed to describe it to software engineers, not hardware engineers (hence MS messing about with AMD's component names). If the placement of the CUs is immaterial to their usage, they can be abstracted out of the diagrams. However, if the placement of the CUs makes a difference in how they are used, devs would need to be told this - use these CUs for some stuff, these other ones for something else. The docs aren't saying that for XB1. This isn't the thread to discuss PS4. ;)
 
From repeatedly looking at the vgleaks documents (both Gen 8 consoles) I can't help but think they are too "high level" and derived from documentation fairly far removed from the sort of documentation that an AMD or Intel would use to describe the design. And I do not just mean from the Visio or Power Point like (redrawn) "quality".
..

The documentation that vgleaks is based off what was given to the actual developers, as well as some white papers and other bits of pieces (such as conference documentation). It describes the system in part (for some of the leaks, like the GPU leaks). It also has a system overview diagram (which is the memory diagram, it shows us all the discreet parts of the system).

If there we any other CU's or bits that the XBONE had they would be in the leak, they are not mentioned, therefore they simply do not exist.
 
The documentation that vgleaks is based off what was given to the actual developers, as well as some white papers and other bits of pieces (such as conference documentation). It describes the system in part (for some of the leaks, like the GPU leaks). It also has a system overview diagram (which is the memory diagram, it shows us all the discreet parts of the system).

If there we any other CU's or bits that the XBONE had they would be in the leak, they are not mentioned, therefore they simply do not exist.

So perhaps my first post was too long.

So I start again with a much smaller and simpler question first (regarding the PS4) and then address the Xbox One later, if that makes sense later on.



Over in VGLeaks there is this comment about the PS4 CU:


VGLeak on PS4 Hardware said:
About 14 + 4 balance:
- 4 additional CUs (410 Gflops) “extra” ALU as resource for compute
- Minor boost if used for rendering

http://www.vgleaks.com/world-exclusive-orbis-unveiled-2/



That particular comment strikes me as interesting as two "off the shelf" 4 Jaguar and 2CU "APU" blocks combined together gives 4CU which are in some way distinct from the other 14CU.

So my first question is:

Does the "true" or "more detailed/more correct" structure of the PS4 APU consist of two APU blocks integrated together with a 14 CU GPU block?

So two Kabini APUs or similar combined with a 14 CU GPU into one silicon die/SoC.

I say Kabini due to the "off the shelf" comments and the fact that I am not aware of any other APU "off the shelf" APU containing Jaguar.
 
Last edited by a moderator:
For example, I would not expect a diagram of two CPU modules as neither console is architected that way (as far as I know):
Jaguar comes in 4 core modules, which would heavily steer the CPU design for the consoles in that direction once they chose it.
Making an 8-core design would be a significant proposition. The quad-core L2 interface is already the size of a Jaguar core, for example.
That would be a large adder in design time, cost, and probably a significant increase in L2 latency.

The VGleaks page goes on to show the latency numbers that reinforce how little was done to screw with the quad-core Jaguar modules. Remote L2 hits have the same ns latency as a memory read on a desktop APU. The memory latency for Durango, which was probably hurt by the remote snoop, is 100ns, or double that of the fastest desktop CPUs.

It makes sense from the perspective that AMD had already gone through years of design and validation for a complex coherent cache interface, and to mess with that would involve an effort that would require redoing a lot of that on a larger scale.

For example, if based upon Kabini (Not GCN 1.0) then scenario 1 (above) would look like:

1. Two Kabini Blocks (2x[4 Jaguar + 2CU]) + Distinct GPU Block
This is unnecessary. The CPU side and GPU side reside on the other side of an interface. AMD's goal is a modular IP set. They don't need to modify the CPU portion and GPU portion in lockstep.
Some earlier AMD statements about modularity went so far as to separate the core IP and cache modules.
The L2 interface in Jaguar could allow that, though for these designs I don't think they wanted to mess with that.

1. Why does the vgleaks show CPU and GPU separately as opposed to a Trinity/Llano/Richland AMD style APU block diagram?
Possibly AMD didn't draw the diagram?
At the very least, actual pictures of the APU dies show that the CPU and GPU portions are very distinct physically.
The CPUs and GPU are still treated very distinctly.

With one of the two being clearly described as based upon off the shelf components I expect to find two modules inside that one: Each with 4 jaguar plus attached CU (not sure if it is Kabini generation or not).
AMD's IP offerings are not that inflexible, and splitting the CUs in a dual APU setup would compromise the design by duplicating too much hardware, overcomplicating the GPU interfaces to memory and the CPUs, and compromising that big crossbar setup the CUs share for their big bandwidth numbers.
It would also expose two GPUs to the software.

2. Perhaps this is why there was mention of 4CU+14CU.
The PS4's architect said that was some general guidance for when developers wanted to get the most bang for their buck when it came to top graphics and compute performance.
An actual physical separation did not make sense.
I once theorized there could be a soft division due to system services reserving some of the CU capacity, but that wasn't the case going by Cerny's interview.

The CUs are uniform and identical, per Sony's statements.

When the full block diagram structure is fleshed out accurately at the level of accuracy you would expect from AMD (if that ever happens) what do we expect to see? Two *APU* blocks plus a distinct GPU block? Or not?
One GPU with uniform compute resources, and most likely two 4-core Jaguar modules for both consoles.
 
Last edited by a moderator:
This is unnecessary. The CPU side and GPU side reside on the other side of an interface. AMD's goal is a modular IP set. They don't need to modify the CPU portion and GPU portion in lockstep.
Some earlier AMD statements about modularity went so far as to separate the core IP and cache modules.
The L2 interface in Jaguar could allow that, though for these designs I don't think they wanted to mess with that.

...

The CUs are uniform and identical, per Sony's statements.


One GPU with uniform compute resources, and most likely two 4-core Jaguar modules for both consoles.

Ok. So what you are saying is since they are on opposite sides of the Unified North Bridge you can easily copy the 4xJaguar block and independently you keep just one CU block on the other side and just up the count from 4 to 18 CU.

So yes, they don't want to mess with the "blocks" but at the level of the "Core Module" not at the level of the "Kabini" or "APU Block".

So if that is the case all 18 CU have the same "class" of APU advantages, likely similar to Kabini or GCN 1.1 if that is a real designator.
 
Ok. So what you are saying is since they are on opposite sides of the Unified North Bridge you can easily copy the 4xJaguar block and independently you keep just one CU block on the other side and just up the count from 4 to 18 CU.
It's comparatively easy versus an architecture where the GPU and CPU portions were more closely tied to one another.
The UNB itself is likely another module or series of modules. There are different bandwidth numbers and links in the VGleaks documents for Durango and Orbis, which points to differences in the uncore clocks or what hooks were used.
So long as each individual block is wired to use its side of the interface links, it keeps overal design complexity from exploding every time a CPU block is added or a CU count is changed.

The GPU itself obviously was a target for customization, and it would have sub-modules.
The division is not as stark as the CPU and GPU sides, but AMD's recent imprecise approach to labeling graphics IP generations shows that there is some mixing and matching going on.
There are portions of the GPU that can more readily change without overly affecting the rest of the system.

GCN decouples the CU array from the rest of the architecture, and there is a low-bandwidth hub where miscellaneous elements can hang from that different clients can be readily hooked into.
For areas where there is tighter interoperability, it would take more work and incur disadvantages.
Uniformity of the CUs simplifies matters in terms of scheduling, defect management, and design effort.
 
Layperson here but,

14+4

Onion aka Fusion Compute Link is for snooping the L1/L2 cache of the cpu.

Onion+ is for bypassing the gpu caches.

From vgleaks both buses are in the PS4 and allow accesses between the cpu and gpu. Both buses share the 10 GBs of bandwidth in both direction.

But I read that bus snooping doesn't scale well.

Couldn't onion or onion+ represent a bottleneck which originates the "14+4" figure?
 
Last edited by a moderator:
That would depend more on what kind of code is being run. The massive disparity between main memory bandwidth and the Onion bus would indicate that a minority of off-GPU accesses should be coherent.

In terms of cache line moves, a single CU is capable of saturating the coherent link at least twice over.

Data moves at a 64-byte cache line granularity, and the CU's memory pipeline can kick off one memory operation per cycle.
64 bytes at 800MHz is a little over 50 GB/s. If pure read or write traffic, half of the bus is off-limits, pushing the disparity to a factor of five.

The closer limit for the Onion bus is what the Jaguar module can take in a cycle.
 
That would depend more on what kind of code is being run. The massive disparity between main memory bandwidth and the Onion bus would indicate that a minority of off-GPU accesses should be coherent.

In terms of cache line moves, a single CU is capable of saturating the coherent link at least twice over.

Data moves at a 64-byte cache line granularity, and the CU's memory pipeline can kick off one memory operation per cycle.
64 bytes at 800MHz is a little over 50 GB/s. If pure read or write traffic, half of the bus is off-limits, pushing the disparity to a factor of five.

The closer limit for the Onion bus is what the Jaguar module can take in a cycle.


What of maintaining coherency? Given that the the ps4 gpu is mostly going to perform traditional workloads, do you need to maintain coherency across the gpu with the caches in the CPU? Wouldn't you want to restrict the number of cache controllers involved? From my reading any writes to shared data by the cpu would be broadcasted (write invalidate) over onion and any cache controller involved would check for the invalidated address in their cache. But couldn't that potentially introduced contention across the gpu caches?
 
Last edited by a moderator:
What of maintaining coherency? Given that the the ps4 gpu is mostly going to perform traditional workloads, do you need to maintain coherency across the gpu with the caches in the CPU?
Coherent traffic is handled by the Onion bus, with throughput that is on the same order as what the CPU can handle.
If by traditional workload you mean rendering, there's the full-bandwidth non-coherent Garlic bus.

The coherent bus would be reserved for a minority of the traffic, including work that cannot tolerate the full latency of the the GPU memory subsystem and command/synchronization traffic. It may be possible to do the bulk of the compute traffic over the Garlic bus, and then set a complete flag over the Onion bus once done.

Wouldn't you want to restrict the number of cache controllers involved?
The primary coherence endpoint for Jaguar is the L2. The L2 interface in particular is responsible for managing it.
The primary coherence structure for the GPU is the L2. Possibly the interface or memory subsystem that tracks its misses would be what tracks what accesses need to be coherent.

The Jaguar cores and CUs don't snoop each other, it's all L2 misses with the Onion bus and I believe the IOMMU as the intermediaries.
That's many fewer clients than there are cores or CUs.


From my reading any writes to shared data by the cpu would be broadcasted (write invalidate) over onion and any cache controller involved would check for the invalidated address in their cache. But couldn't that potentially introduced contention across the gpu caches?
AMD has promised probe filters and the like for fully HSA-enabled APUs. I'm not sure what would be present for Orbis or Durango.
As far as the GPU caches go, the GPU memory subsystem is capable of handling way more contention than the L2 interface will permit out of the Jaguar modules. The separate coherent and non-coherent bus setup isn't there because the GPU can't handle the traffic.
 
Last edited by a moderator:
Is it just me, or does BeyondTed sound a lot like astrograd?

As I've said before, the 14+4 thing has nothing to do with differences at the hardware level and is just a suggestion that was made to devs on how they might like to split up their graphics and compute work on the GPU.
 
Is it just me, or does BeyondTed sound a lot like astrograd?

As I've said before, the 14+4 thing has nothing to do with differences at the hardware level and is just a suggestion that was made to devs on how they might like to split up their graphics and compute work on the GPU.

I think we've had enough of the 14+4 indeed. Also, as Cerny stated and if I understood correctly, they've upped the compute queues to 64 so the compute jobs could be deferred until the silicon becomes available, to make sure they could use the CU's to their fullest. If there was some fixed CU's for specific tasks, would they have needed so many queues?
 
As I've said before, the 14+4 thing has nothing to do with differences at the hardware level and is just a suggestion that was made to devs on how they might like to split up their graphics and compute work on the GPU.

3dilettante seemed to understand the question and answered it.

It didn't seem obvious to me that "off the shelf" would mean splitting/modifying the blocks up inside Kabini at the Core/UNB/CU level as opposed to leaving the whole module alone and adding the GPU.

Cerny did say the GPU was integrated inside the SoC and so I assumed with that statement and the "off the shelf" comments that a GPU and *the whole* Kabini were put together on the one die. The 4CU + 14CU just seemed to fit that like a glove since 2xKabini gives 4CU.
 
As I've said before, the 14+4 thing has nothing to do with differences at the hardware level and is just a suggestion that was made to devs on how they might like to split up their graphics and compute work on the GPU.

Except the part where the 4 CUs have slightly different hardware.
 
They don't.

If they don't then my question is answered.

If they do then that gets back to the other things I have heard leading to the questions. (And then other questions follow.)

But if it is ONE block of 18CU and all 18CU are exactly the same then there is no question.
 
Back
Top