(Alleged) Orbis vs Durango memory subsystem comparison

McHuj · Mar 26, 2013

Vgleaks put up a new article on the PS4 architecture evolution. However, there was an interesting picture of the memory subsystem. That shows the breakdown for the memory buses.

http://www.vgleaks.com/playstation-4-architecture-evolution-over-time/

PS4:

Durango:

(If these are accurate) One thing that jumps out at me is that the CPU on Durango, will have much higher bandwidth available. The coherent access between GPU and CPU is much faster as well on Durango as well.

I would assume that what Sony has for the CPU is sufficient (unless it's supposed to be 20 GB/sec per module) so I wonder why MS would need 2X the bandwidth. Perhaps if the GPU can execute entirely out of the ESRAM (with some streaming consuming the 25.6 GB/sec DMA's), the CPU's can staturate the rest of the bandwidth from the main memory (20.8 + 20.8 + 25.6 = 67.2 GB/sec) .

Seems to me that this would be a better model for full utilization of all resources.

function · Mar 26, 2013

Well if that's true then we know some of where MS spent all those years doing expensive customisation: memory and interconnects.

If MS have engineered in 50% more CPU <--> GPU BW then they must be anticipating that the CPU will be feeding the GPU lots of data (or possibly receiving it). But double the CPU <--> main ram BW? That's a mighty big difference...

... 256 bit vector units? :runaway:

Shifty Geezer · Mar 26, 2013

I'm hesitant to allow this thread as we're not comparing yet, but lets give it a go and only close it if the fanbots start trolling or talking business.

At first I see this as Orbis being BW limited on the CPU. But then I compare it to PS3. PS3 had 8 CPU cores with lots more float power on a more limited bus. Perhaps 20 GBps is all that's needed for the CPU and the limit is a sensible cost saving?

function said:
But double the CPU <--> main ram BW? That's a mighty big difference...

I think it's 3x. There are no terminal arrows on the Onion bus (really, Onion and Garlic??) at the CPU. Onion appears to be a CPU passthrough bus with optional access to CPU's L1 and L2 caches. Snooping the caches at the L1 level sounds like very close CPU/GPU integration to me

Arksine · Mar 26, 2013

McHuj said:
Vgleaks put up a new article on the PS4 architecture evolution. However, there was an interesting picture of the memory subsystem. That shows the breakdown for the memory buses.

http://www.vgleaks.com/playstation-4-architecture-evolution-over-time/

PS4:

Durango:

(If these are accurate) One thing that jumps out at me is that the CPU on Durango, will have much higher bandwidth available. The coherent access between GPU and CPU is much faster as well on Durango as well.

I would assume that what Sony has for the CPU is sufficient (unless it's supposed to be 20 GB/sec per module) so I wonder why MS would need 2X the bandwidth. Perhaps if the GPU can execute entirely out of the ESRAM (with some streaming consuming the 25.6 GB/sec DMA's), the CPU's can staturate the rest of the bandwidth from the main memory (20.8 + 20.8 + 25.6 = 67.2 GB/sec) .

Seems to me that this would be a better model for full utilization of all resources.

There are some major differences. It appears that there is a dedicated bus between the GPU and CPU in Orbis. Durango goes through the North Bridge to reach the CPU L2, and it shares that bandwidth with everything else in the North Bridge. Latency between the CPU and GPU will probably be less in Orbis, particularly since the Orbis GPU can bybass its caches with Onion+.

So it comes down to if you are bandwidth limited at 10GB/s between the CPU and GPU, or at 20GB/s between the CPU and Main memory. I kind of doubt it, its not like these CPUs are powerhouses. Not even Intel's fastest desktop processors can saturate 20GB/s between the CPU and RAM.

Silent_Buddha · Mar 26, 2013

Arksine said:
There are some major differences. It appears that there is a dedicated bus between the GPU and CPU in Orbis. Durango goes through the North Bridge to reach the CPU L2, and it shares that bandwidth with everything else in the North Bridge. Latency between the CPU and GPU will probably be less in Orbis, particularly since the Orbis GPU can bybass its caches with Onion+.

So it comes down to if you are bandwidth limited at 10GB/s between the CPU and GPU, or at 20GB/s between the CPU and Main memory. I kind of doubt it, its not like these CPUs are powerhouses. Not even Intel's fastest desktop processors can saturate 20GB/s between the CPU and RAM.

It's a simplified diagram for Orbis, it is highly unlikely that memory traffic won't be passing through the "northbridge" when going from CPU core to main memory or GPU core as the "northbridge" is part of the CPU module.

In the Durango diagram the "CPU module" is basically the 2 CPU modules + Northbridge while the "GPU module" is the GPU + GPU memory system. Orbis will be the same with the "GPU memory system" handling memory accesses for the "GPU".

Regards,
SB

3dilettante · Mar 26, 2013

I'm not certain the overall uncore organization and bus arrangement between the GPU, CPU, and system memory are entirely different between Orbis and Durango.

The Onion and Garlic links are part of the uncore and would be the arrow between the GPU and northbridge and the arrow between the GPU and DDR3 in the Durango diagram.

There could be a variation in the bus speeds and widths, however.

Arksine · Mar 26, 2013

3dilettante said:
I'm not certain the overall uncore organization and bus arrangement between the GPU, CPU, and system memory are entirely different between Orbis and Durango.

The Onion and Garlic links are part of the uncore and would be the arrow between the GPU and northbridge and the arrow between the GPU and DDR3 in the Durango diagram.

There could be a variation in the bus speeds and widths, however.

After researching it further it appears you are correct. What is interesting is that its using the same terminology AMD used two years ago in their Llano fusion devices. The only difference is the addition of Onion+.

http://amddevcentral.com/afds/assets/presentations/1004_final.pdf

3dilettante · Mar 26, 2013

Onion+ might be something not available to Llano's VLIW5 GPU, which AMD's presentation was based on. GCN's read/write cache structure is different, and that actually creates the need and means to bypass them, although they'd still go out on the same Onion bus.

What questions I do have are whether Durango actually gives a full 20 GB/s port to each CPU cluster, or if the Orbis diagram omitted a similar change.
The Orbis diagram tracks somewhat with the numbers put forward for Llano, whereas Durango's Onion bus numbers might correspond to something stronger or more recent, while Orbis has a significantly larger Garlic bus.

I hope there's some full disclosure on the architectures at some point. Part of my uncertainty is that I don't know how much of the Vgleaks data is being relayed by people who know what any of the words they put out mean. That we keep getting "different" diagrams for the same things makes me think not much.

Narishma · Mar 26, 2013

Indeed it looks like Onion+ is the one thing they added compared to your typical PC APU, the rest looks pretty much the same, including the same names (Onion and Garlic).

Here's a detailed analysis of Llano's architecture: http://www.realworldtech.com/fusion-llano/2/

czekon · Mar 27, 2013

I dont undartsnd much from this

, so looking at those diagrams does durango have a potential to be effective system in compare to orbis?? I dont mean more powerfull but effective att handling tasks?. Is it clever design? Orr maybe this can produce more bottlenecks then it tries to fix?

fellix · Mar 27, 2013

Arksine said:
So it comes down to if you are bandwidth limited at 10GB/s between the CPU and GPU, or at 20GB/s between the CPU and Main memory. I kind of doubt it, its not like these CPUs are powerhouses. Not even Intel's fastest desktop processors can saturate 20GB/s between the CPU and RAM.

Since SNB, a single thread can pretty much max out the available bandwidth in burst reads/writes. Of course, Intel's memory pipeline is quite a bit more sophisticated than anything AMD can offer, me thinks.

itsmydamnation · Mar 27, 2013

fellix said:
Since SNB, a single thread can pretty much max out the available bandwidth in burst reads/writes. Of course, Intel's memory pipeline is quite a bit more sophisticated than anything AMD can offer, me thinks.

I hear this a lot but i cant really replicate it on my ESXi rig with a 8350. intel can get higher bandwidths to a single core. but the second i fire up a different memory bandwidth test on another guest vm tie it to another module i can get very close to max theoretical bandwidth. But latency intel is quite a bit fast then amd.

onQ · Mar 27, 2013

Remember the rumor from Blacken00100 at the arstechnica forums?

seems like the person who was giving him the info mixed things up.

Blacken00100 said:
So, a couple of random things I've learned:

-It's not stock x86; there are eight very wide vector engines and some other changes. It's not going to be completely trivial to retarget to it, but it should shut up the morons who were hyperventilating at "OMG! 1.6 JIGGAHURTZ!".

-The memory structure is unified, but weird; it's not like the GPU can just grab arbitrary memory like some people were thinking (rather, it can, but it's slow). They're incorporating another type of shader that can basically read from a ring buffer (supplied in a streaming fashion by the CPU) and write to an output buffer. I don't have all the details, but it seems interesting.

-As near as I'm aware, there's no OpenGL or GLES support on it at all; it's a lower-level library at present. I expect (no proof) this will change because I expect that they'll be trying to make a play for indie games, much as I'm pretty sure Microsoft will be, and trying to get indie developers to go turbo-nerd on low-level GPU programming does not strike me as a winner.

patsu · Mar 27, 2013

3dilettante said:
Onion+ might be something not available to Llano's VLIW5 GPU, which AMD's presentation was based on. GCN's read/write cache structure is different, and that actually creates the need and means to bypass them, although they'd still go out on the same Onion bus.

What questions I do have are whether Durango actually gives a full 20 GB/s port to each CPU cluster, or if the Orbis diagram omitted a similar change.
The Orbis diagram tracks somewhat with the numbers put forward for Llano, whereas Durango's Onion bus numbers might correspond to something stronger or more recent, while Orbis has a significantly larger Garlic bus.

I hope there's some full disclosure on the architectures at some point. Part of my uncertainty is that I don't know how much of the Vgleaks data is being relayed by people who know what any of the words they put out mean. That we keep getting "different" diagrams for the same things makes me think not much.

Is Onion and Onion+ access mutually exclusive ? Can the CPU use both at the same time ?

Cyan · Jan 23, 2016

The Park developer has said that the Xbox One and PS4 memory limitations are not optimum for developers.

http://gamingbolt.com/ps4xbox-one-8gb-memory-is-not-optimal-situation-for-devs-the-park-dev

Funcom creative director Joel Bylos, who is working on bringing The Park to current gen platforms, is a bit iffy on whether it will last for the next 5 odd years. “Hmmm, that’s a difficult question. It will last because it has to. Is it an optimal situation to put developers in? Absolutely not.

“And VR is coming – that’s going to come with hefty requirements. Who knows what will happen in the end? There are rumors that the PS4 VR has a separate box with unknown hardware inside. Maybe they’ve added a little extra hardware to help with the requirements.”

doom456 · Jan 25, 2016

Cyan said:
The Park developer has said that the Xbox One and PS4 memory limitations are not optimum for developers.

http://gamingbolt.com/ps4xbox-one-8gb-memory-is-not-optimal-situation-for-devs-the-park-dev

The same was said about last gen systems and older consoles yet when another dev get's a game like this working or something close they seem go all quiet or change there tune. Heck the doesn't even look that demanding this just a cop out.

tuna · Jan 25, 2016

What would the 'optimum' be then?

Deleted member 11852 · Jan 25, 2016

tuna said:
What would the 'optimum' be then?

Exactly. Optimum would be the exact amount of memory you need at any instant and this can change thousands of times during the generation of a frame, let alone frame to frame. I fully expect Sony and Microsoft to employ MagicMemory(tm) in the next generation of consoles. It would be lazy not too! :yep2:

(Alleged) Orbis vs Durango memory subsystem comparison

McHuj

function

None functional

Shifty Geezer

uber-Troll!

Arksine

Silent_Buddha

3dilettante

Arksine

3dilettante

Narishma

czekon

fellix

itsmydamnation

onQ

patsu

Cyan

orange

doom456

tuna

Deleted member 11852

Guest

Similar threads