AMD Kaveri APU features the Onion + bus like the PlayStation 4

patsu · Oct 9, 2013

Coherent access may not need CPU or GPU power boost at all. It's a communication thing.
e.g. One may batch their work on the same CPU to reduce external "sharing".

temesgen · Oct 9, 2013

patsu said:
Coherent access may not need CPU or GPU power boost at all. It's a communication thing.
e.g. One may batch their work on the same CPU to reduce external "sharing".

but there is some overhead, how does that impact things? Are we talking bandwidth and couldn't there be a multiplier to latency due to both processors working on the same thing?

patsu · Oct 9, 2013

I'm just saying it is sometimes possible to reduce the sharing frequency by batching up your work. So you sync them (only) at controlled interval.

The fact that you need a GPU to do your work may be because that type of work is more suitable for a GPU. Sometimes, having a bigger CPU may or may not help unless it's a specialized one.

dobwal · Oct 9, 2013

Correct me if I am mistaken because I am going off memory. But the ability of an AMD apu's gpu to read and write into the cacheable memory of the cpu has existed since Llanos.

And 30 GBs of coherent access between the gpu and cpu is not cache related. Durango's performance drops to 15 GB/s when it comes to actually hitting the cpu caches.

I wonder if 30 GBs of coherent accesses is MS's way of describing AMD's zero copy path. You are just logically transferring control of the memory buffer between the gpu and cpu leading to fast transfer rates.

taisui · Oct 9, 2013

dobwal said:
And 30 GBs of coherent access between the gpu and cpu is not cache related. Durango's performance drops to 15 GB/s when it comes to actually hitting the cpu caches.

I read about this in the leak and I'm not sure what to make of it...isn't a cache hit better than a miss? why would this incur a performance drop?

dobwal · Oct 9, 2013

taisui said:
I read about this in the leak and I'm not sure what to make of it...isn't a cache hit better than a miss? why would this incur a performance drop?

My guess is that cache hits and cache misses uses alternative pathways to the cpu memory.

Wouldn't you pin any portion of the cpu cache that needed to maintain coherency with the gpu? Instead of flushing that data to cacheable portion of the system memory, could you flush the data using the write combined buffer to uncached system memory? Which the gpu could access over garlic?

3dilettante · Oct 9, 2013

dobwal said:
Wouldn't you pin any portion of the cpu cache that needed to maintain coherency with the gpu?

The Onion bus has over 10 GB/s of bandwidth, and there's no restriction that traffic over it fall within a 2MB window.
At any rate, AMD's x86 cache architecture doesn't pin things.

Instead of flushing that data to cacheable portion of the system memory, could you flush the data using the write combined buffer to uncached system memory? Which the gpu could access over garlic?

If you want bandwidth, the page table entry would be assigned the necessary attributes for Garlic.
If you meant taking data going over Onion and putting it in the non-coherent write combining buffers, that would then require just throwing away the APU when it comes out of the factory.

dobwal · Oct 9, 2013

3dilettante said:
If you meant taking data going over Onion and putting it in the non-coherent write combining buffers, that would then require just throwing away the APU when it comes out of the factory.

No I meant data that being flushed from the cache to main memory. Under normal circumstances wouldn't servicing cache misses be slower?

In the AMD zero copy presentation, Llanos's gpu reads and writes to cacheable memory was limited to 4.5-5.5 GBs. AMD presented a data path figure for gpu accesses to cacheable memory which went over the UNB to L2 back to UNB then to the system memory.

While access to uncacheable memory was 6-12 GBs and some midrange Llanos zero copy transfer rates hit 15 GBs.

I was wondering if it was possible that coherent data that initially resided in the cpu cache but got flushed to main memory could be moved in a way that allows the gpu faster access to that data.

Basically, would the increase in bandwidth indicate that cache misses aren't going over onion? Could the data end up in zero copy buffers where transfer rates are much better? I mentioned the WC buffers because that seems like the only way the cpu can write to non cacheable memory, but could DMAs be employed to move data into the buffers?

3dilettante · Oct 9, 2013

dobwal said:
No I meant data that being flushed from the cache to main memory. Under normal circumstances wouldn't servicing cache misses be slower?

I'm afraid I'm not following you on this.

In the AMD zero copy presentation, Llanos's gpu reads and writes to cacheable memory was limited to 4.5-5.5 GBs. AMD presented a data path figure for gpu accesses to cacheable memory which went over the UNB to L2 back to UNB then to the system memory.

Not quite. The arrow goes over the cores and L2, which makes sense since Llano's L1s are exclusive and would need to be snooped as well.

While access to uncacheable memory was 6-12 GBs and some midrange Llanos zero copy transfer rates hit 15 GBs.

Assuming we're talking about GPU accesess, that's Garlic.

I was wondering if it was possible that coherent data that initially resided in the cpu cache but got flushed to main memory could be moved in a way that allows the gpu faster access to that data.

The CPU has no idea what data the GPU might want, and in the interests of latency CPU cache writeback data isn't going to hang around on chip for very long.
The problem isn't that it's not fast enough. The problem is that writeback is too quick, and you can't buffer enough data to keep a wide enough data window for a GPU access that might be hundreds of cycles out. That would be a (large) cache.

Betanumerical · Oct 10, 2013

taisui said:
AFAIK, that type of access to CPU require the use of Onion+ bus.

It doesn't require the use of the either the Onion or Onion+ bus, the Onion/Onion+ bus is used that to make coherent communication between the GPU and the CPU more efficient.

If you want more bandwidth and coherency there are ways to achieve it has been mentioned, multiple times in this thread which bus you do is highly dependent on what you want to do and how tight a coupling the data between the CPU and the GPU have and how much bandwidth you need.

dobwal said:
I was wondering if it was possible that coherent data that initially resided in the cpu cache but got flushed to main memory could be moved in a way that allows the gpu faster access to that data.

I was thinking about this too, depending on the amount of data it might be worthwhile to flush the cache lines that are holding the data you want in the CPU (if possible).

onQ · Oct 10, 2013

Could we be looking at the PS4 SoC & not even know it?

“We are looking at an architecture where the bulk of processing will still sit on the main board, with CPU and graphics added to by more digital signal processing and some configurable logic.”

adev · Oct 10, 2013

onQ said:
Could we be looking at the PS4 SoC & not even know it?

“We are looking at an architecture where the bulk of processing will still sit on the main board, with CPU and graphics added to by more digital signal processing and some configurable logic.”

Aren't we fairly confident that the systems are just AMD Fusion based?

Would anyone be surprised by that?

onQ · Oct 10, 2013

adev said:
Aren't we fairly confident that the systems are just AMD Fusion based?

Would anyone be surprised by that?

I'm talking about the DSP & Fixed Function Accelerator helping with processing like in the interview with Sony's Ex CTO a year ago.

pMax · Oct 10, 2013

3dilettante said:
The Onion bus has over 10 GB/s of bandwidth, and there's no restriction that traffic over it fall within a 2MB window.

Would you mind to comment more on this? Why are you referring to a 2Mb window? I am missing something...

onQ · Oct 10, 2013

I really think that's the PS4 SoC

Fixed Function Accelerator = Vector Co-processor

DSP = Speech Recognition Processor ?

Image signal processing = Head & Hand tracking processor for PS4 Eye

& the rest CPU , GPU & Video \ Audio encoding & decoding.

Rangers · Oct 10, 2013

onQ said:
Fixed Function Accelerator = Vector Co-processor

Huh? Wouldn't that be quite the shocker? Why hasn't Sony spoken about this?

Seems like the newest incarnation of secret sauce.

BoardBonobo · Oct 10, 2013

Rangers said:
Huh? Wouldn't that be quite the shocker? Why hasn't Sony spoken about this?

Seems like the newest incarnation of secret sauce.

They could always be keeping their full deck hidden. After all they under no compulsion at all to try one upmanship with MS over hardware. They've already got that in the bag.

Perhaps they are leaving stuff out so that they can gain another broadside of positive PR if they combat the 'let the games speak for themselves' challenge from MS.

There's only 4 weeks to go before itall becomes clear anyway!

Shifty Geezer · Oct 10, 2013

BoardBonobo said:
They could always be keeping their full deck hidden. After all they under no compulsion at all to try one upmanship with MS over hardware. They've already got that in the bag.

When your enemy is down, that's when you deal the killing blow. You don't give them chance to recover. Sony took the performance advantage line and ran with it. MS has dealt a couple of come-back blows. If Sony have something to retaliate with, it behoves them to use that. there's something to be gained from telling the world that your hardware has a capable DSP that'll add to the experience. There's nothing to be gained from withholding that info.

I guess the Image Signal Processor is simply the two-layer blend and scale. The audio processor is the audio decode+mix block. The DSP doesn't fit anything I know of, but it can't be anything substantial. If it is, that's very weird that none of the leaks have spoken of it and none of the Sony engineers have spoken of it.

adev · Oct 10, 2013

The only thing I can relate it to is the zlib block.

dumbo11 · Oct 10, 2013

adev said:
The only thing I can relate it to is the zlib block.

AFAIR the zlib block is supposed to be on the secondary chip/southbridge (which is a very interesting idea... transparent disk/blu-ray compression or compression of network traffic?).

AMD did indicate that they integrated some form of 'Sony IP' into the Liverpool APU, but the most logical assumption is that this is something related to audio/video record/playback.

Anyway, at this point "the proof of the pudding is in the eating". If either console had game-related 'secret sauce' then we should be seeing it in demos, and there is no obvious sign of anything major for either console.

AMD Kaveri APU features the Onion + bus like the PlayStation 4

patsu

temesgen

patsu

dobwal

taisui

dobwal

3dilettante

dobwal

3dilettante

Betanumerical

onQ

adev

onQ

pMax

onQ

Rangers

BoardBonobo

My hat is white(ish)!

Shifty Geezer

uber-Troll!

adev

dumbo11

Similar threads