AMD Kaveri APU features the Onion + bus like the PlayStation 4

Coherent access may not need CPU or GPU power boost at all. It's a communication thing.
e.g. One may batch their work on the same CPU to reduce external "sharing".
 
Coherent access may not need CPU or GPU power boost at all. It's a communication thing.
e.g. One may batch their work on the same CPU to reduce external "sharing".

but there is some overhead, how does that impact things? Are we talking bandwidth and couldn't there be a multiplier to latency due to both processors working on the same thing?
 
I'm just saying it is sometimes possible to reduce the sharing frequency by batching up your work. So you sync them (only) at controlled interval.

The fact that you need a GPU to do your work may be because that type of work is more suitable for a GPU. Sometimes, having a bigger CPU may or may not help unless it's a specialized one.
 
Correct me if I am mistaken because I am going off memory. But the ability of an AMD apu's gpu to read and write into the cacheable memory of the cpu has existed since Llanos.

And 30 GBs of coherent access between the gpu and cpu is not cache related. Durango's performance drops to 15 GB/s when it comes to actually hitting the cpu caches.

I wonder if 30 GBs of coherent accesses is MS's way of describing AMD's zero copy path. You are just logically transferring control of the memory buffer between the gpu and cpu leading to fast transfer rates.
 
And 30 GBs of coherent access between the gpu and cpu is not cache related. Durango's performance drops to 15 GB/s when it comes to actually hitting the cpu caches.

I read about this in the leak and I'm not sure what to make of it...isn't a cache hit better than a miss? why would this incur a performance drop?
 
I read about this in the leak and I'm not sure what to make of it...isn't a cache hit better than a miss? why would this incur a performance drop?

My guess is that cache hits and cache misses uses alternative pathways to the cpu memory.

Wouldn't you pin any portion of the cpu cache that needed to maintain coherency with the gpu? Instead of flushing that data to cacheable portion of the system memory, could you flush the data using the write combined buffer to uncached system memory? Which the gpu could access over garlic?
 
Wouldn't you pin any portion of the cpu cache that needed to maintain coherency with the gpu?
The Onion bus has over 10 GB/s of bandwidth, and there's no restriction that traffic over it fall within a 2MB window.
At any rate, AMD's x86 cache architecture doesn't pin things.

Instead of flushing that data to cacheable portion of the system memory, could you flush the data using the write combined buffer to uncached system memory? Which the gpu could access over garlic?
If you want bandwidth, the page table entry would be assigned the necessary attributes for Garlic.
If you meant taking data going over Onion and putting it in the non-coherent write combining buffers, that would then require just throwing away the APU when it comes out of the factory.
 
If you meant taking data going over Onion and putting it in the non-coherent write combining buffers, that would then require just throwing away the APU when it comes out of the factory.

No I meant data that being flushed from the cache to main memory. Under normal circumstances wouldn't servicing cache misses be slower?

In the AMD zero copy presentation, Llanos's gpu reads and writes to cacheable memory was limited to 4.5-5.5 GBs. AMD presented a data path figure for gpu accesses to cacheable memory which went over the UNB to L2 back to UNB then to the system memory.

While access to uncacheable memory was 6-12 GBs and some midrange Llanos zero copy transfer rates hit 15 GBs.

I was wondering if it was possible that coherent data that initially resided in the cpu cache but got flushed to main memory could be moved in a way that allows the gpu faster access to that data.

Basically, would the increase in bandwidth indicate that cache misses aren't going over onion? Could the data end up in zero copy buffers where transfer rates are much better? I mentioned the WC buffers because that seems like the only way the cpu can write to non cacheable memory, but could DMAs be employed to move data into the buffers?
 
Last edited by a moderator:
No I meant data that being flushed from the cache to main memory. Under normal circumstances wouldn't servicing cache misses be slower?
I'm afraid I'm not following you on this.

In the AMD zero copy presentation, Llanos's gpu reads and writes to cacheable memory was limited to 4.5-5.5 GBs. AMD presented a data path figure for gpu accesses to cacheable memory which went over the UNB to L2 back to UNB then to the system memory.
Not quite. The arrow goes over the cores and L2, which makes sense since Llano's L1s are exclusive and would need to be snooped as well.

While access to uncacheable memory was 6-12 GBs and some midrange Llanos zero copy transfer rates hit 15 GBs.
Assuming we're talking about GPU accesess, that's Garlic.

I was wondering if it was possible that coherent data that initially resided in the cpu cache but got flushed to main memory could be moved in a way that allows the gpu faster access to that data.
The CPU has no idea what data the GPU might want, and in the interests of latency CPU cache writeback data isn't going to hang around on chip for very long.
The problem isn't that it's not fast enough. The problem is that writeback is too quick, and you can't buffer enough data to keep a wide enough data window for a GPU access that might be hundreds of cycles out. That would be a (large) cache.
 
AFAIK, that type of access to CPU require the use of Onion+ bus.

It doesn't require the use of the either the Onion or Onion+ bus, the Onion/Onion+ bus is used that to make coherent communication between the GPU and the CPU more efficient.

If you want more bandwidth and coherency there are ways to achieve it has been mentioned, multiple times in this thread which bus you do is highly dependent on what you want to do and how tight a coupling the data between the CPU and the GPU have and how much bandwidth you need.

I was wondering if it was possible that coherent data that initially resided in the cpu cache but got flushed to main memory could be moved in a way that allows the gpu faster access to that data.

I was thinking about this too, depending on the amount of data it might be worthwhile to flush the cache lines that are holding the data you want in the CPU (if possible).
 
Could we be looking at the PS4 SoC & not even know it?

“We are looking at an architecture where the bulk of processing will still sit on the main board, with CPU and graphics added to by more digital signal processing and some configurable logic.”

HSA+ARCHITECTURE++.jpg
 
Aren't we fairly confident that the systems are just AMD Fusion based?

Would anyone be surprised by that?

I'm talking about the DSP & Fixed Function Accelerator helping with processing like in the interview with Sony's Ex CTO a year ago.
 
I really think that's the PS4 SoC

Fixed Function Accelerator = Vector Co-processor

DSP = Speech Recognition Processor ?

Image signal processing = Head & Hand tracking processor for PS4 Eye


& the rest CPU , GPU & Video \ Audio encoding & decoding.
 
Huh? Wouldn't that be quite the shocker? Why hasn't Sony spoken about this?

Seems like the newest incarnation of secret sauce.

They could always be keeping their full deck hidden. After all they under no compulsion at all to try one upmanship with MS over hardware. They've already got that in the bag.

Perhaps they are leaving stuff out so that they can gain another broadside of positive PR if they combat the 'let the games speak for themselves' challenge from MS.

There's only 4 weeks to go before itall becomes clear anyway!
 
They could always be keeping their full deck hidden. After all they under no compulsion at all to try one upmanship with MS over hardware. They've already got that in the bag.
:???: When your enemy is down, that's when you deal the killing blow. You don't give them chance to recover. Sony took the performance advantage line and ran with it. MS has dealt a couple of come-back blows. If Sony have something to retaliate with, it behoves them to use that. there's something to be gained from telling the world that your hardware has a capable DSP that'll add to the experience. There's nothing to be gained from withholding that info.

I guess the Image Signal Processor is simply the two-layer blend and scale. The audio processor is the audio decode+mix block. The DSP doesn't fit anything I know of, but it can't be anything substantial. If it is, that's very weird that none of the leaks have spoken of it and none of the Sony engineers have spoken of it.
 
The only thing I can relate it to is the zlib block.

AFAIR the zlib block is supposed to be on the secondary chip/southbridge (which is a very interesting idea... transparent disk/blu-ray compression or compression of network traffic?).

AMD did indicate that they integrated some form of 'Sony IP' into the Liverpool APU, but the most logical assumption is that this is something related to audio/video record/playback.

Anyway, at this point "the proof of the pudding is in the eating". If either console had game-related 'secret sauce' then we should be seeing it in demos, and there is no obvious sign of anything major for either console.
 
Back
Top