AMD Kaveri APU features the Onion + bus like the PlayStation 4

Betanumerical · Oct 9, 2013

taisui said:
It just annoys me when wrong information are being presented as facts repeatedly even after being corrected. I could've been nicer, but it's difficult me when the other person is not making sense.

Well considering we haven't proved what I have been saying past the fact that it may be able to coherently write as wrong, I still don't understand the hostility. Can you present information that shows that is wrong?. What i've been saying makes sense to me and it seems makes sense to others (to atleast some degree) or else I would (should) have been brought up on it. If you don't understand something ask how it works and someone will explain, there is no need for hostility here.

adev · Oct 9, 2013

taisui said:
It just annoys me when wrong information are being presented as facts repeatedly even after being corrected. I could've been nicer, but it's difficult me when the other person is not making sense.

Betanumerical said:
Well considering we haven't proved what I have been saying past the fact that it may be able to coherently write as wrong, I still don't understand the hostility. Can you present information that shows that is wrong?. What i've been saying makes sense to me and it seems makes sense to others (to atleast some degree) or else I would (should) have been brought up on it. If you don't understand something ask how it works and someone will explain, there is no need for hostility here.

What Beta is saying makes perfect sense because it's exactly how GCN works, for data modified by the GPU to be visible to the CPU the entire GPU L2 cache has to be written back which stalls all GPU memory access for a few thousand cycles.

We know the PS4 has been modified to alter this scenario, we don't know if the Xbox One has been altered in some way.

VGLeaks does have a single throw away line to that effect: "The GPU also has a coherent read/write path to the CPU’s L2 caches and to DRAM." http://www.vgleaks.com/durango-memory-system-overview/

Does that mean that the Xbox One GPU can write directly to the CPU L2 cache if the same virtual address is resident in both CPU and GPU L2? Does this avoid a full GPU L2 flush? Is it just wrong (I'm not sure it is, I've seen this mentioned in 2 places

)?

I doubt we'll know for sure until someone who actually knows more than just paper specs for these systems comes forwards.

adev · Oct 9, 2013

For a comment which is on topic: Onion+ can bypass the GPU's L1 and L2. You'd better be very careful that you haven't already loaded data from the same address via Onion for something else

Any devs claiming that this system is "easier" to develop for can't have done anything complicated yet.

temesgen · Oct 9, 2013

taisui said:
It just annoys me when wrong information are being presented as facts repeatedly even after being corrected. I could've been nicer, but it's difficult me when the other person is not making sense.

Again not trying to pick on you but there have been several people including a mod who said the questions were not trolling but legit points of discussion.

This site has been a little more hostile than usual since January but if you go back and look historically we've had lots of passionate debate without the personal attacks. It's great to have strong opinions but we can do so without attacking each others motives. It would be nice if we could get back to Pre 1/13 discourse.

Rockster · Oct 9, 2013

Unable to find it now, but I know that I've read or heard an interview with John Sell where he specifically mentioned a coherent GPGPU scenario where the CPU is the consumer, so writing is definitely possible. I however don't know the extent of cache flushing, bypass, or fencing required to make that data visible, and how it compares to PS4. It seems they've given these scenarios a fair amount of thought and provided quite a bit of hypertransport bandwidth to support them; so to your point it would stand to reason they have something similar so why not mention it.

The idea of streaming procedural geometry and texture data from the CPU to the GPU actually informed much of the X360 design. They enhanced the AVX units significantly, provided L1 bypass, and circular L2 read buffer, and supporting instructions and data paths. Color me very surprised if the XB1 can't similarly manage GPU fed scenarios.

temesgen · Oct 9, 2013

Can someone outline the actual benefits of coherent access? I mean specifics, what sort of issues are addressed or effects made possible due to coherent access.

taisui · Oct 9, 2013

temesgen said:
Can someone outline the actual benefits of coherent access? I mean specifics, what sort of issues are addressed or effects made possible due to coherent access.

My understanding:

The advantage is that the GPU and CPU can see the same kind of data, as opposed to the approach in the past, where CPU packed data into a format that the GPU understands (say, texture) and then unpack the result after the GPU does the processing. Not sure if anything is "made possible" this way, since it's just computation, but perhaps they would run faster now.

Ceger · Oct 9, 2013

taisui said:
My understanding:

The advantage is that the GPU and CPU can see the same kind of data, as opposed to the approach in the past, where CPU packed data into a format that the GPU understands (say, texture) and then unpack the result after the GPU does the processing. Not sure if anything is "made possible" this way, since it's just computation, but perhaps they would run faster now.

In essence, CPU and GPU can work together on tasks/computations by knowing where the other is in process and taking over as necessary with less steps and overhead.

taisui · Oct 9, 2013

Betanumerical said:
Well considering we haven't proved what I have been saying past the fact that it may be able to coherently write as wrong, I still don't understand the hostility. Can you present information that shows that is wrong?. What i've been saying makes sense to me and it seems makes sense to others (to atleast some degree) or else I would (should) have been brought up on it. If you don't understand something ask how it works and someone will explain, there is no need for hostility here.

I have not claimed that X1 is anything but uni-directional coherent, because I have no evidence. Yet you made this distinction on how PS4 is fully coherent but X1 is coherent read only, I merely called you out on an inconsistency that I noticed, because AFAIK on the PS4 you either flush or bypass the GPU cache, which means it's only coherent at the system memory level from GPU to CPU, like the X1.

And hence the distinction you suggested is not logical to me, and since you are the one that made the original claim, you should prove it.

Where you stand it seems to me that you are saying I can't prove that PS4 is not fully coherent because I don't have evidence saying that it's not?

Why would the GPU need to flush/bypass the cache if it's fully coherent at the cache level? Is this even logical?

Betanumerical · Oct 9, 2013

taisui said:
I have not claimed that X1 is anything but uni-directional coherent, because I have no evidence. Yet you made this distinction on how PS4 is fully coherent but X1 is coherent read only, I merely called you out on an inconsistency that I noticed, because AFAIK on the PS4 you either flush or bypass the GPU cache, which means it's only coherent at the system memory level from GPU to CPU, like the X1.

And hence the distinction you suggested is not logical to me, and since you are the one that made the original claim, you should prove it.

Where you stand it seems to me that you are saying I can't prove that PS4 is not fully coherent because I don't have evidence saying that it's not?

Why would the GPU need to flush/bypass the cache if it's fully coherent at the cache level? Is this even logical?

The only reason I made the distinction about being read only coherent is because it originally seemed like it didn't have a coherent write bus to the DRAM although it seems now that it does. I made the distinction between the two because at the time it was evident that the PS4 did have a coherent write bus, I stopped talking about that and about the differences between the cache methodologies on at the most I think the last page.

There is a important distinction to be made between the two different cache policies which the GPU's implement and the cache bypass system that the PS4 seems to incorporate which is why I brought them up to say that they are the same because they both support the same things would not be incorrect, but from a practical stand point (if the XBONE is the standard GCN way) then it seems like there is a large penalty doing coherent writes on the XBONE (thousands of cycle stall).

Both GPU's need to flush in order to have the data from there caches visible for the CPU it would seem, the difference lies in as my previous paragraph mentions the way they implement these flushes and the associated penalties in using them.

Neither is cache coherent but that does not mean that they way they maintain coherency between the CPU and GPU is the same, it is similar in practise but the details seem to be majorly different.

3dilettante · Oct 9, 2013

adev said:
VGLeaks does have a single throw away line to that effect: "The GPU also has a coherent read/write path to the CPU’s L2 caches and to DRAM." http://www.vgleaks.com/durango-memory-system-overview/

That is written too ambiguously to be definitive, and my first reaction is to assume it means reads can forward data, not that writes broadcast.

The read path can probe both the CPU L2s and DRAM.
Saying the write path can do more than invalidate resident lines in the CPU caches is a significant change to the cache protocol. What I have read indicates it is using the same protocol as other AMD chips, which would make it write-invalidate.
I would prefer a more explicit statement that that part of the protocol had changed.

Does that mean that the Xbox One GPU can write directly to the CPU L2 cache if the same virtual address is resident in both CPU and GPU L2? Does this avoid a full GPU L2 flush? Is it just wrong (I'm not sure it is, I've seen this mentioned in 2 places )?

The CPU L2s should be physically indexed and tagged. The clients that can access it that have already run their accesses through a TLB, and in the case of aliased virtual addresses it would be ambiguous as to which one had been originally translated if you wanted to do the work in the opposite direction.

taisui · Oct 9, 2013

Betanumerical said:
The only reason I made the distinction about being read only coherent is because it originally seemed like it didn't have a coherent write bus to the DRAM although it seems now that it does. I made the distinction between the two because at the time it was evident that the PS4 did have a coherent write bus, I stopped talking about that and about the differences between the cache methodologies on at the most I think the last page.

There is a important distinction to be made between the two different cache policies which the GPU's implement and the cache bypass system that the PS4 seems to incorporate which is why I brought them up to say that they are the same because they both support the same things would not be incorrect, but from a practical stand point (if the XBONE is the standard GCN way) then it seems like there is a large penalty doing coherent writes on the XBONE (thousands of cycle stall).

I think this is completely straw man, because this this is my comment on Cerny's interview:

He's saying the same thing, that even if compute is cachable in L2, you still need to flush to memory, though there's some optimization there for the graphics so that it doesn't flush entirely.

Betanumerical · Oct 9, 2013

taisui said:
I think this is completely straw man, because this this is my comment on Cerny's interview:

And? it shows that they are different, and the difference goes further then just GFX and Compute split in memory, far further. I don't even know your point right now, I agree that seems to be coherent read path in the XBONE but the difference between the two seems to the cache management and process. So what are you trying to say exactly or are you merely posting to post that I was wrong in past posts which I've already changed my thoughts on?.

Airon · Oct 9, 2013

Ceger said:
In essence, CPU and GPU can work together on tasks/computations by knowing where the other is in process and taking over as necessary with less steps and overhead.

The point is will PS4 10Gb/s peak teoretical BW for coherent Onion+bus will enough for this? Considering also that when Onion+ is in use will leave only 10Gb/s BW for Onion ?

If we look at the numbers of the exemple of X1 memory usage on VGleaks (for sure taken from MS docs) it seems to me that developers of PS4 will have a hell of issues there.

Onion+ shared BW, in combination with the low level CPU could be "the Mother of all the PS4 Bottlenecks".

Betanumerical · Oct 9, 2013

Airon said:
The point is will PS4 10Gb/s peak teoretical BW for coherent Onion+bus will enough for this? Considering also that when Onion+ is in use will leave only 10Gb/s BW for Onion ?

If we look at the numbers of the exemple of X1 memory usage on VGleaks (for sure taken from MS docs) it seems to me that developers of PS4 will have a hell of issues there.

Onion+ shared BW, in combination with the low level CPU could be "the Mother of all the PS4 Bottlenecks".

Theres nothing stopping you doing the cache flush over the garlic bus and doing further things to make sure the data is correct giving access to the CPU in that way.

Airon · Oct 9, 2013

Betanumerical said:
Theres nothing stopping you doing the cache flush over the garlic bus and doing further things to make sure the data is correct giving access to the CPU in that way.

So you can confirm it will be not a big issue?

ERP, S.G. what do you think about it?

Betanumerical · Oct 9, 2013

Airon said:
So you can confirm it will be not a big issue?

ERP, S.G. what do you think about it?

Im not saying it wont be a issue nor am i saying it will be a issue im saying what you do (for both consoles) depends on a wide range of factors which include the amount of bandwidth you need and how tight coupling there is between the CPU and the GPU for the shared data. It would be interesting to know what kind of latencies the L2 snooping busses have this may effect what type of access you want as well. [my guess would be that accessing the L2 via this bus would be somewhere in the realm of 50-100 cycles alone, but this could be way off the mark].

Shifty Geezer · Oct 9, 2013

Do we even have experience of sharing data between GPU and CPU in this fashion? It's new design and architecture, not really targeted by developers before, so I'm not sure anyone can predict what the problems and advantages will be. Unless developers have been targeting HSA chips specifically before. We need new algorithms to take best advantage of CPU and GPU cooperation, and then worry about data formats, and then worry about managing the current dataflow solutions (well, those two go hand in hand, data format and flow). It'd be a pretty amazing programmer who can look at the XB1, Liverpool, and Kaveri CPU<>GPU busses and see what advantages there are and what bottlenecks there are this early on! I suppose some will have existing experience of how CPU and GPU interact and can look at the possibilities in relation to that experience.

Airon · Oct 9, 2013

Shifty Geezer said:
Do we even have experience of sharing data between GPU and CPU in this fashion? It's new design and architecture, not really targeted by developers before, so I'm not sure anyone can predict what the problems and advantages will be. Unless developers have been targeting HSA chips specifically before. We need new algorithms to take best advantage of CPU and GPU cooperation, and then worry about data formats, and then worry about managing the current dataflow solutions (well, those two go hand in hand, data format and flow). It'd be a pretty amazing programmer who can look at the XB1, Liverpool, and Kaveri CPU<>GPU busses and see what advantages there are and what bottlenecks there are this early on! I suppose some will have existing experience of how CPU and GPU interact and can look at the possibilities in relation to that experience.

S.G., I have to say that looking at Onion & Onion+ bus BWs, it seems very clear to me that Sony do not believe that much in this CPU-GPU cooperation.

This seems to me another key point in which the 2 consolle manufacturers have followed 2 different paths / philosophy.

Betanumerical · Oct 9, 2013

Airon said:
S.G., I have to say that looking at Onion & Onion+ bus BWs, it seems very clear to me that Sony do not believe that much in this CPU-GPU cooperation.

This seems to me another key point in which the 2 consolle manufacturers have followed 2 different paths / philosophy.

They don't believe in GPU-CPU cooperation yet they added busses and modified caches to make it more efficient, I don't really follow the logic there. If they didn't believe in it, they wouldn't have any coherent bandwidth at all. The two consoles are merely achieving the same thing in different ways.

AMD Kaveri APU features the Onion + bus like the PlayStation 4

Betanumerical

adev

adev

temesgen

Rockster

temesgen

taisui

Ceger

taisui

Betanumerical

3dilettante

taisui

Betanumerical

Airon

Betanumerical

Airon

Betanumerical

Shifty Geezer

uber-Troll!

Airon

Betanumerical

Similar threads