Xbox One (Durango) Technical hardware investigation

taisui · Oct 6, 2013

Betanumerical said:
Because theres a large difference between coherent read and coherent write, one does not imply the other.

do you have proof that its anything but?
or you just feel that there's more to the story?

adev · Oct 6, 2013

Betanumerical said:
Because theres a large difference between coherent read and coherent write, one does not imply the other.

I'm not sure what you're trying to get at. Both the CPU and GPU can read changed memory coherently. The CPU has write combining support which programmers need to be wary of .

http://fgiesen.wordpress.com/2013/01/29/write-combining-is-not-your-friend/

zupallinere · Oct 6, 2013

Didn't VGleaks docs talk about Durango having 2 types of coherency ?

There are two types of coherency in the Durango memory system:

Fully hardware coherent
I/O coherent
The two CPU modules are fully coherent. The term fully coherent means that the CPUs do not need to explicitly flush in order for the latest copy of modified data to be available (except when using Write Combined access).

The rest of the Durango infrastructure (the GPU and I/O devices such as, Audio and the Kinect Sensor) is I/O coherent. The term I/O coherent means that those clients can access data in the CPU caches, but that their own caches cannot be probed.

The first of thing thought of when I saw READ coherency was I/O coherency, mixed in with whatever other types of coherency.

BRiT · Oct 6, 2013

*ahem* Be mindful of the topic. This is not a versus thread.

Betanumerical · Oct 6, 2013

taisui said:
do you have proof that its anything but?
or you just feel that there's more to the story?

Proof that coherent read doesn't equal coherent write?.

Well for one it could be that, as I have previously mentioned, coherent read requires you to snoop the CPU's cached values in some way, whilst coherent write requires you to either flush the entire cache of the GPU or to bypass the cache entirely.

adev · Oct 6, 2013

Betanumerical said:
Proof that coherent read doesn't equal coherent write?.

Well for one it could be that, as I have previously mentioned, coherent read requires you to snoop the CPU's cached values in some way, whilst coherent write requires you to either flush the entire cache of the GPU or to bypass the cache entirely.

This is accurate AFAIK: http://www.vgleaks.com/durango-memory-system-overview/

TrungGap · Oct 6, 2013

Strange said:
You're just saying the same thing, I acknowledge it's a package and I didn't suggest that the eSRAM is a band-aid solution, but it, however IS there to provide the bandwidth that the system will lack when DDR3/DDR4 was chosen. If you think this is a "band-aid" solution then you will find many engineering designs to be full of band-aids.

Esram is band-aid as much as the gddr5 is the band aid to a less well designed system. It's like saying we need a four lanes freeway everywhere. What if we could isolate the traffic so the we build a four lanes freeway where it's needed and regural two lanes street where we don't...It's like saying TBR is a bandaid to faster memory. It's designed to solve the same problem differently.

In my world, the term band-aid fixed something that is flawed...whether its a design flaw or implementation flaw, and it's not meant to be permanent as part of the design paradigm. Esram is evolution of edram of the 360, so not sure how one can conclude it to be a bandaid.

adev · Oct 6, 2013

Betanumerical said:
Proof that coherent read doesn't equal coherent write?.

Well for one it could be that, as I have previously mentioned, coherent read requires you to snoop the CPU's cached values in some way, whilst coherent write requires you to either flush the entire cache of the GPU or to bypass the cache entirely.

adev said:
This is accurate AFAIK: http://www.vgleaks.com/durango-memory-system-overview/

The parts of this specifically relating to your comments are that the CPU does not snoop the GPU cache.

It actually works by having the GPU invalidate CPU cache lines on write and any reads from those invalidated cache line have to wait for the data to be flushed from the GPU cache.

The GPU can snoop the CPUs cache.

Can you elaborate on why you would believe that reads can be coherent if writes aren't? Wouldn't writes being non-coherent automatically make any reads to altered locations also be non-coherent? MS couldn't claim hardware coherency if they didn't support coherent writes from the GPU.

Strange · Oct 6, 2013

TrungGap said:
Esram is band-aid as much as the gddr5 is the band aid to a less well designed system. It's like saying we need a four lanes freeway everywhere. What if we could isolate the traffic so the we build a four lanes freeway where it's needed and regural two lanes street where we don't...It's like saying TBR is a bandaid to faster memory. It's designed to solve the same problem differently.

In my world, the term band-aid fixed something that is flawed...whether its a design flaw or implementation flaw, and it's not meant to be permanent as part of the design paradigm. Esram is evolution of edram of the 360, so not sure how one can conclude it to be a bandaid.

I wasn't the one to suggest the eSRAM is a band-aid. I consider the two memory types to be a package to deliver what they wanted.

Betanumerical · Oct 6, 2013

adev said:
The parts of this specifically relating to your comments are that the CPU does not snoop the GPU cache.

It actually works by having the GPU invalidate CPU cache lines on write and any reads from those invalidated cache line have to wait for the data to be flushed from the GPU cache.

The GPU can snoop the CPUs cache.

Can you elaborate on why you would believe that reads can be coherent if writes aren't? Wouldn't writes being non-coherent automatically make any reads to altered locations also be non-coherent? MS couldn't claim hardware coherency if they didn't support coherent writes from the GPU.

Thats what im talking about, that you have to invalidate the L2 cache to get coherent writes, and according to later vgleaks articles its not singular lines, its the entire cache. Unless im reading this wrong.

astrograd · Oct 6, 2013

TrungGap said:
Esram is band-aid as much as the gddr5 is the band aid to a less well designed system. It's like saying we need a four lanes freeway everywhere. What if we could isolate the traffic so the we build a four lanes freeway where it's needed and regural two lanes street where we don't...It's like saying TBR is a bandaid to faster memory. It's designed to solve the same problem differently.

In my world, the term band-aid fixed something that is flawed...whether its a design flaw or implementation flaw, and it's not meant to be permanent as part of the design paradigm. Esram is evolution of edram of the 360, so not sure how one can conclude it to be a bandaid.

Thanks, this is precisely what I was getting at. Ppl tend to perpetuate a particular context for why it's there as if it is to cover up something flawed in a design when it was a fundamental basis for the design as is. In tech discussions when ppl start viewing the implementations of a deliberate design as a problem to be solved in and of itself the context of the design becomes muddled and can misguide ppl's interpretation of simple facts.

Strange,

You're only pointing out the advantages (and ignoring some of the consequences/relevency of the advantages) of one system while ignoring the disadvantages.

Disadvantages such as? Again, if you view the eSRAM with the underlying context being its motivating purpose for existing is to alleviate main RAM bandwidth constraints, you run the risk of concluding that it must be some exotic design that devs will struggle with. If you view it (appropriately) as an evolution of the 360's eDRAM, you are more likely to view it as a much more flexible version of that memory architecture, which was viewed by many as among the easiest to develop for ever.

My point is that you can have two radically different, polar opposite interpretations of the technical design depending on which pov you come at it with and doing so from the misinformed viewpoint (which is the most "common" and accepted/perpetuated narrative, unfortunately) will lead to the unrealistic expectations about ease of using that eSRAM.

Why not list the advantages of the other design?
There is a reason why people are giving thumbs up to one console for having 8GB of GDDR5.

It's likely got as much to do with the PS3's design as a contrast to draw upon as anything else tbh. There's a difference between being 'most improved' and generally 'best'. I didn't discuss PS4's design here because...it's an X1 thread. In case you're curious, I think their design is also very solid. Less interesting though as it's a much more straightforward approach to solving a much smaller set of problems than MS faced (i.e. MS is more ambitious in their design, and more nuanced solutions are required as a result imho).

I find it a worthy exercise to keep the prevailing internet narrative from spreading if it's utterly and totally false, which is the case regarding the context of why the eSRAM is there and how the memory architecture's decisions were made. FWIW, I agree that it *shouldn't* matter, but as discussions are upheld by human beings who are susceptible to the influence of such contextual narratives, it's worth noting MS's expressed thoughts on the topic.

Strange · Oct 6, 2013

If you can't think of any disadvantages of the eSRAM+DDR3 package over some other package you really need to think harder.

astrograd said:
Again, if you view the eSRAM with the underlying context being its motivating purpose for existing is to alleviate main RAM bandwidth constraints, you run the risk of concluding that it must be some exotic design that devs will struggle with

In which way or fashion is the eSRAM NOT designed to alleviate bandwidth constraints? :???:

If 68GB/s from the DDR3 was enough for the system do you think the eSRAM would be there?

I'm not is saying it's an exotic design, but it is surely by most standards more complicated and most will agree that it will take more effort to code for to bring the performance up to the same levels.
Whether the increased effort would ultimately matter or not I'm not going to take a stance on. Devs will tell us in time, and we can see the results after a year or two.

However, I do think there is a lot of merit to the "KISS principle", and I generally view complication as a disadvantage.

I have a feeling you're trying to argue if a glass is half empty or half full instead of simply stating that the glass contains 50% water by volume.

I also don't see what's so ambitious in MS's design, and how Sony had much smaller set of problems to solve.
While Sony doesn't have to cope with Kinect (but does have PSEye) and HDMI in, MS doesn't have remote play to worry about.

When you're giving credit to both sides on the same matter it ultimately doesn't matter and people just don't bother to mention it as it doesn't really make a point.

adev · Oct 6, 2013

Betanumerical said:
Thats what im talking about, thawhich types of tasks benefitt you have to invalidate the L2 cache to get coherent writes, and according to later vgleaks articles its not singular lines, its the entire cache. Unless im reading this wrong.

I'd have to double check the specifics again but it was my understanding that the GPU could invalidate individual lines in the CPU cache giving coherent writes and can snoop the CPU cache for coherent reads.

The GPU has to flush it's entire cache when writing coherent memory but that's separate to how the CPU is notified of the changes.

I think there will be a lot of work being done to figure out which type of jobs benefit from coherency with less bandwidth.

Betanumerical · Oct 6, 2013

adev said:
I'd have to double check the specifics again but it was my understanding that the GPU could invalidate individual lines in the CPU cache giving coherent writes and can snoop the CPU cache for coherent reads.

The GPU has to flush it's entire cache when writing coherent memory but that's separate to how the CPU is notified of the changes.

I think there will be a lot of work being done to figure out which type of jobs benefit from coherency with less bandwidth.

Oh where good then

I was just talking wrt to the GPU writing coherent memory itself. So your saying that when the flush happens the CPU sees the changes by invalidating lines on the CPUs cache?.

adev · Oct 6, 2013

Betanumerical said:
Oh where good then I was just talking wrt to the GPU writing coherent memory itself. So your saying that when the flush happens the CPU sees the changes by invalidating lines on the CPUs cache?.

That's my understanding, the GPU invalidates CPU cache lines on write. Presumably the CPU would then stall untill the GPU cache write back is complete.

Betanumerical · Oct 6, 2013

adev said:
That's my understanding, the GPU invalidates CPU cache lines on write. Presumably the CPU would then stall untill the GPU cache write back is complete.

Is the GPU also effectively stalled whilst it is writing back its cache?.

adev · Oct 6, 2013

Betanumerical said:
Is the GPU also effectively stalled whilst it is writing back its cache?.

I don't know and I'm trying to avoid speculating

I'll see if I can find out exactly how that works.

3dilettante · Oct 6, 2013

adev said:
That's my understanding, the GPU invalidates CPU cache lines on write. Presumably the CPU would then stall untill the GPU cache write back is complete.

It doesn't look like the CPU side is aware of what the GPU's cache hierarchy is doing, so it won't be stalling for what could be a long-latency operation.
Coherent traffic over the Onion bus inserts itself into the request queue that orders CPU coherent traffic.
Until that happens, the CPU doesn't know what the GPU is doing, and the request queue is the mechanism for coherence after traffic gets to the end of the Onion bus.

adev · Oct 6, 2013

3dilettante said:
It doesn't look like the CPU side is aware of what the GPU's cache hierarchy is doing, so it won't be stalling for what could be a long-latency operation.
Coherent traffic over the Onion bus inserts itself into the request queue that orders CPU coherent traffic.
Until that happens, the CPU doesn't know what the GPU is doing, and the request queue is the mechanism for coherence after traffic gets to the end of the Onion bus.

The CPU isn't aware. The GPU invalidates the CPU cache so if it does that and starts a write, wouldn't a CPU read have to wait for that to complete before it can begin?

3dilettante · Oct 6, 2013

adev said:
The CPU isn't aware. The GPU invalidates the CPU cache so if it does that and starts a write, wouldn't a CPU read have to wait for that to complete before it can begin?

I think the implementation is more that the GPU sends a write over the Onion bus, and the interface logic itself or in conjunction with the request queue handles inserting it into the order of requests in the coherent hierarchy.
A broadcast of invalidations would happen once the write gets to this point, and any future coherent traffic is going to get on the queue as well. All other attempts to use that cache line are going to hit the queue, until it writes to memory and main memory becomes the final arbiter.

The virtually guaranteed trip to memory is an assumption currently built into the heterogeneous coherence scheme, as far as I can tell from the descriptions of the consoles and Kaveri.

Something like read/modify/write atomic isn't quite covered in this scheme, however. I'm not sure how synchronization is handled there, besides the likelihood that it is painful.

edit: There is the possibility of having hardware for that purpose separate from the GPU. Dedicated atomic units are nothing new within it, and AMD seems to have patents that go in the direction of having dedicated hardware that can process atomic ops.
In that case, the atomic operation could succeed or fail and send a return value back over the bus without stalling the CPUs more than any other sort of cache contention would do.

Xbox One (Durango) Technical hardware investigation

taisui

adev

zupallinere

BRiT

(>• •)>⌐■-■ (⌐■-■)

Betanumerical

adev

TrungGap

adev

Strange

Betanumerical

astrograd

Strange

adev

Betanumerical

adev

Betanumerical

adev

3dilettante

adev

3dilettante

Similar threads