Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
Right. Doesn't this have implications for how the ESRAM is used?

Yes, it is like a giant L3. It will make latency dependant operations much much faster. But as i said increasing L2 from for example 512 KBs to 2MBs could reach similar results. GPU L2 will have much less latency also.
 
Interestingly enough the max read bandwidth is 170GB/s, the max write bandwidth for the GPU is 102GB/s.
Yes, which suggests to me the same bus structure as the eSRAM (identical BW). Write is going to be less than read as you combine multiple buffer samples to single textures.

I'm confused why they talk about CPU latencies but not GPU latencies though. That's supposed to be the whole point to using SRAM over eDRAM, yet there's no word on the GPU RAM latency advantage.
 
Yes, which suggests to me the same bus structure as the eSRAM (identical BW). Write is going to be less than read as you combine multiple buffer samples to single textures.

I'm confused why they talk about CPU latencies but not GPU latencies though. That's supposed to be the whole point to using SRAM over eDRAM, yet there's no word on the GPU RAM latency advantage.

The strange thing is they talked in the prior leak about ESRAM latency being very important for color and depth blocks performance, when the logical thing is to put the frame buffer ( as recommended by MS ) in the DDR3 RAM to take advantage of low latency computing.

Anyway, don´t you think this design is a bespoke HSA system?. I mean, maybe this is why MS is not member of the HSA foundation. This chip seems to have many of the HSA boxes ticked but in its on way.
 
Last edited by a moderator:
Don't get me wrong the primary purpose of the fast memory pool is to increase the overall bandwidth they had to add a fast pool the moment they decided on DDR3. Having said that they selected a low latency solution because they saw value in it.

Yes GPU's have caches, they question becomes how effective they are. It's hard to quantify without running a lot of tests on a lot of existing titles.

The system supports rendering to either memory pool, I suspect any real renderer would render to both inside a frame, there is the issue of how much data copying you end up doing, the DME's are there for a reason, but it all eats bandwidth.

PRT's make it feasible to know pretty much exactly what parts of what textures were actually used in the last frame, but you still have to use that knowledge effectively.

As I've said before the split memory/Bandwidth/ROP count would still be the things that concern me most in the design, but I wouldn't judge anything without actually using it.

My guess is the quantity of memory was important early in the design which dictated DDR3, which dictated the fast memory pool, statistical data and manufacturing complexity probably indicated using SRAM instead of eDRAM.

Sony matching the 8GB is a big deal IMO.
 
Don't get me wrong the primary purpose of the fast memory pool is to increase the overall bandwidth they had to add a fast pool the moment they decided on DDR3. Having said that they selected a low latency solution because they saw value in it.

Yes GPU's have caches, they question becomes how effective they are. It's hard to quantify without running a lot of tests on a lot of existing titles.

The system supports rendering to either memory pool, I suspect any real renderer would render to both inside a frame, there is the issue of how much data copying you end up doing, the DME's are there for a reason, but it all eats bandwidth.

PRT's make it feasible to know pretty much exactly what parts of what textures were actually used in the last frame, but you still have to use that knowledge effectively.

As I've said before the split memory/Bandwidth/ROP count would still be the things that concern me most in the design, but I wouldn't judge anything without actually using it.

My guess is the quantity of memory was important early in the design which dictated DDR3, which dictated the fast memory pool, statistical data and manufacturing complexity probably indicated using SRAM instead of eDRAM.

Sony matching the 8GB is a big deal IMO.

Having a cpu core with 512 KB of L2 if you run a kernel in it and sent it to a CU, wouln´t save travels and latency to have an L2 of similar size attached to that CU?.
For the new program paradigm this consoles will bring ( compute heavy engines ) i think cache increments will be important.
 
But why is it being read from the DMEs section? Shouldn't it be in the gpu itself?

The likely explanation is that the DMEs are hanging off of the hub that AMD GPUs have for low-bandwidth consumers. The hub is designed to allow random widgets to be added or removed from a design without requiring a rearchitecting of the high-bandwidth cache path.

The GPU command front end apparently either does not have high read capability, or it is limited from taking too much bandwidth. Given the vast disparity in size between the size of a command and what it can make the whole GPU write out, this doesn't need the primary memory pipeline used by the CUs. Since it doesn't use much, it can sit off to the side of the big bandwidth consumers.
 
Don't get me wrong the primary purpose of the fast memory pool is to increase the overall bandwidth they had to add a fast pool the moment they decided on DDR3. Having said that they selected a low latency solution because they saw value in it.

Yes GPU's have caches, they question becomes how effective they are. It's hard to quantify without running a lot of tests on a lot of existing titles.

The system supports rendering to either memory pool, I suspect any real renderer would render to both inside a frame, there is the issue of how much data copying you end up doing, the DME's are there for a reason, but it all eats bandwidth.

PRT's make it feasible to know pretty much exactly what parts of what textures were actually used in the last frame, but you still have to use that knowledge effectively.

As I've said before the split memory/Bandwidth/ROP count would still be the things that concern me most in the design, but I wouldn't judge anything without actually using it.

My guess is the quantity of memory was important early in the design which dictated DDR3, which dictated the fast memory pool, statistical data and manufacturing complexity probably indicated using SRAM instead of eDRAM.

Sony matching the 8GB is a big deal IMO.

I hope you can help me answer this, how is the 16ROP going to be a problem? From what I gather, at 12.8 Gpixel it has more than enough fillrate for a 1080p image. So what benefit would having more ROP serve in this case? Sorry if it should be obvious.
 
My guess is the quantity of memory was important early in the design which dictated DDR3, which dictated the fast memory pool, statistical data and manufacturing complexity probably indicated using SRAM instead of eDRAM.

I totally agree with that, it's like a cascade of choices, and they are coherent in the end (maybe not the best but makes sense). About the SRAM instead of eDRAM i don't think there s any other reason than being able to get it with same process and on the same die. Focusing on low latency is again a consequence of that choice in the first place (not the other way). With eDRAM you don't have access to the best process (or not as cheap) because of capacitors. SRAM is just more SRAM on the die, since the gpu already have that kind in small quantities.
 
I hope you can help me answer this, how is the 16ROP going to be a problem? From what I gather, at 12.8 Gpixel it has more than enough fillrate for a 1080p image. So what benefit would having more ROP serve in this case? Sorry if it should be obvious.

TBH I've not spent any time measuring bottlenecks on existing titles, so I'm speculating like everyone else, I'm just basing the ROP count comment on what I see in PC parts targeting similar resolution.
More ROPS obviously only help you in ROP limited situations, hard to know how common that is with 16 ROPS.
Firstly with small triangles you get only about 1/4 of the performance because of the quad pixel organization so fill can be an issue for high polygon density low shader complexity scenarios.
Z though put might be an issue, when rendering shadow maps because you can probably exceed the performance of the RAM solution because of compression and the on chip caches.
My guess is in many of those cases you'd render to eSRAM and not main memory, and the improved efficiency might make up some of the difference.
 
Richard Leadbetter says, I quote: "16 ROPs is sufficient to maintain 1080p"
Now I can't tell for sure if he says this based on the many analysis/comparisons he made or just because "in theory 16 ROPs are enough to maintain 1080p" but maybe his words can help the discussion in some way.
 
Last edited by a moderator:
Looks like the max bandwidth when combining esram and dram is 136.4 GBs and not 170.

Those numbers strike me as wrong. I'm pretty sure their aggregate is 68GB/s read/write or any combination there of by for DDR3 not bidirectional by their other numbers and same for the ESRAM thus it's be limited 68GB/s.

Aka it's 68GB/s read or write or some aggregate of the two not 68GB/s read and 68GB/s write at the same time. I think what they are trying to say is it Reads at 68GB/s from the DDR3/ESRAMm and writes at 68GB/s to ESRAM/DDR3 but that doesn't mean you have 136GB/s you still have 68GB/s through the whole pipe just some cycles of latency involved. Thus writing to itself in fact halves the usable bandwidth.
 
Last edited by a moderator:
The figures are for memory transfers. A transfer will involve a read and a write. In any transfer where dram is the source or destination the total bandwidth of the transfer will be a maximum of 2 x 68.2.
 
That strikes me as using bandwidth wrong. Bandwith is the mouth data flowing throw the pipe not the amount of data going into plus the amount of data going out of the pipe. So it's bound by the slowest speed with is 68.2. Hence it doesn't matter if reads a writes at 68.2 that's the max amount of data that can be moved in a second not 136.4. hence if you copy a bit of data to itself in memory in reality you are halving the available bandwidth.
 
Yeah, that's what they mean. The 136.4 GB/s figure is only meaningful as a portion of the total potential aggregate bandwidth for the entire system. IE: if you are reading at 68GB/s from DDR3 and writing at the same speed to the ESRAM you only have 33.6 GB/s of bandwidth left to use, and that's only in the ESRAM.
 
No what I'm talking about is peak band if you copy one area of memory to another on the same device you can only copy at half the peak bandwidth rate. it you copy from one but to another you use the slowest pipes peak bandwidth on both buses.

And you only have 33.6GB/s of the ESRAM's bandwidth to use but you you could transfer that directly to GPU which is why that metric is flawed if you are thinking 170GB/s as the aggregate bandwidth.
 
Last edited by a moderator:
But they're talking about the BW used / useable by particular operations, not the total amount of data transferred from one pool to an other.

it you copy from one but to another you use the slowest pipes peak bandwidth on both buses.

That's exactly what their figures are trying to show, along with the text just under the table with those figures in, which explicitly states this. ;)
 
All I'm saying is you shouldn't add bandwidth on different buses like that it bothers me. It's 68.2GB\s on both buses and 68.2 of the 102 usable on ESRAM. Adding the aggregate bandwidth of all the buses together and then trying to do a percentage is meaningless and more likely to confuse the issue at hand then it is to explain it to moth people. As I'm willing to bet the non technically inclined will believe that the 136GB\s is the transfer rate of the data not the amount of bandwidth being used on all buses added together by the transfer.
 
Don't forget that this is from leaked developer docs though. I'm sure the intended audience know what it means. :)
 
Status
Not open for further replies.
Back
Top