ROP/Bandwidth Consumption & Low-Level GCN Optimizations ala Humus GDC 2014

But it also states that as well for XB1 "RP/BW", however it also includes BW as well. I see it as plain as day from the screen shot.

So make me understand how one (XB1) is "RP/BW" bound at RGBA:16, yet able to do RGBA:32F within it's given spec. However PS4 cannot given its only "BW" bound at RGBA:16.

That's all I'm asking...

I wouldn't say that the RGBA16 is bound on the ESRAM. Given the 16 ROPs, they need 109 GB/s of BW and that's what exactly it provides. It's the theoretically optimal point for that format and a 16 ROP setup.
 
Blending doubles the bandwidth usage (read-modify-write). RGBA16F with blending is bandwidth bound on all GCN cards (even 290X). Even if you don't sample any textures or read any vertex data.

hm... I was under the impression that the ROPs were half-rate for RGBA16F blending.
 
Aren't these numbers a little misleading since they omit zbuffer writes/reads, there is Min 3 bytes per pixel there.
 
Hi everyone!. Amazing talk at GDC 2014. Only one thing, I think in "Digital Foundry: the complete Xbox One architects interview", they said that every 8 write cycles in eSRAM a dead cycle is inserted, so you can not write. We would actually 95GB / s of bandwidth available for writing so if we use RGBA16F, you will be limited by the bandwidth too.

Bye.
 
I think that was in reference to when both reading and writing simultaneously. When only of them is occurring, you should be able to get the full bandwidth. Maybe I'm wrong, but that was my interpretation.
I've always wondered if that is actually a bug in the chip that they were not able to fix.
 
I think that was in reference to when both reading and writing simultaneously. When only of them is occurring, you should be able to get the full bandwidth. Maybe I'm wrong, but that was my interpretation.
I've always wondered if that is actually a bug in the chip that they were not able to fix.

Hi everyone!. I think in theory is an inherent problem to the technology of the eSRAM. In the interview they said:

The same discussion with ESRAM as well - the 204GB/s number that was presented at Hot Chips is taking known limitations of the logic around the ESRAM into account. You can't sustain writes for absolutely every single cycle. The writes is known to insert a bubble [a dead cycle] occasionally... One out of every eight cycles is a bubble, so that's how you get the combined 204GB/s as the raw peak that we can really achieve over the ESRAM.

In this paper is explained:

https://www.ece.cmu.edu/~ece548/localcpy/sramop.pdf

Bye.
 
Aren't these numbers a little misleading since they omit zbuffer writes/reads, there is Min 3 bytes per pixel there.

I may have missed something, but he seems to be making the case to use compute shader instead of ROPs because of the limited ROP write-out for certain cases, not the usual back-buffer/g-buffer/scene.

Writing through a UAV bypasses the ROPs and goes straight to memory. This solution obviously does not apply to all sorts of rendering, for one we are skipping the entire graphics pipeline as well on which we still depend for most normal rendering. However, in the cases where it applies it can certainly result in a substantial performance increase. Cases where we are initializing textures to something else than a constant color, simple post -effects, this would be useful.
 
Hi everyone!. Amazing talk at GDC 2014. Only one thing, I think in "Digital Foundry: the complete Xbox One architects interview", they said that every 8 write cycles in eSRAM a dead cycle is inserted, so you can not write. We would actually 95GB / s of bandwidth available for writing so if we use RGBA16F, you will be limited by the bandwidth too.

Bye.

dead cycles for write only happens with simultaneous read/write access.
 
Am I interpreting this the wrong way, or using RGBA16F with blending makes the memory setup of the xbone really shine? It seems to me that in that scenario the esram alone would provide almost all the needed bandwidth and the main ram would be free for other consumers.
 
Are you referring to the slides in the OP, I may have missed where blending was mentioned for those examples?

The eSRAM is structured to take the full FP16 write throughput of the ROPs.
Blending, however, is one of the test cases that was disclosed early on for Durango's eSRAM bandwidth, and it's not the best case.
Pre-clock bump, the eSRAM's bandwidth was measured at a ~133 GB/s, which is quite a ways below the theoretical peak.
Possibly, this is due to access conflicts, but a more direct reason could be that the eSRAM's static 50:50 (not quite, but still) mix of read and write capability does not match the desired mix of a blending test of that sort.
 
Are you referring to the slides in the OP, I may have missed where blending was mentioned for those examples?

The eSRAM is structured to take the full FP16 write throughput of the ROPs.
Blending, however, is one of the test cases that was disclosed early on for Durango's eSRAM bandwidth, and it's not the best case.
Pre-clock bump, the eSRAM's bandwidth was measured at a ~133 GB/s, which is quite a ways below the theoretical peak.
Possibly, this is due to access conflicts, but a more direct reason could be that the eSRAM's static 50:50 (not quite, but still) mix of read and write capability does not match the desired mix of a blending test of that sort.

The slide, plus assuming blending would require simultaneous read/writes.

Without blending the esram provides just about enough BW, but the requirement isn't all that high to being with, but when blending it goes up the roof, but esram is (at least theoretically) able to comply partially to that increase.

But what you said about the read/write split makes sense. I assumed the figures given before weren't the peak because they wouldn't be blending all the time, but what you said seems more likely.
 
Back
Top