ROP/Bandwidth Consumption & Low-Level GCN Optimizations ala Humus GDC 2014

McHuj · Mar 30, 2014

Shortbread said:
But it also states that as well for XB1 "RP/BW", however it also includes BW as well. I see it as plain as day from the screen shot.

So make me understand how one (XB1) is "RP/BW" bound at RGBA:16, yet able to do RGBA:32F within it's given spec. However PS4 cannot given its only "BW" bound at RGBA:16.

That's all I'm asking...

I wouldn't say that the RGBA16 is bound on the ESRAM. Given the 16 ROPs, they need 109 GB/s of BW and that's what exactly it provides. It's the theoretically optimal point for that format and a 16 ROP setup.

AlNom · Mar 30, 2014

sebbbi said:
Blending doubles the bandwidth usage (read-modify-write). RGBA16F with blending is bandwidth bound on all GCN cards (even 290X). Even if you don't sample any textures or read any vertex data.

hm... I was under the impression that the ROPs were half-rate for RGBA16F blending.

sebbbi · Mar 30, 2014

AlNets said:
hm... I was under the impression that the ROPs were half-rate for RGBA16F blending.

No. RGBA16F blending is full rate on GCN. However sampling from RGBA16F with bilinear filtering is half rate (point sampling is full rate).

AlNom · Mar 30, 2014

Ah k. Thanks.

mozmo · Mar 31, 2014

Aren't these numbers a little misleading since they omit zbuffer writes/reads, there is Min 3 bytes per pixel there.

Polyteres · Mar 31, 2014

Hi everyone!. Amazing talk at GDC 2014. Only one thing, I think in "Digital Foundry: the complete Xbox One architects interview", they said that every 8 write cycles in eSRAM a dead cycle is inserted, so you can not write. We would actually 95GB / s of bandwidth available for writing so if we use RGBA16F, you will be limited by the bandwidth too.

Bye.

McHuj · Mar 31, 2014

I think that was in reference to when both reading and writing simultaneously. When only of them is occurring, you should be able to get the full bandwidth. Maybe I'm wrong, but that was my interpretation.
I've always wondered if that is actually a bug in the chip that they were not able to fix.

Shifty Geezer · Mar 31, 2014

Yes. Each direction is full rate. Reading and writing simultaneously introduced a drop in peak. It'll be fully explained towards the end of this thread:esram astrophysics

Polyteres · Mar 31, 2014

McHuj said:
I think that was in reference to when both reading and writing simultaneously. When only of them is occurring, you should be able to get the full bandwidth. Maybe I'm wrong, but that was my interpretation.
I've always wondered if that is actually a bug in the chip that they were not able to fix.

Hi everyone!. I think in theory is an inherent problem to the technology of the eSRAM. In the interview they said:

The same discussion with ESRAM as well - the 204GB/s number that was presented at Hot Chips is taking known limitations of the logic around the ESRAM into account. You can't sustain writes for absolutely every single cycle. The writes is known to insert a bubble [a dead cycle] occasionally... One out of every eight cycles is a bubble, so that's how you get the combined 204GB/s as the raw peak that we can really achieve over the ESRAM.

In this paper is explained:

https://www.ece.cmu.edu/~ece548/localcpy/sramop.pdf

Bye.

Polyteres · Mar 31, 2014

Shifty Geezer said:
Yes. Each direction is full rate. Reading and writing simultaneously introduced a drop in peak. It'll be fully explained towards the end of this thread:esram astrophysics

Yes, my bad. I misread the article

Bye.

AlNom · Mar 31, 2014

mozmo said:
Aren't these numbers a little misleading since they omit zbuffer writes/reads, there is Min 3 bytes per pixel there.

I may have missed something, but he seems to be making the case to use compute shader instead of ROPs because of the limited ROP write-out for certain cases, not the usual back-buffer/g-buffer/scene.

Writing through a UAV bypasses the ROPs and goes straight to memory. This solution obviously does not apply to all sorts of rendering, for one we are skipping the entire graphics pipeline as well on which we still depend for most normal rendering. However, in the cases where it applies it can certainly result in a substantial performance increase. Cases where we are initializing textures to something else than a constant color, simple post -effects, this would be useful.

taisui · Mar 31, 2014

Polyteres said:
Hi everyone!. Amazing talk at GDC 2014. Only one thing, I think in "Digital Foundry: the complete Xbox One architects interview", they said that every 8 write cycles in eSRAM a dead cycle is inserted, so you can not write. We would actually 95GB / s of bandwidth available for writing so if we use RGBA16F, you will be limited by the bandwidth too.

Bye.

dead cycles for write only happens with simultaneous read/write access.

LightHeaven · Apr 1, 2014

Am I interpreting this the wrong way, or using RGBA16F with blending makes the memory setup of the xbone really shine? It seems to me that in that scenario the esram alone would provide almost all the needed bandwidth and the main ram would be free for other consumers.

3dilettante · Apr 1, 2014

Are you referring to the slides in the OP, I may have missed where blending was mentioned for those examples?

The eSRAM is structured to take the full FP16 write throughput of the ROPs.
Blending, however, is one of the test cases that was disclosed early on for Durango's eSRAM bandwidth, and it's not the best case.
Pre-clock bump, the eSRAM's bandwidth was measured at a ~133 GB/s, which is quite a ways below the theoretical peak.
Possibly, this is due to access conflicts, but a more direct reason could be that the eSRAM's static 50:50 (not quite, but still) mix of read and write capability does not match the desired mix of a blending test of that sort.

LightHeaven · Apr 1, 2014

3dilettante said:
Are you referring to the slides in the OP, I may have missed where blending was mentioned for those examples?

The eSRAM is structured to take the full FP16 write throughput of the ROPs.
Blending, however, is one of the test cases that was disclosed early on for Durango's eSRAM bandwidth, and it's not the best case.
Pre-clock bump, the eSRAM's bandwidth was measured at a ~133 GB/s, which is quite a ways below the theoretical peak.
Possibly, this is due to access conflicts, but a more direct reason could be that the eSRAM's static 50:50 (not quite, but still) mix of read and write capability does not match the desired mix of a blending test of that sort.

The slide, plus assuming blending would require simultaneous read/writes.

Without blending the esram provides just about enough BW, but the requirement isn't all that high to being with, but when blending it goes up the roof, but esram is (at least theoretically) able to comply partially to that increase.

But what you said about the read/write split makes sense. I assumed the figures given before weren't the peak because they wouldn't be blending all the time, but what you said seems more likely.

ROP/Bandwidth Consumption & Low-Level GCN Optimizations ala Humus GDC 2014

McHuj

AlNom

Moderator

sebbbi

AlNom

Moderator

mozmo

Polyteres

McHuj

Shifty Geezer

uber-Troll!

Polyteres

Polyteres

AlNom

Moderator

taisui

LightHeaven

3dilettante

LightHeaven

Similar threads