Understanding XB1's internal memory bandwidth *spawn

sebbbi · Sep 15, 2013

XpiderMX said:
Is the fire in this video a sample of particle effects?
http://www.youtube.com/watch?v=eL3FIP3C1FQ

Yes. Fire, smoke and sparks are all particle effects.

Particle effects do consume a lot of render target bandwidth, because of heavy overdraw and alpha blending (read+write instead of just write). You can even bottleneck Tahiti's massive GDDR5 BW (288 GB/s) by just alpha blending single colored (untextured) particles to a HDR (4x16f) target. Render target bandwidth is very important for particle rendering. This was clearly visible in current generation console games. Many cross platform PS3 ports had half resolution particles compared to Xbox 360 versions. Xbox 360 GPU had EDRAM to provide the backbuffer bandwidth needed for full resolution particle rendering (alpha blending and overdraw didn't cost any main memory bandwidth). On the other hand Cell SPUs had their own (high bandwidth) internal work memories, and many advanced PS3 games used SPUs to help with graphics processing (tiled deferred lighting and post AA were common targets).

bkilian · Sep 15, 2013

Qroach said:
One thing, about the 5GB talk, someone at Microsoft already commented in the press, that games can use more then 5GB. I'm pretty sure MS already stated that they could go up to a maximum of 7.5GB, which would mean some resident apps would have to be suspended if a game required that much memory.

It's irks me to see people keep saying there's 3GB reserved for the OS when MS has already stated otherwise. The 3GB was a rumor posted before the XB one reveal which isn't totally correct.

Nope, I'm pretty sure no one from Microsoft said any such thing. You can't just suspend the system VM, it's running services used by the Game VM, like the final audio mixer, and Kinect, and a whole host of other things.

Gipsel · Sep 15, 2013

sebbbi said:
Particle effects do consume a lot of render target bandwidth, because of heavy overdraw and alpha blending (read+write instead of just write). You can even bottleneck Tahiti's massive GDDR5 BW (288 GB/s) by just alpha blending single colored (untextured) particles to a HDR (4x16f) target. Render target bandwidth is very important for particle rendering.

Yes. That's why I think for that workload bandwidth/throughput is the most important thing (if it wouldn't one couldn't max out massive GDDR5 interfaces). Any insight, how the lower latency of the eSRAM helps on top of that? MS has put some emphasis on it exactly in connection with blending, which sounds not like the scenario to profit most from it.

pjbliverpool · Sep 15, 2013

3dilettante said:
The reality is a bit more complicated than that. DRAM is heavily optimized for localized or linear accesses with writes and reads not being mixed together. Internally, the DRAM is heavily subdivided, slower, and it can't keep everything at the ready at all times. It also incurs a penalty whenever it has to switch from reads to writes.

The memory subsystem tries very hard to schedule accesses so that they hit as few bank and turnaround penalties as possible, but this isn't simple to do with other constraints like latency and balancing service to multiple clients.

Ideally, the eSRAM could dispense with all of this, and gladly take any mix that works within the bounds of its read and write ports.
However, the peak numbers and articles on the subject suggest that for various reasons there are at least some banking and timing considerations that make the ideal unreachable. The physical speed of the SRAM and the lack of an external bus probably mean that the perceived latency hierarchy is "flatter" than it would be if you were spamming a GDDR bus with reads and writes with poor locality.

This is where I assume the hinted advantages the eSRAM has for certain operations come in, where the access pattern starts interspersing reads with writes, or there is poor locality.

Thanks for the explanation. I'd guessed things may be more complicated than my simplified model and was hoping someone with the requisite knowledge would explain it to me.

However I'm going to be a pain in the ass and continue down this simplification road just a little longer in an attempt to help my non-programmers mind get to grip with things :smile:

So would it be fair to say that you do still need a 50/50 split of read/write to achieve maximum bandwidth utilization of the XB1 esram? However that in such a use case GDDR5 would likely be less efficient because of the switch penalties you mentioned?

And so the workloads (viewed in terms of memory read/writes) to achieve maximum utilization of the esram and GDDR5 would be be pretty much opposite with GDDR5 focused on sustaining all reads or all writes for as long as possible and esram focused on a perfect balance of the 2 at all times?

If that's correct (please just blast my theory out of the water if it's not) then would you care to venture an estimate of how a typical games workload may best fit either model from a high level (I understand at a low level there would be a mix) or could it vary greatly from one model to the other depending on the game engine?

I guess as sebbbi says above situations with heavy alpha blending or overdraw are likely to be more efficient on esram than similarly specified GDDR5. Could it be that XB1 may even be able to outperform a high end GPU like Tahiti in these situations?

pjbliverpool · Sep 15, 2013

blakjedi said:
There is no way you can calculate or justify using the utilization numbers you provide. There's zero basis to think one or the other device is 90 or 60 or anything. Zero. You can use whatever numbers you want but I just see no basis for your utilization decisions. Either you don't think that the architecture of the xb1 is reasonably efficient in order to extract 90% utilization or you think its so exotic that developers couldn't or wouldn't expend the energy to do so.

It's actually neither as I never gave an expectation of the utilization rate of XB1. The numbers I gave were purely hypothetical examples given to illustrate a basic principle (that utilization rate matters).

Thanks to the above posts from 3dilenttante and sebbbi it seems clear that the utilization of GDDR5 is more susceptible to read/write ratio's than I was expecting and so it's going to be the work loads that determine which memory system is overall more efficient.

And there are of course still unanswered questions about how the esram works which may further impact its real world utilization potential.

Rangers · Sep 15, 2013

Bigus Dickus said:
I saw it mentioned previously he Panello had stated the 204 number was incorrect and that it was actually 218, making more sense for a dual ported architecture being 2x the 109. While it "makes sense" I'm not sure how it fits with the pre-upclock numbers of 102 and 192.

Any thoughts as to which is more likely? Do we have the more rational 2x rate as he suggested recently, and MS made some 7/8 math mistake TWICE, or are we still going to be left with some mystery astrophysics? Any bets? What's the line on this one...

I think Penello screwed up. I think it's 204.

BRiT · Sep 15, 2013

Rangers said:
I think Penello screwed up. I think it's 204.

Because you have access to the same technical fellows that Penello has or you have something more technical?

3dcgi · Sep 15, 2013

pjbliverpool said:
Could it be that XB1 may even be able to outperform a high end GPU like Tahiti in these situations?

You can look up a pixel blending fill rate test for Tahiti and find out. If Tahiti achieves > 16 pixels/clock the answer is no. 3dmark has fill rate tests, but I don't remember if blending is enabled.

loekf · Sep 15, 2013

XpiderMX said:
Is the fire in this video a sample of particle effects?

http://www.youtube.com/watch?v=eL3FIP3C1FQ

Looking at the XB1 and PS4 threads, do we really care ?

- Developers doing cross platform games will aim for the "mean average" in terms of IQ. The console with more processing power will suffer. Also on PS3 and X360 there was no noticeable difference.

- If you like exclusives, the guys with the more powerfull console will have nicer looking games..

I don't buy console "S" or "X" because it can do xx GFLOPS or yy GB/s.

Games is all that matters....

Gipsel · Sep 15, 2013

3dcgi said:
You can look up a pixel blending fill rate test for Tahiti and find out. If Tahiti achieves > 16 pixels/clock the answer is no. 3dmark has fill rate tests, but I don't remember if blending is enabled.

Hardware.fr did fillrate tests with and without blending. With all the 32bit color formats Tahiti does the full 32pixel/clock, while with 64bit (4xfp16) bandwidth limitations also kick in. The HD790 achieves 14.5 GPixel/s while they measured 16.4 GPixel/s for the GHz edition. That means they are using already 232 GB/s and 262 GB/s of memory bandwidth (out of the totally available 264GB/s or 288GB/s of the 384Bit GDDR5 interface). Both numbers are above the maybe attainable maximum of the eSRAM of the XB1. So no, just with blending the XB1 is not going to beat a Tahiti.
And if I understood sebbbi right, he did some tests with tiled access to the render target for particle rendering and could get twice the performance by using the ROP caches, which means the ROPs itself could be even faster (they could even do fullspeed 4xfp16 blending, the XB1 docs even mention that explicitly). They are strangled by external bandwidth in some cases, even on Tahiti. And even on the XB1 this could still be sometimes the case as in the documentation leaked by vgleaks, this tiled or localized access is explicitly mentioned as a mean to circumvent the bandwidth limitation of the eSRAM for blending in a 64bit render target.

DrJay24 · Sep 15, 2013

BRiT said:
Because you have access to the same technical fellows that Penello has or you have something more technical?

The number showed up twice, once at DF and again at the Hot Chips talk. Odds are the PR guy got it wrong.

Jay · Sep 15, 2013

He said after quoting 204 and everyone questioning it, he went back to the office and they said it was 218 (straight up double).

Anyway, we have to wait and see what the engineer says if/when he gets to talk about it.

sebbbi · Sep 15, 2013

Gipsel said:
if I understood sebbbi right, he did some tests with tiled access to the render target for particle rendering and could get twice the performance by using the ROP caches, which means the ROPs itself could be even faster (they could even do fullspeed 4xfp16 blending, the XB1 docs even mention that explicitly)

I did those fill rate tests on Tahiti (Radeon 7970 PC hardware), as I mentioned in my earlier post. My tests agree fully with Hardware.fr results. I would definitely NOT post any Xbox One results (or leak any unannounced technical details about it).

Tahiti can do full speed 4x16f blending, as long as around half of the render target data comes from the ROP caches. Extrapolating from my test results Tahiti would need around ~500 GB/s bandwidth to reach full fill rate when pixels are not hitting ROP caches. Theoretical numbers (from Wikipedia): 1000 MHz * 32 pixels/clock * 8 bytes/pixel * 2 (read+write) = 512GB/s.

Gipsel · Sep 15, 2013

sebbbi said:
I did those fill rate tests on Tahiti (Radeon 7970 PC hardware), as I mentioned in my earlier post. My tests agree fully with Hardware.fr results. I would definitely NOT post any Xbox One results (or leak any unannounced technical details about it).

I was not implying this at all. It was a reference to your sentence that even Tahiti's ROPs can be bottlenecked by bandwidth and your earlier description of your tests. I fully understood that you did it on a Tahiti. But as pjbliverpool and 3dcgi wanted to compare Tahiti's ROP capabilities with the XB1's, I was just connecting it to the bits from vgleaks where fullspeed 4xfp16 blending is mentioned and also stated, that localized access (as in your tests on Tahiti) helps to sustain the maximum throughput. It results in a coherent picture.

brunogm · Sep 15, 2013

BRiT said:
No. You misunderstood.

So hows that work because, IIRC in GDDR5 wikipedia a Qimonda doc says reads and writes are parallel.
SGRAM can access two memory pages at once which simulates a dual ported RAM. There is write data mask which saves bandwith in a typical read-modify-write.

Silent_Buddha · Sep 15, 2013

sebbbi said:
I did those fill rate tests on Tahiti (Radeon 7970 PC hardware), as I mentioned in my earlier post. My tests agree fully with Hardware.fr results. I would definitely NOT post any Xbox One results (or leak any unannounced technical details about it).

Tahiti can do full speed 4x16f blending, as long as around half of the render target data comes from the ROP caches. Extrapolating from my test results Tahiti would need around ~500 GB/s bandwidth to reach full fill rate when pixels are not hitting ROP caches. Theoretical numbers (from Wikipedia): 1000 MHz * 32 pixels/clock * 8 bytes/pixel * 2 (read+write) = 512GB/s.

Interesting. So in theory, the Xbox One GPU with only 16 ROPs would theoretically only need ~218 GB/s of bandwidth (read+write) which just happens to match the bandwidth provided by the eSRAM.

Regards,
SB

Gipsel · Sep 15, 2013

Silent_Buddha said:
Interesting. So in theory, the Xbox One GPU with only 16 ROPs would theoretically only need ~218 GB/s of bandwidth (read+write) which just happens to match the bandwidth provided by the eSRAM.

But in that case it should be really easy to measure and not letting you jump through hoops to achieve it. I mean, it can't be too hard to carry out a traditional fillrate test (so no localized accesses, just screen filling quads and writing each pixel once) with blending and a 64bit render target, right?

And I think with MSAA (and bad or no compression) one would need even more bandwidth.

Silent_Buddha · Sep 15, 2013

Gipsel said:
But in that case it should be really easy to measure and not letting you jump through hoops to achieve it. I mean, it can't be too hard to carry out a traditional fillrate test (so no localized accesses, just screen filling quads and writing each pixel once) with blending and a 64bit render target, right?

And I think with MSAA (and bad or no compression) one would need even more bandwidth.

Would your first paragraph represent a real world scenario in a game? I'd assume that would line up more with the achievable peak which is lower than the theoretical peak as I'm assuming that simultaneous read/write would incur some real performance penalties that preclude any system from attaining peak theoretical performance? Hence why developers are told the peak is less than the theoretical 218 GB/s?

Regards,
SB

Bigus Dickus · Sep 15, 2013

Slightly tangential question about the esram... how does a dual ported design compare to a single ported of twice the bus width in terms of transistor counts, IO pins, suitability for future process node shrinks, etc.?

I'm just wondering if MS thought they needed ~200 GB/s bandwidth what factors may have nudged them to the current design rather than a 2056 bit bus for the esram.

Rangers · Sep 15, 2013

BRiT said:
Because you have access to the same technical fellows that Penello has or you have something more technical?

Hot chips slides say 204 *shrug* http://images.thisisxbox.com/2013/08/XBO_diagram_WM.jpg

Which fits with the former 7/8 multiplier throughout.

I'd trust that for now over some off the cuff Penello post on GAF. Which btw I dont think he went and asked any "technical fellow", IIRC I think he was just going with the crowd and possibly made a mistake. He must have thought "well it must be 2X that makes sense". And it was in reply to gaffers stating it should be 218 on the erroneous 2x109 idea. I should look up the post but dont have time right now.

If they're saying 218 now, it's new.

Understanding XB1's internal memory bandwidth *spawn

sebbbi

bkilian

Gipsel

pjbliverpool

B3D Scallywag

pjbliverpool

B3D Scallywag

Rangers

BRiT

(>• •)>⌐■-■ (⌐■-■)

3dcgi

loekf

Gipsel

DrJay24

Jay

sebbbi

Gipsel

brunogm

Silent_Buddha

Gipsel

Silent_Buddha

Bigus Dickus

Rangers

Similar threads