RSX: memory bottlenecks?

Spe

Jaws said:
In addition to 6 threads across 6 SIMD fragment quad units, there are further 8 MIMD vertex units, running another 8 threads capable of per vertex dynamic branching. The G70 architecture would then have 14 threads of execution...

So in smplistic terms, the G70 would have 14 prcessors, with the 8 vertex and 6 fragment quads tuned differently to dynamic branching performance. This seems to be mirrored with CELL, with 1 MIMD PPE and 7 SIMD SPUs, tuned differently for dynamic branching performace.

Incidently, Xenos makes a tradeoff with worse vertex dynamic branching performance than MIMD VS units, with its 3 SIMD unified processors, i.e. vertex batches are 64 vertices instead of 1... but XeCPU has 3 MIMD PPEs which would be better for vertex dynamic branching...

Edit: typos...

CELL is MIMD and SIMD no? Between SPE can be MIMD inside SPE can be SIMD.
 
ihamoitc2005 said:
CELL is MIMD and SIMD no? Between SPE can be MIMD inside SPE can be SIMD.

Yep, I've already mentioned that a cluster of SIMD processors can act like an MIMD processor. CELL can be broken down into 2 processor types, an MIMD PPE and 7 SIMD SPUs, so CELL itself would be an MIMD/SIMD hybrid...the same with G70...
 
DeanoC said:
As ERP says the reason not to use NAO32 is blending. The colour space is okay for small lerps (like in an AA downsample) but gives visual errors for more extreme lerps (like blends).
Ah, thanks for clearing this up. I asked you about this in the other really long thread, and didn't get a reply. So alpha blending takes a back seat in your method. For this reason I couldn't quite understand your excitement about NAO32, but I guess if it suits your needs, then that's great.

So the actual amount of HDR alpha we have is very very small.
I assume that what you're saying is that you can do most of your blending in the final RGB space framebuffer, after all tonemapping, blooming, streaking, etc., right? This is a solution that I suggested for HDR graphics on R3xx/R4xx products. The problem, of course, is that you can't get HDR sources to be full brightness through tinted windows, and HDR reflections on alpha blended objects can't undergo post-processing. I guess that's not a deal-breaker, though.

BTW, Deano, I was inspired by your talk about NAO32 two months ago, and came up with a method to have HDR in a 32-bit framebuffer that does allow for blending, and I think I solved all the corner cases in theory (of course, 32-bits can't work miracles, so I'll have to see it in practice before making any real claims). I'll write a demo some day, but right now my 9700 is being fixed at ATI.
 
london-boy said:
In the end nAo's way of doing HDR+MSAA will probably the the most popular way of doing that (or slight variations of course).
Shifty Geezer said:
Saying NAO32 is a workaround for real HDR is like saying saving a digital photograph saved in an HSL format is a workaround for saving it in RGB and it's not really a digital photo.
Let me ask you guys something. Do you know what it is that makes all those so-called "SM3.0" games unplayable on ATI's R3xx and R4xx hardware in HDR mode?

FP BLENDING.

There are very, very few (if any) shaders in use today that can't be done in PS2.0 with little or no changes. The problem is that games use alpha blending all the time for muzzle flash, plasma guns, fire, smoke, dust, and tons of other special effects. The very fact that there are so few games that took advantage of FP rendertargets in ATI's previous hardware tells you how limiting it is not to have hardware FP alpha blending.

NAO32 is a workaround for real HDR. You don't get the full functionality of the hardware with it. Having said that, there's nothing wrong with using a workaround. Realtime graphics is all about doing hacks and approximations that suit your needs. The issues I mentioned in my previous post may not be noticeable.

However, don't expect this to be an elixir for everyone's bandwidth woes. It has a serious limitation which often cannot be ignored.
 
Hrm, wow, can't believe I missed that entire discussion about NAO32. Quite interesting stuff.

But personally, I don't think there's any reason we should remain with 32-bit integer framebuffers moving into the future. Put simply, the amount of processing that goes into calculating each pixel is going to increase dramatically over the next couple of years, so the performance hit for moving to higher-precision framebuffers is going to become almost nonexistant.

I suppose, of course, that it is important for consoles, as the hardware is static. But for PC's, I don't think it's a very useful technique, as it limits developer freedom. With the amount of increased work that games are going to require moving into the future, it's going to be very important to move towards techniques that have fewer drawbacks, so that artists and programmers aren't spending as much of their time worrying about such things.
 
Chalnoth said:
Hrm, wow, can't believe I missed that entire discussion about NAO32. Quite interesting stuff.

But personally, I don't think there's any reason we should remain with 32-bit integer framebuffers moving into the future. Put simply, the amount of processing that goes into calculating each pixel is going to increase dramatically over the next couple of years, so the performance hit for moving to higher-precision framebuffers is going to become almost nonexistant.
I disagree, ignoring blending issues the question to ask is what advantage does FP16 HDR have for opaque HDR? What are the disadvantages?
IMHO The main advantage is simplicity, the disadvantage is speed.

If I had to predict a trend (its not actually my prediction as its the rule for graphics for the last 20 years...), computational power will go up faster than bandwidth. So unless we hit a resolution wall, saving bandwidth should always be your main concern.
So why waste bandwidth on opaque geometry (usually the majority of a scene)? The blending stuff comes later and you can do a single pass into FP16 if you want (I'm assuming that in future API, we will have the control of the hardware MSAA like we do on a console, so you won't have to lose any information when you convert from NAO32 to FP16).
Bandwidth will/is the bottleneck is so many cases. By saving bandwidth, we can scale ALUs faster than not. Its possible to imagine a future video card with ratios of 10's of ALU per ROP. 5:1 is already here (in a few cases) on one chip I know of... For the foreseeable future FP16 will always take a least twice the ROP power of INT8, halving your already overstretched ROP rate.

Nobody yet has shown me a convincing argument for why FP16 HDR with opaque geometry should be prefered over NAO32 HDR. Is there some advantage I'm missing (for opaque geometry)?

I suspect that an 'ideal' rendering system would be
Opaque -> NAO32 MSAA
Convert NAO32 -> FP16
Alpha -> FP16 MSAA
Tone map -> ARGB8

Actually more 'ideal' would be to have hardware support for NAO32 Alpha.

Of course the issues with MSAA, gamma and tonemapping are another problem.
 
Shared exponent 32 bit formats already support to some degree hw blending (multiplicative blending modes) but if a custom hdr format quality is not good enough for some app it doesn't make much sense to 'natively' support blending anyway.
Regarding nao32 and blending, on some hw might be even faster to blend with an ad-hoc shader on nao32 than using hw blending on a FP16 RT.
The extra hassle to support all the blending modes you need in a game might be partially avoided writing some uber blending shaders or Cg functions which support 'complex' blending equations, then one could select a desired blending mode setting some pixel shader constants.
 
DeanoC said:
I disagree, ignoring blending issues the question to ask is what advantage does FP16 HDR have for opaque HDR? What are the disadvantages?
IMHO The main advantage is simplicity, the disadvantage is speed.
Well, I don't think you can really ignore blending issues (or any other framebuffer adding/averaging), as that's really going to be the primary concern with any sort of nonlinear packing algorithm.

Of course, if you could get your hands on some hardware that would give access to the value of the framebuffer within the shader, then you could get around blending issues entirely. But we'll have to see whether or not IHV's go the extra mile for that tasty feature.

If I had to predict a trend (its not actually my prediction as its the rule for graphics for the last 20 years...), computational power will go up faster than bandwidth. So unless we hit a resolution wall, saving bandwidth should always be your main concern.
Right, but consider a second trend, that towards longer shaders. With shader length increasing in general, there's going to be a shift towards texture memory bandwidth (especially for HDR textures). So it would be more fruitful to use this sort of packing for texture data, moving into the future, than for framebuffer data. That is, if you could get texture filtering to work properly.

I suspect that an 'ideal' rendering system would be
Opaque -> NAO32 MSAA
Convert NAO32 -> FP16
Alpha -> FP16 MSAA
Tone map -> ARGB8
I'm not so sure. Bear in mind that you might be converting the entire scene to FP16 even if only 10% of it is covered by transparent objects. I think it'll typically be more efficient to deal with just one framebuffer all the way to the tonemapping pass.
 
nAo said:
Regarding nao32 and blending, on some hw might be even faster to blend with an ad-hoc shader on nao32 than using hw blending on a FP16 RT.
Wouldn't that require access to the data in the framebuffer to do properly? And can't that only be done through ping-ponging at the moment? Or are you just talking about "faking" the blending through some approximation?
 
Chalnoth said:
Wouldn't that require access to the data in the framebuffer to do properly? And can't that only be done through ping-ponging at the moment? Or are you just talking about "faking" the blending through some approximation?
I'm talking about ping-ponging, but I was simply referring to real world fillrate while blending with a funky color space or with fp16
 
Chalnoth said:
Right, but consider a second trend, that towards longer shaders. With shader length increasing in general, there's going to be a shift towards texture memory bandwidth (especially for HDR textures).
If shaders length increase in general the relative cost associated with a new color space encoding/deconding becomes even smaller..
Bear in mind that you might be converting the entire scene to FP16 even if only 10% of it is covered by transparent objects.
Well..in some cases you can be more efficient than that, dunno about the general case, I'll have to think about it..
 
nAo said:
If shaders length increase in general the relative cost associated with a new color space encoding/deconding becomes even smaller..
Oh, right. I'm not arguing that encoding/decoding is a problem at all. I mean, this is just performance, isn't it?

No, I'm more worried about the nonlinearity of the format. Anyway, as I said, I think it's a superb format for console gaming (that's what it was first designed for, right?). But since the PC is continually evolving, I think that such a technique will quickly become more trouble than it's worth on the PC.
 
DeanoC said:
Nobody yet has shown me a convincing argument for why FP16 HDR with opaque geometry should be prefered over NAO32 HDR. Is there some advantage I'm missing (for opaque geometry)?

I suspect that an 'ideal' rendering system would be
Opaque -> NAO32 MSAA
Convert NAO32 -> FP16
Alpha -> FP16 MSAA
Tone map -> ARGB8
Well, now that you've added "opaque" to your claim, I don't think I can dispute it. You can even use RGBA with an exp() or lookup texture on the alpha to make this sort of scheme very fast instead of going to the CIE color space. The problem is blending, and ping-pong is a bit of a headache. A lot of nice effects are simple alpha blends, and if you want them to be compatible with HDR light sources, then there's not too much you can do.

The reason I'm not convinced that opaque rendering nets much of a speed up with NAO32 is that with a Z-only pass or half-assed object ordering you can mostly get rid of overdraw anyway for opaque surfaces. In the end you're not saving much bandwidth, if any, by rendering to NAO32 and then converting to FP16. The FP16 colour buffer is what, 8MB extra over NAO32 at 1080p? We're talking about 0.5 GB/s saving at 60fps. Include real overdraw, compressed AA (not sure if RSX has this with FP16), memory inefficiency, and maybe you're up to 10% of RSX's GDDR3 bandwidth.

There are two big bandwidth eaters, neither of which is addressed by the NAO32 pass that's first in your system. One is FP16 post processing, like DOF, motion blur, gaussian blurs for bloom (even at 1/4 resolution), etc. The other is alpha blending. For animated smoke, fog, dust, fire, etc., nothing beats a shitload of layers. Grass, branches/bushes, and fur are other very important examples (I still love that Tomohide Fur demo. Why isn't anyone using this effect?). It's just too hard to get a nicely animated, transparent, volumetric effect any other way.

NAO32 just seems to be a solution to the insignificant part of the problem. Unless, of course, RSX has no FP16 MSAA support, in which case the format is useful.
 
Last edited by a moderator:
maybe you're up to 10% of RSX's GDDR3 bandwidth.

You say this like 10% is something trivially small, but I think that even if the only gain was to reduce by 10% the memory bandwidth needed for frame-buffer accesses that it could be a worthy idea, not to mention that you also do save on frame-buffer space.

Still, G70 does not support MSAA on FP render targets and talking about the G7x line kirk did not exactly express his love for shoe-horning FP + MSAA support in the ROP's so I dpubt that we would see sucha change in RSX. In that case NAO32 for opaque geometry also means not to have to fall back to FSAA which would also place a bigger load on the "texture bandwidth+texture operations" side as well asa higher pixel shader cost.
 
Mintmaster said:
You can even use RGBA with an exp() or lookup texture on the alpha to make this sort of scheme very fast instead of going to the CIE color space.
I think shared exponent formats are simple but imho the quality is simply not there (in the general case) compared to FP16, 8 or 9 exponents bits are not enough if you want a high contrast ratio while at the same time n a RGB color space you don't want to devote less than 7 or 8 bits to color mantissas if you don't like color banding. (I don't ;) )

The reason I'm not convinced that opaque rendering nets much of a speed up with NAO32 is that with a Z-only pass or half-assed object ordering you can mostly get rid of overdraw anyway for opaque surfaces. In the end you're not saving much bandwidth, if any, by rendering to NAO32 and then converting to FP16. The FP16 colour buffer is what, 8MB extra over NAO32 at 1080p? We're talking about 0.5 GB/s saving at 60fps. Include real overdraw, compressed AA (not sure if RSX has this with FP16), memory inefficiency, and maybe you're up to 10% of RSX's GDDR3 bandwidth.
In a system where you need 5% of per frame bandwidth just to blend a full screen 720p FP16 RT halving your bw requirements for a specific rendering pass (from 10% to 5%) it's very important.
Furthermore if at some point in the future we get more advanced pixel shaders that let us fetch the current back buffer pixel color so that we can implement our own blending modes in a shader without the need to juggle RTs switching to a 4 bytes per pixel format could theoretically double our fillrate in some situations.

There are two big bandwidth eaters, neither of which is addressed by the NAO32 pass that's first in your system. One is FP16 post processing, like DOF, motion blur, gaussian blurs for bloom (even at 1/4 resolution), etc.
Umh..I'm not sure why all these effects must be performed in a RGB FP16 'enviroment'.
And there is another big bw eather that you should mention...AA resolve :)

The other is alpha blending. For animated smoke, fog, dust, fire, etc., nothing beats a shitload of layers. Grass, branches/bushes, and fur are other very important examples (I still love that Tomohide Fur demo. Why isn't anyone using this effect?). It's just too hard to get a nicely animated, transparent, volumetric effect any other way.
Yep..that's true, but with so many layers the additional costant cost of setting up buffers ping-ponging gets even more negligible.
 
nAo, I guess you're making some good points. I was looking at your format with respect to Deano's "ideal rendering system". I was also thinking that there's really no need for more than one pass on opaque geometry, but I forgot about dynamic reflection maps. Also, going by Deano's system, I didn't think you could do any post-processing in your format because you'd want the alpha blended stuff on there first, but if you're ping-ponging, then there's no reason not to.

Regarding MSAA resolve, I'm surprised it's that hungry. It's only once per frame, and I figured compression helped a lot. Not that this matters much, though, because if you want MSAA then you can't use FP16 anyway from what I'm hearing. Like I said before, MSAA is definately a reason not to use FP16 (rather neat that you don't get many artifacts in your colour space).

Yep..that's true, but with so many layers the additional costant cost of setting up buffers ping-ponging gets even more negligible.
Can you really ping pong that fast? If I had a 1000 possibly overlapping alpha blended particles, can you switch render targets that fast?

Also, doesn't ping-ponging require extra writes? Here's how I thought it worked:
1. Switch render targets
2. Read from the main buffer and write to the temporary one while doing your pixel shader blending.
3. Switch back to the original render target
4. Copy from the temporary buffer to the main one.
(Of course steps 2 and 4 can be appropriately interchanged)

That's 2 reads and 2 writes, which is no BW savings at all over FP16. This seems like it's the only way to get overlapping transparency right, unless they're commutative operations like multiplication and addition, and even then getting bandwidth savings seem tough. LERP's, unfortunately, are really important for most of the things I mentioned in my previous post, and I can't see how you'd get any benefit there.

Are you sure ping-ponging to a 32-bit format is any faster than alpha blending in a 64-bit format? Doesn't make sense to me.
 
Last edited by a moderator:
Panajev2001a said:
You say this like 10% is something trivially small, but I think that even if the only gain was to reduce by 10% the memory bandwidth needed for frame-buffer accesses that it could be a worthy idea, not to mention that you also do save on frame-buffer space.
Well, remember, that was absolute worst case. Consider FP16 versus NAO32 at 720p without MSAA. The former requires a 3.7MB larger framebuffer, so 80% memory efficiency and 2x real overdraw (i.e. not rejected by early Z test) at 60fps gives you 2.5% of RSX's bandwidth. G70 has achieved 100% utilization in 3DMark's fillrate tests, and 80% is pretty normal.

Again, hardware MSAA capability is a very good reason to use NAO32. I'm just saying that it doesn't look like much of a performance benefit.

BTW, looking at these numbers, you gotta wonder why we aren't seeing better graphics on today's PC games. At 1280x1024 and 60fps, the X1900XTX can execute 400 vector instructions and 400 scalar instructions (not even including the mini-ALU!), fetch 130 texture samples, and has 630 bytes of data access for every friggin pixel on the screen. Holy crap.
 
Part of the problem is developing for the least common denominator, and part of the problem is that offline renders look good mostly for the same reason that film looks better than video: good director control of lighting setup + massive tweaking. Even before RM renderers got ray-tracing, radiosity, and photomapping, they still produced stunning results in the right hands.

With real-time rendering, there are obvious hurdles to overcome, but online realtime in-game cinematics should look *MUCH BETTER* than they do now. (I guess MGS4 is an example of how they are progressing)
 
Mintmaster said:
Can you really ping pong that fast? If I had a 1000 possibly overlapping alpha blended particles, can you switch render targets that fast?
As a small aside, I would expect DX10 and console APIs to be able to handle more ping-ponging than DX9 on PC. The CPU overhead involved with DX9 makes some things like this undesireable.
 
Back
Top