Larrabee at Siggraph

It doesn't matter.
You have an extra variable to play with on consoles. Rendering the screen in tiles (not necessarily as small as the Larrabee tiles) will reduce the storage needs. How doesn't that matter?
With the binning technique in the paper, I think Crysis would need over 500MB of bin space.
Just so I can follow the math, how many parameters per vertex and how many vertices per frame are that number made up from?
 
I really hope Larrabee is awesome, but I have my doubts about a lot of what Intel does that doesn't come out of Israel (where Core2Dou was born).

The latest being Atom, which according to the new reviews, Via Nano completely trounces.

While Nano does have higher performance, it isn't anywhere near the same power envelope as Atom. Its like comparing a Phenom or Core2Quad with a Nano, the nano's performance will look ridiculous.
 
- Scatter/gather can access only one cache block per cycle, so it is variable latency based on the number of cache blocks touched.

This is going to be true for pretty much any hardware that implements scatter/gather: 4870, G100, any vector supers computers throughout the ages that don't have word width SRAM main memory.
 
Negligible? With the binning technique in the paper, I think Crysis would need over 500MB of bin space. At 30 fps, that 30 GB/s of bin BW alone. In the future games could easily need this much or more.

So you are assuming that Crysis is using >10 million polygons per frame?

500 / 12 (3 cords per vertex, 32b float each) / 3 (vertex per poly) = 13.89M polygons per frame.

I think you really should check your math or correct my assumptions here.
 
BTW..binning wise, what you really care about is the amount of vertices and triangles per pass, not per frame, as you might be clever and re-use memory from previous passes.
 
So you are assuming that Crysis is using >10 million polygons per frame?

500 / 12 (3 cords per vertex, 32b float each) / 3 (vertex per poly) = 13.89M polygons per frame.

I think you really should check your math or correct my assumptions here.
Umh..it doesn't really work like that, as you want to write out an indexed list of primitives, not a pure list. I also believe Mintmaster is assuming that all attributes per vertex are stored (up to 32 float4 on DX10) in main memory.
 
So you are assuming that Crysis is using >10 million polygons per frame?

500 / 12 (3 cords per vertex, 32b float each) / 3 (vertex per poly) = 13.89M polygons per frame.

I think you really should check your math or correct my assumptions here.

Just to clarify something, I was watching a developer interview where they were moving the camera around the intensive outdoor scenes, where he said they push around a million polys per frame.
 
Umh..it doesn't really work like that, as you want so write out an indexed list of primitives, not a pure list. I also believe Mintmaster is assuming that all attributes per vertex are stored (up to 32 float4 on DX10) in main memory.

which may be, but if you are to the point where you have that many vertexes, you are going to have significantly increased overdraw. From the graphs provided and as pointed out in the paper, it appears that in cases where immediate mode renderers experience large bandwidth requirements, the binned more shows its greatest advantages.

Also binning is fairly resolution independent in bandwidth requirements where as immediate mode memory bandwidth increases significantly with resolution.
 
Just to clarify something, I was watching a developer interview where they were moving the camera around the intensive outdoor scenes, where he said they push around a million polys per frame.

so combined with Nao comment that would put it at a max of ~300 MB per pass or so.
 
Regarding the storage required by coverage masks I wouldn't be overly concerned about it as I'd expect non pathological scenes with many triangles to be mostly composed of many extremely small primitives (if they cover any screen space sample at all..)
 
so combined with Nao comment that would put it at a max of ~300 MB per pass or so.

So in Crysis case, is it as simple as saying:

If the given scene required 700MB of memory and used say 100GB of bandwidth.

Intels approach would require 1000MB of memory but only need perhaps 40GB of bandwidth.

Or have a completely mis understood this?
 
The amount of memory that you are willing to devote to the binning stage is ultimately connected to performance. It's always possible just to store just an ID per primitive and retransform all geometry twice (one might want just to partially evaluate vertex and geometry shaders to come up with a final position per vertex at the binning stage..), and many different implementations that trade memory for speed are possible.
 
The amount of memory that you are willing to devote to the binning stage is ultimately connected to performance. It's always possible just to store just an ID per primitive and retransform all geometry twice (one might want just to partially evaluate vertex and geometry shaders to come up with a final position per vertex at the binning stage..), and many different implementations that trade memory for speed are possible.

I see that as being very interesting for the console space. Large memory is cheaper than large bandwidth. And you would save on needing to spend up to 400m transistors on your 32MB of edram for next gen.
 
Once you hit a limit you can also simply start rendering the most highly populated tiles ... sure you won't be able to completely render them, but you can just save the Z-buffer when you are done and finish it up later. IMG had a name for that method, but I can't immediately recall it.
 
Last edited by a moderator:
You have an extra variable to play with on consoles. Rendering the screen in tiles (not necessarily as small as the Larrabee tiles) will reduce the storage needs. How doesn't that matter?
You mean the reduction in the framebuffer? Sure, but I'm of the philosophy that large framebuffers aren't needed. 1080p, 4xAA, 32 bpp (FP10 or even better shared exponent) is enough, IMO, so that's just 64 MB.

Besides, the cost is fixed, and framebuffer reduction can be achieved on any platform with object level binning into large tiles. The problem with binned rendering is the unpredictability.

Yes, you can dump tiles before binning is finished, but then in the most poly-heavy situations (AFAIK the biggest culprit for framerate dips) you not only have a large rendering load, but you lose efficiency too.

Just so I can follow the math, how many parameters per vertex and how many vertices per frame are that number made up from?
So you are assuming that Crysis is using >10 million polygons per frame?
Remember, I'm talking about the way Intel is doing it, and I did acknowledge that there are ways of reducing it.

I think 50-100 bytes per vertex (some vertex shaders create many iterators) and 60 bytes per primitive is reasonable (in addition to coverage info, they need a 3x3 matrix per primitive for interpolation and a float3 for 1/z). Is 4M polys per frame that unreasonable of an assumption? Even when running at 20-30 fps we see scaling that doesn't depend only on pixel count.
 
Last edited by a moderator:
4 Mpolys per frame is reasonable, 4 Mpolys per pass is not (right now, perhaps in a couple of years..). What we care about with regards to the storage required by the binning stage is the number of primitives per pass.
 
Regarding the storage required by coverage masks I wouldn't be overly concerned about it as I'd expect non pathological scenes with many triangles to be mostly composed of many extremely small primitives (if they cover any screen space sample at all..)
Fair enough. I wasn't thinking of a very good storage scheme before, but I still think we're looking at a minimum of 9 bytes per primitive (4x4 region, 4xMSAA, 1 byte for XY) just for the coverage mask, and then you have other stuff to worry about too.
The amount of memory that you are willing to devote to the binning stage is ultimately connected to performance. It's always possible just to store just an ID per primitive and retransform all geometry twice (one might want just to partially evaluate vertex and geometry shaders to come up with a final position per vertex at the binning stage..), and many different implementations that trade memory for speed are possible.
True, which is why I mentioned it above. This especially makes sense for Larrabee due to its scalable triangle setup speed. Quasi-rasterizing for binning and then rasterizing again later isn't a big deal.
4 Mpolys per frame is reasonable, 4 Mpolys per pass is not (right now, perhaps in a couple of years..). What we care about with regards to the storage required by the binning stage is the number of primitives per pass.
Okay, maybe I overestimated it. Crysis just seems to stress setup, though, looking at the resolution scaling. Given the framerates, I think peak poly counts do get really high per pass.

I also forgot about clipped/culled/null triangles not getting binned.
 
I also forgot about clipped/culled/null triangles not getting binned.
Oh yeah, I completely forgot about rejected primitives as well!
BTW, on HS we had (worst case scenarios) up to 4Mpolys per frame, split more or less evenly over 5 rendering passes (3 shadow maps + 1 zpass + 1 color pass), factor in culling and I think we never rendered more than 0.5 visible Mpolys per pass.

Still their implementation (streaming out all shaded vertex data) doesn't sound completely optimal memory wise, a 2 passes vertex shading approach might save a lot of memory for a small performance hit.

Despite the problems generated by a TBR do you think they really had any other option?
It seems to me that this is by far the most reasonable decision they could have made with regards to their rendering architecture.

I guess a lot of details are also missing from the paper for lack of space as they don't say a word about pre and post transform (partially software) vertex caches or fragment rasterization order.
 
I guess a lot of details are also missing from the paper for lack of space as they don't say a word about pre and post transform (partially software) vertex caches or fragment rasterization order.
It's all software, so wouldn't these kind of things just be parameters? One game might benefit from a large vertex cache, another from a specialized rasterization order, ...
 
Yes, it's all software, but if your software doesn't even try to retrieve a previously transformed vertex you won't have any (implict) post transformed cache in place :)
 
Back
Top