View Full Version : IMR "Wall" Limits V PVR
PVR_Extremist
21-May-2002, 12:07
A number of years ago the general consensus of opinion was that IMR would hit some kind of "technological wall" with respect to memory bandwidth and clock speed. This was an argument that PowerVR fans (including myself, truth be told) used to proport PowerVR technology as the eventual winner in this race.
Clearly to date at least this has not happened. Credit to nV and others who have employed various techniques to use their available bandwidth more efficiently (whilst managing to get hold of faster more capable RAM also).
My question:
Is there still some kind of "technological wall" which will hamper IMR performance in the future?
My personal opinion is that nV and others will lean towards a hybrid config further optimising their bandwidth saving techniques but not fully going towards a tile based deferred rendering solution.
Also,
Are "other" companies actively pursuing Tile based deferred rendering in "future projects"? If they are how long do you think PowerVR hold the advantage with respect to experience and capability in that area of technology?
Regards
Tino
mboeller
21-May-2002, 12:13
see below the thread "Intel's new graphics core" for answers.
PVR_Extremist
21-May-2002, 12:15
I saw that thread but in all honesty don't really consider Intel much of a competitor in the 3D market.
And it dont answer any of my other questions :roll:
mboeller
21-May-2002, 13:00
I saw that thread but in all honesty don't really consider Intel much of a competitor in the 3D market.
And it dont answer any of my other questions :roll:
Sorry; but I have no answers myself. I had hoped that you see Intel as an "player" in the 3D-chipset arena. I'm not sure myself if this new Intel-chipset will help IMG; but it is better when 2 companies have an deferred rendering architecture instead of only one.
If Intel would have used this core in the 850E-chipset too, then they would have been a force to recognisse, simply because the 4,2GB/sec bandwidth would have been enough for an deferred renderer to shine ( and maybe even outshine the nForce chipset / MX400 class ).
I hope that Intel will use this graphics-core in all new chipsets to improve the 3D-performance of all new chipsets. If that happens then IMG has an easier time when new 3D features come up in new DirectX / OpenGL versions.
I think game developers also helped to avoid the "technology wall" by never developing a game that will not run well with the current IMR.
Imagine the games we could have if developers had good deferred render available.
The trick is that we get used to the limitations imposed by IMR.
Soon or later they (hardware and software developers) will have to think seriouslly about deferred rendering.
We are approaching a level where polygons will become smaller than the size of pixels.. then even deferred rendering will become obsolete, surely?
Is there still an advantiage of using deferred rendering when polygons are so small?
Rendering is not necessarily deferred with tiling, the scene is just rendered in tile order ... there's a difference.
PVR_Extremist
21-May-2002, 14:26
Soon or later they (hardware and software developers) will have to think seriouslly about deferred rendering.
When? Why? Does this infer that the big 2 already have deferred rendering somewhere in their roadmap?
I dont know anything about roadmaps and this is just a gamer´s guess (or hope).
In 2 or 3 years we will start to see some DX9 games (many levels of multipass), and the hardware developers will have a good .09 micron process available (means DX9 to the mass). The main competition will be the $80 to $150 DX9 card which is cost sensitive. To keep cost down with high performance a deferred rendering will make sense. It will save bandwith and fillrate.
I dont know anything about roadmaps and this is just a gamer´s guess (or hope).
In 2 or 3 years we will start to see some DX9 games (many levels of multipass), and the hardware developers will have a good .09 micron process available (means DX9 to the mass). The main competition will be the $80 to $150 DX9 card which is cost sensitive. To keep cost down with high performance a deferred rendering will make sense. It will save bandwith and fillrate.
Well that's pretty much the line of reasoning that Tino refers to that PVR fans have used in the past. And obviously for one reason or another, it just has not come to pass.
PVR_Extremist
21-May-2002, 16:34
I dont know anything about roadmaps and this is just a gamer´s guess (or hope).
In 2 or 3 years we will start to see some DX9 games (many levels of multipass), and the hardware developers will have a good .09 micron process available (means DX9 to the mass). The main competition will be the $80 to $150 DX9 card which is cost sensitive. To keep cost down with high performance a deferred rendering will make sense. It will save bandwith and fillrate.
Well that's pretty much the line of reasoning that Tino refers to that PVR fans have used in the past. And obviously for one reason or another, it just has not come to pass.
I was just going to say that :lol:
I can see Tile Based Deferred Rendering making more sense from a cost point of view. Indeed it doesnt cost alot today. Then again what we have available now from IMGTEC isnt a GF4ti4600 in performance either.
So my original questions still stand....
Not exactlly the same reasoning.
What if Kyro II had a DDR memory and 3 or 4 pipelines?
I think it could still be cheap and much faster than any other card in the same price range.
edited: what I am trying to say is "the same price range with much better performance" not "the same performance in the same price range".
Actually, Intel using a deffered + tile architecture, is far more important than IMG help bring to the PC arena. One thing to note is the fact that the i810 chipsets with their integrated core are in about 50% of desktops, IIRC. It was in a survey done not too long ago, it was discussed in the forums as well.
With that said, I think we'll see more of the same, with i845G. If this is the case then, guess where a lot of game makers will target their games?
Ok, hit me, but I think that IMR is the way to go. Not only can you get decently close speedwise with HyperZ like implementations but you can also get useful information in the form of the depth values for shadowmapping and other effects. IMR's may also give you occlusion query information.
Reverend
21-May-2002, 17:35
It's simple really - faster memory as well as more games appear to be CPU limited.
Noone likes occlusion queries but software developers, and now that they have it Ive only seen it put down on the GDalgorithms :) (Demo's are nice, but I prefer the impressions from game developers.) As long as you use immediate mode rendering in its present form its the only way to do finegrained occlusion culling, granted ... but its a pretty damn sucky way, would be much better if the hardware had access to bounding volume information itself IMO.
Getting Z values for a shadow buffer is hardly a problem for a tiler, just a variation on rendering to a texture.
Not exactlly the same reasoning.
What if Kyro II had a DDR memory and 3 or 4 pipelines?
I think it could still be cheap and much faster than any other card in the same price range.
edited: what I am trying to say is "the same price range with much better performance" not "the same performance in the same price range".
Well the whole "What if?" line of arguments have also been done to death previously as well. "What if the Neon250 came out on time? What if memory prices didn't drop for the original 3Dfx Voodoo1? What if the Kyro had the benefit of mass production? Etc. etc." We all know that on paper deferred renderers have great benefits. The problem is that so far, no one has proved that one can come out at, "the same price range with much better performance" - speaking of the top end here where IMR's are supposedly hitting the bandwidth ceiling.
If people would stop trying to proove the negative others could stop presenting "what if" scenarios.
If people would stop trying to proove the negative others could stop presenting "what if" scenarios.
There's no 'proving of the negative' here because you can't logically prove a negative. We're waiting for proof of the assertion that IMRs will not be able to keep up with bandwidth demands and that deferred renderers will take over.
arjan de lumens
21-May-2002, 22:42
My question:
Is there still some kind of "technological wall" which will hamper IMR performance in the future?
Still memory bandwidth. So let's take a look at what the maximum possible memory bandwidth into a single-chip GPU might be. This bandwidth is determined mainly by 2 factors:
Width of memory bus, in number of pins
Datarate per pin of memory bus
For a Geforce4, you get 128 bits * 650 MHz = 10.4 GBytes/sec. But the ultimate maximum? The maximum bus width is obviously no less than 256 bits (P10, Parhelia) - with flip chip packaging, pin counts of up to several thousands is possible, so I'd estimate that a 1024-bit (external!) bus is possible (though certainly expensive as hell). For the datarate, Rambus QRSL signalling permits a datarate of 1.6 GBit/s per pin - I'm not aware of any other scheme that offers comparable per-pin datarates.
This amounts to a (rather hypothetical) maximum of about 200 GB/s, barring cost and signal integrity issues. Which is about 20 times what Geforce4 has. So IMRs won't run into any hard limits anytime soon - about 6-7 years away, according to Moore's law. At which time eDRAM may have gotten cheap enough to take over for external memory solutions...
Althornin
21-May-2002, 22:49
If people would stop trying to proove the negative others could stop presenting "what if" scenarios.
There's no 'proving of the negative' here
Hence the word "trying" in MFA's post.... :)
DemoCoder
21-May-2002, 22:52
The DreamCast was a platform where every single developer knew they were coding for a deferred architecture and yet the games developed did not blow away games on IMR systems. You can't blame it on the developers.
The IMR systems to date have been bottlenecked in other areas such as fillrate and geometry performance. It's all well and nice that they don't need 500 Mhz DDR to work, but they skimped on the fillrate and T&L.
Developers can't very well push 20x overdraw/multipass with massive architecture if the CPU/T&L/fillrate of the unit can't handle it.
But of course, many people have been harping on this for years while all the naysayers bashed IMR vendors for boosting fillerate and wasting efforts on T&L.
Althornin
21-May-2002, 22:54
The IMR systems to date have been bottlenecked in other areas such as fillrate and geometry performance. It's all well and nice that they don't need 500 Mhz DDR to work, but they skimped on the fillrate and T&L.
dont you mean "the TBR/Deferred rendering systems to date...."
And if so, then i mostly agree with what you have to say.
Entropy
22-May-2002, 00:41
My question:
Is there still some kind of "technological wall" which will hamper IMR performance in the future?
My personal opinion is that nV and others will lean towards a hybrid config further optimising their bandwidth saving techniques but not fully going towards a tile based deferred rendering solution.
To adress the original question, there obviously doesn't seem to be any hard limits. The performance development has been quite predictable, with greater increases when the memory subsystem has gotten a factor of two architectural boost.
Generally speaking, graphics is well suited to parallell processing which would seem to point us in the general direction of tilers, though not necessarily deferred renderers.
Looking at the trends of game graphics, we see
1. Increased polygon count
2. More complex environment = more overdraw
3. More work per pixel
1 would seem to favour IMRs, 2 would seem to favour DMRs and 3 could go either way with a theoretical favour for DMRs but with problems too.
(As usual, programmers will adapt to the limitation of the platforms available, so for DMRs to take over the market, they have first got to outperform IMRs on their own turf so to speak, as IMRs set the standard. But that is market dynamics, not technology.)
As has been pointed out, memory bandwidth would seem to be the factor that places the upper bound on IMR performance. (And for that matter DMRs, but at a slightly different point and for slightly different reasons. Data flow is _always_ limitid by bandwidth. doh.) This will be the year when 256-bit DDR takes off, we can expect the usual clock ramps, and we have 4-bits-per-pulse tech waiting in the wings if necessary. Extrapolating, this should take us to a nominal 100GB/s within five years or so. Not too shabby, but not too exciting either, as it is only a factor of five after all, and the estimate is not pessimistic. However, that is time enough for GPUs to be able to carry sizeable amounts of memory on-chip, which is one way of to reduce the dependance on off-chip memory bandwidth.
The problem for IMRs is that the bandwidth development is still pretty slow compared to the overall performance increases we could envision in that time frame. So we need to get smarter with how we use it, and indeed we are, using different techniques to both reduce unecessary rendering examplified for instance by HyperZ, and to reduce unnecessary polygon load with Matrox's depth adaptive tesselation as the latest but certainly not last example. Peering deeply into the crystal ball in order to predict the farthest front of technology (five years or so) we should be able to expect doubled rendering performance every year during that period.
(So extrapolating from the latest benchmark/demos Commanche4 and CodeCreatures, in five years we will be able to marvel at large numbers of nicely modelled static trees rather than either or. Oh joy.)
Deferred rendering is attractive due to the fundamental reasonableness of only rendering what is actually seen. But it doesn't remove all bottlenecks, and introduces some extra work of its' own, and the real question is whether the bottleneck it removes is so much more limiting than the next bottleneck down the line + DMR overhead.... If not, extending and improving IMRs may be more practical in an application environment where IMR limitations are taken into account in graphics engine development and applications.
Entropy
Probably the major memory bandwith advantages for tiling comes from the locality of reference of the depth and frame buffer information more than from deferred rendering as such. Since the depth and frame buffer information for a tile can fit entirely on the chip, the z's and frame colors stay on chip for all the depth and color computations. This allows very high on-chip memory bandwidth to be used much like edram solutions, only without the large on-chip memory requirements.
Deferred rendering is an added bonus for memory bandwidth since it primarily reduces texture bandwidth which is generally less intensive at the moment. In the future, it will eliminate wasted pixel shader computations which will become critically important.
However, by using a combination of compressed z's, hierarchical z buffering, and multiple z checks per pixel, combined with application driven deferred rendering (an unshaded pass followed immediately by a shaded pass), IMRs get almost all of the memory bandwidth savings of a deferred rendering tiler without any of its problems (API incompatibilities, etc.).
In the future, z queries will help reduce memory bandwidth even more (though they work equally well for both TBRs and IMRs)
The memory bandwidth "wall" is a bit illusory. There are many memory bandwidth technologies yet available. 256 bit buses are currently popular. Embedded RAM of one type or another is still a bit off but holds a lot of promise. MCM's open up many possibilities. Frequencies continue to climb. Better caching mechanisms, especially for geometry are on the horizon. On chip tessellation and displacement maps will also help in the geometry bandwidth department.
Chip designers forecast as best they can what the technology and cost structure of memories will be like when their design is built a couple of years out. Different 3d vendors take different memory approaches, but they all create a design that provides the memory bandwidth to meet their goals, using whatever they think is going to be the least expensive and best approach at the time of product launch. That's their job of course.
If any remember my posts of the past, they know that I like tilers. However, it is no coincidence that all of the major 3d hardware vendors at the high end including Nvidia, ATI, 3dlabs, and Matrox all use immediate mode renderers. Their engineers are all very aware of the tradeoffs between tiling architectures and IMRs and they have chosen IMR's for a reason. So it may seem that I prefer IMRs. I do not. I simply prefer whatever works best. Other than that I have no preferences either way. If TBRs really are the fastest solution and can produce the highest quality, high-end 3d graphics then they must demonstrate it with purchasable products the way IMRs have been doing for some time.
I think if TBR was so clearly the way of the future the way high precision color, programmability, and high quality AA are, then vendors would have pushed for the API changes to really support it long ago and would now all be using it. The fact that all the largest players in the market have not done this means their engineers feel there are better alternatives, and until there are purchasable products to demonstrate otherwise, they have not been proven wrong.
I for one would really like to see a fully maxed out TBR with all the pixel pipelines, external memory bandwidth (plenty of this is still needed of course), programmability, vertex shader performance, high quality AA, etc. needed to fully show what the approach is capable of. A real contender on the TBR side would be interesting to say the least.
High quality AA is the future? More like the past repeating itself :) (Warp5)
Jerry Cornelius
22-May-2002, 05:09
I think if TBR was so clearly the way of the future the way high precision color, programmability, and high quality AA are, then vendors would have pushed for the API changes to really support it long ago and would now all be using it.
This argument only holds so much water. Look at EAX and A3D, Beta and VHS, 4 stroke engines with pushrod valvetrains etc...
I think sometimes the first thing with it's foot in the door get's all the marbles. Once it becomes accepted and understood it's a high risk to depart from the norm and do something else, especially when you have to sell it.
I don't know what "the future of 3D rendering" is but it's a safe bet it's going to involve realtime shadows and lighting. ONce ray tracing get's into the "picture" scene capturing will be inavoidable. Once you have that you may as well have a tile based deferred rednerer.
I haven't given this much (any) thought, but I wouldn't be surprised if homogenous recursive descent rasterization introduces some added difficulties with deferred rendering.
I'm sure there is a deferred implementation that could work with unprojected, unclipped geometry; however, it probably wouldn't be very fun to implement in hardware (like I said -- I haven't given this any thought, so if you know of an algorithm to do this, I'd love to see it).
Most future engines will probably use a multipass technique like Doom III in order to avoid running costly shaders on occluded pixels -- render just Z in one pass, and then do all lighting and shading in subsequent passes. The big loss with this technique is geometry throughput; however, we're rapidly approaching a point where burning thousands/millions of (untextured, unlit) triangles to save fillrate is a given, since vertex throughput is so high (and cheaper to add than pixel throughput).
Noone likes occlusion queries but software developers
I like having occlusion query capability; however, games use their visibility information for *so much* more than just rendering (i.e., AI, collision detection, physics, sound, etc.) that adding occlusion query capability to graphics cards isn't going to revolutionize game engines. If you're careful about pipeline stalls and flushes, it does improve performance.
Is there still some kind of "technological wall" which will hamper IMR performance in the future
Yes, but that "technological wall" is also shared by deferred renderers. Even with fancy Z-rejection circuitry, for any given scene, you will need to be able to fill log2(n)*resolution pixels (n is the average overdraw for the scene) every frame. In comparison, ray-tracing doesn't have this requirement. The common argument is that as depth complexity and resolution continue to increase, the added per-fragment cost of doing ray tracing is more than made up for by the fact that you only need to trace 1 ray/fragment. Multi-pass visibility algorithms like Doom III's help skirt this wall; however, many people have argued that ray-tracing hardware will be a necessity, since all Z-buffer hardware is subject to the same theoretical shortcomings.
Ok so assume 1 ray per fragment. IMG already achieves this with PowerVR (re: your comment on deferred renderers needing to fill log2(n)*res fragments per frame).
If you mean ray-tracing as in shadows, reflection, and refraction, then this means multiple rays per fragment AFAICS.
Deferred renderers still need to perform all the depth tests -- there isn't much that a deferred renderer offers over a multipass technique such as Doom III's.
And WRT shadows, reflections, etc -- each of those effects is an additional 1 ray/fragment (per layer of reflection/refraction), as opposed to an expected log2(n) using a Z-buffer renderer. There is a logarithmic advantage to using ray tracing (over both deferred and immediate mode renderers); however, the constant cost is so high that Z-buffering is still advantageous.
Ailuros
22-May-2002, 08:31
Dumb layman's question: Both IMR and TBR approaches seem to have advantages as disadvantages. Why not attempt within the realm of possibility to combine both approaches' advantages into one architecture in the future (edram included when it becomes mainstream) trying to overcome as much as possible either sides' disadvantages out.
From my rather simplistic viewpoint I don't see vendors so far not taking the advantages of defered rendering into account, rather making small steps in the above direction. Someone correct me please if I'm wrong.
mboeller
22-May-2002, 08:41
I'm sure there is a deferred implementation that could work with unprojected, unclipped geometry; however, it probably wouldn't be very fun to implement in hardware (like I said -- I haven't given this any thought, so if you know of an algorithm to do this, I'd love to see it).
You mean something like Fluid-Studio's REVi-3D-engine? I don't know if it is still in development, cause the info is gone, but IMHO they had this sort of 3D-engine in development.
Link : http://www.flipcode.com/cgi-bin/iotd.cgi?ShowImage=05-27-2000
We are rapidly reaching a point where we will start to want to fill a couple of shadow buffers for every frame, for which we will need all the transform power we can lay our hands on.
Remember, reality is 80 million polygons :) Only a minority of those go directly on screen.
Fluid-studio's method is an unknown as far as performance is concerned, but its creator does not present it as a fully general way of occlusion culling ... it does need preprocessing.
Ive said this before, but Ill repeat it anyway ... raytracing does not make sense for first hits and shadow rays. Raytracing might only have shade a pixel once, but it has to test rays against all the surfaces which are potentially visible for a pixel ... compared to a renderer ala Greene's hierarchical Z-buffer paper this will result in almost the same number of tests per pixel. With deferred shading thrown in the mix all raytracing has over the Z-buffer (for first hits and shadow rays) is a slight storage advantage (because it can shade a pixel immediately) and its ability to subsample a scene ... which is only really usefull if you are reusing samples from previous frames (otherwise subsampling == aliasing).
PVR_Extremist
22-May-2002, 10:12
We are rapidly reaching a point where we will start to want to fill a couple of shadow buffers for every frame, for which we will need all the transform power we can lay our hands on.
Remember, reality is 80 million polygons :) Only a minority of those go directly on screen.
80 Million Polygons? I thought reality was death and taxes :roll:
Ty:
Well the whole "What if?" line of arguments have also been done to death previously as well. "What if the Neon250 came out on time? What if memory prices didn't drop for the original 3Dfx Voodoo1? What if the Kyro had the benefit of mass production? Etc. etc." We all know that on paper deferred renderers have great benefits. The problem is that so far, no one has proved that one can come out at, "the same price range with much better performance" - speaking of the top end here where IMR's are supposedly hitting the bandwidth ceiling.
The same price range with much better performance is the key to consagrate the deferred rendering idea.
The technology has already been proved by PowerVR. It works and work very well.
gking,
Unless I'm missing something, you are comparing multiple depth tests per fragment againts multiple ray-triangle intersections per fragment.
basically i don't see how ray-tracing a fragment is a constant time operation...
Regards,
Serge
The same price range with much better performance is the key to consagrate the deferred rendering idea.
The technology has already been proved by PowerVR. It works and work very well.
Well not much has changed since the original 3Dfx came out in this regard. That has always been one of the advantages over IMR yet still to this day (trying to get this back topic on track), DRs have not overtaken IMRs. This was one of the original questions that started this thread, "Is there still some kind of "technological wall" which will hamper IMR performance in the future?" which implies that DRs would surpass IMRs (because they would have to rely on expensive memory, etc.). To this day, it still hasn't happened nor does it appear to be happening anytime soon imo.
The wall is in front of you right now.
How to play Doom3 at 1024x768x32 at 60 fps with a $100 card now?
The technology wall are just avoided by game developers when they develop a new game. Sometimes a developer push a little and the wall appear.
The wall is in front of you right now.
How to play Doom3 at 1024x768x32 at 60 fps with a $100 card now?
The technology wall are just avoided by game developers when they develop a new game. Sometimes a developer push a little and the wall appear.
If the "wall" is here now, then from this point on you are saying that IMRs are going to be surpassed by DRs then. No one mentioned playing Doom3 at that performance level with a $100 card as proof of the demise of IMR though. Or are you implying that a 100 DR card will be able to play Doom3 at that performance level? I'm not sure I understand the reference to it.
Recapped, the point of the thread was that a long time ago, there supposedly was this memory bandwidth "wall" that would cause IMRs to go away because they couldn't keep up with DRs. It turns out that is no more true today than it was back then which is why Tino asked his questions. In other words, memory and other bandwidth saving techniques have evolved to keep pace with the bandwidth requirements for IMRs and games. I'm not saying that it doesn't exist, I'm just saying that imo, IMRs haven't hit it yet. Maybe soon, maybe not, I don't know.
Althornin
23-May-2002, 02:14
Remember, reality is 80 million polygons :) Only a minority of those go directly on screen.
Do you think at some point we will go away from texturing polys?
To a world created entirely out of flat one color polygons (if you have enough of them, this is possible!!)
And would this (if it is the eventual end product) work very poorly on a TBR? because in that situation, geometry, not fillrate, would be king.
PRman usually works with lots of polys per pixel, bucket rendering does not seem to be a problem for them.
RussSchultz
23-May-2002, 13:21
PRman usually works with lots of polys per pixel, bucket rendering does not seem to be a problem for them.
Doom at one frame every 10 minutes wouldn't be too exciting. :)
Wether its realtime or not they still want it to be as fast as possible.
If the "wall" is here now, then from this point on you are saying that IMRs are going to be surpassed by DRs then. No one mentioned playing Doom3 at that performance level with a $100 card as proof of the demise of IMR though. Or are you implying that a 100 DR card will be able to play Doom3 at that performance level? I'm not sure I understand the reference to it.
I dont want to use "if" but with a good 128bits 250MHz DDR (8GB/s) DR card we probably could have a excellent framerate without a big price. Probably something around $150. How much cost a Radeon 8500LE? Now imagine it using DR.
The HyperZ and other techniques will give probably no more than an aditional 20% of framerate.
Pacal,
here in the states we can get the OEM vers for a touch over $100. But is lower clocked mostly likey. At around $130 we get fully retial versions of LEs.
I dont want to use "if" but with a good 128bits 250MHz DDR (8GB/s) DR card we probably could have a excellent framerate without a big price. Probably something around $150. How much cost a Radeon 8500LE? Now imagine it using DR.
The HyperZ and other techniques will give probably no more than an aditional 20% of framerate.
Heh, I know you don't want to use "If" but you basically did, "What IF we had a DR with ..." so really the question Tino asked hasn't been answered. We know about the great advantages DRs have but the question Tino asked is When?
Althornin
24-May-2002, 02:07
PRman usually works with lots of polys per pixel, bucket rendering does not seem to be a problem for them.
Right, but wouldnt the overhead costs for a defered render become prohibative?
I mean the sorting of polys (for the sole purpose of not having to draw them later) is important NOW, because we have relatively low geometry and lots of texture layers. But with enough polys, you dont need textures - so then, theoretically, wouldnt a defrered renderer kinda suck?
Tiling is only meant to localize framebuffer access to the point where you can finish rendering a pixel without needing to store any intermediary values in the external framebuffer ... handy if you want to have lots of storage per pixel. As long as it doesnt inrease complexity and/or bandwith for other parts of the pipeline its a net win. Which at a high number of polygons per pixel basically boils down to the question wether you will be able to tile polygons without actually running them through the vertex shader. If you can then IMO with a big enough tile size its a net win, if not not.
I just found this thread accidentally while looking for something else, I realised its old but I just had to comment on this quote:
The DreamCast was a platform where every single developer knew they were coding for a deferred architecture and yet the games developed did not blow away games on IMR systems. You can't blame it on the developers.
Wether DC was better looking then the best IMR system at that time (it was the best looking console, arguably even after PS2 was released) isn't really important here. The point is that developers are programing to IMR's and that is limiting how effective TBR can be. To see this simply look at some of the DC game, did you ever see anything even approaching Shenmue that could run at 30fps on a Neon 250 (basically the chip in Dreamcast)? Nope because it could only be acheived when the devs made the game with a TBR in mind instead of with a IMR in mind. We all talk about how TBR can have performance problems in some games because devs make their games as if IMR is the only rendering technique out there, well what sort of performance problems with a IMR have if all games were made as if TBR was the only method of rendering out their?.. big problems is the answer.
BoddoZerg
30-Aug-2002, 15:23
As said by many people in this thread, the popularity of TBR depends on people designing games for TBR, which is too high a barrier to entry if TBR is only moderately more cost-efficient than IMR. Thus, no matter how good it is, TBR will never become popular unless IMR hits the long-predicted "Wall". Judging by the Radeon9700's performance, this is not happening any time soon.
MrNiceGuy
30-Aug-2002, 15:57
The real advantage of a TBR may not be opaque depth complexity, like a early Z system can assist with, but blending complexity, that early z only eliminates some of ( let's say half of blended pixels are occluded ).
Blending complexity goes up as developers start to use more trees ( blend the leaves b/c alpha test & MSAA don't look good ), grass, particle systems ( waterfalls, explosions, weather effects, blood sprays ).
It also goes up as developers do lighting that can't be expressed in a single pass. For instance, stencil shadow volumes require each light to be its own pass. Color shadow volumes don't, but not many people are using these yet.
Of course, doing particle systems, trees and stencil volumes tend to be very simple pixel shaders ( or no shader at all ), and IMR vendors are beginning to accelerate simple cases by fill rate - ie r300 only does 8 pix/clock when only 1 texture or z/stencil only.
I think IMRs will continue to dominate, partially because of history, and partially because so far, people have used tilers as a way of making things cheaper, rather than the way to make things that much faster. I think Gigapixel would have changed this.
As said by many people in this thread, the popularity of TBR depends on people designing games for TBR, which is too high a barrier to entry if TBR is only moderately more cost-efficient than IMR.
I don't think games need to be designed for TBR before TBR can become popular. All that's needed is a more upto date card then Kyro II was and the popoluarity will come. Certainly to get the most out of TBR games should be designed for it, but you certainly don't need to design games for TBR to already see a large performance advantage spec for spec. As for only moderately more cost efficient, well Kyro II was a cheaper chip to produce then Geforce 2 MX and outperformed it easily, that's more then moderately more cost efficient IMO. As for the "wall".. that's a tough one, it'll take allot of thought and possibly a crystal ball to even try to answer that.. we'll just have to wait and see.
Chalnoth
30-Aug-2002, 18:34
Well, after only reading the first post in this thread (yeah, could be a mistake, I know), I have to put my own two cents in.
Quite simply, I believe that deferred renderers will hit a technological wall long before IMR's.
The reason should be obvious. For deferred rendering, the TBR needs to cache the entire scene before rasterization. This brings along with it a host of problems.
First among these is memory usage. The deferred renderer must set aside an amount of memory that is large enough for the most complex scene that is ever rendered. This means a large exepense in video memory that largely goes unused.
You also need double-buffering in the scene buffer for the sorting and rasterization to go on at the same time, making for even more memory usage.
Since buffer overruns are inevitable, those also present pretty major problems. I believe the Kyro series handles the issue by going ahead and writing a z-buffer in external video memory to combine two separately-rendered deferred frames. For optimal performance, this z-buffer must always be allocated, which results in more unused memory space.
And then comes bandwidth. As triangle counts increase and start approaching the micro polygon stage, not only will the Kyro need more memory size than a deffered renderer, it will also need quite a lot more memory bandwidth. The worst-case scenario here comes for long and thin polygons, such as those seen in highly-tesellated pillars or pipes.
Of course, you cannot forget pixel shader. While vertex shading is, more or less, trivial to implement on a TBR, pixel shading is not. In particular, those fragment programs need to be stored along with everything else in the scene buffer, as well as the increasing amounts of data that are being passed from the vertex shaders to the pixel shaders.
Another way to look at this is simply that while an IMR will only have to store many things once, or perhaps not at all (just send them over AGP every frame), the deferred renderer will need to store extra copies of many things.
Granted, this doesn't mean that the idea of deferred rendering is totally useless, it's just that it starts to lose its luster as polycounts increase. Limited implementations that don't even attempt to always render in a deferred manner may be useful.
Dave Baumann
30-Aug-2002, 18:41
First among these is memory usage. The deferred renderer must set aside an amount of memory that is large enough for the most complex scene that is ever rendered. This means a large exepense in video memory that largely goes unused.
No there is not ‘must’ it is optimal for it to be cached, but that doesn’t mean it ‘must’ store it all.
You also need double-buffering in the scene buffer for the sorting and rasterization to go on at the same time, making for even more memory usage.
Not entirely you don’t.
Althornin
30-Aug-2002, 20:35
But chalnoth, being as a DR does not need the high speed exotic ram that IMr's use, memory space on a DR is (or should be) quite cheap. So your argument about storage space is kinda moot.
Chalnoth
30-Aug-2002, 23:10
But chalnoth, being as a DR does not need the high speed exotic ram that IMr's use, memory space on a DR is (or should be) quite cheap. So your argument about storage space is kinda moot.
No, it's not. Even though size becomes a problem before bandwidth, the scene buffer will suck up more bandwidth than the frame and z-buffers as polycounts continue to increase.
Chalnoth
30-Aug-2002, 23:14
No there is not ‘must’ it is optimal for it to be cached, but that doesn’t mean it ‘must’ store it all.
As I said, a design that does not attempt to always store the entire scene every frame wouldn't be bad. The main reason is simply because if you do attempt to store all scene data every frame, if a few frames come along that exceed that buffer (high-action scenes), fillrate will take a nosedive.
Not entirely you don’t.
Why wouldn't you need double-buffering to ensure that both the binning/sorting portion of processing can be done at the same time as rasterization?
Jerry Cornelius
30-Aug-2002, 23:59
Scene storage is only as important as AGP (or whatever other interface there is) is fast. How much data can you send to a card 60 times a frame? Because if you aren't sending it you are saving it in the previous video memory. What can you puch accross an AGP bus right now, 2 GB/sec? that's only around 60 MB totally saturated at 30 fps. I'd be surprised if a game uses half of that in the next few years.
As far as the wall goes, we are hitting it now. Has anyone noticed the price of video cards lately? And the emphasis is starting to lean towards features. You can render to a hardware tile and still be a IMR, as we've seen in some recent product, but without scene capturing you can't realize all the benifits, the biggest of which being you only have to handle any given tile of memory once.
Chalnoth
31-Aug-2002, 00:07
Scene storage is only as important as AGP (or whatever other interface there is) is fast.
Not true in the least. There's a lot of scene data (pre-transform) that's already stored in video memory today, and more in the future. Not all data is transferred over AGP every frame. Additionally, data going into the transform pipeline is not necessarily the same as data coming out from it.
Additionally, data going into the transform pipeline is not necessarily the same as data coming out from it.
Yep.
For example N-patches can increase the data size considerably.
Jerry Cornelius
31-Aug-2002, 05:26
Not true in the least. There's a lot of scene data (pre-transform) that's already stored in video memory today, and more in the future. Not all data is transferred over AGP every frame.
Of course it is. If you're storing the geometry onboard already than the extra space required when compared to an IMR is going to be a percentage of the video memory not a magnitude.
I don't see HOS presenting a problem either. Since the bounds of the final geometry and the destination output buffer will be known before hand, there shouldn't be any need to generate an entire scene worth of parametric geometry up front.
Chalnoth
31-Aug-2002, 17:33
Of course it is. If you're storing the geometry onboard already than the extra space required when compared to an IMR is going to be a percentage of the video memory not a magnitude.
No, that's not true, because, as I said earlier, data in does not equal data out. Quick example: Unreal Tournament 2003. UT2k3 uses prefabs to reuse geometry data. There's an option to disable this "compression" and transform all geometry into world-space at level load time. Vogel said that this takes up to about 16MB in some levels.
Additionally, don't forget that not all data in the vertex buffers on the card may be visible, and there may be other data that will come straight from the CPU (though hopefully less now with more advanced vertex shaders). In other words, there are too many factors to simply state that scene data will be "a percentage of video memory."
Dave Baumann
31-Aug-2002, 19:41
Why wouldn't you need double-buffering to ensure that both the binning/sorting portion of processing can be done at the same time as rasterization?
Sorting and Rasterisation occur on the same frame of data on a tile-by-tile basis. Once an entire frame is captured a tile is sorted and that is sent to the rasteriser whilst the next tile is being sorted. So, in reality you only need to capture one frame of data - once tiles have been sorted in a frame that memory space can be reused for the incoming data from the next frame.
Chalnoth
31-Aug-2002, 19:51
Sorting and Rasterisation occur on the same frame of data on a tile-by-tile basis. Once an entire frame is captured a tile is sorted and that is sent to the rasteriser whilst the next tile is being sorted. So, in reality you only need to capture one frame of data - once tiles have been sorted in a frame that memory space can be reused for the incoming data from the next frame.
That doesn't make any sense to me. The incoming data from the next frame operates in an immediate-mode fashion. If the scene buffer data is stored on a per-tile basis, the memory freed will not necessarily be available for incoming data. If the scene buffer data is stored sequentially by incoming data, then it would be freed in a chaotic manner, and therefore pretty much unusable for incoming data.
Jerry Cornelius
31-Aug-2002, 19:52
IMO we aren't likely to see levels that make extensive use of complex primitives. It's been tried before and only goes so far.
I suppose that inherently there are some geometry compression techniques that are not going to work well with scene capturing. In the real world, I doubt this will have much impact anytime soon, if at all.
Dave Baumann
31-Aug-2002, 20:33
That doesn't make any sense to me.
That much is evident.
Look, you are not rendering one frame and sorting another.
The data coming is is captured and arranged in memory via pointer lists of tiles to triangles, once you have a full scene the sorting begins on a tile-by-tile basis and once one tile has been sorted rasterisation of that tile begins; so raterisation is operating one tile behind the sorting. Now, once rasterisation has occured the scene data for that tile is no longer needed so its memory can be freed and reused for the incoming data on the next frame. So, in theory you don't need the full whack of memory for two whole scenes worth, just one (and a bit so you maintain efficiency from the swap from one frame to the next).
AFAIK they don't resuse the memory on a tile basis, but groups of tiles. IIRC there is a patent available for the scheme.
Chalnoth
31-Aug-2002, 20:43
All that means to me is that you still need to double-buffer the post-transform data, as well as hold a little bit of extra data for the tile being rendered and the next tile (or group of tiles...).
That is, I'm reading the procedure as:
Transform->Sort->Rasterize
The way you're describing it, it seems some caching is needed between each stage, and double buffering for optimal performance. I still do not see how Transform->Sort can double buffer anything less than the entire scene, as the scene data comes in in an immediate-mode fashion.
The double buffering from Sort->Rasterize is, more or less, trivial and should be handled by on-chip caches, as it's only for a single tile or a group of tiles.
I hope you didn't think I meant that two double-buffers of the whole scene were required...
Tessier
28-Nov-2002, 23:29
IMHO for a tile based deferred renderer the rendering pipeline is: transform->index->sort->rasterize.
And I think you need two buffers both for transformed geometry data and index data.
The T&L unit writes transformed "frame2" data into the first buffer.
Indexing unit uses the outputs of the T&L unit, and works parallel with it on the same frame ("frame2").
At the same time the sorting unit works on another frame ("frame1"), using the already transformed and indexed data.
Sorting and rasterization work parallel with each other, but not on the same tile (sorting is one tile ahead of rasterization).
In a tile based system transformation and sorting cannot work on the same frame, as you have to know which triangles fall into the tile you are sorting currently - and you have this information only when all triangles in the frame are transformed.
DemoCoder
28-Nov-2002, 23:53
I don't think the reason to go to a deferred rendering architecture is to increase bandwidth efficiency. In the future, I think the primarily limit of IMR will be shader execution speed.
If you have a 100 to 1000 instruction shader, an architecture which can avoid executing long shaders for invisible pixels will be much faster.
Also, filling depth/stencil will also be a limit for unified lighting algorithms.
There is a sort of 2-pass deferred rendering technique you can do with IMRs. In the first pass, you draw the depth buffer and store in another render tartet, the parameters you need to input to a pixel shader. In the second pass, you read the parameters from the previous pass, use that to select which pixel shader to run and the parameters. The early-Z culling will take of not executing the long shaders for hidden pixels.
IMHO for a tile based deferred renderer the rendering pipeline is: transform->index->sort->rasterize.
In a tile based system transformation and sorting cannot work on the same frame, as you have to know which triangles fall into the tile you are sorting currently - and you have this information only when all triangles in the frame are transformed.
Wouldnt it be cheaper do transform->binary search&insert instead, i.e. no separate sorting procedure ? Insert procedure itself is always sorted insert ?
darkblu
29-Nov-2002, 10:55
IMHO for a tile based deferred renderer the rendering pipeline is: transform->index->sort->rasterize.
In a tile based system transformation and sorting cannot work on the same frame, as you have to know which triangles fall into the tile you are sorting currently - and you have this information only when all triangles in the frame are transformed.
Wouldnt it be cheaper do transform->binary search&insert instead, i.e. no separate sorting procedure ? Insert procedure itself is always sorted insert ?
erm, what's that sorting for in the first place?
Tessier
29-Nov-2002, 11:05
Well, ok, not sorting, but z comparing.
Kristof
29-Nov-2002, 11:53
IMHO for a tile based deferred renderer the rendering pipeline is: transform->index->sort->rasterize.
In a tile based system transformation and sorting cannot work on the same frame, as you have to know which triangles fall into the tile you are sorting currently - and you have this information only when all triangles in the frame are transformed.
Wouldnt it be cheaper do transform->binary search&insert instead, i.e. no separate sorting procedure ? Insert procedure itself is always sorted insert ?
I am not really paying attention but this sort has to be per pixel correct, not sure how you want to do that, do you want to store things at the per pixel level ?
Well the actual sequence is, V Shade, Clip, VP Trans, Cull, Tile/Bin (this bit is the memory consumer, it isn't a sort) and once the whole scene has been gathered, Rasterise (the memory free'r). There is no specific "Sorting" step, per pixel depth test is effectively applied in the same way as a conventional rasteriser, but a tile at a time (memory being free'd by group of tiles).
The parameter store does not need to be double buffered as its treated more like a fifo. i.e. Tiler consumes memory blocks, rasteriser frees them both proceses happen at the same time. The only time the front end stalls is when there is zero memory left when it requests it. However in reality if a scene can fit in the available parameter memory (free and in use) stalls are kept to a minimum. At worst these stalls are a bit like the stall you get in an IMR solution when the rasterisation of a modest size triangle stalls the whole geometry pipeline (this is why IMR peak rates are normally much higher than their real world sustained figures).
John
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.