Real-time raytracing Real Soon?

wco81

Legend
Interesting conclusion:

Based on these numbers and on the relative performance differences between the P4 and the latest processors from Intel and AMD, I think it's reasonable to guesstimate that a four-core Conroe or Athlon 64 system would get you to the point where you could do real-time ray tracing at 450M raysegs/s or higher. This would put software-based real-time ray tracing well in reach of a God Box in late 2007, and a Performance Gaming Box not too long thereafter. These numbers also make me wonder if you couldn't do real-time ray tracing on a PS3, or even on an Xbox 360 (less concurrency than the PS3, but higher per-thread performance).

http://arstechnica.com/news.ars/post/20060805-7430.html

Is Hannibal far out on a limb?
 
optimistically; reasonable realtime renditions (re: clever hacks ~ workarounds) of raytracing, radiosity, global illumination, etc may happen in the first half of the next decade (2010-2014) for PCs and consoles. just my guess.

it's not going to happen in 2-3 years.
 
A problem with software rendering in this timeframe is if the CPU is maxed out doing the render what's going to run the game code?
 
zed said:
paper on doing raytracing on the cell, as u can see it thrashes a opteron + most likely will be at least twice as quick as the fastest cpu avaiable

Probably, although they are rating an Opteron at about 8GFLOPs whereas Conroe at nearly 3GHz is almost 50GFLOPs and from skimming I am not sure if the Opteron they tested was dual or single core. Anyhow it seems on their tests Conroe would have lost by a bit, which is not totally surprising as Raytracing is also very reliant on fast memory accesses and FlexIO/XDR seem to be a strength here.

ie Intel Core 2 Extreme X6800 costing $1000 (not bad for a $500-$600 machine)

But Conroe only costs Intel $40 to produce ;) Seriously it is difficult to compare cost that way in that the markets are really different on the business end and the consumer end (different access and uses). The E6400 is the same as the E6700 in everything but pretty much cost (same chip, different binning). So the Intel chips have a significant amount of markup -- Intel adds significant markup to the retailers and the retailers add on top of that to the consumer. The PS3 on the other hand has little to no profit for the retailer and is sold at a loss by Sony. One could flip flop this and compare the cost of a Cell blade or addon card which would be competing in a similar market and pricing structure, in which case you would be paying thousands upon thousands for Cell (like $7k for the addon card) versus $300 for a basic Conroe.

Anyhow, the results are exciting. Jawed posted a nice post a while back on GI using sparse sampling and it showed it took in the range of 120-180 seconds of ~1Megapixel display resolutions with 500k-1000k triangles on a single core P4 2.8GHz. With Cell possibly reaching 100s of cores next gen and Intel aiming for the 32 range for 2009 there is some hope. Hopefully GPUs will become diversified enough where they could toss their 500+ ALUs at it in 5 years time. I think that is where the real win could be. hundreds upon hundreds of simple cores with great bandwidth/latency to a larger pool of memory. That may be the big problem there though: lots and lots of fast memory.
 
I still haven't seen how they get around the memory bottlenecks. The moment you get beyond one recursion, you can have totally random memory accessing. That means no caching or data prefetching but working straight at RAM speeds. What I've seen from the likes of SaarCOR are simplistic scenes that aren't going to suffer from as much RAM thrashing as a real scene where you're using Ray Tracing for things it's best at like realtime reflections. A collection of matte objects with one or two glass balls is not at all a useful indicator of realtime raytracing applications. It's all very well saying RT algorithms offer greater realism, but if you're not using the techniques that provide that realism, it actually looks worse than standard rasterized graphics. And if you're only rendering matte, textured objects, why raytrace at a few frames a second when your GPU can render at 60+?

For very complex scenes raytracing makes sense due to the scalability, plus it can render HOS without any greater complexity, but I don't see how they are going to get past the memory bottlenecks for realtime, large-scale ray tracing using the features that make it worth using. 5 fps for technical, scientific applications makes sense, but for games - not in a zillion years unless they've found a hack for the memory problem.

On an Ars point, what does Hannibal mean with the XB360 having higher per-thread performance? Is he suggesting an XeCPU core will outperform a SPU at a raytracing algorithm at the same clockspeed? Why would that be? Memory accessing problems in the SPU? I can't figure why he'd think that. :???:
 
Shifty Geezer said:
On an Ars point, what does Hannibal mean with the XB360 having higher per-thread performance? Is he suggesting an XeCPU core will outperform a SPU at a raytracing algorithm at the same clockspeed? Why would that be? Memory accessing problems in the SPU? I can't figure why he'd think that. :???:

If that would be the case, why not use the PPU on CELL instead of the SPUs? I don't think it makes any sense. I think he's refering to the common-known "higher general purpose computing performance" of the 360 CPU Cores (because it got 3(!!!) of them - which is total bs anyways :LOL: ...).

Even worse, the per-thread performance of the PPU is supposed to be higher on the PS3 CELL as reported by some devs (CryTek).
 
I still haven't seen how they get around the memory bottlenecks. The moment you get beyond one recursion, you can have totally random memory accessing. That means no caching or data prefetching but working straight at RAM speeds. What I've seen from the likes of SaarCOR are simplistic scenes that aren't going to suffer from as much RAM thrashing as a real scene where you're using Ray Tracing for things it's best at like realtime reflections. A collection of matte objects with one or two glass balls is not at all a useful indicator of realtime raytracing applications. It's all very well saying RT algorithms offer greater realism, but if you're not using the techniques that provide that realism, it actually looks worse than standard rasterized graphics. And if you're only rendering matte, textured objects, why raytrace at a few frames a second when your GPU can render at 60+?

For very complex scenes raytracing makes sense due to the scalability, plus it can render HOS without any greater complexity, but I don't see how they are going to get past the memory bottlenecks for realtime, large-scale ray tracing using the features that make it worth using. 5 fps for technical, scientific applications makes sense, but for games - not in a zillion years unless they've found a hack for the memory problem.

On an Ars point, what does Hannibal mean with the XB360 having higher per-thread performance? Is he suggesting an XeCPU core will outperform a SPU at a raytracing algorithm at the same clockspeed? Why would that be? Memory accessing problems in the SPU? I can't figure why he'd think that. :???:

I'll ask

Edit

Looks like there is already a discussion going about that on the forums , though Hannibal hasn’t jumped into the discussion yet

http://episteme.arstechnica.com/eve/forums/a/tpc/f/174096756/m/226007140831/p/1
 
Last edited by a moderator:
On an Ars point, what does Hannibal mean with the XB360 having higher per-thread performance? Is he suggesting an XeCPU core will outperform a SPU at a raytracing algorithm at the same clockspeed? Why would that be? Memory accessing problems in the SPU? I can't figure why he'd think that. :???:

Just a guess, but possibly he is assuming the higher theoretical peak flops per core (which is higher on Xenos cores) plus the more robust (relatively) branching and such and memory access. Maybe he has something else in mind though.

tbh Raytracing may be nice, but Global Illumination techniques is where it is at (the terms can sometimes be used interchangibly by some but typically I have seen RT as meaning single rays with no bounces where GI is all the associative properties of the rays and bouncing). Like Shifty a lot of the demo RT stuff is pretty simple and not very represenative of the complexity any of us want, and dare I say some of the GPU rasterizing hacks that mimick a lot of effects can look nicer that straight RT. Kind of the cost/reward. Spending all your resources on RT may not necessarily give you a "win" in the end product if it eats up the graphics budget for other effects.
 
Just a guess, but possibly he is assuming the higher theoretical peak flops per core (which is higher on Xenos cores).
As an aside, can you quickly refresh me why this is so? I thought they were both vec4 VMX units in XeCPU and SPU both at 3.2 GHz. What extra maths wizzlet does XeCPU have?

As for branching, that's not something I associate with raytracing. The number of conditional calculations aren't many, and most calculations are applied for every ray regardless. It's this simplicity and elegance that makes RT a nice graphics rendering method, but not a fast one! Short of a materials conditional 'IF bump-map-applied THEN fetch normal map' I can't see where you'd have branching, and a lot of that should be able to be restructured into iterative steps I think. Given the increased number of SPUs and efficiency with which RT algotrithms can fit into the LS, I'd have thought Cell would trump XeCPU in a big way, unless the memory fetching is much better on XeCPU and holding Cell back. Algorithmic textures would help alleviate that. Won't make for good games, but if you're just after 'realtime raytracing' it should be doable! A raytraced snooker game would be the ideal test-case I think.
 
lol, I already solved the random access problem of raytracing. I posted it in another thread somewhere, but I don't think anyone cared. In any case, I've discussed it with some graduate students at my university, and they think it should work too. The only cavet is you need a multi-core setup with massive interprocessor BW. The CELL should be perfect, and by my grossly inaccurate estimation, fairly close to the "interesting" level of performance.
 
Last edited by a moderator:
As an aside, can you quickly refresh me why this is so? I thought they were both vec4 VMX units in XeCPU and SPU both at 3.2 GHz. What extra maths wizzlet does XeCPU have?

On paper, the PPE FPU contribute another 4 flops per cycle. But a lot of people think they shouldn't be counted..IIRC, the FPU and VMX can only alternately execute mathematical instructions for example.

On a general note, I think per thread performance would very much depend on your workload. A PPE isn't necessarily going to be better at all with some tasks, at least not because it's more "general' - as we saw before in some apps and benchmarks, the SPE can clearly outperform much fatter cores still like the P4. It's memory model is a double-edged sword in this regard - for tasks that can effectively use it, it can seemingly be a big win.

edit - more than that, I'm really thinking of per-core performance, and I'm assuming that's what Hannibal meant to say. Because if we're considering per-thread performance with 6 threads sharing 3 cores..well...
 
Last edited by a moderator:
lol, I already solved the random access problem of raytracing. I posted it in another thread somewhere, but I don't think anyone cared. In any case, I've discussed it with some graduate students at my university, and they think it should work too. The only cavet is you need a multi-core setup with massive interprocessor BW. The CELL should be perfect, and by my grossly inaccurate estimation, fairly close to the "interesting" level of performance.

These
http://beyond3d.com/forum/showpost.php?p=791374&postcount=19
http://beyond3d.com/forum/showpost.php?p=791456&postcount=29
posts?

I have been thinking along the same lines, where each core (SPE) would hold a small volume of the scene geometry and stream incomming rays from main memory. The rays would either pass through the volume or generate new reflection or refraction rays or die.
One problem that I i'm stuck on is how to assign rays to a new volume element in a neat way.
Another problem is the selction of accelareation structures for the scene geometry. I have a gut feeling that making the correct selection is crusiar for good performance.
 
Again, as I've said many times before on this board: It's going to be excellent for shadows, but for anything else we're not there by a long shot.
The shadow buffer technique could be regarded as a very simple kind of specialised raytracing, it just doesn't take into account the slant of the surface the shadow is going to be projected upon though.
 
Squeak said:
It's going to be excellent for shadows
Depends on your definition of excellent.
You only need a very simple subset of functionality for accurate shadows, and it ought to be considerably faster then a full blown raytracer.
That said - it's a subject for a lot of debate whether even that simplified subset would be 'fast enough' relative to performance of existing realtime shadow solutions (on existing hardware, I'm not saying anything about future).

Shifty Geezer said:
but if you're just after 'realtime raytracing' it should be doable!
You can claim that just by including one of the variations of relief mapping (or whatever the heck people are calling it nowadays). Though now that you mention it, pool/snooker with raytaced balls would be funny, though I bet the developer would make them all look like glass just to show off RT effects.

Acert93 said:
Just a guess, but possibly he is assuming the higher marketing flops per core (which is higher on Xenos cores)
Fixed. :p
I suspect he was thinking more along the lines of running code with lots of random memory accesses and conditionals (which is common in many RT implementations).
 
The raytraced version of Quake looks like an ass and it took 20 Athlons to render in real time. Raytracing on the CPU for the coming years seem like a silly idea. The GPU is making much better progress. Radiosity with about 10,000 elements can be done in real time according to GPU Gems 2.
 
These
http://beyond3d.com/forum/showpost.php?p=791374&postcount=19
http://beyond3d.com/forum/showpost.php?p=791456&postcount=29
posts?

I have been thinking along the same lines, where each core (SPE) would hold a small volume of the scene geometry and stream incomming rays from main memory. The rays would either pass through the volume or generate new reflection or refraction rays or die.
One problem that I i'm stuck on is how to assign rays to a new volume element in a neat way.
Another problem is the selction of accelareation structures for the scene geometry. I have a gut feeling that making the correct selection is crusiar for good performance.

Well, for the purposes of real time graphics you need to have an acceleration structure that's flexible enough to be updated real time as geometery changes. The key thing to remember is that you don't need perfection, a good approximation is enough. Thus, you must allow some overlap between volumes, and rays that fall in between are aribtrarily assigned to one volume or the other, but not both. Preferably, they are assigned to the "busier" volume. I'm studying abstract binary trees as a basis for my approach. In any case, the actual scene partitioning would be more granular then the volumes assigned to processors, which would be determined by the amount of scene data they contain (i.e. that fits in the processor cache minus ray data). There is much research and information in this area, so while difficult, I'm sure it can be solved "good enough".

The way I'm planning on doing ray assignment is by storing a adjacency map as a part of a volume's state. Thus, when a ray passes from one volume into another, the processor simply streams it out into the cache of the new volume, without worrying about the rest of the scene. The managment core (e.g. the PPE) would scan the outgoing stream, and it will redirect to the appropriate processor those rays that are destined to a volume that is already being worked on, otherwise it goes to RAM. The ray caches would be stored in a set of equally sized data blocks in a memory pool, where each block is assigned to a volume as needed. The manager maintains a list of these data blocks sorted by how full they are. When the volume a processor is working on runs out of data, the processor just queries this list and picks up a new volume and ray data set to work on. In this way, for those volumes that contain an excess of rays, the load can be shared.

Of course, these adjacency maps will have to be updated over time as the scene geometry, and thus partitioning, changes. However, this can all be determined and setup between frames without much trouble. Temporal coherency should minimise this problem in any case.

If you want to collaborate on this project, I'm open to the idea, btw. Unfortunitly, I don't have a CELL to work with right now, but I might be able to squeeze my university for one. In any case, just getting the thing written for x86 CPUs would be a start.
 
Last edited by a moderator:
How are you going to handle texturing? For each ray cast, not only do you need to determine the geometry hit, but you also need to read the diffuse texture, normal, and other maps associated with the surface point sampled.
 
Well, the problem is certainly reduced with this system, since each volume will be associated with a limited set of textures, although I admit it's not perfect. There's two possibile approaches I'm thinking of. One, you could use ray packets like in other systems to improve the coherency of texutring and the access of other surface parameters. Another way is to extend this idea of spacial coherency, what with the volumes, into the surface level. Basically, cast all the rays and store the location, replacing the ray's old point of origin. Then you look up the surface it hit, and add the ray to a list of rays that have hit that surface. There are clever ways of doing this that require no significant branching. Now you have reduced the problem to a single texture set.

Of course, that may not be good enough when you are dealing with high resolution textures and large polygons, but there are a couple options. First you can tesselate the surface into smaller polygons. Second, you can have some ordering huristic that tells you where to insert a ray into the surface's list of rays. For example, one could partition the surface (assuming it's flat) into a regular grid, whose density is determined by the texture resolution. Then, you can determine the index into that grid that a ray would lie on, and order based on that. However, this necessitates adding significant branching into an inner loop, which doesn't seem like a great idea to me.

Combining both ideas, one could trace the rays into the original geometery, and then add them to these hit lists based on the surface partitioning. In effect, you tessellate the original surface into "virtual" polygons, which arn't involved in ray tracing but are involved in ray/surface interactions. With 2D surfaces, it's not that unlike rasterisation actually. For higher order surfaces, you may need to use a 3D instead of a 2D grid, but any partitioning is better then none. It avoids branchiness, because you only need to calculate the index into the array of lists to find the list for the partition block the ray hits in. The only drawback is that you need to create an array of head pointers for each surface that use this feature (and not all will need it), but this shouldn't consume that much memory.
 
Last edited by a moderator:
Back
Top