Intel and raytracing

Status
Not open for further replies.

Techno+

Regular
i came across this

http://www.theinquirer.net/default.aspx?article=39101

They say that they are running this version of quake using 4 x quad core machines aka 16 cores, and the GPU just pushes out pixels.

It doesn't end here, there is another intel raytracer at work that will run on a quad core.

Do you think that GPGPU would be suitable for RT, if no then how long do you think GPUs are going to live?
 
Ah Intel... why do you continue to show off stuff that isn't particularly new and still refuse to answer simple theoretical questions about why you believe rasterization will die, why raytracing will take over, and why x86 is the best architecture for it? (Can you tell that I'm still annoyed at their refusal to answer the previous posted questions?)

And I wish the Inquirer would stop eating up everything they say...
 
Charlie should go hide in a hole, because his complete lack of understanding of the dynamics and technical factors at play is so pitiful that I'm not even sure where to start.

And I think he needs a serious reality check regarding the performance of that solution too: http://www.idfun.de/temp/q4rt/benchmarks.html - notice that performance scales linearly with resolution, and that this is at 256x256. So, yeah, that's just 30-35x higher performance than at 1920x1200.... And that's for a laughably small numbers of secondary rays, I'd suspect!

If you want to do raytracing, at least do things right. Like, you know, these guys have: http://graphics.cs.uni-sb.de/Publications/2006/drpuasic_rt06_final.pdf

The most important thing there to notice probably is the chip layout on Page 5. Notice how the shader core takes so much of the die. Next, look at Page 7's performance numbers compared to CELL, and read the following paragraph: "A comparison to a Cell implementation of ray tracing shows up to 2.5 times higher performance, despite the hardware complexity being similar (see Table 1), and the DRPU8 ASIC performing much more complex shading (including textures). This shows the efficiency of the DRPU architecture compared to general purpose designs."

As such, excluding the shader core, the perf/mm2 and perf/watt of that chip are quite impressive to say the least. That's fairly unsurprising, given that it's fixed-function, but it does highlight how important that is. There is nothing magical about CPUs that makes them so amazingly good at raytracing, just like there is nothing magical about GPUs making them horribly bad at raytracing if the implementation is made with their architecture in mind.

Ask yourself this question: if raytracing becomes important as an addition to rasterization for certain effects or, less likely as a replacement for it... Which chip is the most likely to integrate fixed-function units for raytracing? The CPU or the GPU? Of course, that might not matter so much if the two chips merge, but right now that isn't going to happen in the high-end for the next couple of years.

Also, Techno+, these two threads might be of some interest to you:
http://forum.beyond3d.com/showthread.php?t=40372
http://forum.beyond3d.com/showthread.php?t=36792

In the end, what you really want IMO are serial-optimized, throughput and latency-tolerant processors along with special-function units, all on one chip. NVIDIA, AMD and Intel all seem to have long-term projects with at least three of these four things on the same chip, but I haven't seen any indication of anyone wanting the four at the same time. That would definitely be quite interesting, if the programming paradigm was right.

Another factor to consider is that design and mask costs seem to imply that the best strategy, going forward, is to maximize your addressable market for every chip you create. Chips optimized for a specific niche might not make sense eventually, and this is going to get even more pronounced as you get to the 32nm/22nm nodes and below. In that context, integration might become an economic negative, unless it is for an extremely large market in the first place, or that redundancy allows you to extend your target market.

There are many potential solutions to maximize RoI (Return on Investment) there while keeping integration levels high where it matters, but I'm already off-topic enough, so I'll just keep that for another rant... ;)
 
Algorithms

Without getting too much into the math, lets just say there are some scaling advantages to using raytracing over normal raster graphics. Where you would have to up geometry and take a non-linear scaling hit if you are using a rasteriser, raytracing will do the same job with a much more linear increase in complexity.

I just wish Charlie had a clue about algorithms, then he wouldn't eat up crap like that. In short unoptimized rasterization and raytracing both have an algorithmic complexity of O(width_pixels * height_pixels * triangles). While unoptimized raytracing is always this complexity, rasterization usually does not exhibit this performance but is more like O(width_pixels * height_pixels * overdraw) where overdraw is usually fairly small, like <10.

Of course you can optimize raytracing a lot, and turn it into a O(nlogn + width_pixels * height_pixels * logn) where n is the number of triangles. But even that is still not a clear cut win against rasterization.

And if I missed or messed up anything could Simon or Nick, or some of the others here with more knowledge than myself chime-in and correct me.
 
And if I missed or messed up anything could Simon or Nick, or some of the others here with more knowledge than myself chime-in and correct me.
You're pretty much right I believe, although the point that most people miss is that rasterization can also have the same geometric complexity as raytracing if perfect LOD and occlusion culling systems are used. Now of course this isn't possible in practice, but neither is a perfect raytracing acceleration structure for dynamic scenes ("pretty good" is usually the best option). That said it's arguably easier to get closer to the ideal complexity with raytracing than rasterization, but as was previously stated, it's not a clear win either way.

Also GPUs do a decent job at raytracing, and the Cell does even better (at least at the intersection tests - not as much at shading). There is already a commercial product that uses a full-featured GPU raytracer (by these guys - look for DeltaGen/RealTrace), so it's certainly viable.
 
I e-mailed Charlie asking him about GPGPU RT and he said that it isn't suitable since GPU memory access is all over the place, unless new algorithms are made. Is he correct?

Question : I see that GPUs and CPUs (cell to be precise) are both good at specific parts of RT, so will the hybrid FPU i talked about in the Fusion thread be suitable since it combines the best of both worlds?
 
I e-mailed Charlie asking him about GPGPU RT and he said that it isn't suitable since GPU memory access is all over the place, unless new algorithms are made. Is he correct?
I think it's coherency that's the problem. GPUs are stuck using large packets of rays (10s).

Cell and the DRPU both have 4-ray packets.

Jawed
 
I e-mailed Charlie asking him about GPGPU RT and he said that it isn't suitable since GPU memory access is all over the place, unless new algorithms are made. Is he correct?

I'm not sure what Charlie is talking about here but one issue I've heard discussed is a higher miss rate on texture caches for ray tracing. Which might just be another way of saying that ray tracing has slightly more random and unpredictable memory access patterns than rasterization, which of course GPUs are designed for.
 
While a standard accelerator does have poor memory performance on GPUs, the accelerator can actually be cleverly organized to take advantage of GPU design. That said, coherence is certainly a problem on GPUs, especially for small triangles.

On the Cell it's much easier to do dynamic load balancing and *all* of the bounces for a single ray on a single SPU. It's also much easier to pull in a big chunk of the data structure to work with at once, while on GPUs the lack of local memories and block memory movements causes some inefficiency.

That said a hybrid approach seems to have some promise in the short term, with Cell-like architectures doing the intersections, and GPU-like architectures doing the shading. PS3 seems like an ideal candidate for such an interaction.

Still, it makes a lot of sense to rasterize the primary rays... this is only relevant for secondary rays really, of which there are usually many fewer, excepting contrived "hall of mirrors"-type scenes.
 
as I thought it's using Saarland's OpenRT, which was already running quake 3 maps realtime on a cluster of 16 or 20 pentium 4 PC some time ago. so I'm not surprised. now I'd be interested to know how fares SaarCOR (ratraycing hardware on a FPGA) against a quad CPU.. and I'm also wondering if we'll see raytracing hardware integrated into CPUs (or specialized units that take care of the ray intersection tests or whatever).
That'd be an interesting Fusion-like system! (I don't give much a shit about Fusion as I don't see what it brings compared to the CPU+IGP we have today. though it should be nice for laptops, embedded and low cost/low power PC). Doesn't a raytracer needs a more random access memory and thus would feel good integrated to a CPU?
 
That said a hybrid approach seems to have some promise in the short term, with Cell-like architectures doing the intersections, and GPU-like architectures doing the shading. PS3 seems like an ideal candidate for such an interaction.
I don't like sounding like a broken record, but... :)
http://www-csl.csres.utexas.edu/use...ics_Arch_Tutorial_Micro2004_BillMarkParts.pdf - Page 85
The notion of having both throughput and latency-optimized cores on the same chip is interesting, IMO, since it's a much more generic vision of integration.

The big question, long-term, is can you actually merge these two kinds of cores and maintain high architectural efficiency? I don't think anyone really figured out a good paradigm for that yet. I predict much win and gold for whoever does though, whether it's a researcher in his basement or one of the Big Three.

I suspect the solution lies in abandoning the concept of a traditional register file. Something much more exotic than those two rather hacky ideas might hold a lot of promise IMO... The bad news is, that's the kind of thing that requires a lot of combined effort by both hardware and software engineers because it requires creating an instruction set that makes sense for both the hardware and the compiler. And those kinds of projects tend to be doomed to failure. There are exceptions, though, so who knows...
 
http://www.cs.utexas.edu/~trips/overview.html

Unlike traditional processor architectures that operate at the granularity of a single instruction, EDGE [Explicit Data Graph Execution] ISAs support large graphs of computation mapped to a flexible hardware substrate, with instructions in each graph communicating directly with other instructions, rather than going through a shared register file.

http://www.cs.utexas.edu/~trips/prototype.html

Each of the two processor cores can execute up to 16 out-of-order operations (integer or floating point) per cycle, from a window of up to 1,024 in-flight instructions. The processor core is composed of multiple copies of five different types of tiles interconnected via microarchitectural networks. Each core may be configured in a single threaded mode or in a 4-thread multithreaded mode in which instructions from multiple threads may execute simultaneously.
Jawed
 
On the Cell it's much easier to do dynamic load balancing and *all* of the bounces for a single ray on a single SPU. It's also much easier to pull in a big chunk of the data structure to work with at once, while on GPUs the lack of local memories and block memory movements causes some inefficiency.

With regard to Cell and the SPUs, would performance likely be better if there was a shared L2 to store the data structure in versus the local store? I only ask because it seems like the data would be replicated quite a bit among the local stores with neighboring rays. Where as a larger shared L2 would be able to fit substantially more of the data structure in it, not only because it's larger but there'd be less duplication of data.
 
and I'm also wondering if we'll see raytracing hardware integrated into CPUs (or specialized units that take care of the ray intersection tests or whatever).

I'm not certain if it would be the most efficient way of doing things but one could certainly include instructions in a CPU for the intent of computing barycentric coordinates quickly. These instructions could be used in both a ray tracer for intersection testing and a rasterizer for scan conversion.

Just a thought...
 
Ah Intel... why do you continue to show off stuff that isn't particularly new and still refuse to answer simple theoretical questions about why you believe rasterization will die, why raytracing will take over, and why x86 is the best architecture for it? (Can you tell that I'm still annoyed at their refusal to answer the previous posted questions?)

And I wish the Inquirer would stop eating up everything they say...
that's what Intel said, that rasterization would die? :O First new for me on that matter.

Btw, articles from a few days ago:

https://www.intel.com/content/www/u...tracing-on-cpu-gpu-with-embree.html#gs.3k0bwd

 
796.jpg
 
Status
Not open for further replies.
Back
Top