D. Kirk and Prof. Slusallek discuss real-time raytracing

KimB · Jun 4, 2004

Interesting. I have a professor that does experiments over at LHC.

But I have to say, he'd totally disagree with you on those projects being useless. Typically what we actually get out of these high-powered experiments are side-effects that become very useful. For example, it is high-energy physics experiments that have primarily driven the advancement of superconductor magnets, which are now widely used in medical imaging.

And our government also gets quite a lot out of funding such expensive projects. Specifically, they get a lot of highly-trained scientists who are used to solving very challenging problems. They then offer them a big paycheck to give them a job on the "other side of the fence," to work on military projects, such as finding ways to ensure that our nuclear arsenal will still work without actually blowing bombs up.

nAo · Jun 4, 2004

Chalnoth said:
But I have to say, he'd totally disagree with you on those projects being useless. Typically what we actually get out of these high-powered experiments are side-effects that become very useful. For example, it is high-energy physics experiments that have primarily driven the advancement of superconductor magnets, which are now widely used in medical imaging.

You got me wrong! I'm not against researches at LHC

In fact I believe projects like LHC are a big opportunity to improve our knowledge of the universe and to improve our technology too.
But let me say LHC is a HUGE project, and believe me, there a lot of money spent of research projects completely fucked up from the start.
I worked in the ALICE experiment..and we had simulations of certain events that should (eheh..we don't know for sure!) happen in one rad-hard detector. It was started a research project to compress data collected by that detector..they even designed and fabbed a chip to compress those data. Too bad they knew from the start the compression algorithm they selected didn't fit very well with data they expected and the custom designed hw would have been destroyed in performance by a common FPGA (and even more in the future knowing LHC it's a project that should last 20+ years..), but they didn't stop afterall.
They spent milions..a complete waste of money. Money that could be spent on better research projects.

ciao,
Marco

davepermen · Jun 4, 2004

nAo said:
davepermen said:

show me any gpu that beats the performance of saarcor at raytracing...

Click to expand...

Don't be blind, look in the future.
Moreover there's already custom hw designed to perform raytracing and they aren't going to make any revolution in realtime 3D rendering, anytime soon..not in this universe

yeah, of course its about useless to have theoretical hw that runs at 100x slower raw speed (16x less pipelines, 5.5x lower clockrate), that can still do bether, than any gpu now.

this is a proof of concept. it proves, that, if correctly designed, hw can perform very good at raytracing. we don't need a gf6 clocked at 3ghz with 64 pipelines one day, to get simple raytracing. no, we can step back, look again, and design from the start. the resulting new design is presented and called saarcor. a bad piece of hw (slow, only one pipe.. i mean, imagine a gf6 running like that

). but the DESIGN of the hw is awesome, wich is why it results in outstanding performance compared to its needs.

all this research is available to anyone interested. it is as well available to ati, nvidia, intel, amd, etc. with the demonstration on siggraph 04, they will show, that hw is doable, and even while they don't have much power by themselfes, the hw performs really well. this can be enough, possibly, to motivate the big companies.

yeah, there will possibly not be an ATI SaarCor chip, nor an NVIDIA SaarCor, or something (i'd like to see an 8x opteron board with.. 4 opterons in and 4 "saarcorons"

oh, and, of course, lots of other cards in the pcie slots, too

), but it gives a full solution, and they can then estimate how much of this they could fit simply with low additional cost into their hw.

the design of dx10 can, with simple work, get mapped onto the saarcor design. if nvidia/ati do take care, they can opengl2.0, dx10, and openrt1.0 all onto one chip, and all performing fast. with minimal cost.

but they do need to get some motivation. a 90mhz chip that outperforms a gf6, that is a motivation

(and don't even try to imagine the marketing part! RAYTRACING! REALTIME! HARDWARE! boah.. they will be able to push till the last, spit out zillions of demos showing ALL SORT OF THINGS..).

all in all. every step helps. they did now years of research. the result is positive, they are able to show its doable. this is an important step. a VERY important one.

lets see who does the next steps. i can't wait.

nAo · Jun 4, 2004

davepermen said:
all in all. every step helps. they did now years of research. the result is positive, they are able to show its doable. this is an important step. a VERY important one.

They needed a pencil and some paper sheets to know it's doable. It's called computer science

. After you know you can do it but it won't be profitable anytime soon (from an academic and economic point of view) and that could never compete with other solutions (no matter what could be, but what CAN be), u'd better spend that moneys to advance RT in different ways, in my opinion

lets see who does the next steps. i can't wait.

World domination?

morfiel · Jun 4, 2004

Re: HW raytracing

GeLeTo said:
1. You need an axis-aligned BSP of the whole scene. When something changes you have to rebuld the affected BSP nodes. This is very slow (especialy when the changed geometry spans a root node) and can not be done efficiently in hardware.

That's not fully true. While the hardware design presented is not compatible with arbitrary
changing geometry, it can handle hierarchical animation. This is done by subdividing the
scene into objects (much like GL display lists) and then instantiating the objects. A separate
BSP's is built for each object and then a top-level BSP is build ober the bounding boxes
of all objects. As this top-level BSP can easily be rebuilt for reasonable numbers of objects
(say a few thousands) per frame, you can do animations with it. Together with keyframe
animation you still can't do anything rasterisation can but enough to implement, say Quake.

GeLeTo said:
3. I don't see how this architecture can be less complex than the classic hardware implementation - they replace the very simple triangle rasterization+depth compare units with a bunch of paralel raytracing units that perform BSP traversal and triangle intersections.

Ok, that's a "why"-question which usually are a little difficult to answer. I'll try anyway. In
computer graphics, there is a term which is called - output-sensitive. That means that you
only calculate what you see. Rasterisation is not output-sensitive only in a limited, Z-Buffer
related way. To improve this, you build a scenegraph layer into your application. This leads to
the situation that you have your visiblity-information split: in a highlevel part, like a BSP thats
in the application, and a per-pixel part that's on your graphcs card. So far no problem. But
now you want to do effects. Due to the non-recursive pipeline-approach of rasterisation,
you have to apply multi-pass rendering (if your scene needs a reflection, you have to render
the reflection map in advance). THE problem is that the control-flow for this multipass
rendering comes from your application, which only has limited visiblity information. In other
words: at the point in time when you have to decide if you want to calculate a reflection map,
and at what resolution, you don't know yet if it's gonna be visible in the end.
This results in huge overhead and very high memory bandwidths. Furthermore, the design
can't be parallelized well in most stages of the pipeline (shading is an exception, though) , so
high throughput is a must. Both together leas to challanging chip design.

In RayTracing hardware, the control flow lies fully in the card, which is the right way to do it,
because only here all visiblitly informations are available. The draw-back is that you don't get a
real pipeline but more like loop-thing, where a shader can generate new rays and then throw
it backwards into the tracing core.

But you gain the fact that everything is parallelizable at will. Some raytracing pioneer (I don't
remember his name) once said: it is embarassingly parallel. So you don't even try to build a
fast intersector or fast shading unit, but you keep it simple, simple and simple. Now you have
a slow intersector, and a slow shading unit, but they are tiny. And then you pack _lots_ of them onto a chip. This isn't a solition for general purpose hardware, as you usually got lot's of dependencies, so most programms don't run twice as fast on twice as many CPU's, but raytracing does up to the point where you have one CPU per pixel on the screen. That gives
a very simple and highly efficient chip: lots of identical, small and simple unit. You got twice the area on the die, you just pack on twice as manu units. That's scalablilty

GeLeTo said:
All other disadvantages of the standart raytracing still apply - lots of context switching, needs the whole scene in memory at once, etc.

Here you are right. But on the other hand Raytracing hardware needs by far less bandwidth
then rastersiation hardware. So you have to spend a few gig (1-2?) of ram for your graphics-card, but it's not as bad as it sounds, because you can take the cheapst RAM available instead of heading for SRAMS ans 256 bit busses as nvidia and ati
does.

GeLeTo · Jun 4, 2004

Re: HW raytracing

morfiel said:
That's not fully true. While the hardware design presented is not compatible with arbitrary changing geometry, it can handle hierarchical animation ... A separate BSP's is built for each object and then a top-level BSP is build over the bounding boxes of all objects...

That sounds fair. Still won't handle arbitrary deformations.

morfiel said:
Rasterisation... Furthermore, the design can't be parallelized well in most stages of the pipeline (shading is an exception, though) , so high throughput is a must. Both together leas to challanging chip design.

In RayTracing... you gain the fact that everything is parallelizable at will. Some raytracing pioneer (I don'tremember his name) once said: it is embarassingly parallel. So you don't even try to build a fast intersector or fast shading unit, but you keep it simple, simple and simple. Now you have a slow intersector, and a slow shading unit, but they are tiny. And then you pack _lots_ of them onto a chip.

But this paralelism has it's achiles heel - all these _lots_ of tiny units have to access the BSP/triangle data. And to keep them paralell - they have to access the SAME data, othervise the memeory becomes the bottleneck. For quke-style geometry this is easy - lots of big polygons get intersected at the same time by lots of adjacent rays, many ol these rays pass through the same BSP sectors.

But not so with more complex scenes. Small triangles(intersected by only small number of rays), big areas (where even coherent rays when away from the viewpoint will span different BSP sectors) - this will cause each of the many small units to request a different set of data which will saturate the bandwidth and kill the perfomance.

nAo · Jun 4, 2004

Re: HW raytracing

GeLeTo said:
But not so with more complex scenes. Small triangles(intersected by only small number of rays), big areas (where even coherent rays when away from the viewpoint will span different BSP sectors) - this will cause each of the many small units to request a different set of data which will saturate the bandwidth and kill the perfomance.

They should group rays..and process them together.
What about non-polygonals primitives? We don't want to process 1+ bilion triangles scene..they should have a tesselated primitives cache.

GeLeTo · Jun 4, 2004

Re: HW raytracing

nAo said:
They should group rays..and process them together.

That's what I am saying. But they can't group them effectively for complex scenes - read my post.

nAo · Jun 4, 2004

Re: HW raytracing

GeLeTo said:
That's what I am saying. But they can't group them effectively for complex scenes - read my post.

I read your post, but I don't fully agree about your remarks on complex scenes. Complexity just as number of triangles in the scene doesn't work itself against grouping rays. Even a one quad plane can distrupt ray coherency with a proper BRDF

Thorn · Jun 4, 2004

Re: HW raytracing

GeLeTo said:
But not so with more complex scenes. Small triangles(intersected by only small number of rays), big areas (where even coherent rays when away from the viewpoint will span different BSP sectors) - this will cause each of the many small units to request a different set of data which will saturate the bandwidth and kill the perfomance.

Subdivide your scene in large tiles and replace them with a lower resultion tile if the tile is far away. Objects can be replaced too in the distance

davepermen · Jun 4, 2004

for the complex scene claims: first: the billion-triangle-scene benches show it can scale very well.

second: you've seen their implementation of displacement-mapping? exactly: subdivide to a texel-wide grid, and render the small triangles displaced.

there are several situations where they have tons of small triangles, and it always performs well. should be enough prove its not an issue.

davepermen · Jun 4, 2004

nAo said:
They needed a pencil and some paper sheets to know it's doable. It's called computer science . After you know you can do it but it won't be profitable anytime soon (from an academic and economic point of view) and that could never compete with other solutions (no matter what could be, but what CAN be), u'd better spend that moneys to advance RT in different ways, in my opinion

it IS competing with other solutions. till now, it is even the BEST solution. there is NO proof that gpu's of today or tomorrow are able to beat a saarcor chip. on the other hand, the saarcor chip can get tuned by chipdevelopers if they want. saarcor shows real proof its implementable, runable, and fast (with cheap, small, non-advanced hw). if thats not a competition that at least psychologically hurts nvidia and ati..

yes, saarcor will never sell. but the technology in it, that can get sold. and reused in other hw.

how can you state saarcor can not compete? only with one reason: there is no other acceptable solution. show me one.

GeLeTo · Jun 4, 2004

Re: HW raytracing

nAo said:
GeLeTo said:

That's what I am saying. But they can't group them effectively for complex scenes - read my post.

Click to expand...

I read your post, but I don't fully agree about your remarks on complex scenes. Complexity just as number of triangles in the scene doesn't work itself against grouping rays...

I'l try to explain.
In a traditional rasteriser to detect the visibility of a single pixel you have to test against the value of that pixel in the Z-Buffer. With a reasonable overdraw you need only a few Z-buffer reads for each pixel in the frame to determine the visibility.

In a raytracer - you have to test against the BSP first and then - against the triangles referenced from the BSP leaf nodes that the first test has reached.
Obvously these tests requite a lot of reads from the memory - much more for each pixel than a traditional rasterizer. To overcome this - multiple rays are grouped ond processed concurently for as long as they require the same data - traverse the same BSP nodes and hit the same triangles.

What I am arguing is that this becomes impossible if the triangles and BSP sectors become too small and too many. You simply cannot group rays if they don't pass through the same BSP sectors and hit the same triangles.

Maybe an on-chip cache of the recently fetched BSP nodes and triangles may help, but this still will not be effective enough.

MfA · Jun 4, 2004

If you are just getting primary intersections for a given scene rasterizers will beat saarcor easily under most circumstances relevant to gaming.

Megadrive1988 · Jun 7, 2004

very, very awe-inspiring realtime raytracing. to think that this was produced on a 90 Mhz, 1-pipeline chip is just...incredible.

just think if we had a 2nd or 3rd generation saarc chip that had 200+ million transistors and 400~500 Mhz core clock. with all the improvements and fixes that come with a 2nd or 3rd generation chip. the leap in performance for raytracing we might see....it just boggles the mind.

I hope that some sort of raytracing unit is built into NV70 and R700 8)

morfiel · Jun 7, 2004

Re: HW raytracing

GeLeTo said:
Maybe an on-chip cache of the recently fetched BSP nodes and triangles may help, but this still will not be effective enough.

Obviously, you need a on-chip cache. Experiments have shown that for practical scenes
a buffer of 4 KB of BSP-Node Cache plus 2 KB of Triangle-Cache plus 6 KB of Matrix-Cache
are sufficient to reach cache-hit-rates of >99 %.

The idea is again that of parallelism. Fast memory accesses are only important for high
troughput. The saarcore design just lets multiple "tasks" run on the same pipeline in a
round-robin-manner. (I think currently it's 16 tasks). Once a task fetches a BSP node from
memory (to the caches), 15 other tasks are processed while the first one is waiting for
it's result. As there are no dependencies between the tasks, you can easily hide the
memory-latency.

WaltC · Jun 7, 2004

SecretFire said:
Given that a GeForce 6800 has 10-20x the floating point of the Opteron system you describe, you are a poor programmer if you cannot program it to run a ray tracer at least twice as fast on a GPU as on a CPU.

Click to expand...

Maybe it's just the german way, but these guys seem kidn of hostile.

What I'd like to see is a definition of "10-20x the floating point of the Opteron"...

I'd really be interested in how that little number is reached.

Obviously, 3d gpus are not designed to ray trace, and so of course have no "raytraycing acceleration" built in...

But that won't stop (and never has) the PR guys from blurring the lines of clear factual distinction into an indecipherable mess just to promote their products in some mindless fashion. What Kirk needs to do is to demonstrate a box without a cpu, powered exclusively by an nV40 gpu running Windows or Linux, and running something like Lightwave, and doing ray-trace rendering 10x-20x faster than the lowly Opteron. Heh...

When Kirk can do that, I'll sit up and pay attention! Otherwise, it seems like PR gobbledegook to me.

suicuique · Jun 7, 2004

WaltC said:
What I'd like to see is a definition of "10-20x the floating point of the Opteron"... I'd really be interested in how that little number is reached.

Obviously, 3d gpus are not designed to ray trace, and so of course have no "raytraycing acceleration" built in... But that won't stop (and never has) the PR guys from blurring the lines of clear factual distinction into an indecipherable mess just to promote their products in some mindless fashion. What Kirk needs to do is to demonstrate a box without a cpu, powered exclusively by an nV40 gpu running Windows or Linux, and running something like Lightwave, and doing ray-trace rendering 10x-20x faster than the lowly Opteron. Heh... When Kirk can do that, I'll sit up and pay attention! Otherwise, it seems like PR gobbledegook to me.

Thanks for this sensible, most thought provoking post in an up-till-now very interesting thread.

BTW "floating point rating" is easily defined. Look it up.
It has nothing to do with the capability to running Windows/Linux or Lightwave though.

regards, alex

davepermen · Jun 7, 2004

anyways, he's right, alex.

kirk just "cried" around how much bether his gpu is, no mather the facts. his gpu is faster, has more parallel pipelines, but can't provide a 90x faster raytracing solution than saarcor (wich it would have to, because the whole thing runs 90x faster in the end

).

he can't even show a raytracing solution on his hw, peforming EQUAL as saarcor.

it's a pretty poor discussion partner, kirk, that is. only hyping his gf6.

i do understand him. but he lacks proof.

UPO · Jun 7, 2004

if developers use some raytracing as part of their engines for certain effects then would it be possible to place in silicon couple of small, fast tracing units in order to speed up calculations? ( hybrid chip) or does it require completely new design of GPU?

D. Kirk and Prof. Slusallek discuss real-time raytracing

KimB

nAo

Nutella Nutellae

davepermen

nAo

Nutella Nutellae

morfiel

GeLeTo

nAo

Nutella Nutellae

GeLeTo

nAo

Nutella Nutellae

Thorn

davepermen

davepermen

GeLeTo

MfA

Megadrive1988

morfiel

WaltC

suicuique

davepermen

UPO