Larrabee delayed to 2011 ?

Voxilla · Sep 24, 2009

Scali said:
The reason for this is that it is getting ever more difficult to push graphics further in a way that is visually significant. We've now reached a point where trying to go for more resolution, higher AA/AF or better framerates seems to be a dead end. For most intents and purposes, 1920x1080 with 4xAA/16xAF and 60 fps is going to be enough for many years to come. And trying to push further requires incredible leaps in performance.

I'm looking here at a blue ray movie called 'Earth' on my 4MP screen.
And yes it is only 1920x1080 30p, and it would be interesting to estimate the ?xAA/ ?xAF. But when I look at it I can't help thinking we are still light years away of rendering/simulating anything close to that.

To only even try remotely approach these visuals we will need all the horsepower and cleverness for decades to come.
For sure new rendering / physics approaches will be needed and I would include volume rendering as one of them :smile:

Scali · Sep 24, 2009

Voxilla said:
I'm looking here at a blue ray movie called 'Earth' on my 4MP screen.
And yes it is only 1920x1080 30p, and it would be interesting to estimate the ?xAA/ ?xAF. But when I look at it I can't help thinking we are still light years away of rendering/simulating anything close to that.

To only even try remotely approach these visuals we will need all the horsepower and cleverness for decades to come.
For sure new rendering / physics approaches will be needed and I would include volume rendering as one of them :smile:

I still don't think you get what I'm saying.
What I'm saying is basically this:
Since G80 was introduced, 'nothing' happened. Most games still have DX9-level graphics, sometimes pimped up slightly with a handful of DX10 candy. Since G80, both nVidia and AMD have only been competing with more of the same, getting to ridiculously sized chips and incredibly powerhungry and noisy videocards (and sadly the HD5870 is no exception).

What we need isn't just faster, what we need is better. And that is going to require more efficient and more flexible GPU architectures. If you just stick to current DX10 architectures, you'd need to continue throwing incredible amounts of transistors at the problem, it's just not going to work. We need architectures that work smarter, not harder.

compres · Sep 24, 2009

Andrew Lauritzen said:
I definitely have done work in both, but this isn't a question of the ISA's themselves at this point... I claim that no one in their right mind is going to write code at the assembly level for processors like these nowadays - intrinsics and their associated compilers are good enough, if not better than hand-coding the assembly nowadays, even for the inner-most kernels.

Thus the question purely comes down the advantages of x86 (as you state, compatibility and tools) vs. the disadvantages at the *hardware* level. The implicit assertion appears to be that it's expensive at the hardware level to support x86, but I've yet to see someone actually back that up with numbers.

Anyways this is straying a bit off-topic...

I think it is quite on topic.

I know for a fact that they are also targeting the HPC sector with larrabee, and in these applications sometimes people optimize the assembler code. Specially with a new chip and infant compiler optimizations.

Voxilla · Sep 24, 2009

Scali said:
What we need isn't just faster, what we need is better. And that is going to require more efficient and more flexible GPU architectures. If you just stick to current DX10 architectures, you'd need to continue throwing incredible amounts of transistors at the problem, it's just not going to work. We need architectures that work smarter, not harder.

As long as the problem remains parallelizable, throwing more logic at it is way to solve it. Introducing more flexibilities as has been going on with DX 9.10,11 enables programmers to come up with smarter ways to improve visuals...

EDIT: This happens to be my 100th post, I need to celebrate this with a drink

Scali · Sep 24, 2009

Voxilla said:
As long as the problem remains parallelizable, throwing more logic at it is way to solve it. Introducing more flexibilities as has been going on with DX 9.10,11 enables programmers to come up with smarter ways to improve visuals...

I don't see DX11 as a big step in flexibility.
The only thing it really adds is tessellation... but ironically enough it is again a fixed approach, not a flexible one.
Even on DX10 hardware you can program tessellation through Compute Shaders. So what I would want to see (and which will invetiably happen at some point), is that such programmability is efficient enough to not need these fixed approaches. Much like how pixelshaders made fixedfunction shading obsolete.
Eventually I'd want the entire pipeline to be implemented through Compute Shaders... but the step from DX10 to DX11 just seems like a very minor one in terms of flexibility.

Nick · Sep 24, 2009

silent_guy said:
According to this page, the two spheres thingy is procedural geometry demo. I don't know exactly that means, but I suppose it's more interesting than just rendering just two spheres...

For a sphere, procedural basically means intersecting a ray with x² + y² + z² = r². It's just a buzzword that is used to indicate that the scene has non-polygonal objects. Sphere's have always been the simplest objects to ray-trace.

Why this urge to stretch every minor fact into an argument for your dogma's? There's really no need for that and it makes a smart person sound silly.

It's not an urge, it's just an observation. You can't support efficient hybrid rendering without supporting efficient raytracing. It's always going to be computationally expensive, so if your scene is 90% rasterized and 10% raytraced you don't want 10% of time going to the rasterization and 90% to raytracing.

Unless you see a way to have efficient hybrid rendering with today's GPU architectures I really don't see why it would sound silly to suggest that GT300 might be a leap forward toward supporting efficient ray-tracing?

Voxilla · Sep 24, 2009

Scali said:
Eventually I'd want the entire pipeline to be implemented through Compute Shaders... but the step from DX10 to DX11 just seems like a very minor one in terms of flexibility.

Probably you already could try to implement an entire pipeline in software with compute shaders. Sort polygons in bins, rasterize them...
It could work, might result in Larrabee like performance.

If you want to do raytracing, this should be possible too with compute shaders...

Scali · Sep 24, 2009

Voxilla said:
Probably you already could try to implement an entire pipeline in software with compute shaders. Sort polygons in bins, rasterize them...
It could work, might result in Larrabee like performance.

If you want to do raytracing, this should be possible too with compute shaders...

I think you already gave the answer (marked in bold above). Things are 'possible' with compute shaders today, they just aren't all that efficient. That's the point... What I want is the ability to implement the entire pipeline AND get good performance. And for that, we need more flexibility. Larrabee is going the right way in trading fixed-function hardware for more programmability and flexibility. It remains to be seen whether at this point you can get competitive performance out of this approach.
I suppose Intel wants to take the entire leap at once, while I was hoping for small steps at a time. Fixed tessellation in the pipeline just doesn't seem to be a step in the right direction.
Unified shaders were a nice step in the right direction, and from what I understood, AMD now performs texture filtering by reusing the ALU interpolators, that would also be a nice step. That's the sort of thing I want to see. Make smarter use of your hardware, don't just throw more hardware at it. There is still a lot to be gained by just re-using existing resources more efficiently.

Voxilla · Sep 24, 2009

Scali said:
Fixed tessellation in the pipeline just doesn't seem to be a step in the right direction.
Unified shaders were a nice step in the right direction, and from what I understood, AMD now performs texture filtering by reusing the ALU interpolators, that would also be a nice step. That's the sort of thing I want to see. Make smarter use of your hardware, don't just throw more hardware at it. There is still a lot to be gained by just re-using existing resources more efficiently.

The DX11 tessellation is not all that fixed function if you delve into it. Probably you know already all of this, anyway just a recap.
The only thing it does is produce a triangle mesh, I mean just the mesh connectivity.
It doesn't even generate the vertices, as you need to compute them yourself in a second vertex shader called the domain shader.
Basically this shader is fed the parametric spline control point and the uv coordinates per 'to be calculated vertex'. You then have to write a shader to evaluate the parametric spline at given uv including calculating normals.
Generally this is a very inefficient way to calculate all tessellated vertices. It could have been done much more efficient with fixed function hardware...
The up of the DX11 approach is that it needs very little transistors to be implemnted. The down is that it puts an extreme load on the shaders and I'm wondering if they can calulate vertices fast enough to saturate the triangle rasterizer.
Before I've implemented a similar approach in DX8 for rendering the Utah teapot from a Bezier patch description entirely on the GPU and I can only reach 200 M triangles a second on a GTX280.

BTW texture filtering is still done in the texture units on the RV870, only texture coordinate interpolation is done by the shaders, see other thread.

rpg.314 · Sep 24, 2009

Nick said:
Unless you see a way to have efficient hybrid rendering with today's GPU architectures I really don't see why it would sound silly to suggest that GT300 might be a leap forward toward supporting efficient ray-tracing?

Just slap a ray sorting ip blob on the chip and let it deal with all the crap. Caustic would like that.

If you have only ~10% raytraced scene, it may be even makes sense to have it over making an overall more flexible chip.

But yeah, GPU's need to become more flexible without giving up their present advantages of perf/mm2 perf/W. And those OptiX demos suck waaaay more than the lrb demo.

Anteru · Sep 24, 2009

rpg.314 said:
But yeah, GPU's need to become more flexible without giving up their present advantages of perf/mm2 perf/W. And those OptiX demos suck waaaay more than the lrb demo.

You didn't see the path-tracer rendering the Veyron from this year's SIGGRAPH, did you? This looked more impressive than the LRB demo IMAO, especially as it was running on current-gen hw (and I'm pretty sure it'll run much better on next-gen HW from nVidia).

rpg.314 · Sep 24, 2009

No I did not, could you post a link here?

Scali · Sep 24, 2009

rpg.314 said:
No I did not, could you post a link here?

I posted a link in this thread some time ago.

rpg.314 · Sep 24, 2009

You mean the one where a green car runs around in an urban environment, running on a bunch of Quadros? I have seen that one.

Scali · Sep 24, 2009

rpg.314 said:
You mean the one where a green car runs around in an urban environment, running on a bunch of Quadros? I have seen that one.

Well, that car was supposed to be a Bugatti Veyron (you can see the Bugatti logo in the pictures), it was rendered using some kind of interactive raytracer, and those pictures were taken at SIGGRAPH. So this is what Anteru was referring to.

rpg.314 · Sep 24, 2009

The demo of fractal intersection with a sphere running around was cool too. There was a lot of geometry changing below it, but they must have used some hacks to render it in real time.

pcchen · Sep 24, 2009

I think an interesting question here is, "is ray-tracing going to be more efficient than scanline rendering in the long run?"

At the first glimpse, the answer seems to be obvious. Of course scanline rendering is faster. Even the movie industry agree with this (many 3D CG are still rendered with scanline rendering, only those pixels requiring ray-tracing such as glass or metal are rendered with ray-tracing).

However, if we look beyond current hardwares, it's getting less obvious now. Currently, computation power (the raw GFLOPS number) is growing faster than memory bandwidth, which is again growing faster than memory latency. The disparity is getting very large very quickly. On the other hand, the requirement for image resolution is not growing as fast (i.e. a 30000x20000 image is not necessarily better compared to current 1920x1080 images).

As a result, triangles are getting smaller, because people want better 3D models with those nice tessellations, while resolution does not need to increase as much. This some how makes scanline rendering less efficienct, and growing memory latency certainly doesn't help here.

So... is ray-tracing going to be the best and the most efficient way to render a 3D scene? I don't know. But it certainly doesn't hurt to try, though.

Scali · Sep 24, 2009

I think raytracing still has some pretty obvious shortcomings that haven't been properly tackled yet.
All this "realtime raytracing" stuff is based around being able to make use of statically precomputed acceleration structures like kD-trees to get logarithmic ray-triangle test behaviour.
This all falls apart the moment you start skinning or otherwise animating your triangle meshes.
With offline rendering you can get away with it, since the rendering time is so long, that spending a few seconds on calculating the acceleration structures for an individual frame is acceptable. This however doesn't hold for realtime purposes.

Then other obvious problems are with various filtering issues. Generally raytracers just throw more rays at the problem to generate more samples when filtering. Far less efficient than rasterizers which make use of neighbouring pixels to apply smart filtering through local derivatives and such.

I think for those reasons, rasterizing/REYES will be a more interesting tool than raytracing for a long time to come. Raytracing is interesting mainly as a tool to locally solve indirect lighting conditions, but applied only where it matters.
I wonder if this is ever going to change at all.

3dilettante · Sep 24, 2009

Andrew Lauritzen said:
I see this comment thrown around a lot and superficially it seems to hold water... but in reality, I wonder whether people have the facts/numbers to actually back this up (particularly from the hardware point of view), or whether people are just making a lot of assumptions. I tend to give hardware designers the benefit of the doubt with respect to making good decisions, but hey I'm no hardware expert and maybe the people making these comments are Still, if that's the case, I'd be interested in seeing the facts/logic backing up the assertion rather than more vacuous statements.

Charlie of Semiaccurate fame claims he spoke with an Atom engineer who stated that core was 15-20% larger because it was x86.

The relatively contemporaneous Intel P5 core and the Alpha EV4 showed a 3.1M to 1.68M transistor count disparity.
There are some notable design differences, but also some overlap in overal specification.
There will probably never be a completely apples to apples comparison because manufacturers have different design targets and different circumstances.

It should be noted that Core2 was a notable x86 milestone in that it went 4-wide.
RISC chips that wide had been around since the mid-1990s.

The decoder block in Nehalem is one of the largest partitions in the core.
AMD's predecoder for K8 has a predecode block fo 16 parallel predecoders--before it gets to the actual decoders.

So long as there have been performance RISC designs with a concerted development effort, Intel desktop x86s have only ever approached parity with a process lead.

There are a range of flags and processor status registers that are arbitrarily set by various instructions, so many of those are renamed.
The load/store pipeline is more complex, the number of addressing modes is greater, and pipelines tend to have a few extra stages because of decoding.

The instruction caches for a given level of capacity are larger for performant x86 chips because they have predecode information in them. Some designers trade-off size by reducing the error-correcting capability of the L1 Icache versus the data side.

The cost is non-zero, and the multicore era is preventing the usual "bloat a core until x86 doesn't matter" process.
This is somewhat mitigated by the increasing dominance of L3 and non-core logic, as the proportions for this are relatively ISA-agnostic.

This is a piece from 2000, but there are numbers and reasons stated.
http://www.realworldtech.com/page.cfm?ArticleID=RWT021300000000&p=1

There is a comparison of an Athlon and Alpha core with a near 2x transistor disparity.

We may have another data point once the Cortex A9 chips come back in silicon form.

compres · Sep 24, 2009

First of all thank you for the detailed response.

3dilettante said:
Charlie of Semiaccurate fame claims he spoke with an Atom engineer who stated that core was 15-20% larger because it was x86.

My guess was at about 5%

The relatively contemporaneous Intel P5 core and the Alpha EV4 showed a 3.1M to 1.68M transistor count disparity.
There are some notable design differences, but also some overlap in overal specification.
There will probably never be a completely apples to apples comparison because manufacturers have different design targets and different circumstances.

Exactly my thoughts.

It should be noted that Core2 was a notable x86 milestone in that it went 4-wide.
RISC chips that wide had been around since the mid-1990s.

The decoder block in Nehalem is one of the largest partitions in the core.
AMD's predecoder for K8 has a predecode block fo 16 parallel predecoders--before it gets to the actual decoders.

So long as there have been performance RISC designs with a concerted development effort, Intel desktop x86s have only ever approached parity with a process lead.

There are a range of flags and processor status registers that are arbitrarily set by various instructions, so many of those are renamed.
The load/store pipeline is more complex, the number of addressing modes is greater, and pipelines tend to have a few extra stages because of decoding.

The instruction caches for a given level of capacity are larger for performant x86 chips because they have predecode information in them. Some designers trade-off size by reducing the error-correcting capability of the L1 Icache versus the data side.

The cost is non-zero, and the multicore era is preventing the usual "bloat a core until x86 doesn't matter" process.
This is somewhat mitigated by the increasing dominance of L3 and non-core logic, as the proportions for this are relatively ISA-agnostic.

And would you agree that in the case of larrabee using a RISC arch. results in them being able to pack in more cores? That's why I believe it is even more relevant when talking about this chip instead of when comparing the core i7 vs other big RISC CPUs.

This is a piece from 2000, but there are numbers and reasons stated.
http://www.realworldtech.com/page.cfm?ArticleID=RWT021300000000&p=1

There is a comparison of an Athlon and Alpha core with a near 2x transistor disparity.

We may have another data point once the Cortex A9 chips come back in silicon form.

Nice, thanks again.

Larrabee delayed to 2011 ?

Moderator