NVIDIA confirms Next-Gen close to 1TFlop in 4Q07

Hannibal · May 24, 2007

Arun Demeure said:
So you think Larrabee isn't "it"? I don't disagree, I don't think it's what they're thinking of for the "traditional GPU market" either, but we'll see.

Arun,
I've seen you make the claim that Larrabee is not intended for gaming before, and (with all due respect) it's just not correct. I have it on good authority that there will be traditional gaming-oriented GPU products that come out of the Larrabee project. And sources aside, this is what seems to make the most sense.

I laid out Intel's Larrabee strategy inthis article, which got a lot of positive feedback from people who actually know about Intels Larrabee plans. I didn't see it discussed here, though, but maybe I missed that thread.

Hannibal · May 24, 2007

Larrabee

Holy forum delay batman! Sorry for the double post, but after a few minutes when my first post never showed up, I thought that either the forum ate it or that I did something wrong

Arun · May 24, 2007

Hey Hannibal,

First, sorry for the delay, you just activated the anti-spam system. It's really effective, just a tad overly aggressive on new posters!

You'll only have this problem under certain circumstances (including links, but not limited to that) for your first posts, and then it'll go away. We tend to be pretty quick to allow posts, anyway!

And I agree with you, perhaps in the future I should just link to my pieces too instead of trying to summarize, because after rereading some of the things I said on the subject I definitely summarized too much!

What I meant (and that's what I said in this news piece) is that the *first* iteration of Larrabee will not be aimed at the gaming market at all, as far as I can see.

I tend to speak of Larrabee as the "chip", but you're probably right that it is the name of the core, and as such refers to the whole range of chips that will be manufactured based on the Larrabee architecture. And I agree that Intel is probably hoping that this includes gaming down the road; in which case, the fixed-function units aimed at the GPGPU market would presumably be replaced by traditional GPU units. I'd presume that there would be a delay of 9-12 months between the GPGPU iteration and the GPU iteration, however, but I could be wrong.

The big question there in my mind, however, is whether Intel wants to implement rasterization as a fixed-function process, or if they are still (blindly, imo) hoping for raytracing to become a viable opportunity. We'll see. For their own sake, I hope that they are not naive enough to believe raytracing will be a viable *replacement* for rasterization in the 2010 timeframe. As an added bonus for some effects, it could be quite interesting by then though, I think.

Also, another very important factor is the perf/mmÂ² of Larrabee for gaming tasks, if it wants to enter that market. In the GPGPU space, your margins are going to be ridiculously high anyway, so who cares if the chip costs you $100 or $300 to manufacture? Perf/watt will matter, but perf/mmÂ² won't. In the gaming market, on the other hand, you cannot hope to compete with a less efficient architecture (unless you got fab space to burn, but that's not a viable long-term strategy!)

If a Larrabee derivative for gaming can get within 35-40% of the perf/transistor of G100/R800 for gaming tasks, then that could get very interesting, because of Intel's fab advantage and their willingness to burn some cash to make the project successful initially. If they can't get that efficient, then it is largely irrelevant and both NVIDIA and AMD will have wasted hundreds of millions of dollars trying to compete with a threat that doesn't exist. Heh.

One major advantage that Larrabee has on its side is that Paul Otellini recently said that he wanted to move more products to "node n" rather than "node n+1" in the future. This is a tidbit I didn't see anyone mention, but I think it's very significant, because it does imply that discrete graphics chips would certainly always be on the cutting edge node, rather than the previous one.

3dilettante · May 24, 2007

Hannibal said:
Arun,
I've seen you make the claim that Larrabee is not intended for gaming before, and (with all due respect) it's just not correct. I have it on good authority that there will be traditional gaming-oriented GPU products that come out of the Larrabee project. And sources aside, this is what seems to make the most sense.

I laid out Intel's Larrabee strategy inthis article, which got a lot of positive feedback from people who actually know about Intels Larrabee plans. I didn't see it discussed here, though, but maybe I missed that thread.

"The cores will also have a super-wide 512-bit vector FPU that's capable of processing sixteen-element floating-point vectors (single precision), along with support for control flow instructions (loops and branches) and some scalar computations. "

How would you characterize this from an operands per second basis?
If that wording describes Larrabee's vector FPU, what would current GPU shader units be described as?

Does this mean the unit has 512-bit source and destination registers?

If that is the case, with 32-bit values packed in, that is roughly equivalent to 32 scalar operands as handled by R600.
If Larabee will have 16 cores, that would be 512 fp32 operands processed per clock.

Looking at MAD throughput, an individual processor in a SIMD array can pull in 15 32-bit operands. That processor is 1/16 of a SIMD array, which on R600 is one of 4
15*16*4=960

That's 960 operands/clock today, versus 512 on a chip a process node or more removed in the future.

Is my math wrong, or am I misinterpreting the meaning behind the Intel FP unit's description?

Obviously, there are significant flexibility differences, but I wanted to start the debate with some numbers that are known with R600 and hopefully a clear idea of how Larabee's terminology may map to current GPUs.

Arun · May 24, 2007

BTW, in your article, the quoted correspondence says this:

However, this is largely because we don't know many better ways to implement ray tracing than general-purpose hardware, not because of some intrinsic advantage of ray tracing over other methods on CPUs.

I sure as hell hope Intel engineers don't also think that's true, because this is definitely fixed-function hardware: http://graphics.cs.uni-sb.de/~woop/rpu/rpu.html

There was a more detailed paper on the RPU, and it was quite striking that the majority of the die was dedicated to shading (you know... that thing traditional GPUs do too!) and I did some basic calculations comparing the perf/mmÂ² of of the raytracing part of the chip against the performance they gave for CELL (which didn't do shading), and needless to say, it was susbtantially more efficient. And that's just an university project, without custom logic or dynamic gating or anything else you'd see in CELL.

That doesn't mean the ALU part of future chips won't be very important, though. It'll arguably still be the most important part. But there will also imo remain a significant place for fixed-function units, and perhaps it would be wise to consider load-balancing between fixed-function units and programmable ALUs in some cases to improve efficiency. There is a nice but quite vague NVIDIA patent by Erik Lindholm on the subject for TMUs, btw. It's not hard to imagine that being extended to other systems.

INKster · May 24, 2007

I was always intrigued about why Nvidia presented a brand new revision of the G80 core within such a low volume part as the 8800 Ultra.
Now we know. Near future GTS and Ultra owners could be up for a dose of added OC "pizzazz", although the same might not be said of future GTX purchasers...

It seems the A3 core revision might not make it to the middle-ground of the 8800 series.

Hannibal · May 24, 2007

Arun Demeure said:
What I meant (and that's what I said in this news piece) is that the *first* iteration of Larrabee will not be aimed at the gaming market at all, as far as I can see.

This is what I thought, too, but I've actually heard otherwise. I now believe that we'll see a gaming part first, and then the HPC stuff later on.

The big question there in my mind, however, is whether Intel wants to implement rasterization as a fixed-function process, or if they are still (blindly, imo) hoping for raytracing to become a viable opportunity.

I think the answer here is that they're going to do mostly raster with a bit of raytracing thrown in--the raster stuff will be done in fix-function hardware, with the RT running on the general parts of the GPU.

3vi1 · May 24, 2007

Techno+ said:
I heard that one of the main reasons ATI agreed to the acquisition was to use AMD's on die communications technologies.

ATI initiated the affair not AMD.

Anyhow, as I saw it AMD should of bought ATI way back in 2003 as anyone could see that processors were moving twoards a parallel instructiuon/data pathway... The merger is LATE - hence all the goofoffs with R600.

Killer-Kris · May 24, 2007

Arun Demeure said:
... I'm sure they can improve efficiency a fair bit and maybe improve the ratios' balance a bit, but that won't give them a completely new architecture.

Now I have to ask do you think that R600 is inherently an inefficient architecture, beyond the mismatched ALU-TEX ratio?

As for the power consumption is there anything in the architecture that makes it run so hot, ala Prescott? Or are we just looking at a very large chip which they didn't take much time to optimize the circuits on because it was so late, which was then fabbed on a somewhat leaky process?

Techno+ · May 24, 2007

Hannibal said:
I think the answer here is that they're going to do mostly raster with a bit of raytracing thrown in--the raster stuff will be done in fix-function hardware, with the RT running on the general parts of the GPU.

You are kind of correct. when the AMD-ATI merger was announced, i read an article about CPU-GPU unification and the future of RT. The writer asked several experts from AMD, ATI, nvidia, game development studios and intel on their thoughts about this. Anyway, one of them was Tim Sweeney and he said that photo-realistic rendering should be done using a combination of Raster Operations and Ray Tracing. So I guess that what Tim said is going to be implemented in intel gfx hardware, because both seem to love software graphics, thus they share a vision

What do you think is the possibility that intel's GPU inherits some of Larrabee's design, but is based on hardware similar to nvidia and ATI (i.e shaders, ROPS, TMUs. etc.)

Edit: I found the link (I keep an archive full of interesting links), it was Tim Sweeney.

Link

Ray tracing is superior for handling bounced light, reflection, and refraction. So, there are some places where you will definitely want to ray trace, and some cases where it would be a very inefficient choice. Certainly, future rendering algorithms will incorporate a mix of techniques from different areas to exploit their strengths in various cases without being universally penalized by one techniqueâ€™s weakness.â€

Arun · May 24, 2007

Hannibal said:
This is what I thought, too, but I've actually heard otherwise. I now believe that we'll see a gaming part first, and then the HPC stuff later on.

That's very interesting. Doesn't really collide perfectly with one of my few datapoints, but it's far from impossible.

I think the answer here is that they're going to do mostly raster with a bit of raytracing thrown in--the raster stuff will be done in fix-function hardware, with the RT running on the general parts of the GPU.

That's a reasonable architectural tradeoff for the 2010 timeframe, I guess. Ideally, you'd want to have a clean API for raytracing that could in the future be supported through specialized hardware rather than just asking developers to "implement it themselves" imo, but that's really a detail and I'm just nitpicking without sufficient information here.

Now I have to ask do you think that R600 is inherently an inefficient architecture, beyond the mismatched ALU-TEX ratio?

It certainly seems less efficient than G8x to me (at least in terms of ROPs and TMUs, I'd tend to believe the ALUs are pretty dense). I don't think it's catastrophically less efficient though; as long as they're willing to swallow lower margins than NVIDIA, I can't see why they wouldn't be able to compete fine. After all, RV610 and RV630 are pretty good chips. Regarding power consumption, RV6xx chips prove the architecture itself is fine there, and that it's the specific implementation of the R600 (process, maybe circuits, I don't know) that is a problem.

So unless G9x gets more efficient and ATI's refreshes get less efficient (errr?!), I don't personally think there's a massive problem for them. They just shouldn't expect their graphics division to be incredibly profitable, but that's hardly news.

Killer-Kris · May 24, 2007

Arun Demeure said:
It certainly seems less efficient than G8x to me (at least in terms of ROPs and TMUs, I'd tend to believe the ALUs are pretty dense).

So do you mean die space consumed by those parts between G80 and R600? Or like I was saying that they chose to include more ALUs when more ROPs and TMUs would have been more advantageous for todays workloads?

If it's the later they should be able to fix it in a refresh with out much trouble and that should put them back on a level playing field with any refresh that Nvidia produces.

Regarding power consumption, RV6xx chips prove the architecture itself is fine there, and that it's the specific implementation of the R600 (process, maybe circuits, I don't know) that is a problem.

OK, that's the impression I got as well, but have seen far to many comments, not necessarily in this thread, with regards to R600 being a bad architecture because of its power consumption.

Geo · May 24, 2007

Hannibal said:
This is what I thought, too, but I've actually heard otherwise. I now believe that we'll see a gaming part first, and then the HPC stuff later on.

I think the answer here is that they're going to do mostly raster with a bit of raytracing thrown in--the raster stuff will be done in fix-function hardware, with the RT running on the general parts of the GPU.

Heya, Hannibal. Glad you made it past the sharks in the moat. :smile:

Nice Apr. 26th article there. Missed it at the time. Presumably you've seen the Carmean presentation we put up on April 11?

What I'd really like to know is if your source who opined that Larrabee would be a good *hybrid* architecture for combining the best of rasterization (for primary rays) and ray tracing (for secondary rays) is a, shall we say, "source close to Intel"? Because the industry people we know all say that might work too, but when they read Intel researchers waxing poetical (See Takahashi's long blog that's linked in this thread) on ray tracing it surely doesn't sound like they are only contemplating secondary ray usage.

Also, are you hearing any rumbles that Intel has approached MS about including any ray tracing stuff in DX? Aren't they going to require some API support to really get this off the ground for gaming?

How does this fit with the idea of them going discrete AIB cards, which everyone now expects?

Let us know if you want to keep chatting about this on the forum, and if so we'll break it off into its own "B3D grills Ars' Hannibal re Intel Larrabee" thread.

Arun · May 24, 2007

Killer-Kris said:
So do you mean die space consumed by those parts between G80 and R600?

Well, I don't actually know how dense G80's ALUs are, so I'm not sure in relative terms. What I would *tend* to believe, however, and I could be wrong, is that ATI's ALUs competitive in terms of perf/mmÂ², while their ROPs and TMUs aren't.

The result of that is that their "balanced ratio" is not the same as NVIDIA's. Even for a single given game, the ALU, TMU and ROP requirements are going to vary a lot during over time, including inside a given frame. So imagine if for a given game, on average, you gain 50% performance for either doubling the number of TMUs or the number of ALUs. If for one architecture, the TMUs are cheaper than the ALUs, it makes more sense to have more of those. And the reverse for the other architecture.

So I'm not convinced ATI's architecture is strictly speaking 'unbalanced', because it really depends on how big/small each kind of unit is. I'm not convinced the ratio is 100% ideal, but I'd be surprised if that for the size of their units, it was completely wrong. How many ROPs you want also depends on how good your bandwidth saving techniques are, although I'd be surprised if ATI was so bad in that respect. But it's hard to say, of course.

3dilettante · May 24, 2007

Any clues on how to characterize the vector units in Larrabee?

Is it like some kind super-wide SSE with 512-bit registers?
Does that mean two source registers with one serving as the destination?

That's 2 512-bit operand loads per vector unit per clock, equal to 32 single-precision scalars. That leaves out any 3-operand operations, which wouldn't fit too well with x86 semantics. (Or would the semantics be broken?)

If we assume 16 cores per Larrabee chip, that's 512 operands per clock.
It would be 768 if the unit could source 3 operand registers (not likely if it keeps the same semantics).

R600 can pull in per SIMD processor 15 operands per clock with a MADD.
If we go with a 2-operand op, it's 10 operands per clock
15*16*4 = 960 operands that can be processed per clock
10*16*4 = 640 operands

G80 has 128 SPs, each capable of a MADD, so thats
3*128 = 384 operands per main clock
2*128 = 256 operands per clock
These are roughly double-clocked shader units, so 768 or 512 operands per core clock.

Looking at max operand throughput in the non-SF shader processors:
(3-operand, 2-operand)
R600: 960, 640
G80: 384, 256 (x2)
Larrabee: 768?, 512(likely)

The unknown is clock speeds, both for Larrabee and the GPUs that would be around by that point.

Pixel shader branch granularity (pixels):
R600: 64
G80: 32
Larrabee 16 (According to Ars)

Julidz · May 24, 2007

what's the difference between MADD and MUL instructions ?

nutball · May 24, 2007

Julidz said:
what's the difference between MADD and MUL instructions ?

MUL = A * B
MADD = A * B + C

MADD is often counted as being 2 FLOPs rather than one. Which it is if you don't have any MADD units but maybe isn't if you do :/

Julidz · May 24, 2007

nutball said:
MUL = A * B
MADD = A * B + C

MADD is often counted as being 2 FLOPs rather than one. Which it is if you don't have any MADD units but maybe isn't if you do :/

sorry , I didn't get it yet :/

3dilettante · May 24, 2007

A MUL multiplies two numbers.
A MADD is a multiply that adds the result of the multplication to a third number.

A FLOP is any floating point operation, so a multiply and an addition, if done separately count as two FLOPs.

However, a MADD is a combination instruction, so it's not just 1 operation, but it can't be used as flexibly as 2 separate floating point operations.

If a chip doesn't have a MADD unit, it must use a MUL and an ADD separately.

compres · May 24, 2007

Julidz said:
sorry , I didn't get it yet :/

Just look at its names, they are instructions that is why the don't use the whole word or phrase to describe them:

MUL: means multiply
MADD: means multiply add

The MADD instruction exists because it is really common to make an ADD right after a MUL and thus it is a nice optimization.

edit: Yeah sorry 3D beat me and explained it better too.

NVIDIA confirms Next-Gen close to 1TFlop in 4Q07

Hannibal

Hannibal

Arun

Unknown.

3dilettante

Arun

Unknown.

INKster

Hannibal

3vi1

Killer-Kris

Techno+

Arun

Unknown.

Killer-Kris

Geo

Mostly Harmless

Arun

Unknown.

3dilettante

Julidz

nutball

Julidz

3dilettante

compres

Similar threads