NVIDIA confirms Next-Gen close to 1TFlop in 4Q07

Shtal · Jun 6, 2007

Vincent said:
Due Launch Date : End of Nov

ATI's next generation part codenamed R700 is scheduled for first quarter of 2008.
http://www.fudzilla.com/index.php?option=com_content&task=view&id=211&Itemid=1

ATI will not be so behind with R700 to response to Nvidia G92.

Domell · Jun 6, 2007

Shtal said:
ATI's next generation part codenamed R700 is scheduled for first quarter of 2008.
http://www.fudzilla.com/index.php?option=com_content&task=view&id=211&Itemid=1

ATI will not be so behind with R700 to response to Nvidia G92.

Hmmm but as you see below Fudzilla said NVIDIA G100 is scheduled for 1Q 2008....
http://www.fudzilla.com/index.php?option=com_content&task=view&id=355&Itemid=1

I don`t think ATI will release 3 High-End GPUs in 10 months.... r600 May 2007, r650 September/October 2007 and r700 about March 2008.... Imo we will see r700 in the 2H 2008 togehter with G100....

Shtal · Jun 6, 2007

Domell said:
Hmmm but as you see below Fudzilla said NVIDIA G100 is scheduled for 1Q 2008....
http://www.fudzilla.com/index.php?option=com_content&task=view&id=355&Itemid=1

I don`t think ATI will release 3 High-End GPUs in 10 months.... r600 May 2007, r650 September/October 2007 and r700 about March 2008.... Imo we will see r700 in the 2H 2008 togehter with G100....

G92 and G100 has very small time frame differentness between of them ?? - well possibly ATI may have some kind Extreme edition of R700 as well.

Domell · Jun 6, 2007

Shtal said:
G92 and G100 has very small time frame differentness between of them ?? - well possibly ATI may have some kind Extreme edition of R700 as well.

But i think G100 and r700 as well won`t be released in Q1 2008.... At the earliest in Q2 2008 but more probably Q2/Q3 2008....

INKster · Jun 6, 2007

Michael Hara said recently that Nvidia will stick to the "high-end launch in the Fall, mainstream launch in the Spring" business model for the foreseeable future, so no G100 in no Q1'08, or even Q2'08.
G92 is what you get, and it's not that bad either.

Shtal · Jun 6, 2007

Domell said:
But i think G100 and r700 as well won`t be released in Q1 2008.... At the earliest in Q2 2008 but more probably Q2/Q3 2008....

Probably ATI-R670 will fight Nvidia G92 then!

The latest batch of roadmaps tells of details about several new parts, for example the RV670 and R670, http://www.theinquirer.net/default.aspx?article=40068

AnarchX · Jun 6, 2007

Shtal said:
Probably ATI-R670 will fight Nvidia G92 then!

I dont think that R670 will fight against one G92.

Domell · Jun 6, 2007

AnarchX said:
I dont think that R670 will fight against one G92.

Why not??

INKster · Jun 6, 2007

Domell said:
Why not??

Probably because it's a dual-GPU card, with maybe two R650's working in Crossfire mode.
However, i wouldn't rule out a similar move by Nvidia, even though we know G92 is not another GX2-type of refresh product (due to known process changes, added FP64 support, etc).

trinibwoy · Jun 6, 2007

Geeforcer said:
25% smaller die size, 30% increase in transistor count, 28% decrease in process size. Does that add up?

65^2 is 52% of 90^2. How'd you get a 28% decrease?

_xxx_ · Jun 6, 2007

trinibwoy said:
65^2 is 52% of 90^2. How'd you get a 28% decrease?

90 - 28% = 65. Obviously miscalculated

Mintmaster · Jun 6, 2007

dnavas said:
Err, no, not that I know of. If one cluster doesn't do texturing, and another cluster needs a huge amount of texturing, the TMUs don't get sent to work for the other cluster, afaik.

True, but IMO that's not a particularly useful capability. There should be enough threads on each cluster that this situation is statistically rare. If you really wanted to take care of this corner case, it would make more sense to make the sequencer a bit more intelligent in choosing which batch to go after next.

Geeforcer said:
25% smaller die size, 30% increase in transistor count, 28% decrease in process size. Does that add up?

Probably. 90nm --> 65nm theoretically means 48% less area per transistor (though usually the gains aren't quite that high between processes). A 25% decrease in area and 30% increase in tranny count only needs a 42% decrease in transistor area, so it's very realistic.

G92 looks pretty nuts to me. I thought ATI might have an advantage in clocking up its shaders when AMD came aboard, but now that NVidia beat them to that with G80 and will likely go even further with G92, I don't see ATI having much success against the latter.

Geeforcer · Jun 6, 2007

trinibwoy said:
65^2 is 52% of 90^2. How'd you get a 28% decrease?

Remind me to stay away from calculations at 3 am.

Jawed · Jun 6, 2007

Mintmaster said:
True, but IMO that's not a particularly useful capability. There should be enough threads on each cluster that this situation is statistically rare. If you really wanted to take care of this corner case, it would make more sense to make the sequencer a bit more intelligent in choosing which batch to go after next.

Strangely enough, what Davros suggests seems to be how R600 works. But it works like that all the time.

Hmm...

Jawed

dnavas · Jun 7, 2007

Jawed said:
Strangely enough, what Davros suggests seems to be how R600 works. But it works like that all the time.

"Davros"

Well, I'm not sure I'm in love with R600's approach, either. Should nearby pixels really be handled by disjoint TMUs? Does it make sense to *always* ship work across the chip? One could come up with a trivial predication case that would effectively underutilize R600's TMUs as well.

From a high-level perspective, I guess I think of this problem in a couple of different ways. Either worktypes are determined, and processing units (TMUs, SFUs, ALUs) assigned dynamically, or a kernel forks off requests to unit processing farms, which report back results (the individual 'farms' manage prioritization of incoming requests, etc.).

MintMaster is probably right, that multiple threads can almost certainly hide underutilization, but the above seems somewhat more flexible when it comes to handling DB. As long as #units/#sequencers(?) <= average(kernel_data_width), you don't have a DB problem. I'm sure there are much larger problems to deal with, though -- like shipping data all over a chip.... Something I wouldn't expect a higher-clocked chip to try to do. [And it is looking like the G92 is a MUL-enabled, 192proc, higher speed chip, if "2x theoretical" and "2.5-3x real" and "30% smaller die" are to be believed]

-Dave

Jawed · Jun 7, 2007

dnavas said:
"Davros"

Ah, sorry, Dave - hasty posting during an advert break syndrome

For the rest of this posting, just assume I've got one eye somewhere else

...

Well, I'm not sure I'm in love with R600's approach, either. Should nearby pixels really be handled by disjoint TMUs? Does it make sense to *always* ship work across the chip? One could come up with a trivial predication case that would effectively underutilize R600's TMUs as well.

I think one would need to do some serious simulation to understand this.

I can only think that once you've built latency tolerance, the two approaches (private TUs versus shared-distributed TUs) end up moving the same amount of data around the ring.

Hmm, except that texels in compressed form (which I presume they are, while they're in L2) would consume less ring bandwidth. When a TU produces a quad of texel results (or, perhaps, 4 quads of texel results as a burst in response to one batch) that are fully filtered and are destined for registers, surely they consume more bandwidth on the ring? Then again, texel-overhead relating to anisotropic filtering is saved, since those extra texels tend to stay in their "home" L2. Gah.

We don't know the rasterisation pattern in R600. Considering a batch of 64 pixels, for example, is it:

1111222233334444
1111222233334444
1111222233334444
1111222233334444

or:

1111111133333333
1111111133333333
2222222244444444
2222222244444444

etc.

I remember a rasterisation patent document that implied rasterisation along the long axis of a triangle, so either width-wise or height-wise rasterisation is possible. What's the effect of that on texel locality? How big are the screen-space tiles within which rasterisation is constrained? What about that texture caching patent application I keep linking, the prefetching one?

I can't think what kind of trivial predication you're referring to that would waste R600's TUs. The "home" arbiter for the texture requests (for a batch) is forced to treat the 16 quads of texel results that it's waiting for as asynchronous events. Predication would de-select texture-fetches at the quad level, I guess, so the arbiter would only send out quad-fetches to "foreign" TUs as needed.

Brainfade...

Jawed

dnavas · Jun 7, 2007

Jawed said:
For the rest of this posting, just assume I've got one eye somewhere else ...

Fair enough. You should assume that I'm asleep while posting this, then

I can't think what kind of trivial predication you're referring to that would waste R600's TUs.

If quad X always goes to TMUx, then a predication mask that always masks (say) quad 2, will leave TMU2 without any work to do.

I'm not sure how a local TMU uses the ring at all -- local ALUs talk to local TMUs, I wouldn't expect that to be over the ring. As it is, ALUs are always talking to remote TMUs (how remote depends on which quad). Have I misunderstood something? [that's a stupid question

] What have I misunderstood?

-Dave [->sleep]

Jawed · Jun 7, 2007

dnavas said:
If quad X always goes to TMUx, then a predication mask that always masks (say) quad 2, will leave TMU2 without any work to do.

Ah, OK, that's the kind of thing synthetics are for. Actually, that'd prolly make a really good synthetic for testing the performance of R600 texturing. Similar to dynamic branching tests that only use rectangular areas of coherence.

Which reminds me of a similar possibility with the way textures are defined and then fetched. It's possible to use a stride that will hit only one memory channel.

I'm not sure how a local TMU uses the ring at all -- local ALUs talk to local TMUs, I wouldn't expect that to be over the ring.

No, but some of the texels could be in a foreign TMU's L2 already. Presuming that L2 is distributed - which I'm assuming is the case for the time being...

As it is, ALUs are always talking to remote TMUs (how remote depends on which quad). Have I misunderstood something? [that's a stupid question ] What have I misunderstood?

No, I don't think you misunderstood anything. I might draw a diagram of how I think it all hangs together at some point...

Ooh, hang on, there's this from Watch Impress

I wish AMD would just post the complete set of slides.

Anyway, that doesn't show the ring bus at all, so I prolly should still have a go at a more detailed diagram.

Jawed

mhouston · Jun 7, 2007

Eric Demers gave a talk about the R6XX processors at Stanford's CS448 and AMD actually let us post the slides. http://graphics.stanford.edu/cs448-07-spring/. The talk was not a completely deep technical dive as it was in some ways designed to inspire students aiming to become architects and talk about why some things were done.

Geo · Jun 7, 2007

Jawed said:
I wish AMD would just post the complete set of slides.

Anyway, that doesn't show the ring bus at all, so I prolly should still have a go at a more detailed diagram.

Jawed

We have Eric's architecture deep-dive from Tunis. We also have a long list of interview questions into Eric. Hopefully these things get published together. . .

NVIDIA confirms Next-Gen close to 1TFlop in 4Q07

Shtal

Domell

Shtal

Domell

INKster

Shtal

AnarchX

Domell

INKster

trinibwoy

Meh

_xxx_

Mintmaster

Geeforcer

Harmlessly Evil

Jawed

dnavas

Jawed

dnavas

Jawed

mhouston

A little of this and that

Geo

Mostly Harmless

Similar threads