NVIDIA confirms Next-Gen close to 1TFlop in 4Q07

ATI's next generation part codenamed R700 is scheduled for first quarter of 2008.
http://www.fudzilla.com/index.php?option=com_content&task=view&id=211&Itemid=1

ATI will not be so behind with R700 to response to Nvidia G92.

Hmmm but as you see below Fudzilla said NVIDIA G100 is scheduled for 1Q 2008....
http://www.fudzilla.com/index.php?option=com_content&task=view&id=355&Itemid=1

I don`t think ATI will release 3 High-End GPUs in 10 months.... r600 May 2007, r650 September/October 2007 and r700 about March 2008.... Imo we will see r700 in the 2H 2008 togehter with G100....
 
Hmmm but as you see below Fudzilla said NVIDIA G100 is scheduled for 1Q 2008....
http://www.fudzilla.com/index.php?option=com_content&task=view&id=355&Itemid=1

I don`t think ATI will release 3 High-End GPUs in 10 months.... r600 May 2007, r650 September/October 2007 and r700 about March 2008.... Imo we will see r700 in the 2H 2008 togehter with G100....

G92 and G100 has very small time frame differentness between of them ?? - well possibly ATI may have some kind Extreme edition of R700 as well.
 
G92 and G100 has very small time frame differentness between of them ?? - well possibly ATI may have some kind Extreme edition of R700 as well.

But i think G100 and r700 as well won`t be released in Q1 2008.... At the earliest in Q2 2008 but more probably Q2/Q3 2008....
 
Michael Hara said recently that Nvidia will stick to the "high-end launch in the Fall, mainstream launch in the Spring" business model for the foreseeable future, so no G100 in no Q1'08, or even Q2'08.
G92 is what you get, and it's not that bad either.
 
Why not??

Probably because it's a dual-GPU card, with maybe two R650's working in Crossfire mode.
However, i wouldn't rule out a similar move by Nvidia, even though we know G92 is not another GX2-type of refresh product (due to known process changes, added FP64 support, etc).
 
Err, no, not that I know of. If one cluster doesn't do texturing, and another cluster needs a huge amount of texturing, the TMUs don't get sent to work for the other cluster, afaik.
True, but IMO that's not a particularly useful capability. There should be enough threads on each cluster that this situation is statistically rare. If you really wanted to take care of this corner case, it would make more sense to make the sequencer a bit more intelligent in choosing which batch to go after next.

25% smaller die size, 30% increase in transistor count, 28% decrease in process size. Does that add up?
Probably. 90nm --> 65nm theoretically means 48% less area per transistor (though usually the gains aren't quite that high between processes). A 25% decrease in area and 30% increase in tranny count only needs a 42% decrease in transistor area, so it's very realistic.

G92 looks pretty nuts to me. I thought ATI might have an advantage in clocking up its shaders when AMD came aboard, but now that NVidia beat them to that with G80 and will likely go even further with G92, I don't see ATI having much success against the latter.
 
True, but IMO that's not a particularly useful capability. There should be enough threads on each cluster that this situation is statistically rare. If you really wanted to take care of this corner case, it would make more sense to make the sequencer a bit more intelligent in choosing which batch to go after next.
Strangely enough, what Davros suggests seems to be how R600 works. But it works like that all the time.

Hmm...

Jawed
 
Strangely enough, what Davros suggests seems to be how R600 works. But it works like that all the time.

"Davros" :)

Well, I'm not sure I'm in love with R600's approach, either. Should nearby pixels really be handled by disjoint TMUs? Does it make sense to *always* ship work across the chip? One could come up with a trivial predication case that would effectively underutilize R600's TMUs as well.

From a high-level perspective, I guess I think of this problem in a couple of different ways. Either worktypes are determined, and processing units (TMUs, SFUs, ALUs) assigned dynamically, or a kernel forks off requests to unit processing farms, which report back results (the individual 'farms' manage prioritization of incoming requests, etc.).

MintMaster is probably right, that multiple threads can almost certainly hide underutilization, but the above seems somewhat more flexible when it comes to handling DB. As long as #units/#sequencers(?) <= average(kernel_data_width), you don't have a DB problem. I'm sure there are much larger problems to deal with, though -- like shipping data all over a chip.... Something I wouldn't expect a higher-clocked chip to try to do. [And it is looking like the G92 is a MUL-enabled, 192proc, higher speed chip, if "2x theoretical" and "2.5-3x real" and "30% smaller die" are to be believed]

-Dave
 
Ah, sorry, Dave - hasty posting during an advert break syndrome :oops:

For the rest of this posting, just assume I've got one eye somewhere else :D ...

Well, I'm not sure I'm in love with R600's approach, either. Should nearby pixels really be handled by disjoint TMUs? Does it make sense to *always* ship work across the chip? One could come up with a trivial predication case that would effectively underutilize R600's TMUs as well.
I think one would need to do some serious simulation to understand this.

I can only think that once you've built latency tolerance, the two approaches (private TUs versus shared-distributed TUs) end up moving the same amount of data around the ring.

Hmm, except that texels in compressed form (which I presume they are, while they're in L2) would consume less ring bandwidth. When a TU produces a quad of texel results (or, perhaps, 4 quads of texel results as a burst in response to one batch) that are fully filtered and are destined for registers, surely they consume more bandwidth on the ring? Then again, texel-overhead relating to anisotropic filtering is saved, since those extra texels tend to stay in their "home" L2. Gah.

We don't know the rasterisation pattern in R600. Considering a batch of 64 pixels, for example, is it:

1111222233334444
1111222233334444
1111222233334444
1111222233334444

or:

1111111133333333
1111111133333333
2222222244444444
2222222244444444

etc.

I remember a rasterisation patent document that implied rasterisation along the long axis of a triangle, so either width-wise or height-wise rasterisation is possible. What's the effect of that on texel locality? How big are the screen-space tiles within which rasterisation is constrained? What about that texture caching patent application I keep linking, the prefetching one?

I can't think what kind of trivial predication you're referring to that would waste R600's TUs. The "home" arbiter for the texture requests (for a batch) is forced to treat the 16 quads of texel results that it's waiting for as asynchronous events. Predication would de-select texture-fetches at the quad level, I guess, so the arbiter would only send out quad-fetches to "foreign" TUs as needed.

Brainfade...

Jawed
 
For the rest of this posting, just assume I've got one eye somewhere else :D ...

Fair enough. You should assume that I'm asleep while posting this, then :)

I can't think what kind of trivial predication you're referring to that would waste R600's TUs.

If quad X always goes to TMUx, then a predication mask that always masks (say) quad 2, will leave TMU2 without any work to do.

I'm not sure how a local TMU uses the ring at all -- local ALUs talk to local TMUs, I wouldn't expect that to be over the ring. As it is, ALUs are always talking to remote TMUs (how remote depends on which quad). Have I misunderstood something? [that's a stupid question ;)] What have I misunderstood?

-Dave [->sleep]
 
If quad X always goes to TMUx, then a predication mask that always masks (say) quad 2, will leave TMU2 without any work to do.
Ah, OK, that's the kind of thing synthetics are for. Actually, that'd prolly make a really good synthetic for testing the performance of R600 texturing. Similar to dynamic branching tests that only use rectangular areas of coherence.

Which reminds me of a similar possibility with the way textures are defined and then fetched. It's possible to use a stride that will hit only one memory channel.

I'm not sure how a local TMU uses the ring at all -- local ALUs talk to local TMUs, I wouldn't expect that to be over the ring.
No, but some of the texels could be in a foreign TMU's L2 already. Presuming that L2 is distributed - which I'm assuming is the case for the time being...

As it is, ALUs are always talking to remote TMUs (how remote depends on which quad). Have I misunderstood something? [that's a stupid question ;)] What have I misunderstood?
No, I don't think you misunderstood anything. I might draw a diagram of how I think it all hangs together at some point...

Ooh, hang on, there's this from Watch Impress

kaigai03.jpg


I wish AMD would just post the complete set of slides.

Anyway, that doesn't show the ring bus at all, so I prolly should still have a go at a more detailed diagram.

Jawed
 
Eric Demers gave a talk about the R6XX processors at Stanford's CS448 and AMD actually let us post the slides. http://graphics.stanford.edu/cs448-07-spring/. The talk was not a completely deep technical dive as it was in some ways designed to inspire students aiming to become architects and talk about why some things were done.
 
I wish AMD would just post the complete set of slides.

Anyway, that doesn't show the ring bus at all, so I prolly should still have a go at a more detailed diagram.

Jawed

We have Eric's architecture deep-dive from Tunis. We also have a long list of interview questions into Eric. Hopefully these things get published together. . .
 
Back
Top