Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 06-Jun-2007, 09:55   #126
Shtal
Senior Member
 
Join Date: Jun 2005
Posts: 1,320
Default

Quote:
Originally Posted by Domell View Post
But i think G100 and r700 as well won`t be released in Q1 2008.... At the earliest in Q2 2008 but more probably Q2/Q3 2008....
Probably ATI-R670 will fight Nvidia G92 then!

The latest batch of roadmaps tells of details about several new parts, for example the RV670 and R670, http://www.theinquirer.net/default.aspx?article=40068
__________________
What is the meaning of life? - Why I'm here, I know my past, because I return to the past but I'm going forward to see my future, to find the truth, meaning of the existence and purpose.
Shtal is offline   Reply With Quote
Old 06-Jun-2007, 10:41   #127
AnarchX
Senior Member
 
Join Date: Apr 2007
Posts: 1,393
Default

Quote:
Originally Posted by Shtal View Post
Probably ATI-R670 will fight Nvidia G92 then!
I dont think that R670 will fight against one G92.
AnarchX is offline   Reply With Quote
Old 06-Jun-2007, 11:36   #128
Domell
Member
 
Join Date: Oct 2004
Posts: 247
Default

Quote:
Originally Posted by AnarchX View Post
I dont think that R670 will fight against one G92.
Why not??
Domell is offline   Reply With Quote
Old 06-Jun-2007, 11:41   #129
INKster
Senior Member
 
Join Date: Apr 2006
Location: Io, lava pit number 12
Posts: 2,108
Default

Quote:
Originally Posted by Domell View Post
Why not??
Probably because it's a dual-GPU card, with maybe two R650's working in Crossfire mode.
However, i wouldn't rule out a similar move by Nvidia, even though we know G92 is not another GX2-type of refresh product (due to known process changes, added FP64 support, etc).
INKster is offline   Reply With Quote
Old 06-Jun-2007, 12:36   #130
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Quote:
Originally Posted by Geeforcer View Post
25% smaller die size, 30% increase in transistor count, 28% decrease in process size. Does that add up?
65^2 is 52% of 90^2. How'd you get a 28% decrease?
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 06-Jun-2007, 15:46   #131
_xxx_
Naughty Boy!
 
Join Date: Aug 2004
Location: Stuttgart, Germany
Posts: 5,008
Default

Quote:
Originally Posted by trinibwoy View Post
65^2 is 52% of 90^2. How'd you get a 28% decrease?
90 - 28% = 65. Obviously miscalculated
__________________
I have thought some of nature's journeymen had made men, and not made them well, they imitated humanity so abominably.
_xxx_ is offline   Reply With Quote
Old 06-Jun-2007, 18:38   #132
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,779
Default

Quote:
Originally Posted by dnavas View Post
Err, no, not that I know of. If one cluster doesn't do texturing, and another cluster needs a huge amount of texturing, the TMUs don't get sent to work for the other cluster, afaik.
True, but IMO that's not a particularly useful capability. There should be enough threads on each cluster that this situation is statistically rare. If you really wanted to take care of this corner case, it would make more sense to make the sequencer a bit more intelligent in choosing which batch to go after next.

Quote:
Originally Posted by Geeforcer View Post
25% smaller die size, 30% increase in transistor count, 28% decrease in process size. Does that add up?
Probably. 90nm --> 65nm theoretically means 48% less area per transistor (though usually the gains aren't quite that high between processes). A 25% decrease in area and 30% increase in tranny count only needs a 42% decrease in transistor area, so it's very realistic.

G92 looks pretty nuts to me. I thought ATI might have an advantage in clocking up its shaders when AMD came aboard, but now that NVidia beat them to that with G80 and will likely go even further with G92, I don't see ATI having much success against the latter.
Mintmaster is offline   Reply With Quote
Old 06-Jun-2007, 19:09   #133
Geeforcer
Harmlessly Evil
 
Join Date: Feb 2002
Posts: 2,027
Default

Quote:
Originally Posted by trinibwoy View Post
65^2 is 52% of 90^2. How'd you get a 28% decrease?
Remind me to stay away from calculations at 3 am.
__________________
"Complexity is easy; simplicity is difficult."
Geeforcer is offline   Reply With Quote
Old 06-Jun-2007, 21:31   #134
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Mintmaster View Post
True, but IMO that's not a particularly useful capability. There should be enough threads on each cluster that this situation is statistically rare. If you really wanted to take care of this corner case, it would make more sense to make the sequencer a bit more intelligent in choosing which batch to go after next.
Strangely enough, what Davros suggests seems to be how R600 works. But it works like that all the time.

Hmm...

Jawed
Jawed is online now   Reply With Quote
Old 07-Jun-2007, 00:01   #135
dnavas
Member
 
Join Date: Apr 2004
Posts: 325
Default

Quote:
Originally Posted by Jawed View Post
Strangely enough, what Davros suggests seems to be how R600 works. But it works like that all the time.
"Davros"

Well, I'm not sure I'm in love with R600's approach, either. Should nearby pixels really be handled by disjoint TMUs? Does it make sense to *always* ship work across the chip? One could come up with a trivial predication case that would effectively underutilize R600's TMUs as well.

From a high-level perspective, I guess I think of this problem in a couple of different ways. Either worktypes are determined, and processing units (TMUs, SFUs, ALUs) assigned dynamically, or a kernel forks off requests to unit processing farms, which report back results (the individual 'farms' manage prioritization of incoming requests, etc.).

MintMaster is probably right, that multiple threads can almost certainly hide underutilization, but the above seems somewhat more flexible when it comes to handling DB. As long as #units/#sequencers(?) <= average(kernel_data_width), you don't have a DB problem. I'm sure there are much larger problems to deal with, though -- like shipping data all over a chip.... Something I wouldn't expect a higher-clocked chip to try to do. [And it is looking like the G92 is a MUL-enabled, 192proc, higher speed chip, if "2x theoretical" and "2.5-3x real" and "30% smaller die" are to be believed]

-Dave
dnavas is offline   Reply With Quote
Old 07-Jun-2007, 01:51   #136
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by dnavas View Post
"Davros"
Ah, sorry, Dave - hasty posting during an advert break syndrome

For the rest of this posting, just assume I've got one eye somewhere else ...

Quote:
Well, I'm not sure I'm in love with R600's approach, either. Should nearby pixels really be handled by disjoint TMUs? Does it make sense to *always* ship work across the chip? One could come up with a trivial predication case that would effectively underutilize R600's TMUs as well.
I think one would need to do some serious simulation to understand this.

I can only think that once you've built latency tolerance, the two approaches (private TUs versus shared-distributed TUs) end up moving the same amount of data around the ring.

Hmm, except that texels in compressed form (which I presume they are, while they're in L2) would consume less ring bandwidth. When a TU produces a quad of texel results (or, perhaps, 4 quads of texel results as a burst in response to one batch) that are fully filtered and are destined for registers, surely they consume more bandwidth on the ring? Then again, texel-overhead relating to anisotropic filtering is saved, since those extra texels tend to stay in their "home" L2. Gah.

We don't know the rasterisation pattern in R600. Considering a batch of 64 pixels, for example, is it:

1111222233334444
1111222233334444
1111222233334444
1111222233334444

or:

1111111133333333
1111111133333333
2222222244444444
2222222244444444

etc.

I remember a rasterisation patent document that implied rasterisation along the long axis of a triangle, so either width-wise or height-wise rasterisation is possible. What's the effect of that on texel locality? How big are the screen-space tiles within which rasterisation is constrained? What about that texture caching patent application I keep linking, the prefetching one?

I can't think what kind of trivial predication you're referring to that would waste R600's TUs. The "home" arbiter for the texture requests (for a batch) is forced to treat the 16 quads of texel results that it's waiting for as asynchronous events. Predication would de-select texture-fetches at the quad level, I guess, so the arbiter would only send out quad-fetches to "foreign" TUs as needed.

Brainfade...

Jawed
Jawed is online now   Reply With Quote
Old 07-Jun-2007, 06:38   #137
dnavas
Member
 
Join Date: Apr 2004
Posts: 325
Default

Quote:
Originally Posted by Jawed View Post
For the rest of this posting, just assume I've got one eye somewhere else ...
Fair enough. You should assume that I'm asleep while posting this, then

Quote:
I can't think what kind of trivial predication you're referring to that would waste R600's TUs.
If quad X always goes to TMUx, then a predication mask that always masks (say) quad 2, will leave TMU2 without any work to do.

I'm not sure how a local TMU uses the ring at all -- local ALUs talk to local TMUs, I wouldn't expect that to be over the ring. As it is, ALUs are always talking to remote TMUs (how remote depends on which quad). Have I misunderstood something? [that's a stupid question ] What have I misunderstood?

-Dave [->sleep]
dnavas is offline   Reply With Quote
Old 07-Jun-2007, 14:28   #138
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by dnavas View Post
If quad X always goes to TMUx, then a predication mask that always masks (say) quad 2, will leave TMU2 without any work to do.
Ah, OK, that's the kind of thing synthetics are for. Actually, that'd prolly make a really good synthetic for testing the performance of R600 texturing. Similar to dynamic branching tests that only use rectangular areas of coherence.

Which reminds me of a similar possibility with the way textures are defined and then fetched. It's possible to use a stride that will hit only one memory channel.

Quote:
I'm not sure how a local TMU uses the ring at all -- local ALUs talk to local TMUs, I wouldn't expect that to be over the ring.
No, but some of the texels could be in a foreign TMU's L2 already. Presuming that L2 is distributed - which I'm assuming is the case for the time being...

Quote:
As it is, ALUs are always talking to remote TMUs (how remote depends on which quad). Have I misunderstood something? [that's a stupid question ] What have I misunderstood?
No, I don't think you misunderstood anything. I might draw a diagram of how I think it all hangs together at some point...

Ooh, hang on, there's this from Watch Impress



I wish AMD would just post the complete set of slides.

Anyway, that doesn't show the ring bus at all, so I prolly should still have a go at a more detailed diagram.

Jawed
Jawed is online now   Reply With Quote
Old 07-Jun-2007, 15:28   #139
mhouston
A little of this and that
 
Join Date: Oct 2005
Location: Cupertino
Posts: 342
Default

Eric Demers gave a talk about the R6XX processors at Stanford's CS448 and AMD actually let us post the slides. http://graphics.stanford.edu/cs448-07-spring/. The talk was not a completely deep technical dive as it was in some ways designed to inspire students aiming to become architects and talk about why some things were done.
mhouston is offline   Reply With Quote
Old 07-Jun-2007, 15:44   #140
Geo
Mostly Harmless
 
Join Date: Apr 2002
Location: Uffda-land
Posts: 9,156
Send a message via MSN to Geo
Default

Quote:
Originally Posted by Jawed View Post
I wish AMD would just post the complete set of slides.

Anyway, that doesn't show the ring bus at all, so I prolly should still have a go at a more detailed diagram.

Jawed
We have Eric's architecture deep-dive from Tunis. We also have a long list of interview questions into Eric. Hopefully these things get published together. . .
__________________
"We'll thrash them --absolutely thrash them."--Richard Huddy on Larrabee
"Our multi-decade old 3D graphics rendering architecture that's based on a rasterization approach is no longer scalable and suitable for the demands of the future." --Pat Gelsinger, Intel
". . .its taking us longer than we would have liked to get a [Crossfire game] profiling system out there" --Terry Makedon, ATI, July 2006
"Christ, this is Beyond3D; just get rid of any f**ker talking about patterned chihuahuas! Can the dog write GLSL? No. Then it can f**k off." --Da Boss
Geo is offline   Reply With Quote
Old 07-Jun-2007, 15:45   #141
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Thanks Mike, that's great. That'll keep me busy for a while!

There's an admittedly vague die picture for those who like pretty pix.

Jawed
Jawed is online now   Reply With Quote
Old 07-Jun-2007, 16:00   #142
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,093
Default

I'm kind of confused as to why GPU makers insist on die shots where the innards are obscured by the pinout. Don't they use the same kind of packaging scheme?

CPU makers seem to have no issue with showing higher-res and more clear die shots. They sometimes even go out of their way to draw borders around functional units.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 07-Jun-2007, 16:04   #143
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Xenos and R520 die shots are comparatively clear...

Jawed
Jawed is online now   Reply With Quote
Old 07-Jun-2007, 16:09   #144
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,093
Default

I haven't seen any hi-res die shots of those cores. I haven't looked too hard for R520 shots, but I don't recall seeing a good pic of Xenos either.

Is there a presentation or article I missed?
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 07-Jun-2007, 16:25   #145
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

There's nothing hi-res I'm afraid. The R520 review here has a die shot, I believe. Not available right now.

Xenos die shot is out there somewhere, can't think where. Mostly I have these things on disk, not URLs (as the latter have a habit of disappearing)...

Jawed
Jawed is online now   Reply With Quote
Old 07-Jun-2007, 17:50   #146
BrynS
Member
 
Join Date: Jul 2003
Posts: 406
Default

Apologies for continuing the OT, but P.29 from the Stanford presentation:
Quote:
Originally Posted by ATI Radeon HD 2000 Series Architecture Overview by Eric Demers
What comes next?
[...]
-Tons of tuning for current architecture!
[...]
BrynS is offline   Reply With Quote
Old 07-Jun-2007, 17:59   #147
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

Not really surprising, it holds for any new architecture
__________________
[twitter]
More samples, we need more samples! [Dean Calver]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 00:34.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.