Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Old 28-Oct-2006, 05:52   #2051
INKster
Senior Member
 
Join Date: Apr 2006
Location: Io, lava pit number 12
Posts: 2,108
Default

Quote:
Originally Posted by lopri View Post
Brilliant.
Well, currently SLI only works on a single display, and it's a true limitation.

There is another one.
Card's like the 7950 GX2, standard SLI of Crossfire have shown us that it's the lack of a large installed base of ultra-high-resolution displays that limits both the performance gains visibility, as well as sales of multidisplay-capable hardware (not just graphics cards, but also motherboards with compliant chipsets -often from the same brand as the GPU, obviously-).
Nvidia never refrained itself from marketing the 7900/7950 GX2 products for the ultra-high-definition (UHD) crowd only.
What's the next step after that, knowing that the 30' form-factor seems to be a limit for some time now in the PC arena ? Go multi. What else ?


Of course, only a very narrow section of consumers can afford a 24, 27 or even 30 inch display in addiction of the remaining high-end hardware, and even those are still limited by the single display issue.
So, its reasonable to assume that, say, two side-by-side widescreen 20' LCD's with 1680 x 1050 native resolution each (3360 x 1050) are much cheaper to the average high-end gamer than a single Apple/Dell 30' 2560 x 1600 LCD, and perhaps even more flexible (you can always game on one and use the other one for something else, especially now in the multi-core CPU age).
It would instantly remove both a bottleneck for current multi-GPU's, and grow the appeal of the platform beyond just "the usual suspects".


Finally, the two-way SLI on 8800 GTX seems to suggest a kind of daisy chaining of graphics cards, one that would only make sense if more than two cards where present on a single system.
Since that would be costly, and the GTS doesn't have the two connectors, 3+ cards is almost a requirement when you need the power to drive one, two and more UHD displays while gaming with DX10 quality games, and for that you need the best there is (unless gaming is not important and you do like Apple, with 7300 GT's driving 30 inchers...).


What do you guys think ?

Last edited by INKster; 28-Oct-2006 at 06:07.
INKster is offline  
Old 28-Oct-2006, 13:04   #2052
Rangers
Regular
 
Join Date: Aug 2006
Posts: 9,269
Default

Quote:
Originally Posted by INKster View Post
Well, currently SLI only works on a single display, and it's a true limitation.

There is another one.
Card's like the 7950 GX2, standard SLI of Crossfire have shown us that it's the lack of a large installed base of ultra-high-resolution displays that limits both the performance gains visibility, as well as sales of multidisplay-capable hardware (not just graphics cards, but also motherboards with compliant chipsets -often from the same brand as the GPU, obviously-).
Nvidia never refrained itself from marketing the 7900/7950 GX2 products for the ultra-high-definition (UHD) crowd only.
What's the next step after that, knowing that the 30' form-factor seems to be a limit for some time now in the PC arena ? Go multi. What else ?


Of course, only a very narrow section of consumers can afford a 24, 27 or even 30 inch display in addiction of the remaining high-end hardware, and even those are still limited by the single display issue.
So, its reasonable to assume that, say, two side-by-side widescreen 20' LCD's with 1680 x 1050 native resolution each (3360 x 1050) are much cheaper to the average high-end gamer than a single Apple/Dell 30' 2560 x 1600 LCD, and perhaps even more flexible (you can always game on one and use the other one for something else, especially now in the multi-core CPU age).
It would instantly remove both a bottleneck for current multi-GPU's, and grow the appeal of the platform beyond just "the usual suspects".


Finally, the two-way SLI on 8800 GTX seems to suggest a kind of daisy chaining of graphics cards, one that would only make sense if more than two cards where present on a single system.
Since that would be costly, and the GTS doesn't have the two connectors, 3+ cards is almost a requirement when you need the power to drive one, two and more UHD displays while gaming with DX10 quality games, and for that you need the best there is (unless gaming is not important and you do like Apple, with 7300 GT's driving 30 inchers...).


What do you guys think ?
It makes sense because it's obvious quad SLI by way of GX2 style is dead with these cards. They are too hot and power hungry too fit two on one PCB (even the G71, a remarkably mild chip for it's power, had too be downclocked for GX2).

So the only way to get Nvidia's precious quad SLI is 4 physical cards.

Ugh.
Rangers is offline  
Old 28-Oct-2006, 14:04   #2053
Demirug
Senior Member
 
Join Date: Dec 2002
Posts: 1,326
Send a message via MSN to Demirug
Default

Looks like that the German PC Games Hardware magazine (print not online) had another NDA than anyone else. The current issue (delivered today) includes an official G80 preview. Only tech stuff no benchmarks.
__________________
GPU blog
Demirug is offline  
Old 28-Oct-2006, 14:15   #2054
LeStoffer
Senior Member
 
Join Date: Feb 2002
Location: Land of the 25% VAT
Posts: 1,247
Default

Quote:
Originally Posted by Demirug View Post
The current issue (delivered today) includes an official G80 preview. Only tech stuff no benchmarks.
Does anybody mind posting a short rundown for the "tech stuff" from PC Games Hardware then?
__________________
Best regards, LeStoffer

"We are all agreed that your theory is crazy. The question that divides us is whether it is crazy enough to have a chance of being correct." Niels Bohr
LeStoffer is offline  
Old 28-Oct-2006, 14:17   #2055
christoph
Member
 
Join Date: Feb 2002
Posts: 148
Default

from 3dcenter:

Quote:
Alle Infos sind zusammengefasst aus dem PCGH Special 'Geforce 8800' Ausgabe 12/2006 S54ff

Hier ein paar Facts:
- 681mio Transen
- neue ROPs mit FP32 + MSAA
- 128 Steam Prozzies mit 1350Mhz (1 Skalar pro Takt) = 518,4GFLOP/s
- WUAF
- besseres Dynamic Branching
- besseres Eraly Z

Streaming Proc.:
Bisher konnten GPUs (Verktorprozzies) einen RGBA Wert auf einemal berechnen. Da aber nicht immer vier Vektoren benötigt werden müssen die ALUs nicht mehr 3:1 oder 2:2 aufgeteilt werden. Somit steigert die die effizeinz. Ein skalar ist im Prinzip ein einkanaliger Vektor. Der Shader-Compiler hat somit einen einfacheren Job.
Die 128 Streaming Proc. sind aus gründen der Cacheverwaltung zu 16ALUs zusammengefasst. Jede dieser ALUs kann pro Takt eine MADD und eine MUL ausführen.
Wenn man die reine MADD Leistung vergleicht kommt man auf 345,6GLOP/s

Textureinheiten:
Acht Bilineare Textureinheiten sind integriert. Allerdings nur 4 Textur Address Proc. pro Gruppe. Insgesamt können daher 64 bi-Texel pro Takt berechnet werden, aber nur 32 adressiert. Die restlichen TMUs können per Koordinaten Offset zusätzlkiche Daten samplen, wenn wenigstens eine Dimension mit der ersten übereinstimmt (z.B. 2:1 Bi-AF).
TMUs haben eine eigene Adresseinheit, es muß nicht mehr der Pixel-Shader verwendet werden. -> TMU Latenz entschärft.

ROPs:
Es gibt 6 ROP Cluster, jeder Cluster ist mit 64bit bei 900Mhz angebunden. Somit die 384bit.
Bei der GTS ist es ein Cluster weniger. Je ROP vier Pixel mit Farb- und Texturinfo.
Z-und Stencil-Ops weiter getuned.

AA:
Es wir 4/8/16 MSAA geben.
Neu ist zuschaltbares 'Coverage Sampling AA'. Dies ist ein Kopressionsverfahren, welches aber nicht immer Kompatibelsein soll. Die 16 Zufällig verteilten Subpixel verden auf den Bedekungsgrad als Booleanwert (Wahr/Falsch) gespeichert. Das ergebnis wird auf 4x oder 8x MSAA Kompromiert.

Last edited by christoph; 28-Oct-2006 at 14:20.
christoph is offline  
Old 28-Oct-2006, 15:28   #2056
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,681
Default

Quote:
Originally Posted by Rangers View Post
It makes sense because it's obvious quad SLI by way of GX2 style is dead with these cards. They are too hot and power hungry too fit two on one PCB (even the G71, a remarkably mild chip for it's power, had too be downclocked for GX2).

So the only way to get Nvidia's precious quad SLI is 4 physical cards.

Ugh.
Expect the power and heat to go down significantly once nVidia puts out its next set of high-end parts based upon this architecture, as they should be based upon a die shrink of the soon-to-be-released GPU's. The mid-high part may well work well with a GX2-style part (certainly not the highest-end part).
Chalnoth is offline  
Old 28-Oct-2006, 15:33   #2057
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,981
Default

Cool.

128 scalar MADD+MUL - 518 GFLOPS/s
Groups of 16
4 Texture Address calculators per group - 32 total
8 TMU's per group - 64 total
6 ROP clusters (how many ROPs in a cluster? One?) / 4 samples/clock (minimum 4xAA ?) Double Z/stencil?
New Coverage Sampling AA not universally compatible
__________________
What the deuce!?

Last edited by trinibwoy; 28-Oct-2006 at 15:42.
trinibwoy is offline  
Old 28-Oct-2006, 16:17   #2058
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,952
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by trinibwoy View Post
128 scalar MADD+MUL - 518 GFLOPS/s
LOL, back to NV40?!!! style "asymmetric" ALUs "per pipe".

My head hurts. I can't think what the best way of scheduling that is! Utilisation isn't necessarily very good there, at all. What the scalar gives, the dual-issue takes away. ARGH.

Quote:
Groups of 16
So, minimum of 16 granularity for fragment batches? I wonder about vertices, 16 too?

It'll be interesting to see the difference in performance that branching granularity of, say, 16 in G80 versus 64 in R600 makes...

Quote:
4 Texture Address calculators per group - 32 total
So, is that the Multifunction Interpolators?

Quote:
8 TMU's per group - 64 total
8 TMUs running at ~half the speed of 4 streaming processors (the bit that does texture address calculation) - I think this was the concensus already based on specs.

Quote:
6 ROP clusters (how many ROPs in a cluster? One?) / 4 samples/clock (minimum 4xAA ?) Double Z/stencil?
I dare say it would make sense that this is a single decoupled functional block. I would guess each "cluster" is 4 ROPs, so a 24 ROP design?

Quote:
New Coverage Sampling AA not universally compatible
Oh dear, how can that be? ARGH

64 TMUs seems like an awful lot, considering the bandwidth available is only 86GB/s. R580's 16 TMUs happily use 60GB/s. The only way to explain this, I guess, is if the 64 TMUs are designed for full-speed FP16 texture filtering, say, and half-speed FP32 texture filtering.

Jawed
Jawed is offline  
Old 28-Oct-2006, 16:38   #2059
Demirug
Senior Member
 
Join Date: Dec 2002
Posts: 1,326
Send a message via MSN to Demirug
Default

Quote:
Originally Posted by Jawed View Post
LOL, back to NV40?!!! style "asymmetric" ALUs "per pipe".

My head hurts. I can't think what the best way of scheduling that is! Utilisation isn't necessarily very good there, at all. What the scalar gives, the dual-issue takes away. ARGH.
Well you shouldn’t ignore the high clock rate of the stream processors and that they are no longer tied to the texture processing.

Quote:
Originally Posted by Jawed View Post
So, is that the Multifunction Interpolators?
No. That the part that calculates the position and weight of the samples from the texture coordinates.
__________________
GPU blog
Demirug is offline  
Old 28-Oct-2006, 17:00   #2060
dnavas
Member
 
Join Date: Apr 2004
Posts: 349
Default

Quote:
Originally Posted by Jawed View Post
LOL, back to NV40?!!! style "asymmetric" ALUs "per pipe".

My head hurts. I can't think what the best way of scheduling that is! Utilisation isn't necessarily very good there, at all. What the scalar gives, the dual-issue takes away. ARGH.
The ALUs don't seem that impressive, indeed.

Quote:
So, is that the Multifunction Interpolators?
4 interpolators per group? Not sure. Are we sure that neither the MADD nor the MUL are more capable units?

Quote:
8 TMUs running at ~half the speed of 4 streaming processors (the bit that does texture address calculation) - I think this was the concensus already based on specs.
Modulo 64 vs 48 TMUs, which is a bit of a shock. Thsi sounds like a texturing monster. One thing Trini doesn't share is that, despite the clock differences, apparently you don't get full use of the TMUs until the results of the addressing can be reused as in 2xAF (if I'm reading that translation correctly).

Quote:
I dare say it would make sense that this is a single decoupled functional block. I would guess each "cluster" is 4 ROPs, so a 24 ROP design?
Doesn't sound programmable. <sigh>

Quote:
64 TMUs seems like an awful lot
Yeah -- this looks like where the transistor budget went. As my interest was mainly in a beefy ALU section, I'm a bit disappointed, but, I daresay most gamers should be quite happy with this.
dnavas is offline  
Old 28-Oct-2006, 17:16   #2061
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,952
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by dnavas View Post
The ALUs don't seem that impressive, indeed.
I'm tempted to take my silly code, above, and re-sequence it through a MAD+MUL+MI configuration! That should be fun. It'll be interesting to see how much faster it ends up...

Quote:
4 interpolators per group? Not sure. Are we sure that neither the MADD nor the MUL are more capable units?
Maybe NVidia put in the MUL to make the MI units (in special-function mode, e.g. SIN or EXP) run faster. I think SF processing may be dependent on "pre-processing", e.g. a MUL needs to be performed before the SF can be performed. I'm not sure. Or maybe it's a MAD that needs to be done first, and the MUL is there to pick up some slack.

Quote:
Modulo 64 vs 48 TMUs,
Don't understand what you're saying there.

Quote:
which is a bit of a shock. Thsi sounds like a texturing monster. One thing Trini doesn't share is that, despite the clock differences, apparently you don't get full use of the TMUs until the results of the addressing can be reused as in 2xAF (if I'm reading that translation correctly).
I'm thoroughly confused by this whole texture address thing I think it's best I just read and not talk about it.

Quote:
Yeah -- this looks like where the transistor budget went. As my interest was mainly in a beefy ALU section, I'm a bit disappointed, but, I daresay most gamers should be quite happy with this.
ALU:TEX ratio does seem strangely low. But the fp16 and fp32 texturing requirements of D3D10 skew that significantly. And it's not as if >500GFLOPs is anything to sneer at, R600 will prolly be in the same ballpark.

Not to forget performant dynamic branching.

So, now we just need to find out more detail about scheduling/load-balancing. Demirug chuckles, knowing that there's a whole load more subtlety in there.

I'm a bit miffed that no specific mention of VB/CB type functionality/performance or streamout performance/architecture has come up explicitly in any rumours.

Jawed
Jawed is offline  
Old 28-Oct-2006, 17:22   #2062
Demirug
Senior Member
 
Join Date: Dec 2002
Posts: 1,326
Send a message via MSN to Demirug
Default

Quote:
Originally Posted by Jawed View Post
ALU:TEX ratio does seem strangely low.

Jawed
Calculate again and don’t forget to consider different clock domains.
__________________
GPU blog
Demirug is offline  
Old 28-Oct-2006, 18:05   #2063
Prometheus
Junior Member
 
Join Date: Jul 2002
Location: Greece
Posts: 97
Icon Smile

I have a fealing the beyond3d G80 review will be one the longest(pages) ever.

Last edited by Prometheus; 28-Oct-2006 at 18:07.
Prometheus is offline  
Old 28-Oct-2006, 18:08   #2064
dnavas
Member
 
Join Date: Apr 2004
Posts: 349
Default

Quote:
Originally Posted by Jawed View Post
I'm tempted to take my silly code, above, and re-sequence it through a MAD+MUL+MI configuration! That should be fun.
You've got a bit of a masochistic streak in you, don't you?

Quote:
It'll be interesting to see how much faster it ends up...
Now, on that I can agree

Quote:
Maybe NVidia put in the MUL to make the MI units (in special-function mode, e.g. SIN or EXP) run faster. I think SF processing may be dependent on "pre-processing", e.g. a MUL needs to be performed before the SF can be performed. I'm not sure. Or maybe it's a MAD that needs to be done first, and the MUL is there to pick up some slack.
Well, it seems most logical to put the MI in the lightweight MUL unit for utilization reasons. It may well be somewhere entirely different, though. I haven't got a clue.

Quote:
Don't understand what you're saying there.
The rumors had been 48 TMUs, not 64. At least, last time I was paying attention. So, I was expecting 48 total, not 64. Even with 48 I was growing concerned that that was going to be a fair number of transistors. At 64....

Quote:
I'm thoroughly confused by this whole texture address thing I think it's best I just read and not talk about it.
Well, my assumption had been, basically, 48/48/32, where the group of 32 was broken up into TMU-ROP dedicated units. It's not clear that that's what exists here, though. The address units appear to be separated, and (presumably) running at core clock speeds. When aniso (?? "2:1 Bi-AF") is used, at least one dimension can be reused, and so it can feed all TMUs. (I'm not sure I understand that, but I don't have a good feel for aniso-math or what the addressing really does-- I had thought that 'correct' aniso required sqrt <shrug> I'm surprised there isn't a trilinear optimization -- aren't the two bilerps just a shift away from each other?) If addressing is insufficient, the TMUs can steal cycles from the shaders. Presumably this would not be required too frequently!

Quote:
ALU:TEX ratio does seem strangely low. But the fp16 and fp32 texturing requirements of D3D10 skew that significantly. And it's not as if >500GFLOPs is anything to sneer at, R600 will prolly be in the same ballpark.
1350 * 128 : 64 * 575 = 4.7:1, no? Of course, that isn't comparing Vec4:TEX.... And how to account for the "half"/MUL unit is unclear.
...And Demi is right that, for the most part, TMUs aren't intruding on shaders. However, there was ... something undeniably sexy about a teraflop of programmable logic.

Quote:
So, now we just need to find out more detail about scheduling/load-balancing. Demirug chuckles, knowing that there's a whole load more subtlety in there.
Well, if the patent is accurate, they appear to have come up with a degree of elegance, there. To simplify, the "TCP" has a single, externally pokable workload unit acceptor. The TCP can take this and move it into one of the stream units. So, you iterate through your pixel and vertex inputs, shove the units into any TCPs not already full, while the TCP is simultanesouly shifting that work out. Pixels are tiled, vastly simplifying dependency checking, and increasing local cache re-use. I'm going to go back to the patent and print that puppy out, though, so I can flesh out the details. This is a vast simplification of what is really going on, and therefore, in reality, *wrong* IIRC, there's a cache of "live work", a set of threads which accept work (not st.units), a unit that checks to make sure that all inputs are available, etc....

Quote:
I'm a bit miffed that no specific mention of VB/CB type functionality/performance or streamout performance/architecture has come up explicitly in any rumours.
I miss the prog.ROP. Looks like there's plenty of work still left to be done in the future.

-Dave
dnavas is offline  
Old 28-Oct-2006, 18:32   #2065
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,981
Default

Quote:
Originally Posted by dnavas View Post
Modulo 64 vs 48 TMUs, which is a bit of a shock. Thsi sounds like a texturing monster. One thing Trini doesn't share is that, despite the clock differences, apparently you don't get full use of the TMUs until the results of the addressing can be reused as in 2xAF (if I'm reading that translation correctly).
Yep, they mention that the half the TMU's will have to piggyback off of the texture address calcs. Will be interesting to see if this is workable on other things besides AF. And we've been expecting 64 TMU's for a couple days now - 36.8 GT/s = 575Mhz * 64.
__________________
What the deuce!?

Last edited by trinibwoy; 28-Oct-2006 at 18:40.
trinibwoy is offline  
Old 28-Oct-2006, 18:39   #2066
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,981
Default

Quote:
Originally Posted by Demirug View Post
Calculate again and don’t forget to consider different clock domains.
That might work if you just look at ALU count but if you look at pixel shader arithmetic to texturing capability it doesnt look that impressive.

G71: 249.6 Gflops/s, 15.6 GT/s ~ 16:1
G80: 518.4 Gflops/s, 36.8 GT/s ~ 14:1

Granted G80 flops are probably much more useful flops.
__________________
What the deuce!?
trinibwoy is offline  
Old 28-Oct-2006, 18:39   #2067
Xmas
Rebel
 
Join Date: Feb 2002
Location: On the pursuit of happiness
Posts: 3,066
Default

Quote:
Originally Posted by Jawed View Post
LOL, back to NV40?!!! style "asymmetric" ALUs "per pipe".

My head hurts. I can't think what the best way of scheduling that is! Utilisation isn't necessarily very good there, at all. What the scalar gives, the dual-issue takes away. ARGH.
What would be so difficult about scheduling with such an arrangement?

As for utilization, it just depends on the types of operations a shader needs. If you have a shader pipeline with a single MADD, would you say utilization is good for a shader that does only ADDs, even though a large part of the ALU sits idle? ATI had an ADD+MADD arrangement for some years now, while NVidia seems to favor MUL performance. Both have their advantages, and some operations (i.e. LERP) can be done in different ways to suit each arrangement better. More MULs means better dot product performance. We also don't know how special functions might take advantage of these units.

Quote:
64 TMUs seems like an awful lot, considering the bandwidth available is only 86GB/s. R580's 16 TMUs happily use 60GB/s. The only way to explain this, I guess, is if the 64 TMUs are designed for full-speed FP16 texture filtering, say, and half-speed FP32 texture filtering.
Instead of 64 "full" TMUs, it seems more like 8 quad-TMUs that can generate two bilinear samples per clock from the same texture, under certain circumstances.
Xmas is offline  
Old 28-Oct-2006, 18:41   #2068
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,952
Send a message via Skype™ to Jawed
Default

OK, "counting" ALUs.

128*(MAD+MUL) = 256

Assuming that the sequential scalar scheduling generates a 33% gain in "equivalent" performance (compared to traditional Vec4 ALUs), so that's equivalent to 341 ALUs.

Compared against the TMU clock, 1350MHz is 2.3x, so that's 784 ALUs per TMU clock. But there's 64 TMUs, so that's 12.25 ALUs per TMU.

So now you enter an argument about the effective number of scalar ops per fragment, averaged over the length of a typical shader. Say it's 3 (RGB). That gets us to a 4:1 ALU:TEX ratio, with high utilisation.

Is the utilisation higher than the nominal 6:1 we see in R580 (counting 48*(MAD+ADD) )? G80's per clock utilisation should be higher (due to serial scalar) but the MAD+MUL/MAD+ADD arrangement of both seems like a bit of a handicap in real code.

And that's not taking account of other differences between the architectures:
  • special function co-issue (part of Vec4 in R580, separate in G80)
  • dedicated texture address calculation in R580
Fingers-crossed I haven't slipped-up somewhere.

Jawed

Last edited by Jawed; 28-Oct-2006 at 18:47. Reason: whoops wrong second ALU in R580
Jawed is offline  
Old 28-Oct-2006, 18:46   #2069
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,323
Default

I don't think it's like NV4x/G7x ALUs configuration cause in this case ALUs clock is so high (while being fabricated on the same process) that let me think pipeline has to be shorter, so I vote for a MADD and a MUL unit working in parallel, non serially.
__________________
[twitter]
More samples, we need more samples! [Dean Calver]
First they ignore you, then they laugh at you, then they fight you, then you win. [Mahatma Gandhi]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline  
Old 28-Oct-2006, 18:50   #2070
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,952
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by nAo View Post
I don't think it's like NV4x/G7x ALUs configuration cause in this case ALUs clock is so high (while being fabricated on the same process) that let me think pipeline has to be shorter, so I vote for a MADD and a MUL unit working in parallel, non serially.
I agree. I'm about to launch into making another pipeline pizza diagram, this time with MAD+MUL side by side...

Jawed
Jawed is offline  
Old 28-Oct-2006, 18:54   #2071
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,981
Default

Quote:
Originally Posted by Jawed View Post
dedicated texture address calculation in R580
Based on that 3dcenter excerpt the texture address calc on g80 seems to be dedicated as well.
__________________
What the deuce!?
trinibwoy is offline  
Old 28-Oct-2006, 18:56   #2072
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,981
Default

Quote:
Originally Posted by nAo View Post
I don't think it's like NV4x/G7x ALUs configuration cause in this case ALUs clock is so high (while being fabricated on the same process) that let me think pipeline has to be shorter, so I vote for a MADD and a MUL unit working in parallel, non serially.
Isn't a shorter pipeline usually indicative of a lower clock?
__________________
What the deuce!?
trinibwoy is offline  
Old 28-Oct-2006, 19:02   #2073
RoOoBo
Member
 
Join Date: Jun 2002
Posts: 305
Default

Quote:
Originally Posted by nAo View Post
I don't think it's like NV4x/G7x ALUs configuration cause in this case ALUs clock is so high (while being fabricated on the same process) that let me think pipeline has to be shorter, so I vote for a MADD and a MUL unit working in parallel, non serially.
Well, it would depend on what's better for the way they schedule instructions from the same or different threads. Given enough threads and a fast enough thread switch mechanism it wouldn't matter a bit the latency of the pipeline. Until now ATI and NVidia seem to like the option of initiating two dependant instructions in a single cycle (that's the purpose of arranging the ALUs in cascade). In fact that made sense because relatively short shaders were still frequent until now (no chances of finding independant instructions to issue) and the fact that the there is a limited set of operation combinations (MADD + ADD or MADD + MUL) that seem to be relatively frequent in the kind of computations that are performed in the fragment shader.

But that may have changed for G80 and now it's better to use what is the normal implementation in general purpose processors: parallel ALUs AKA superscalar architectures.
RoOoBo is offline  
Old 28-Oct-2006, 19:05   #2074
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,323
Default

Quote:
Originally Posted by trinibwoy View Post
Isn't a shorter pipeline usually indicative of a lower clock?
If in N stages they could execute 2 dependant MADDs (G7x) I'm sure that in M<N stages they can do 2 non dependant MADD and MUL at higher clock (not taking the registers bw needed to feed these units in consideration at all)
__________________
[twitter]
More samples, we need more samples! [Dean Calver]
First they ignore you, then they laugh at you, then they fight you, then you win. [Mahatma Gandhi]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline  
Old 28-Oct-2006, 19:06   #2075
RoOoBo
Member
 
Join Date: Jun 2002
Posts: 305
Default

Quote:
Originally Posted by trinibwoy View Post
Isn't a shorter pipeline usually indicative of a lower clock?
He means that if the pipeline is too long the processor can be stalled more frequently because of dependant instructions. Or that for countering that that the pipeline is longer you need to increase the thread length (batch size) or implement a more expensive thread scheduler. The ALU pipeline of the G80 if it was arranged as in the NV40 or G70 would have to be quite longer to support a 1,3 GHz clock.
RoOoBo is offline  

 

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 05:39.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.