Anand talk R580

Turtle 1 · Nov 16, 2005

Granted this score was reached using dry ice . But it was also the old drivers. Fugger has O/C the X1800Xt on air cooling to 750+core plus . So i would say when ati releases the next drivers 13000 will be reachable on water cooled PC.

Turtle 1 · Nov 16, 2005

radeonic2 said:
3dmark
Ya that's great and all, but how bout you play some real games at 1600x1200 with 4x fsas and 16AF and tell us that we dont need anything faster than a 1800xl.

I never meant to imply that you don't need faster than X1800Xl .
I have a problem with screwing people over. Fact is I was going to quit building PC'S until after the Vista release.
When you sell someone a $5000 water cooled PC and its outdated in 1 year thats a bad thing .
So what i have done is I am selling vista compliant PC'S now. What the customer gets from me when Vista is released is a R600 already paid for . Hd or blu ray optic drive.(prepaid) ( I remove the X1800Xl'S and the Dual layer DVD's and replace with the new parts.)
I also advised all customers to buy the gate way 21" lcd or keep what they have all bought the gateway lcd. (Its Compliant)
They all got 2 gigs of memory.
The only thing I am unsure of is the harddrives. I told customers what I knew and the HD drives was a non-concern.
Yes the customers has to buy his own copy of vista. Since I lock the bios on these PC'S I will do the install free of charge.
Why do I lock the bios? These PC's come with 5 year warranty . Already O/C to there max . stable O/C . So to protect myself and the customers I lock the Bios. I sell to the 5 states around my own state only so its easy to take care of customers needs .

Moloch · Nov 16, 2005

Turtle 1 said:
I never meant to imply that you don't need faster than X1800Xl .
I have a problem with screwing people over. Fact is I was going to quit building PC'S until after the Vista release.
When you sell someone a $5000 water cooled PC and its outdated in 1 year thats a bad thing .
So what i have done is I am selling vista compliant PC'S now. What the customer gets from me when Vista is released is a R600 already paid for . Hd or blu ray optic drive.(prepaid) ( I remove the X1800Xl'S and the Dual layer DVD's and replace with the new parts.)
I also advised all customers to buy the gate way 21" lcd or keep what they have all bought the gateway lcd. (Its Compliant)
They all got 2 gigs of memory.
The only thing I am unsure of is the harddrives. I told customers what I knew and the HD drives was a non-concern.
Yes the customers has to buy his own copy of vista. Since I lock the bios on these PC'S I will do the install free of charge.
Why do I lock the bios? These PC's come with 5 year warranty . Already O/C to there max . stable O/C . So to protect myself and the customers I lock the Bios. I sell to the 5 states around my own state only so its easy to take care of customers needs .

Vista compliant

Doesn't vista come out in like 2007?
Ya bet it will run it super on 2 year old hardware :!:

You say max stable oc.. so does that account for dust buildup for people who dont realize sucking in air sucksin dust with it?
I personally wouldn't sell "vista compliant" pcs untill I've used a final beta so I can guage performance of the OS and really get some time checking out the OS with various setups.

Turtle 1 · Nov 16, 2005

Wiel the dust issue bothered me cause of the time invoved. It's in the customers contract that I have to clean there PC's 1 a year.(5 years)
I am presently working on a PC case that has a Air cleaner installed in the lower part of the case with the PSU. This is the same type used in submarines. The only other fans will exhaust the clean cool air threw the RAD.
Hopefully I will have the case completed and ready for manufactoring

. By Jan. 2006.

KimB · Nov 16, 2005

Mintmaster said:
You're not thinking about this from a compiler's viewpoint and the dependency tree. You don't need to have the whole function you're integrating in memory at once.

Right, but as I stated, the more you have in memory at once, the less work needs to be done.

So if you're load limited, computation time scales as 1/(1 + 1/n), where 2+1.25n is the number of registers used.

In this case, of course, the instructions break up very nicely, and if you're smart about it, you don't need to use any more instructions with fewer registers. But, as you stated, you still need to do more work: more memory accesses.

But consider a recursive algorithm. Every time it recurses, you create new temporary storage that must be stored in-flight. Consider this 1-D integration scheme:
1. Evaluate f(a), f((a+b)/2), and f(b).
2. Use trapezoidal integration: Integral = (f(a) + f((a+b)/2))/( (a+b)/2 - a) + (trapezoidal integral over second half)
3. Use Simpson's rule integration: Integral = 1/(a+b)*((1/4)f(a) + (1/2)f((a+b)/2) + (1/4)f(b))
(Trapezoidal integration is like drawing straight lines between the three points and integrating, Simpson's rule like drawing a cubic fit to the three points and integrating)
4. Compare the two integrals: if the difference between the values obtained is less than some pre-defined threshold, output the result. Otherwise, recompute the integration on each half of this piece individually using the above algorithm.

Now, bear in mind that if evaluating this function is very short, one step in the integration scheme takes a very short amount of time (since you only need to evaluate the function once for each subdivided integration: the endpoints have already been done). If you were to attempt to implement this in today's hardware, then, you'd want a way to do so without writing any intermediate results before moving on to the next step. Register space limitations would provide an absolute limit as to how many recursive steps you could perform.

Of course, I don't think I'd want to attempt to implement this algorithm in hardware just yet. I'm hoping that next-gen GPU's will allow one pixel or vertex object to spawn two more, which would be the ideal situation for this recursive algorithm. One would then just take the sum of the framebuffer to get a result.

Skrying · Nov 16, 2005

Rather interesting, sounds to me you're certainlly not in the business to make money. In fact your model seems like a terrible waste on profit on your part. When I build a PC I offer a customer a 1 year warranty on repairs and then it can be expanded if they want longer. They usually dont expand it. I also do not make any promises on performance other than for games and operating systems right now, I cant judge the future and going by some beta of an OS tells me very little really.

You build computers for now really, building a computer for the future is simply a waste and you playing with predicting the future. In the end that's bad for you and the customer.

{Sniping}Waste · Nov 16, 2005

Turtle 1 said:
Wiel the dust issue bothered me cause of the time invoved. It's in the customers contract that I have to clean there PC's 1 a year.(5 years)
I am presently working on a PC case that has a Air cleaner installed in the lower part of the case with the PSU. This is the same type used in submarines. The only other fans will exhaust the clean cool air threw the RAD.
Hopefully I will have the case completed and ready for manufactoring . By Jan. 2006.

Not a good idea. If you clean there systems 3 time a year then it will work but less then that then the dust will cloge the air filter and there goes the air flow through the case and overheating will start.

Turtle 1 · Nov 17, 2005

{Sniping}Waste said:
Not a good idea. If you clean there systems 3 time a year then it will work but less then that then the dust will cloge the air filter and there goes the air flow through the case and overheating will start.

Yes I seen that problem coming . So what I did was Stoped using My watercooling system and case and switched to the Koolance PC3 725 BLorSL. Thats an expos2 in a Lian LI V1000 case. I have no fear of this case starving for cool air. Also keep in mind the only fans in this case are the 2 120mm case fans 2 120 rad. fans and 1 fan on the power&cooling SLI PSU.
So A year will work for me .
As for the other comment above . I am not tring to build a pc empire . I will make money .
But thats not what I am About. Good customer care . Upgrade ability . and the use of only the finest available parts. As for 1 year warranty your what I stand against. I buy only retail parts. Memory lifetime warranty/ Video cards/ lifetime warranty. Free programs for customers. Intel CPU'S I have abused the hell out of these things. Haven't lost one yet.
PC power&cooling PS 5 year warranty. .

Now If you buy a nvidia gpu and AMD cpu I only give a 2 year warranty. I have had to replace nvidia GPU'S because customers didn't like the shimmering. Yes I lost money on those PC'S . But I got good reputation and happy customers . I only sell 2 models of PC'S
Both cost $5000+ so I take care of my customers. Most give me there old systems also .
But your right Mike Dell I am not. What I am is a man who loves PC hardware configured the way it should be . I also care that I give my customers the very best available parts.
As it stands right now I won't touch a NV4 chipset because of problems I had with the Creative Fatality1 sound cards . I heard they fixed it but I only give 1 chance and your out.

DemoCoder · Nov 17, 2005

Chalnoth, it seems to me that your algorithm might be able to be written in tail recursive format. The recursion step is

Code:

method: integrate(f,a,x,b)
given f, a, x, b  (x=a+b/2, received as parameter)

if(abs(simpson-trap) < epsilon)) 
   return simpson

else 
   return integrate(f, a, x/2, x) + integrate(f, x, x+b/2, b)

This is tree recursive, not tail recursive, but only reason this method isn't tail recursive is because of the first call to integrate (the plus would be taken care of by passing in an accumulator argument). I think with a little more thought, this method could probably be transformed into a tail recursive method, and then the extra stack space wouldn't be needed. (normally, if you wrote this as an iterative method, you'd still need to use a stack datastructure to keep track of where you are in the computation. But in the case of this method, parent frames of the stack can be restored mathematically. Given a,x,b, I can compute the parent range and recover the parent interval via 2*b - a, if this was called for the "left half" of the interval. All you'd need to do is pass a another argument to say which half of the interval (or root) to make this calculation)

KimB · Nov 17, 2005

DemoCoder said:
This is tree recursive, not tail recursive, but only reason this method isn't tail recursive is because of the first call to integrate (the plus would be taken care of by passing in an accumulator argument). I think with a little more thought, this method could probably be transformed into a tail recursive method, and then the extra stack space wouldn't be needed.

I don't quite see how, not without doing lots of recalculation. Specifically, if you make it tail recursive, you will have to recalculate f(b) about half of the time that it shows up, which would be (very, very roughly) a 50% increase in the amount of computation required (assuming the simpson and trap functions have computations times trivial in length compared to evaluating the function itself). Actual performance will, of course, depend upon the integral being evaluated.

Edit:
Oh, and the tree-recursive algorithm also has the added benefit that it is maximally stable numerically: that is, every time you add the integrals together, it's going to be over two integrals of the same length, which are going to therefore have approximately similar values. This is more stable than would be the case for just adding the results of the integrals sequentially as one would do with tail recursion.

DemoCoder · Nov 17, 2005

I didn't say it wouldn't be possible without recomputation of subproblems, only that it is possible to do it without stack storage. As I explained in the other message, register storage can be traded off vs recomputation of subexprssions. There are many divde and conquer algorithms expressed as non-optimal recursive implementations because recursion looks cleaner, and/or memoization is ugly (such is the case where you have overlapping subproblems) The classic tree recursive fibonacci is hideously inefficient.

As for numerical stability, that might apply in the case of general purpose scientific computation on CPUs, but I wouldn't expect any kind of guarantees on GPUs in the first place. Moreover, with only so many temporary registers, maximum recursion depth is going to be practically limited to 4, both by SM3.0 and/or storage limitations. On top of this, IIRC, the loop counter register can't be used to index into temporary registers.

Thus, an HLSL compiler would have to handle tree recursion by generating n different functions, each of which uses a different interval of temporary registers, in order to mimic a real stack frame, since PS3.0 "function calls" really only save/restore the PC register, and not registers.

Mintmaster · Nov 17, 2005

DemoCoder said:
That was a typo. The context of my message is clear

Well it is now that you've changed your numbers. 61% is different from 100%. The only conclusion I could make is that you said 100% because of architectural improvements.

Regarding your calculations, remember that the X1600XT is 590MHz and you assumed NVidia would hit 650MHz. In all likelyhood R580 will be clocked higher.

Thus, I don't think an R580 and a 32-pipe G7x clocked similarly will be a blowout, especially since it depends on very heavy ALU workload.

I'm not sure if anyone said it would be a blowout, and I certainly didn't. One time I said "I think R580 would beat a 32-pipe 90nm G7x", and another time I said this hypothetical chip couldn't come close to R580 specifically in games with math-heavy shaders, like FEAR (where the X1600XT beats the GS quite substantially when bugfixed). It's going to depend a lot on the game/benchmark.

Personally, I think ATI is holding too much baggage for their dynamic branching ability. They'll have to cross thier finger it gets used, because otherwise they wouldn't have to skimp on the texturing units so much.

Bill · Nov 17, 2005

I hesitate to even post this but:

http://www.nextl3vel.org/board/index.php?showtopic=729

I thought R580 was 48 pipes?

And he's saying R520 is really 32? Would explain the large die size, but dont we have an X-Ray of the die? What does it show? How many qauds?

DemoCoder · Nov 17, 2005

Mintmaster said:
Well it is now that you've changed your numbers. 61% is different from 100%. The only conclusion I could make is that you said 100% because of architectural improvements.

I didn't change my numbers. My original statement was supposed to read that the GTX512 was 100% over the 6800U and that the GTX512 was 25% over the GTX256. That I included the point about GTX256/6800U being 61% in my "correction" is just to add another datapoint that these theoretical scaling calculations fit very well with realworld benchmarks.

I'm not sure if anyone said it would be a blowout, and I certainly didn't. One time I said "I think R580 would beat a 32-pipe 90nm G7x", and another time I said this hypothetical chip couldn't come close to R580 specifically in games with math-heavy shaders, like FEAR (where the X1600XT beats the GS quite substantially when bugfixed). It's going to depend a lot on the game/benchmark.

Are you sure that FEAR shaders are as math heavy as you think they are? Has anyone ripped out any of the shaders and calculated math:texture instruction ratios? Also, what benchmarks are you looking at? DrivenHeaven for example has *fixed* FEAR benchmarks at http://www.driverheaven.net/reviews/X16_GS/fear.htm that don't show any substantial beatdown against the GS. I haven't been able to find many X1600XT/GS FEAR benchmarks. But I'm also suspecting that the game engine isn't very efficient, because FEAR IMHO doesn't look very good for the level of performance it achieves on high end systems. Sure, it looks good, but it doesn't look amazingly good compared to other engines, so I have to question whether the engine isn't maximizing system performance.

Geo · Nov 17, 2005

Bill said:
I hesitate to even post this but:

http://www.nextl3vel.org/board/index.php?showtopic=729

I thought R580 was 48 pipes?

And he's saying R520 is really 32? Would explain the large die size, but dont we have an X-Ray of the die? What does it show? How many qauds?

When I hit my fourth

in two paragraphs, I just gave up.

It's out, we have a die shot (something we never got with G70, btw), and still IT WILL NOT DIE. M'gawd.

Arty · Nov 17, 2005

geo said:
When I hit my fourth in two paragraphs, I just gave up.

It's out, we have a die shot (something we never got with G70, btw), and still IT WILL NOT DIE. M'gawd.

X-rays ! rofl

Mintmaster · Nov 17, 2005

DemoCoder said:
Extra registers are useful for time-space tradeoff if you want to eliminate common subexpressions. If you don't use the extra registers, you just have to recompute some values that get written over.

I know, that was basically the whole point of my response to Chalnoth. The task had redundant data loading. My point is that a large quantity of these subexpression is nearly impossible to find in a realtime pixel shader.

If you look at Forth programs for example, the stack for any given method never gets more than a few elements deep, especially methods that leave only one value as a result.

My point exactly. You can't output more than 16 values, so it's extremely rare to find use for so many registers.

The challenge is to find a balance. For the NV30, the limitation of 2 FP32 registers of 4FP16 registers (with huge penalty for exceeding) was too much, and aggressive non-cse would be needed, but you are then burning extra cycles. But once you get around 8 registers, you won't need many more except in pathological cases.

So why are you telling me that PS3.0 is so great without dynamic branching? That's where this whole thing started from. PS2.0 has 12 or 16 registers, which is plenty.

Looking now at this discussion between you and Chalnoth, I see you brought recursion up as an example, but that's about as branching heavy as you can get. It's well outside the context of my statements, because my claim is this: PS3.0 without dynamic branching offers almost nothing over PS2.0+ (i.e. ps_2_x).

(PS2.0+ does offer something over PS2.0, but just a tad, and NV3x was deficient in so many other areas that it couldn't take advantage of them. That's getting a bit off topic, though...)

Thus, I don't think any shader should be called PS3.0 if it doesn't have any dynamic branching. Otherwise, it's very likely to run on the ps_2_x profile, and maybe even ps_2_0 with some reworking. Two more FP iterators is the biggest advantage I see, along with nicer interface between VS and PS, but both of these are merely conveniences.

So I think my previous analogy is still accurate: NV40 is no more PS3.0 than NV30 is PS2.0+. However, the relevance of this statement is greatly diminished for two reasons. One is that R300 did PS2.0 right 6 months before NV30's half assed implementation, whereas R520 did PS3.0 right 18 months after NV40's half assed effort. The other is that real PS3.0 shaders do not offer the graphics leap over PS2.0 that the latter had over PS1.x.

Mintmaster · Nov 17, 2005

DemoCoder said:
Are you sure that FEAR shaders are as math heavy as you think they are? Has anyone ripped out any of the shaders and calculated math:texture instruction ratios?

Well, I guess I'm not sure and I'm just making assumptions. There are a few games here and there where the X1800XT can significantly beat the 7800GTX, but only a few where the X1600XT beats the 6800GS. I do know that at Digit-Life, there are some shaders where the X1600XT is very nearly 3 times the speed of the X1300Pro, so I know such circumstances do exist.

I also hypothesized that the FEAR game engine is not very efficient, but since it scaled with resolution all I could think of were shaders. If stenciling was the limiting factor, it would undoubtedly run better on NVidia hardware. On the other hand, it has an appreciable hit for MSAA, so it's confusing. I know Demirug has extracted shaders before, so we can ask him, but I haven't seen him on the boards for a while. It would be really nice if there was a free wrapper framework somewhere for both OpenGL and D3D.

Also, what benchmarks are you looking at? DrivenHeaven for example has *fixed* FEAR benchmarks at http://www.driverheaven.net/reviews/X16_GS/fear.htm that don't show any substantial beatdown against the GS.

Well, first of all, people have been reporting ~35% increase in that thread on these boards. Secondly, that driverheaven review only updated the 2xAA score, which has always had a higher hit on ATI cards than on NVidia. The framerate went from 23fps to 34fps: a 50% increase. If you add a conservative 30-40% to the noAA and 4xAA scores, it'll be way ahead of the 6800GS.

ANova · Nov 17, 2005

ATI has authorized IHVs to offer XTs with increased clocks similar to what we've seen with the GTX, I think that's what people are thinking of when they mention the XT PE. Something like a Sapphire X1800 XT Ultimate at 675, 1600.

DemoCoder · Nov 17, 2005

Mintmaster said:
So why are you telling me that PS3.0 is so great without dynamic branching? That's where this whole thing started from. PS2.0 has 12 or 16 registers, which is plenty.

Because most of PS3.0 is about programmer productivity alterations (e.g. the new input model, etc) It makes coding easier, it doesn't alter what is fundamentally implementable in most cases. As I said, if you find yourself with less registers, you just workaround it by introducing recalculation. The only time you can't get around this is if your algorithm requires a fundamental amount of state.

But any such algorithms I can imagine either a) are GPGPU not GPU and b) too slow to be realtime anyway.

Looking now at this discussion between you and Chalnoth, I see you brought recursion up as an example, but that's about as branching heavy as you can get.

No, Chalnoth brought it up. Recursion is orthogonal to branching. In fact, tail recursion and loop iteration are provably identical. The only branch in most cases is to exit the loop. I merely pointed out even in tough cases, one can convert tree recursion to tail recursion if one is willing to recalculate.

Programming models matter. One could construct a pixel shading API that is programmed in a register combiner like fashion, but it would be overwhelmingly hard to use. PS3.0 increases programmer productivity as well as making high level language support easier by making the static compiler's job easier. That's its biggest selling point, by bringing the instruction set closer to an orthogonal full CPU instruction set. The only thing it's missing is a real stack and scatter writes.

It's well outside the context of my statements, because my claim is this: PS3.0 without dynamic branching offers almost nothing over PS2.0+ (i.e. ps_2_x).
(PS2.0+ does offer something over PS2.0, but just a tad, and NV3x was deficient in so many other areas that it couldn't take advantage of them. That's getting a bit off topic, though...)

The only problem with your claims is that dynamic branching was part of ps_2_x as well. PS3.0 most just makes required what was once optional features in ps_2_x. You don't define what makes HW 3.0 compliant, Microsoft does.

Why should I accept your arbitrary assertion that dynamic branching is the only feature that defines a PS3.0 capable hardware, especially since you yourself acknowledge the limited circumstances in which it will make a perceptual difference. More developers will gain greater productivity from the superior programming model of PS3.0, so I'd claim that whatever feature benefits the greatest number of circumstances is the most important one. In fact, static branching (required in 3.0) may turn out to be one of the more generally useful features for developers, since the alternate is to build one's own concatenation/macro processor, and to end up with more resident shader programs uploaded to the GPU. But that's just a hunch. And with massive ALU power comes the ability to do shader antialiasing with gradient and lod lookups.

What if I claimed that the defining feature of VS3.0 is vertex texture lookup, and since ATI doesn't support it, ATI has a half-ass implementation of VS3.0? Your whole argument seems to be an attempt to redefine NV4x as not SM3.0 capable. Doesn't MS and WHQL say who is and isn't 3.0?

Anand talk R580

Turtle 1

Turtle 1

Moloch

God of Wicked Games

Turtle 1

KimB

Skrying

S K R Y I N G

{Sniping}Waste

Turtle 1

DemoCoder

KimB

DemoCoder

Mintmaster

Bill

DemoCoder

Geo

Mostly Harmless

Arty

KEPLER

Mintmaster

Mintmaster

ANova

DemoCoder

Similar threads