Beyond3D Forum

Beyond3D Forum (http://forum.beyond3d.com/index.php)
-   Pre-release GPU Speculation (http://forum.beyond3d.com/forumdisplay.php?f=51)
-   -   The NEXT LAST R600 Rumours & Speculation Thread (http://forum.beyond3d.com/showthread.php?t=39173)

fellix 01-May-2007 14:20

Hmm, it seems to me, that those HL2 shots have been resized after the screen capture, which doesn't make good for a precise evaluation. :roll:

nAo 01-May-2007 14:24

Yep, they have been resized, and yes..the second shot is a bit blurrer

Jawed 01-May-2007 14:56

Quote:

Originally Posted by nAo (Post 979734)
Yep, they have been resized, and yes..the second shot is a bit blurrer

Which would imply that in CFAA mode the RBEs are "dumb" sampling from neighbouring pixels regardless of whether the neighbouring pixels are edge pixels or not.

Which, being honest, is hardly surprising.

There's gonna be LOD fiddling, movies and riots, I guess.

Jawed

Jawed 01-May-2007 15:00

Quote:

Originally Posted by silent_guy (Post 979534)
Is it completely ridiculous to suggest that constants are loaded into registers before issuing ALU ops. I guess the performance drop would be catastrophic in some cases. :wink:

I didn't suggest that's how it worked. What I mean is that an operand for an instruction is either a register or a constant. It could, also, be a gather from memory, but...

Quote:

(Question for a 3D shader expert: How common are constant operands in real life 3D shaders?)
http://www.gamedev.net/community/for...ply_id=2521733

Constants are important in D3D10.

Quote:

Yes, that's the other big question: but with the vec4, isn't the only freedom a permutation of xyza? (I haven't read the CTM docs...)
That reduces a 15-way fetch into a 6-way fetch (if the 1D supports MAD), if you're smart about the way you lay out registers in memory. If there are no restrictions at all in R600, you'd need 15 independent reads and 5 independent writes per shader unit. Insane.
I expect we're stuck until someone tests this. I've spent lots of time thinking about register fetch, but not for this configuration. Truthfully, no-one seems to have even explored the limits of R300 :sad:

Quote:

Another question: in R600, the branch logic is in parallel with the ALU's. Is that a departure from R5xx or has it always been like that?
R5xx also has this dedicated unit. R420 and earlier couldn't perform dynamic branching in the pixel shader.

Quote:

Yes, I'm not sure there is that much additional value in allowing independent scalar operations, since the majority of ops in 3D are vec3 or vec4 based anyway and the amount of scalar ops are not that common.
It's probably a different story for GPGPU.
Yeah. Hopefully there'll be a nice whitepaper for the R600 architecture describing motivations.

Purely scalar code with every instruction dependent upon its predecessor will make R600 crawl. G80 will lap it up.

Jawed

Dooby 01-May-2007 15:03

I use photoshop for about 80% of my day, and Im *sorta* convinced that that the first Non-AA image has been sharpened, as well as resized. Zoom in around dark lines, such as the powerline pole, and you can definitely see more of a lighter outline in the Non-AA pic than the AA'd pic. Im aware the AA'd pic has a slight outline too, but the Non-AA pic is much more pronouced. You usually get this when you apply a sharpen filter in Photoshop et al.

I would take these images with a pinch of salt when looking at texture quality.

That said, OMG the FSAA looks GORGEOUS!

Jawed 01-May-2007 15:06

Quote:

Originally Posted by silent_guy (Post 979617)
But the inevitable price to pay is that you give up something at the system level. And that's a bad thing. In this case, that price means more over-dimensioning to reduce freak corner cases, over-dimensioning to limit the latency penalty (2 opposite direction rings), less control over scheduling etc. On R580, that was manageable because it was only used for read return data. But if R600 also uses this to transport write data, it gets really ugly.

GPUs are latency tolerant. It seems like a completely different focus from CPUs and routers. GPUs spend their lives converting little packets into bigger packets, because throughput is more important than latency.

Jawed

EasyRaider 01-May-2007 15:07

Quote:

Originally Posted by Jawed (Post 979746)
Which would imply that in CFAA mode the RBEs are "dumb" sampling from neighbouring pixels regardless of whether the neighbouring pixels are edge pixels or not.

Which, being honest, is hardly surprising.

There's gonna be LOD fiddling, movies and riots, I guess.

Jawed

Considering all the nasty texture swimming and moire I have seen with my X1900, a bit of blur might be a good thing.

Silent_Buddha 01-May-2007 16:17

I'm actually more interested in the 12x and 24x CFAA settings.

I believe the 16x CFAA is 8x MS with a Wide Tent. Thus I'd expect it to be a bit blurrier.

I'm much more interested in seeing the 12x CFAA with it's 8x MS and Narrow Tent.

Also, like to see the effect of Edge Resolve with the 24x setting.

Still, 16x CFAA does a wonderful job on those trees, wires, and fence. Hopefully 12x CFAA has a similar effect in those areas.

[Edit] Zoomed in on those shots and I'm really not liking how 16x with it's Wide Tent blurs and washes out the colors. Someone please try to get a shot of the 12x setting. :)

Regards,
SB

silent_guy 01-May-2007 16:24

Quote:

Originally Posted by Farhan (Post 979638)
On the other hand, you can't use ALL those metal layers (usually top 2 reserved for power/ground IIRC?), then you have your clock distribution and what not. Also the higher metal layers are much less dense than the lower ones...

You have some freedom wrt what you can do for each layer, and you can mix. Clock networks do not have their own dedicated layers. The buffers are just placed first before the rest of the cells and usually the nets may different spacing rules. But I don't know the exact details anymore.

Quote:

Wouldn't you just dump your data onto the stop closest to you and it will just go to its required stop? All the scheduling and whatnot should happen individually on each of the stops, no? Since the memory controllers are going to be on the stops i think? What am i missing here?
Additional and unpredictable(!) latency. The latter happens when you have a ring stop wanting to insert data when there's also data on the ring trying to shift in the same place.
And there's the risk of running into deadlocks. An obvious way would be if one client backs up and data for this client arrives: it will stall the whole ring. But there are more insidious cases with multiple clients injecting at just the wrong time.
This is avoided by adding larger buffers and either conservative or complex scheduling or both, but it's not easy to get right. A ring is a great example of how a seemingly simple system can exhibit beautiful patterns of entropy. :wink:

Quote:

Originally Posted by Jawed (Post 979755)
GPUs are latency tolerant. It seems like a completely different focus from CPUs and routers. GPUs spend their lives converting little packets into bigger packets, because throughput is more important than latency.

Well, yes, we all know that, don't we?
But that doesn't mean you have to ask for it. Each additional cycle of latency cost you an additional amount of buffering or an earlier breaking in performance.

Switching from defense to offense: other than some non-system related implementation details, is there a single advantage of a ring over a crossbar?

(The latency in routers introduced by the switching fabric itself is pretty much irrelevant compared to the latency introduced by higher-level scheduling. And you don't have the closed loop of a requester having to wait for returning data. Throughput is much more important.)

Silent_Buddha 01-May-2007 16:33

Wasn't one of the reasons stated by ATI for the ring bus over a crossbar was that either the complexity of a crossbar increased much more so than a ringbus with higher width? Or that transistor useage increased much more significantly with crossbar vs. ringbus with higher bus width?

I'd imagine one reason NV could only do 384 bits was due to either number of transistor needed or complexity.

Then again, I'm just a layman so I might be getting all of this back arsewards.

Regards,
SB

Bjorn 01-May-2007 16:34

Quote:

Originally Posted by trinibwoy (Post 979678)
Heh, I'm sure they wish that was "killing performance over 8800GTS/GTX". It'll be interesting to see Nvidia's response - a 2900XT at $400 will be mighty attractive.

There is a drawback with being late to the game though, the competition's cards are available for quite a bit below the MSRP. The 8800 GTS 640 Mb can be had for 360$, the 320 Mb for 260$.

mao5 01-May-2007 16:36

Quote:

Originally Posted by vertex_shader (Post 979634)
That picture from another area, the game has very unbalanced frame rates, what driver the guy using with the XT?:smile:

x1950xtx's pic and R600XT'pic are in the same area?

vertex_shader 01-May-2007 16:43

Quote:

Originally Posted by mao5 (Post 979791)
x1950xtx's pic and R600XT'pic are in the same area?

Almost.

mao5 01-May-2007 16:43

R600 XT counterattack to NV purevideo HD without ATI AVIVO.
http://vietnamglobalteam.org/images/smilies/BBP/50.gif

http://images.anandtech.com/graphs/g...0452/14537.png

http://www.chiphell.com/attachments/...3FgJgyMtsC.jpg

mao5 01-May-2007 16:47

Quote:

Originally Posted by vertex_shader (Post 979794)
Almost.

I remember you guys want the compare fps at the same site, VS, you should see the two pics again, one pic was taken at the road beside the airport garage, one pic was taken at the road with trees and boskage around, you call them almost same?

Bjorn 01-May-2007 16:51

I'm guessing that this means that it'll support VC1 also which the 8600 doesn't. At least not to the extent that it does with h264.

All broadcasted HD content (at least i Europe) will afaik be h264 though so i don't really see this as a problem, but it sure doesn't hurt to have it either.

_xxx_ 01-May-2007 16:51

Quote:

Originally Posted by silent_guy (Post 979783)
Additional and unpredictable(!) latency. The latter happens when you have a ring stop wanting to insert data when there's also data on the ring trying to shift in the same place.
And there's the risk of running into deadlocks. An obvious way would be if one client backs up and data for this client arrives: it will stall the whole ring. But there are more insidious cases with multiple clients injecting at just the wrong time.
This is avoided by adding larger buffers and either conservative or complex scheduling or both, but it's not easy to get right. A ring is a great example of how a seemingly simple system can exhibit beautiful patterns of entropy. :wink:

But why the 512-bit bus then? ;)

The stalls are less of a problem, since you will surely have dedicated lines there. Wouldn't make sense otherwise.

pjbliverpool 01-May-2007 16:52

Quote:

Originally Posted by mao5 (Post 979795)
R600 XT counterattack to NV purevide HD without ATI AVIVO.
http://vietnamglobalteam.org/images/smilies/BBP/50.gif

That tells us very little. For a start that graph is showing the max CPU use, not the average. Average is here:

http://images.anandtech.com/graphs/g...0452/14536.png

And the CPU use would depend greatly on the film used.

mao5 01-May-2007 16:59

Quote:

Originally Posted by pjbliverpool (Post 979805)
That tells us very little. For a start that graph is showing the max CPU use, not the average. Average is here:

http://images.anandtech.com/graphs/g...0452/14536.png

And the CPU use would depend greatly on the film used.

I don't think so, without ATI AVIVO in 8.361, R600XT already show such a good VC1 performance, How about average CPU utilization with AVIVO HD in official drv?

http://vietnamglobalteam.org/images/smilies/BBP/50.gif

mao5 01-May-2007 17:02

Vegas carnage will show up

http://vietnamglobalteam.org/images/smilies/BBP/50.gif

vertex_shader 01-May-2007 17:04

Quote:

Originally Posted by mao5 (Post 979800)
I remember you guys want the compare fps at the same site, VS, you should see the two pics again, one pic was taken at the road beside the airport garage, one pic was taken at the road with trees and boskage around, you call them almost same?

Some meter difference, but the 8800gtx pictures you nliked its maked in whole different area.

mao5 01-May-2007 17:06

Quote:

Originally Posted by vertex_shader (Post 979817)
Some meter difference, but the 8800gtx pictures you nliked its maked in whole different area.

ok. so you admit they are really different scence.

Jawed 01-May-2007 17:08

Quote:

Originally Posted by silent_guy (Post 979783)
Well, yes, we all know that, don't we?
But that doesn't mean you have to ask for it. Each additional cycle of latency cost you an additional amount of buffering or an earlier breaking in performance.

Compared with a couple hundred clocks of worst-case latency on a DDR fetch, do you know how much worst-case latency R600's ring bus will add (compared with a crossbar)? I dunno. 10s of clocks?

Quote:

Switching from defense to offense: other than some non-system related implementation details, is there a single advantage of a ring over a crossbar?
I guess we'll just have to wait for a GPU designer to pipe up :razz:

Here's an overview of a GPU I drew up while speculating about R600:

http://forum.beyond3d.com/showpost.p...postcount=1447

(some of that is R600-specific speculation.)

Perhaps you'd like to compare that with a router and we can talk about where the meat of the scheduling/balancing problem is. In my opinion a GPU has so much scheduling to do that bus scheduling turns out to be a supporting role, not the centre of its universe.

I think the implementation factor, that you're brushing off, is a big deal. IBM went with a ring bus for Cell, hugely motivated by simplicity of implementation.

Jawed

silent_guy 01-May-2007 17:17

Quote:

Originally Posted by Jawed (Post 979822)
In my opinion a GPU has so much scheduling to do that bus scheduling turns out to be a supporting role, not the centre of its universe.

At the end of the day, the ring is nothing but a transport mechanism that doesn't really do anything. And that's my whole point: the hoopla about the ring being an important feature is completely unwarranted. :grin:

(I have to run... Rest will follow later)

mao5 01-May-2007 17:18

Quote:

Originally Posted by mao5 (Post 979816)

showtime!

Anandtech:
"The Benchmark

Our benchmark was suggested to us by Ubisoft and it's basically an average FPS of looking out of the window on the first helicopter ride over a cityscape in Mexico. "


CPU: Intel Core 2 Extreme X6800 (2.93GHz/4MB)
Motherboard: EVGA nForce 680i SLI
Intel BadAxe

http://images.anandtech.com/graphs/r...1206/13806.png

R600XT Tester:
CPU: Intel Core 2 E6600 (2.40GHz/4MB) R600XT( Default clock)

same scene:

http://www.chiphell.com/attachments/...uXWKWLMKSs.jpg

The Test Video Setting:
http://www.chiphell.com/attachments/...jKHQl2UZar.jpg

http://vietnamglobalteam.org/images/smilies/BBP/50.gif


All times are GMT +1. The time now is 19:22.

Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.