Beyond3D Forum

Beyond3D Forum (http://forum.beyond3d.com/index.php)
-   Pre-release GPU Speculation (http://forum.beyond3d.com/forumdisplay.php?f=51)
-   -   The NEXT LAST R600 Rumours & Speculation Thread (http://forum.beyond3d.com/showthread.php?t=39173)

PSU-failure 10-May-2007 19:40

Quote:

Originally Posted by Kaotik (Post 984398)
http://www.hardocp.com/image.html?im...NfMV8yX2wuanBn

What's up with this, only single 6pin connector yet the computer at least appears to be running based on the led-lightning :???:

Back to the first rumours, it was expected the R600 could work with either 1 8pins PEG power connector or 2 6pins...

Maybe that was true? :?:

Considering the drivers, don't forget that blaming these could simply be AMD's suggestion for those under NDA to not disclose anything.

leoneazzurro 10-May-2007 19:45

Quote:

Originally Posted by PSU-failure (Post 984521)
Back to the first rumours, it was expected the R600 could work with either 1 8pins PEG power connector or 2 6pins...

Maybe that was true? :?:

Considering the drivers, don't forget that blaming these could simply be AMD's suggestion for those under NDA to not disclose anything.

Or because slot and card are PCIE 2.0

Kaotik 10-May-2007 20:01

Quote:

Originally Posted by PSU-failure (Post 984521)
Back to the first rumours, it was expected the R600 could work with either 1 8pins PEG power connector or 2 6pins...

Maybe that was true? :?:

Considering the drivers, don't forget that blaming these could simply be AMD's suggestion for those under NDA to not disclose anything.

But in that pic, if you look closely, it has 1x 6pin connected, not 1x 8pin?

edit: How much can you draw power from the PCIe 2.0 slot to the video card?

Fornowagain 10-May-2007 20:12

Quote:

Originally Posted by Kaotik (Post 984533)
edit: How much can you draw power from the PCIe 2.0 slot to the video card?

150W on 2.0

FrameBuffer 10-May-2007 20:13

Quote:

Originally Posted by Kaotik (Post 984533)
But in that pic, if you look closely, it has 1x 6pin connected, not 1x 8pin?

edit: How much can you draw power from the PCIe 2.0 slot to the video card?

IIRC the R600 (HD 2900) is not a PCIe 2.0 part where as the RV6x0 parts are and unless my memory fails me PCIe 2.0 allows double the power from PCIe 1.x (75W -> 150W) through the adoption of the 8pin PCIe connector .. EDIT: lol nm got beat to it already.

_xxx_ 10-May-2007 20:27

Quote:

Originally Posted by Pressure (Post 984171)
Wow, slow down there. First off, the entire market does not consist of High-end solutions. In fact, that is were the minority of money are made.

Oh, but that's what Joe will see on every mag title. While pretty much noone will put the low-end stuff there.

Galduta 10-May-2007 20:55

http://www.nextgpu.com/forum/index.php?topic=17.435

Gigabyte GA-965P-DS3
-Intel Core 2 Duo E6300@3.26Ghz 466Mhzx7
-Team Xtreem 2x1GB D9@933Mhz Cas 5-5-5-15 1:1
HD 2900 xt

The CD drivers are the old 8.361
---------------------------------------------------------------------------
Fear bench 1600x1200, AF 8X, No AA no SS

71 MIN
111 AVERAGE
217 MAXIMUM

---------------------------------------------------------------

1600x1200, AF8X, AA4X, soft shadows ON *
Min:21
Med:58
Max:113

Maybe not is correct !!

1600x1200, AF8X, soft shadows ON

32
55
107

--------------------------------------------------------

One 8800 GTS STOCK clock
158.18 WHQL:

1600x1200, FA@8x, SSon
41
65
177

1600x1200, AA@4x, FA@8x, SSon:
21
38
69
---------------------------------------------------------------------------------------------------------------

Later maybe more , or tomorrow ;) .The person has for my 100% of reliability

_xxx_ 10-May-2007 20:59

Quote:

Originally Posted by dizietsma (Post 984256)
They look better to me as well. From the results it's rather uncanny how well it finds it's own niche between the GTX and GTS, at least in DX9 current games. Perhaps it (more than?) matches the GTX in future games if it is that forward looking as well as people have suggested and history tends to show?

Considering the leaked slides, I'd try to describe it in the old terminology: look at it as an 80-pipe chip. There are 80 "fat" pipes going against 120 nV's 80 "single" or including the missing MUL "1.5x" (on average, ass-uming it's used half the time) pipes.

If the scheduling is good and the load balance is favorable for the R600 architecture (higher shader load for example, though that balance also depends on the batch size etc.), it'll gain perf compared to the GTX. The opposite case, high tex/filtering load and less shader load will give nV the advantage of (best case for nV) 128:80 or 16:10 - which also coincidentaly matches the alu-cluster sizes in the chips ;)

So the simplest factor affecting the performance will be if those "fat" pipes can be more often used for multiple ops than for single ops, as well as if the texturing is the bottleneck in the given situation.

chavvdarrr 10-May-2007 21:17

Quote:

Originally Posted by _xxx_ (Post 984563)
Considering the leaked slides, I'd try to describe it in the old terminology:

Do you take into account that NV pipes are double pumped?

All in all, making same mistake twice in a raw (too few TMU power) can't be coincidence.

_xxx_ 10-May-2007 21:20

Quote:

Originally Posted by fellix (Post 984355)
I wonder, if this bicubic resampling is a "full" implementation, does it [the hardware] support programmable coefficients -- say for more blurry or more sharpen output? :wink:
It could be somewhat "locked" to a MIP level (AF?) and with gradually iteration to sharpen the output texel as the MIP level increases to aid the AF resampling. D'oh!

I think that rather has something to do with their new hybrid AA algorithms, the "wide/narrow tent" stuff.

_xxx_ 10-May-2007 21:34

Quote:

Originally Posted by chavvdarrr (Post 984576)
Do you take into account that NV pipes are double pumped?

No, I think their simplicity makes up for most of that.
Well yeah, it was a mistake in the sense that they expected the market to rely much more on increased shader power, but obviously they were too early again. And by the time that begins to matter, these cards will be obsolete anyway IMO.

Silent_Buddha 10-May-2007 21:46

Quote:

Originally Posted by neliz (Post 984282)
Lol, doesn't this guy need to uninstall his ATI drivers before he can use his 8800? (hint, ATI logo in 8800 benchmark pictures.

Nice catch Neliz, it looks like that guy probably has both the 8800 gts 320 and HD 2900 XT installed at the same time.

Then he probably just changes which is the Primary monitor (thus the one benched on) in Windows Display Manager.

Not sure how trustworthy his results will be then.

Regards,
SB

DemoCoder 10-May-2007 22:03

Quote:

Originally Posted by _xxx_ (Post 984578)
I think that rather has something to do with their new hybrid AA algorithms, the "wide/narrow tent" stuff.

Weren't people claiming CFAA was done in the ROPs or scan-out/resolve HW? if the tent filter is being done by a texture unit, I don't see the advantage. A shader-based MSAA resolve pass isn't that expensive fillrate/shader wise, and one could implement arbitrary filter kernels to one's hearts content without wasting silicon on fixed-function bicubic support.

Unknown Soldier 11-May-2007 00:19

2 Attachment(s)
Quote:

Originally Posted by Fornowagain (Post 984274)

Weird, My Gainward Bliss GTS 320 has higher SM2.0 and SM3.0 scores and at def. clocks for the GPU and CPU. My CPU is QX6600 so the score is higher.

3DMark06 - 9651
QX6600 - Default clock
Gainward Bliss GTS 320 - Default Clocks
Resolution is Def. - 1280x1024

---------

3DMark05 - 14080
QX6600 - Default clock
Gainward Bliss GTS 320 - Default Clocks
Resolution is Def. - 1024x768

Will test 3DMark05 at 1280x1024

US

Unknown Soldier 11-May-2007 00:39

1 Attachment(s)
3DMark05 - 13161
QX6600 - Default clock
Gainward Bliss GTS 320 - Default Clocks
Resolution is Def. - 1280x1024

US

Jawed 11-May-2007 03:50

Quote:

Originally Posted by Frank (Post 984440)
So, what do we think about the instruction scheduling?

If you haven't already, I guess now is a good time to look at the CTM guide:

http://ati.amd.com/companyinfo/resea..._CTM_Guide.pdf

since it at the very least provides inspiration and potential comparison!

Quote:

1. It uses a VLIW for all ops issued to an 5+1+1 ALU block for each clock.
Not sure how you get 5+1+1, since it's MAD/SF+MAD+MAD+MAD+MAD+BR - 1+4+1 if you like.

I assume they're really entirely separate ALUs that are simply clocked in parallel. The width being, erm 4, 8, 16, whatever pixels/primitives/vertices. So:
  • 16x MAD/SF
  • 16x MAD
  • 16x MAD
  • 16x MAD
  • 16x MAD
  • 16x BR
if 16-pixels per clock are processed by the ALU pipeline.

Quote:

2. It issues a single instruction word each clock for each ALU.
I'm going to assume that each instruction lasts 4 clocks, because that's what R5xx does. So the instruction decode and settling time for operand-fetch addressing and so on can be less than frenetic.

Quote:

3. It uses sequential instruction packing (ie. it issues a single instruction word for multiple sequential ops) for each ALU.
4. It issues one (or two) instruction word(s) each clock for each ALU block, but it has a small instruction cache and so can issue multiple ops in a single clock.
5. Like 4, but it issues blocks of instruction words for each batch.
I think I saw somewhere 512-instruction slots. So if a program is longer than that, then the instructions are paged-in as needed, I presume. It's possible for a single batch to run a clause of code that's hundreds of instructions in length - a clause being bounded by texturing instructions (or a branch). Obviously, a clause can straddle instruction pages.

So I would guess that each instruction of the clause is fed to the ALU pipeline as it's needed, from the instruction cache. I presume that an instruction page fault causes the batch to be switched out of the pipeline, until the page is ready.

Since R5xx has an alternating pipeline where batch instructions are sequenced as AAAABBBB, that seems like a reasonable starting point for R600. If that's the case, you can see immediately that two different instruction pages could be used to feed the ALU pipeline. One batch might be a vertex shader and the other a pixel shader.

R600's instruction scheduler might work by keeping all available batches on the same instruction page, whenever possible (subject to other resource hazards, queues filling up that sorta crap), so deferring instruction-page swaps until as late as possible. Essentially to minimise the number of swaps.

R5xx should be doing some kind of instruction page handling. R4xx may do too, since it can support hundreds of instructions (erm, can't remember how many for SM2.0b... 512?). So, instruction-page handlng is prolly quite normal these days. Dunno if it really amounts to much for us armchair types.

But it adds an extra dimension to the batch scheduling problem. Another dimension to consider is that R600 prolly supports multiple concurrent render contexts. Xenos supports 8 and a patent application for R600 explicitly refers to eight when discussing memory management.

Quote:

Further:
A. Constants are part of the instruction word.
I expect so, since R5xx supports this kind of inline constant.

Quote:

B. Constants are distributed separately.
R5xx also has a constant store.

Since a constant store is a key concept in D3D10, it's quite clear that R600 will have a beefed-up version. D3D10 constants can be huge (multi-KB in size), formed as 4096-element structures. It's a whole new ballgame! I wonder if constants and register file actually share memory in some fashion, rather than each having a dedicated pool. But the R600 diagram seems to imply a dedicated store ...

(G80 has a 64KB constant cache shared by all clusters. I don't know if it's monolithic or distributed.)

One of the uses for the constant store in R5xx appears to be to hold vertex attributes, each attribute interpolated for each rasterised fragment (I'm skating on thin ice, admittedly). So that might be vertex colour, vertex normal, texture coordinates etc. It's quite costly. So R600 could do the same, with these interpolated attributes held in the constant store.

Although, if you look at the Xenos functional diagram, you'll see a block called Shader Pipe Interpolators, which appears to be doing on-demand attribute interpolation (as instructions are issued). So, ahem, maybe that's how R600 will work...


But one of the recent patents seemed to describe how attribute interpolation could be done in parallel with rasterisation, so I'm confused. The R600 diagram contains an SPI block. I trip up on this stuff I'm afraid.

Hmm, just had a thought, maybe the SPI block is actually just a programmable unit that the Sequencer controls in addition to the ALUs and TUs. Since, per vertex, the count and types of attributes that need to be interpolated varies, the quantity of work (and therefore duration of program) varies. Hmm...

Quote:

Option 1 would fit the picture presented, but would use a significant amount of unneeded bandwidth from the ringbus. Those VLIWs would be quite wide, and they would require you to issue 6 separate instructions when one would do (like, with a conditional vec4 + clamp/modifier op). But it would ease the work of the compiler / scheduler.
Since each of the co-issue ALUs are almost certainly all separate (easiest to think of the BR pipe to see why), each of the six pipes always needs a dedicated instruction. e.g.:

SF+vec2+vec2+BR: RCP+MAD+MAD+ADD+ADD+LT

vec4+scalar+NOP: MAD+MAD+MAD+MAD+MAD+NOP

Quote:

Option 2 seems the logical thing to do, but then why group those ALUs, and it would be effectively just as bad as 1 in bandwidth requirements. Effectively, 1 and 2 are the same.
I'm not sure what kind of bandwidth you're thinking of here, to be honest.

Quote:

Option 3 is what I expect the G80 to do. It offers interesting possibilities for the R600, but would be hard to schedule. And then why not go fully scalar? I expect a bit of this, but very limited.

Option 4 would make things the most efficient. It is like 3, in that instructions can take multiple clocks and/or do two things sequential (like first calculating and then clamping/modifying/masking), and it would go well with a model that uses slots to run batches for each thread. Ie: each thread has a fixed amount of instruction slots (say, 8) and a minimal amount of clocks (say, 4) for each batch to run. When done, it can switch to the next thread or continue the current one. A texture lookup would terminate the batch and leave a bubble if there were to few ops to fill the slots/clocks. But it would be very hard to implement, and would demand a very complex scheduler and compiler.
Since R300, texture operations automatically bound a clause of ALU instructions. So the scheduler (don't forget R300 has asynchronous texturing) issues a lump of instructions upto the, known in advance, point at which the texture operation is submitted.

So the only time a bubble should be incurred is when the shader unit has run out of batches to issue due to some horrid combination of texturing latency (e.g. with multiple levels of dependency) and/or dynamic branching.

So, normally, clauses of code will be fed into the ALU pipeline end-to-end, no bubbles.

Quote:

Option 5 is probably the way to go. It is like 4, but much better to manage: you only have to move complete blocks around, that hold all you need to execute a single run. Reasonably easy to implement. Then again, you would not be able to do more than would fit inside a single block, and you would still move too much data around if you're only going to do 4 vec4 MULs or such in a single run.
Not sure why there'd be "too much data".

Quote:

Options A and B both have their strengths and weaknesses as well. It would depend on the option above used what would make the most sense. For option 5, I would put all the constant data needed in the instruction block. And most likely all the other data (registers, flags and texture fetches) as well. It also makes it much easier to buffer all that in a very fast local buffer inside or next to the ALU block.
The R600 diagram actually puts the constant cache alongside the instruction cache, which is a clear hint that they're running side-by-side as required by the shaders.

Quote:

So, 1 would fit the available data the best, 4 would probably the most efficient, and 5 would be almost as good as 4, but much easier to implement and schedule. And in a sense, it is just about the opposite of how (I think) the G80 does it.
Hope you don't mind the fact I've just rambled on, rather than trying to construct a meaningful scenario that answers your questions.

Jawed

Rangers 11-May-2007 04:24

Quote:

Originally Posted by leoneazzurro (Post 984356)
Yes, what I see is that R600 is a card that on paper has way more power of X1950XTX. 25-30% more shader power, texturing power unknown but likely to be at least 10-15% more (if they kept the same R580 units, but they say they are improved). Beefed up ROPS and the bandwidth to feed them.
So, if it performs in average on par with X1950 XTX, it can be only

1) ATI made some terrible mistake in designing R600, and there are bottlenecks and chip problems impairing the performance, (i.e. trying to put too many features on it but missing some resouce in one or more fundamental points) or
2) making a performance driver for R600 is really hard, with very big difficulties due to co-issue erformance penalties, practically needing the driver to be optimized heavily for each game or
3) both

And what makes me wonder were the "Preliminary" watermarks all around...

It doesn't perform like a X1950XTX..it performs more like 10-15% above that.

A 8800GTS is stronger than a X1950XTX, by a good deal in many cases. They're not equivalent..

I really just do not think R600 is shader bottlenecked..there's just no way that makes any sense. That's not where we should be looking..

BTW, for those speaking of ATI can rally with "good mid-range", I hate to tell you this, but 8800GTS which R600 competes with IS mid-range. They are $240, without Nvidia even trying to get the price down..

R600 IS mid range..

Further they're only going to cut TMU's from here to keep an arbitrary ratio..so it's going to be impossible for them to have a great low-mid part. Competitive, maybe, since 8600 is no great shakes..but it's virtually impossible for it to be great.

Moloch 11-May-2007 04:59

Quote:

Originally Posted by Rangers (Post 984754)
It doesn't perform like a X1950XTX..it performs more like 10-15% above that.

A 8800GTS is stronger than a X1950XTX, by a good deal in many cases. They're not equivalent..

I really just do not think R600 is shader bottlenecked..there's just no way that makes any sense. That's not where we should be looking..

BTW, for those speaking of ATI can rally with "good mid-range", I hate to tell you this, but 8800GTS which R600 competes with IS mid-range. They are $240, without Nvidia even trying to get the price down..

R600 IS mid range..

Further they're only going to cut TMU's from here to keep an arbitrary ratio..so it's going to be impossible for them to have a great low-mid part. Competitive, maybe, since 8600 is no great shakes..but it's virtually impossible for it to be great.

240???
http://www.newegg.com/Product/Produc...iption=8800GTS

Your continual ati bashing is become tiresome btw.
You should have your name changed to "doomsayerdaamnit" or some such :wink:

BRiT 11-May-2007 05:51

Quote:

Originally Posted by radeonic2 (Post 984764)

The 8800 GTS 320 Meg cards can be had for $240 with specials. The 640 Meg cards cost a bit more.

silent_guy 11-May-2007 05:51

Quote:

Originally Posted by Frank (Post 984440)
Option 1 would fit the picture presented, but would use a significant amount of unneeded bandwidth from the ringbus. Those VLIWs would be quite wide, and they would require you to issue 6 separate instructions (inside the VLIW one) when a short one would do (like, with a conditional vec4 + clamp/modifier op). But it would ease the work of the compiler / scheduler.

I don't see how the local instruction storage organization has any influence on bandwidth? My guess is that the combined instruction words are either 64 or 128-bits wide, fetched as such from external memory and stored together as 1 VLIW or as seperate words, one for each ALU/BR, depending on the implementation. Assuming a 4 cycle rotation, instruction fetch scheduling and distribution to the decoders shouldn't be in the critical path, so my bet would be that the instruction words are stored together, since that's more area efficient.

Quote:

Originally Posted by Jawed (Post 984747)
I think I saw somewhere 512-instruction slots. So if a program is longer than that, then the instructions are paged-in as needed, I presume. It's possible for a single batch to run a clause of code that's hundreds of instructions in length - a clause being bounded by texturing instructions (or a branch). Obviously, a clause can straddle instruction pages.

Why not a L1 instruction cache instead of a paging mechanism? Probably as easy to implement than a paging mechanism (though I haven't thought through the consequences of multiple threads), and with less chances of running into freak performance pitfalls due to page straddling?

Rangers 11-May-2007 06:05

Quote:

Originally Posted by radeonic2 (Post 984764)
240???
http://www.newegg.com/Product/Produc...iption=8800GTS

Your continual ati bashing is become tiresome btw.
You should have your name changed to "doomsayerdaamnit" or some such :wink:

I'm not ATI bashing I'm R600 bashing..and not that until the last few days as it begins to become clear that well, this product is lacking. I was hoping for a great performing part up until very recently. And I still cant believe it draws 220+ watts for that. That's the real kick in the pants here.

And yeah, I fibbed a bit on the 8800GTS, but only a bit. The cheapest on newegg was $260 after rebate, $280 before, with an average price of around $300. ZZF might well have something cheaper though.

nelg 11-May-2007 06:24

200+ pages and the only conclusion I can draw is that Jawed has spent more time on the R600 than ATI.

tEd 11-May-2007 06:30

Quote:

Originally Posted by nelg (Post 984775)
200+ pages and the only conclusion I can draw is that Jawed has spent more time on the R600 than ATI.

:lol:

he's a machine.

Silent_Buddha 11-May-2007 06:35

Quote:

Originally Posted by Rangers (Post 984774)
I'm not ATI bashing I'm R600 bashing..and not that until the last few days as it begins to become clear that well, this product is lacking. I was hoping for a great performing part up until very recently. And I still cant believe it draws 220+ watts for that. That's the real kick in the pants here.

And yeah, I fibbed a bit on the 8800GTS, but only a bit. The cheapest on newegg was $260 after rebate, $280 before, with an average price of around $300. ZZF might well have something cheaper though.

You may not think you're ATI bashing, but your choice of words and the tone they convey sure makes it seem like you do.

And apparently anyone that actually has a card and is under NDA and has actually tested the power draw... Well, none of them apparently have come even close to 220+ watts unless doing some very serious overclocking. From the little scraps that have come out it would seem that the power draw on the GDDR3 version of the card draws around 175-190watts when not overclocked. Although that would be an abolutely amazing piece of engineering if it's running at 220+ watts normally at 740 mhz and it can easily overclock another 100 mhz without drawing more than 5 more watts of power. ;)

And calling doom and gloom before anyone even remotely reputable has published a review?

By that same token. The 8800 GTX was an absolute and abject failure also right? Since I seem to recall some "benches" (to be kind) that were floating around the net before it came out that didn't paint a rosey picture.

Am I saying R600 is going to be a huge success and kick the pants off the competition? Nope. Am I saying it's a failure before it's been properly reviewed based on some incredibly shoddy "benching" (to be kind) that's been posted? Nope.

Am I waiting until I can actually see a proper review done with proper drivers in a properly controlled environment before making a decision about what video card I will buy? You bet'cha.

Then again if neither company delivers a stable driver (stability first, performance second for me) in Vista 64, then neither will get my money.

That said, it's only 4 days until NDA expires. Think you can hold onto your britches and avoid the whole sky is falling routine until then? :grin: And if the R600 totally falls on it's face and bursts into flames in some reviewer's hands, you can feel free to say you told me so and that the world is coming to an end.

Regards,
SB

Kaotik 11-May-2007 07:02

Quote:

Originally Posted by Rangers (Post 984774)
I'm not ATI bashing I'm R600 bashing..and not that until the last few days as it begins to become clear that well, this product is lacking. I was hoping for a great performing part up until very recently. And I still cant believe it draws 220+ watts for that. That's the real kick in the pants here.

And yeah, I fibbed a bit on the 8800GTS, but only a bit. The cheapest on newegg was $260 after rebate, $280 before, with an average price of around $300. ZZF might well have something cheaper though.

If it draws 220+ watts, who come Macci can play TDU fine on A64 X2 6000+, 580X mobo, 2GB RAM, 4xHDD, R600 with 430W Antec NeoHE PSU, and Sampsa gotten so far max consumption of 298 watts for whole system (while in this case the system specs are unknown)?


All times are GMT +1. The time now is 14:10.

Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.