View Full Version : Huddy says "R600"
Well then, finally put that one to bed, R600 it is.
http://www.xbitlabs.com/news/video/display/20060525104243.html
trinibwoy
25-May-2006, 21:06
Mr. Huddy said that Xbox 360 game console, which sports developed by ATI Xenos graphics core with unified shader architecture and 48 shader processors, loses 20% to 25% performance in pixel-shader limited games, when its graphics chip is configured as non-unified, e.g.,16 processors work strictly on vertex shaders, whereas 32 are assigned for pixel shaders.
I'm assuming that those numbers don't tell us much. Performance on a unified architecture after dedicating flexible units to specific tasks can't really be used as an accurate indicator of non-unified performance with a similar configuration (16/32).
TurnDragoZeroV2G
25-May-2006, 21:07
R600... heh. Needs to get me some of that. And Conroe, I'm thinking, at this point.
On a side note, if we can take the 20-25% losses on Xenos going from dynamic allocation to fixed 32/16 ratio to be accurate and realistic... then that's pretty impressive. Of course, he does have a job to do at the same time.
Well i expect the next nvidia chip to also be the fastest dx9 chip they've ever build.:lol:
Huddy using a comparison on Xenos of 32/16 (2/1) is interesting.
Are they back to claiming 16 shaders for R580 then? Because if not, 48/8 is obviously much higher than 2/1.
SugarCoat
25-May-2006, 21:15
R600 series, how many times do i have to say it!
This was very interesting (if true).
Mr. Huddy said that Xbox 360 game console, which sports developed by ATI Xenos graphics core with unified shader architecture and 48 shader processors, loses 20% to 25% performance in pixel-shader limited games, when its graphics chip is configured as non-unified, e.g.,16 processors work strictly on vertex shaders, whereas 32 are assigned for pixel shaders.
R600 series, how many times do i have to say it!
And as soon as you finally admit you've been a paid ATI employee all along, then we'll take it as official!
I keed. :lol:
Release the damn thing already.
Getting seriously bored...
Jawed
JHoxley
26-May-2006, 00:08
Is R600 a D3D10 part or not?
My vague understanding was that Nvidia's G80 and ATI's R600 were the next-gen/D3D10 parts... but that linked article keeps on referring it to a DX9 part?
Cheers,
Jack
Dave Baumann
26-May-2006, 00:12
Yes, R600 is DX10 based (this article, I believe, was formed from an initial DX10 breifing that ATI are doing), however, the reality is that R600 and G80 will see far more DX9 code than they will DX10, so they have to be consumate DX9 performers as well as having DX10 capabilities.
Additionally, ATI says its first DirectX 10 graphics processor – code-named R600 – will have unified shader micro-architecture, which will allow to boost performance even further compared to currently existing micro-architectures and . The performance improvements are conditioned by a special built-in arbiter processor, which will “tailor” rendering of every frame across the 64 unified shader pipelines. Such an approach, according to ATI, allows to utilize all execution engines within the chip, while in traditional architectures – where pixel shaders and vertex shaders are calculated by dedicated units – some of the arithmetic processors may stand idle waiting for others to complete their tasks.
Hmm !!??? Where that 64 pipes comes from ??! Actual confirmation from ATI !?!?
Edit; Quote is from this article:
http://www.xbitlabs.com/news/video/display/20060525104034.html
Chalnoth
26-May-2006, 00:50
Hmm !!??? Where that 64 pipes comes from ??!
64 is about right for the R600. Bear in mind that the R580 has 48 pixel shader "piplines" alone. These pipelines individually don't do as much as nVidia's pipelines on the G7x, however, and the number of texture units is probably only going to be 16-32.
It remains to be seen how well this will compare to nVidia's offering, about which we know nearly nothing (except that from their public statements, it seems unlikely to be a unified architecture).
The initial article from today says:
according to sources familiar with the plans of ATI Technologies as well as some media reports, ATI R600 will have 64 unified shader processors – an unprecedented number so far, 16 texture units – inline with today’s GPUs, clock-speed beyond 650MHz and support for high-speed GDDR4 memory controller.
So he may have just "assumed to be in evidence" his initial report.
I'm guessing 64 is the minimum we'd be looking at, or you might have scenarios where it couldn't hand R580 the licking you would want it to be able to do.
Could it be more than that? Dunno.
Chalnoth
26-May-2006, 00:58
I'm guessing 64 is the minimum we'd be looking at, or you might have scenarios where it couldn't hand R580 the licking you would want it to be able to do.
Could it be more than that? Dunno.
Right. It really depends upon how much logic they have to add just to implement the unification. If not much, and if they're producing it on a smaller process than the R580, then it could have significantly more pipelines.
If, however, it takes quite a bit more logic, and if they're still doing it on 90nm, or even just 80nm (which isn't much smaller), then 64 seems more likely (less is possible, but, as you said, the performance wouldn't be there, and thus such a situation seems unlikely).
Edit:
Since the current architecture seems to be 70% along the road to unification already, I actually suspect it won't take that many transistors. But just bear in mind that nVidia is currently sitting pretty with an architecture that only requires ~55% the die area of ATI's, and thus can add a whole lot more core logic before even reaching parity on die size. Anyway, I think it's going to be a very interesting fall (or perhaps early next year, if we're unlucky).
I'm holding out for 96 Xenos style pipes and 32 TMU pipes. Gotta keep that 3:1 ratio :razz: and have some use for 100GB/s of bandwidth...
Jawed
kyetech
26-May-2006, 01:03
But I guess its no surprise theres not *that* many more pipes than the previous gen seeing as each pipe needs to be wider to support a SM4 pipeline no?
So really the budget is blown on redesign rather than additional pipes.
Chalnoth
26-May-2006, 01:06
I personally don't expect SM4 to require all that many more transistors over SM3. I think that the real reason we probably won't see more than 64 pipelines is that ATI's already got a big chip, and probably won't want to go any bigger for the next gen.
kyetech
26-May-2006, 01:10
I personally don't expect SM4 to require all that many more transistors over SM3. I think that the real reason we probably won't see more than 64 pipelines is that ATI's already got a big chip, and probably won't want to go any bigger for the next gen.
what about nvidia then?
Chalnoth
26-May-2006, 01:13
what about nvidia then?
Well, right. I expect them to have to spend quite a few more transistors to get branching performance up there. But they can grow their die size by a whole hell of a lot more, too (the GeForce 7900 GTX is just 55% the size of the Radeon X1900). So we'll have to wait and see.
It's true that the new integer stuff in SM4 makes the pipeline more complex, but compared to R580 I dare say a Xenos pipeline is simpler (vec4 MAD + scalar special function versus vec4 MAD or vec3 + scalar special function PLUS vec4 ADD, the "mini-ALU").
So I think it's better to start with a Xenos pipeline and add integer functionality to it.
Jawed
ShootMyMonkey
26-May-2006, 01:22
Mr. Huddy said that Xbox 360 game console, which sports developed by ATI Xenos graphics core with unified shader architecture and 48 shader processors, loses 20% to 25% performance in pixel-shader limited games, when its graphics chip is configured as non-unified, e.g.,16 processors work strictly on vertex shaders, whereas 32 are assigned for pixel shaders.
I'm not sure this is really all that meaningful a statement. I mean, it's about a specific operational mode where 1/3 of the pipes are dedicated to vertex processing and the remaining 2/3 are dedicated to fragment processing. So... if you've got something that's pixel shader limited, and you reduce the maximum number of possible pixel shader units by 33%, you lose up to 25% of your performance... hmmmm... Nah, couldn't be.
Any examples of these particular pixel-shader limited games that gave them these results?
Someone in IRC speculated that this 2/1 test was probably driven by the granularlity they could mess with Xenos pipes, in groups of 16.
So, "when ps limited" is probably still valid, and that there is some significant advantage gained when that is true, is probably also still good. . . tho I think I'd take the actual percentages with a grain of salt.
Chalnoth
26-May-2006, 03:01
I'm not sure this is really all that meaningful a statement. I mean, it's about a specific operational mode where 1/3 of the pipes are dedicated to vertex processing and the remaining 2/3 are dedicated to fragment processing. So... if you've got something that's pixel shader limited, and you reduce the maximum number of possible pixel shader units by 33%, you lose up to 25% of your performance... hmmmm... Nah, couldn't be.
I'd be rather surprised if the R600 ever made use of operation modes with fixed vertex/pixel ratios.
Nelsieus
26-May-2006, 03:47
Yes, R600 is DX10 based (this article, I believe, was formed from an initial DX10 breifing that ATI are doing), however, the reality is that R600 and G80 will see far more DX9 code than they will DX10, so they have to be consumate DX9 performers as well as having DX10 capabilities.
Which is what most early Vista games will be. Primarily DX9 with DX10 features that gamers can enable granted they have a DX10 system. Or atleast that's my understanding of it.
Chalnoth
26-May-2006, 04:39
Except these days I think there's less reason to have any real visual difference between a SM3 DX9 path and a SM4 path than there has been with feature changes in the past. Hell, even the move from SM2 to SM3 has the difference of floating point blending. The real difference should be in performance, and DX10 might well deliver 20%-30% improvement to start, from everything we've been hearing, and more once developers really start making use of the featureset.
Indeed conroe and R600 FTW :wink:
sonyps35
26-May-2006, 04:52
I'm not sure this is really all that meaningful a statement. I mean, it's about a specific operational mode where 1/3 of the pipes are dedicated to vertex processing and the remaining 2/3 are dedicated to fragment processing. So... if you've got something that's pixel shader limited, and you reduce the maximum number of possible pixel shader units by 33%, you lose up to 25% of your performance... hmmmm... Nah, couldn't be.
Exactly. As stated, it seems to mean nothing. Would have been nicer if he said unified gained 25% performance over non-unified, period, or something like that.
Release the damn thing already.
Getting seriously bored...
Jawed
You'll be waiting a long time. This sucker may not release til January. I expect it a good deal earlier, but sure not anytime soon.
The 16 texture units is a disapointment if true.
Also it's always confusing now, are they talking about 64 pipes or 64 Xenos style ALU's?
The latter would probably have even less shader power than R580! However, are pipes even possible in a unified architecture?
Mintmaster
26-May-2006, 05:27
Well, right. I expect them to have to spend quite a few more transistors to get branching performance up there.
I don't. NVidia will probably go for 40-48 G70 style pipes, trimmed a bit further so that less common and more expensive texturing modes (like projective textures, cube maps) run at half rate. Maybe 32 triple issue pipes is another possibility. Either way, IMO it will kill a 64-pipe R600 in DX9 games and early DX10 games that are effectively DX9.
NVidia does what's best for their business. Branching performance that will go mostly unused won't be a high priority for them. Remeber all the non-branching RSX shaders from PS3 development (and cross-platform titles also) won't exactly speed up adoption of DB. At best I think NVidia will eliminate the latency for branching instructions, and there will still be only one thread per quad.
Ailuros
26-May-2006, 05:32
I don't. NVidia will probably go for 40-48 G70 style pipes, trimmed a bit further so that less common and more expensive texturing modes (like projective textures, cube maps) run at half rate. Maybe 32 triple issue pipes is another possibility. Either way, IMO it will kill a 64-pipe R600 in DX9 games and early DX10 games that are effectively DX9. NVidia does what's best for their business. Branching performance that will go mostly unused won't be a high priority for them.
I don't see G80 being another NV4x "refresh" but that's just me.
Chalnoth
26-May-2006, 07:06
NVidia does what's best for their business. Branching performance that will go mostly unused won't be a high priority for them. Remeber all the non-branching RSX shaders from PS3 development (and cross-platform titles also) won't exactly speed up adoption of DB. At best I think NVidia will eliminate the latency for branching instructions, and there will still be only one thread per quad.
While I don't think that nVidia will bring their branching granularity up to the level of the R580, I think they'll make significant improvements. Also, I believe I had heard that the G7x was already multithreaded. But I realize I'd have to find a source on that....I'm just too tired at the moment.
dizietsma
26-May-2006, 07:57
Whatever it ends up with I get the feeling from reading the article that Ati will increase the ratio of shaders to texture units from 3:1 to 4:1.
Whatever it ends up with I get the feeling from reading the article that Ati will increase the ratio of shaders to texture units from 3:1 to 4:1.
Dunno about that.
Having said that, I would not mind more texture power (assuming more BW), but I would not want to reduce the ratio of ALU : TEX.
I don't see them having less than 16, and actually assuming GDDR4 finally makes it's appearance, the quote above would suggest more than 16, however many shaders there are to go with it.
Chalnoth
26-May-2006, 10:15
Increasing the shader to texture ratio makes a whole lot of sense, since now the shaders will be working on a mix of vertex and pixel data, where the vertex data will almost always make use of zero texture instructions. Unless, that is, ATI comes to the conclusion that the shader to texture ratio of the R580 was overdone, even when looking forward.
Having said that, I would not mind more texture power (assuming more BW), but I would not want to reduce the ratio of ALU : TEX.
I don't see them having less than 16, and actually assuming GDDR4 finally makes it's appearance, the quote above would suggest more than 16, however many shaders there are to go with it.
Yep, 16 TMUs, along with 16 point samplers (vertex fetch) just seems under-powered if you have ~100GB/s of bandwidth available.
Especially as render target sizes (screen resolution) aren't going anywhere significant - 1280x1024 and 1600x1200 are still where the vast majority of enthusiast gamers are topping out, with a few 1920x1080 front-runners (I can't see ATI planning for 2560x1600 gamers).
Plus, 16 extra TMUs would only demand ~16GB/s out of the extra bandwidth brought by GDDR4, leaving 25GB/s+ for render targets and streamout.
Jawed
caboosemoose
26-May-2006, 12:17
Especially as render target sizes (screen resolution) aren't going anywhere significant - 1280x1024 and 1600x1200 are still where the vast majority of enthusiast gamers are topping out, with a few 1920x1080 front-runners (I can't see ATI planning for 2560x1600 gamers).
Sorry, I'd have to disagree with that. 1,600 x 1,200 is approaching the status of a non-resolution in the real world and 1,920 x 1,080 is a non resolution on the PC. At the high end, it's all about 1,680 x 1,050 and 1,920 x 1,200, with the prices on 20-inch widescreen panels having already gone through the floor and 23 and 24-inch panels rapidly becoming mainstream. By comparison, sales of 20 and 21-inch 4:3 panels are pretty insignificant - and that's reflected in the higher prices typically chared for these panels. By the time R600 is out, you'll easily be able to pick up a 23-24-inch monitor for £500 of less - which given the relative longevity of a monitor compared with a new high end video card looks very cheap.
I'd also predict that 2,560 x 1,600 will be more important than you think. Already you can pick up the Dell 3007 for about £1,000. That unit is selling like hot cakes and by the end of the year therre's no doubting it will be more affordable. Plus, both BenQ and Samsung are doing their own 30-inchers soon (possibly other companies, too) so that means more competition and even lower prices. I think performance at 2,560 x 1,600 is something that ATI and NVIDIA will have to take seriously for flagship parts.
Dave Baumann
26-May-2006, 12:23
Hi res panels are already important in their thinking - ATI's on-chip HeirZ is sufficent to span 2560x1600 resolutions with no loss of efficiency, while G71's ZCULL copes with a res or so below that (but in excess of 16x12).
My point is that screen resolutions are still topping out around 2MP, where 50GB/s is generally adequate. When GDDR4 brings 90-100GB/s (I hope it doesn't come in at the bottom of the range ~75GB/s) the bandwidth available will be more than adequate to afford an increased texturing rate, in addition to rendering pixels faster.
2560x1600 (4MP) currently requires alternate frame rendering for a safe 60fps, seemingly consuming 32 TMUs, 32 ROPs and ~93GB/s of bandwidth.
http://www.beyond3d.com/reviews/ati/r580/index.php?p=24a
---
Now we get talk of 50% faster clocks:
http://www.theinquirer.net/?article=31968
I suppose that's an alternative to more TMUs.
Jawed
trumphsiao
26-May-2006, 13:22
I heard Charter is in charge of G80 Wafer production instead of TSMC.
Now we get talk of 50% faster clocks:
http://www.theinquirer.net/?article=31968
I suppose that's an alternative to more TMUs.
Jawed
And G80 in summer or early autumn, and no R600 until Q4. Which I doubt the first pretty heavily based on NV's own public statements. I suppose the second is possible, tho would be disappointing.
I don't. NVidia will probably go for 40-48 G70 style pipes, trimmed a bit further so that less common and more expensive texturing modes (like projective textures, cube maps) run at half rate.
Those do already run at half rate, IIRC. At least for the address calculation, not necessarily for the filtering.
Yep, 16 TMUs, along with 16 point samplers (vertex fetch) just seems under-powered if you have ~100GB/s of bandwidth available.
Especially as render target sizes (screen resolution) aren't going anywhere significant - 1280x1024 and 1600x1200 are still where the vast majority of enthusiast gamers are topping out, with a few 1920x1080 front-runners (I can't see ATI planning for 2560x1600 gamers).
Plus, 16 extra TMUs would only demand ~16GB/s out of the extra bandwidth brought by GDDR4, leaving 25GB/s+ for render targets and streamout.
Bandwidth demands of the TMUs depends on their clock speed. And I expect the first cards with GDDR4 to have less than 90GB/s of bandwidth available.
I don't think increasing the resolution makes a huge difference one way or the other.
I don't think I'm buying the 1000mhz R600, at least as a mass edition. Maybe engineer-over-a-beer "yeah, we got one up to that".
And I don't think I'm buying 65nm either before the end of the year.
So what else would be in reach at 80nm besides 64/16? I'd think 96/32 would be too big. 80/24? That looks butt-ugly to "powers of 2" snobs, but maybe? 96/24 still too big?
trumphsiao
26-May-2006, 16:28
I don't think I'm buying the 1000mhz R600, at least as a mass edition. Maybe engineer-over-a-beer "yeah, we got one up to that".
And I don't think I'm buying 65nm either before the end of the year.
So what else would be in reach at 80nm besides 64/16? I'd think 96/32 would be too big. 80/24? That looks butt-ugly to "powers of 2" snobs, but maybe? 96/24 still too big?
My take on either G80 or R600 is : G80 contains Higher-Clock 64 MIMD ALUs compared to Lower-Clock R600 with mammoth of 128 SIMD ALUs and they share same point : both of them are "ALU:TMU = 4:1 " DX10 iteration.
My take on either G80 or R600 is : G80 contains Higher-Clock 64 MIMD ALUs compared to Lower-Clock R600 with mammoth of 128 SIMD ALUs and they share same point : both of them are "ALU:TMU = 4:1 " DX10 iteration.
If NV goes 4:1, then Dizzy is going to hurl himself from the roof of NV headquarters on to the hood of Jen-Hsun's car as it comes out of the parking garage. :lol:
trinibwoy
26-May-2006, 17:07
If NV goes 4:1, then Dizzy is going to hurl himself from the roof of NV headquarters on to the hood of Jen-Hsun's car as it comes out of the parking garage. :lol:
Dizzy = David Kirk ? :???:
Sorry. dizietsma (who sometimes refers to himself as 'Dizzy', which is where I picked it up). Loud critic of 3-1 (let alone 4-1), and I think I recall some rash promises from him of what he'd do if NV ever followed ATI down that path. :lol:
Tim Murray
26-May-2006, 18:35
The R580 interview (http://www.beyond3d.com/reviews/ati/r580/int/) is a nice source of information regarding "why is the ratio of ALU to TMU the way it is" and "what would you do given more bandwidth." Then there's the lovely Xenos article, which can answer all of your unified shader architecture questions.
Of course, what nobody's mentioned yet (except that one person ;) ) is that R600 will probably be the part that finally gets ATI into the workstation market. Considering how dependent workstation performance is upon VS, I don't know how G80 would be able to compete with R600. Couple that with the mythic OGL rewrite that was "going to take a few generations" about to hit the "few generations" mark, and R600 seems pretty likely to make a big splash there.
"gets ATI into the workstation market" is a bit harsh. . .they are doing okay-ish in the low end of that market, volume-wise, today. But suddenly having a part with 64 (or more) "VS" should certainly help quite a bit to get them into the mid/upper level of that market where the rich margins are.
Chalnoth
26-May-2006, 20:11
I don't think that the R600 will have anything that could inherently help them make further inroads to the workstation market. That market is mostly about the drivers, which have rather different requirements.
SugarCoat
26-May-2006, 21:01
---
Now we get talk of 50% faster clocks:
http://www.theinquirer.net/?article=31968
I suppose that's an alternative to more TMUs.
Jawed
Word of caution. They were all over the place with the R520 clockspeeds. From 500-800MHz if i remember correctly. Month or three and they'll problably say its actually 800MHz or something.
trinibwoy
26-May-2006, 21:04
Sorry. dizietsma (who sometimes refers to himself as 'Dizzy', which is where I picked it up). Loud critic of 3-1 (let alone 4-1), and I think I recall some rash promises from him of what he'd do if NV ever followed ATI down that path. :lol:
Ah ok :smile:
Mintmaster
26-May-2006, 21:21
I don't see G80 being another NV4x "refresh" but that's just me.
I'm not saying it's going to be a refresh, I just think they'll stick with this dense ALU philosopy they have right now and forget about the so-called "ultrathreading", especially since they're not going unified. They'll still need lots of modifications for DX10. Maybe fast renderstate changes, integer support, bitwise ops, etc.
Also, I believe I had heard that the G7x was already multithreaded.Yup, as I said, one thread per quad. ATI's strength in DB is from two sources: Granularity, which gives the really big gains; and "free" branching instructions. My guess is NVidia will adopt the latter, but not the former, since it costs too much compared to additional pipelines. I don't think granularity is a big issue for initial DB applications anyway, such as uber shaders, POM, and soft shadow maps. It's the really experimental stuff like ray tracing, GPGPU stuff, non-trivial physics, etc. that granularity is needed for, and those apps are well down the road.
Those do already run at half rate, IIRC. At least for the address calculation, not necessarily for the filtering.Interesting. Maybe they could do 1/4 rate to save some more. Either way, I'm expecting NVidia to go for over 40 shader pipes with comparable performance to the current ones.
NVidia possibly made a tiny mistake this generation by keeping G71 small and reaping the profits. They won't do that next gen, and because they know they have a huge density advantage over ATI, they'll try to blow them out next round with a 300+ mm2 chip. I think 48 shader pipes at 80nm is doable.
Megadrive1988
26-May-2006, 21:57
"G80" is supposed to be the true next-gen chip from Nvidia, part of the NV5X family.
it's been 2 years since NV4x was released, it is some for a mostly fresh architecture.
in the older days of Nvidia, the G70/G71 would've simple been the NVX5 refresh
i.e. TNT2, NV15, NV25, NV35.
back on topic. the R600 will go through several refreshes upgrades, I believe. there will be an R6XX refresh and the R7xx series will probably be, not a totally new architecture, but yet further refreshes/upgrades of the R600 technology, ala R350/R360, then R420/R480 then R520 and finally R580.
next totally new architecture from ATI beyond the R400/Xenos/R600 base technology? R800.
that's just purely my guess -speculation. yet, recall the 2004 interview with Dave Orton
http://www.beyond3d.com/interviews/daveorton/index.php?p=3
So the R600 family will mainly be centred primarily in the Valley and Orlando with a little bit from Marlborough, and
then the R800 would be more unified.
Tim Murray
26-May-2006, 22:27
Word of caution. They were all over the place with the R520 clockspeeds. From 500-800MHz if i remember correctly. Month or three and they'll problably say its actually 800MHz or something.
True, but I doubt the transistor count will be anywhere close to the same kind of increase we saw from R480 to R520.
SugarCoat
27-May-2006, 00:16
True, but I doubt the transistor count will be anywhere close to the same kind of increase we saw from R480 to R520.
Considering unified pipes are smaller then the traditional (what is traditional anymore? ;)) i dont think they'll surpass the 400M mark by much if at all. One fact however is the cost of the part which will be quite a bit. It would be silly to blow yields for a ridiculous clock. If its over 800MHz i'd be surprised.
What about Fast-14?
http://www.beyond3d.com/forum/showthread.php?t=8803
Jawed
Tim Murray
27-May-2006, 04:14
Well, if we're going into Silly-We've-Been-Speculating-For-Years-About-This-Land, then let me share a piece of inside info with you: G80 is based almost entirely on Gigapixel IP!
Well, if we're going into Silly-We've-Been-Speculating-For-Years-About-This-Land, then let me share a piece of inside info with you: G80 is based almost entirely on Gigapixel IP!
http://www.warp2search.net/modules.php?name=News&file=article&sid=18627
Ailuros
27-May-2006, 06:06
Well, if we're going into Silly-We've-Been-Speculating-For-Years-About-This-Land, then let me share a piece of inside info with you: G80 is based almost entirely on Gigapixel IP!
Which seems seriously dated for today's standards in the form it was up to 2000.
Ailuros
27-May-2006, 06:13
I'm not saying it's going to be a refresh, I just think they'll stick with this dense ALU philosopy they have right now and forget about the so-called "ultrathreading", especially since they're not going unified. They'll still need lots of modifications for DX10. Maybe fast renderstate changes, integer support, bitwise ops, etc.
Sensible speculation, but I have a totally different feeling right now. And yes it's just a feeling not based even on rumours.
Interesting. Maybe they could do 1/4 rate to save some more. Either way, I'm expecting NVidia to go for over 40 shader pipes with comparable performance to the current ones.
NVidia possibly made a tiny mistake this generation by keeping G71 small and reaping the profits. They won't do that next gen, and because they know they have a huge density advantage over ATI, they'll try to blow them out next round with a 300+ mm2 chip. I think 48 shader pipes at 80nm is doable.
"Shader pipes" = ALUs? (just to clarify that one).
Yes that sounds feasable, but depending how one really counts it's not that much more than on G7x; which might suggest significant architectural changes for the ALUs themselves and their relation to the rest of the chip's units.
To avoid misunderstandings how many TMUs do you expect?
Chalnoth
27-May-2006, 08:20
What about Fast-14?
For how long now have people been asking about this tech? My bet is that they're already using it. That, or they never will.
Considering unified pipes are smaller then the traditional (what is traditional anymore? ;)) i dont think they'll surpass the 400M mark by much if at all.
Why are unified pipes smaller than traditional ones?
Mintmaster
27-May-2006, 10:22
"Shader pipes" = ALUs? (just to clarify that one).
Like I said, I'm thinking they'll be somewhat comparable to G71's pipes, so not just ALU's. I'm guessing similar to the ALU + (ALU or TEX) kind of shader pipe they've got going right now.
To avoid misunderstandings how many TMUs do you expect?I'm thinking 32 if they go triple issue or 40-48 if they stay dual issue. They'll be a bit more simplified from the current setup, maybe taking 4 cycles for FP16/I16/I32 filtering (I'm crossing my fingers for high precision integer filtering), initial AF setup, cube map calcs, etc. Maybe they won't provide enough register space to hit that peak most of the time in order to stay compact.
I know that sounds like overkill, but again this ties into my theory that they'll leverage the current setup that they've perfected. With this design, I don't see them saving much space by reducing the texture rate, since the address calcs are done so fast, though I may be underestimating the ease of fetching data from the texture cache.
Of course, I could be way off, and NVidia could decouple the TMUs like ATI did. Then they'd sort of be abandoning their current architecture, though, as your gut feeling dictates. My gut feeling says they won't go that route until they decide to go unified.
Mintmaster
27-May-2006, 10:26
Why are unified pipes smaller than traditional ones?
I think SugarCoat was comparing to NVidia's pipes and pre-R5xx (with TMU logic). We already know that ATI added a lot of math power in R580 with only 60M transistors. This, of course, assumes they didn't take anything out of R520 that was "overbuilt".
Chalnoth
27-May-2006, 10:50
They'll be a bit more simplified from the current setup, maybe taking 4 cycles for FP16/I16/I32 filtering (I'm crossing my fingers for high precision integer filtering), initial AF setup, cube map calcs, etc.
I don't think more than 2-cycle latency on FP16 makes sense.
Mintmaster
27-May-2006, 12:16
I don't think more than 2-cycle latency on FP16 makes sense.
At first I was thinking the same thing, but when you do things on a quad by quad basis you can slow things down quite a bit while still saving space. I was just throwing ideas around, really. Xenos does FP16 filtering by converting to 16.16 and doing integer filtering at 1/4 rate, I assume by cascading the filtering logic (the filtering weights remain 8-bit or less). Some clamping could happen since FP16 range is more than 2^32, but it's a decent compromise IMO.
I don't think more than 2-cycle latency on FP16 makes sense.
Why not? Considering that the most common case of texturing involves compressed textures and that G70 is completely bandwidth limited for 32bpt and above, I'd say it makes perfect sense.
Chalnoth
27-May-2006, 16:40
Well, if you just consider that the G7x already is an extremely efficient piece of hardware in terms of die space, why should nVidia simplify things further?
For how long now have people been asking about this tech? My bet is that they're already using it. That, or they never will.
IIRC it could only be used in a brand new architeture, so I would say it is R600 or never.
Chalnoth
27-May-2006, 18:19
Why? It's just an ALU layout library.
Kombatant
27-May-2006, 18:45
Well, if we're going into Silly-We've-Been-Speculating-For-Years-About-This-Land, then let me share a piece of inside info with you: G80 is based almost entirely on Gigapixel IP!
Actually G80 is really Rampage 2 - but shhh! don't tell anyone :lol:
Well, if you just consider that the G7x already is an extremely efficient piece of hardware in terms of die space, why should nVidia simplify things further?
To be able to add more units.
Extremely efficient compared to what? It's not efficient to have units that can rarely sustain their peek processing rates. Maybe the additional cost was low. Maybe it wasn't, but c&p design was cheap.
Chalnoth
27-May-2006, 19:34
Compared to their competitor.
EasyRaider
27-May-2006, 19:57
edit: never mind
Ailuros
27-May-2006, 22:59
Of course, I could be way off, and NVidia could decouple the TMUs like ATI did. Then they'd sort of be abandoning their current architecture, though, as your gut feeling dictates. My gut feeling says they won't go that route until they decide to go unified.
My gut feeling also says one step at a time ;)
Compared to their competitor.
That transistor advantage might be moot when they slap GDDR4 on the R580.
Chalnoth
28-May-2006, 12:36
That transistor advantage might be moot when they slap GDDR4 on the R580.
I don't see what that would have to do with anything. It's just a memory interface like any other.
Have we heard anything on ROPs for R600? Obviously they wouldn't be using the ones from Xenos. . . They "touched" the ones for R5xx, so one would think they'd want to leverage most of that work forward. . .but I'd think they're going to have to touch them again anyway for R600, certainly for de-coupling them since they won't be tied to quads anymore. Might anything else go in there as well since they have to revisit them anyway?
pjbliverpool
28-May-2006, 15:27
Have we heard anything on ROPs for R600? Obviously they wouldn't be using the ones from Xenos. . . They "touched" the ones for R5xx, so one would think they'd want to leverage most of that work forward. . .but I'd think they're going to have to touch them again anyway for R600, certainly for de-coupling them since they won't be tied to quads anymore. Might anything else go in there as well since they have to revisit them anyway?
4 samples per pixel in a single cycle like Xenos. It seems like this feature is of more use in the PC than a console anyway.
4 samples per pixel in a single cycle like Xenos. It seems like this feature is of more use in the PC than a console anyway.
Umh..are you sure? I think it makes more sense on Xenos (due to edram)
Have we heard anything on ROPs for R600? Obviously they wouldn't be using the ones from Xenos. . . They "touched" the ones for R5xx, so one would think they'd want to leverage most of that work forward. . .but I'd think they're going to have to touch them again anyway for R600, certainly for de-coupling them since they won't be tied to quads anymore. Might anything else go in there as well since they have to revisit them anyway?
The ROP "coupling" in R5xx and earlier is such an intrinsic aspect of rasterisation, scheduling, texture-caching, hierarchical-Z, colour-buffer-caching and Z-buffer-caching that it raises a lot of intrigue for me. It all hinges on screen-space tiling.
I suspect ATI will stick with screen-space tiling, which at least means that the last three items in that list could be neatly segregated, per tile.
And that's as far I get - I've speculated loads about this stuff already, pondering how the width of the "arrays" in R600 might pan out (16-wide, or wider like in Xenos? or narrow, 1-quad) and then you can speculate about whether R600 is a monolithically scheduled unified architecture, i.e. 3:1 like Xenos, or whether it might consist of grouped arrays, e.g. 3:1 + 3:1 + 3:1 + 3:1 (which, as you can see, could easily retain the "coupled-ROP" architecture).
It's worth noting that Xenos is not wholly "unified" in terms of the content of its threads. It's not possible to schedule 96 threads of vertex shader work on Xenos: it's capped at 32 threads of vertex work and 64 threads of pixel work. (So when Xenos is wholly executing, say, vertex work, the pixel-thread specific queue and register file memories are actually unused.) This isn't usually a problem because with little texturing in vertex shaders, there's not much latency to hide (i.e. not much reason to swap-around vertex threads when texturing is required). Although I'm sure it wouldn't be hard to construct an example that makes Xenos crawl, with heavy vertex texturing. (Though it's worth noting that vertex fetch is, effectively, vertex texturing - but I presume that latency is partly hidden by Xenos's sequencer, fetching vertex data ahead of initiating vertex threads.)
This 1:2 ratio in vertex:pixel threads might lend itself to being spread around a 3:1 + 3:1 + 3:1 + 3:1 organisation of units. In effect, granularising threads further, making the cut-over periods between "all-vertex" or "all-pixel" and normal "mixed vertex/pixel" less painful.
I'm hiding GS as part of VS, by the way, just for the sake of simplicity.
Jawed
pjbliverpool
28-May-2006, 17:17
Umh..are you sure? I think it makes more sense on Xenos (due to edram)
I don't know about a technical point of view but from a performance point of view, all of these GPU's are going to be measured on their performance with 4xFSAA engaged (2x isn't barely touched these days).
So if there is a reasonable FSAA performance advantage to be had from making that relatively small change to the GPU then I would have thought ATI (or nvidia for that matter) would want to take it.
Unless such a change would not result in a performance advantage in a PC GPU?
Chalnoth
28-May-2006, 19:44
It's worth noting that Xenos is not wholly "unified" in terms of the content of its threads. It's not possible to schedule 96 threads of vertex shader work on Xenos: it's capped at 32 threads of vertex work and 64 threads of pixel work.
This makes a lot of sense, because vertex and pixel work still require different amounts of storage. So it's much easier to do it this way.
dizietsma
28-May-2006, 20:21
I don't think nvidia will go 4:1 ratio so the CEO is safe from me attacking him with a rubber chicken, however of course shaders is the way forward so the ratio will go up for nvidia.
I just see nvidia as very conservative nowadays, I think they are still haunted by the FX series. I think that explains why they will not go pure unified shaders straight away as well. They do big numbers though so I think we will see an 8/12 quad gpu.
Megadrive1988
28-May-2006, 22:15
Actually G80 is really Rampage 2 - but shhh! don't tell anyone :lol:
you mean Fear or Mojo, right? :lol:
It's worth noting that Xenos is not wholly "unified" in terms of the content of its threads. It's not possible to schedule 96 threads of vertex shader work on Xenos: it's capped at 32 threads of vertex work and 64 threads of pixel work. (So when Xenos is wholly executing, say, vertex work, the pixel-thread specific queue and register file memories are actually unused.) This isn't usually a problem because with little texturing in vertex shaders, there's not much latency to hide (i.e. not much reason to swap-around vertex threads when texturing is required). Although I'm sure it wouldn't be hard to construct an example that makes Xenos crawl, with heavy vertex texturing. (Though it's worth noting that vertex fetch is, effectively, vertex texturing - but I presume that latency is partly hidden by Xenos's sequencer, fetching vertex data ahead of initiating vertex threads.)
This makes a lot of sense, because vertex and pixel work still require different amounts of storage. So it's much easier to do it this way.
This is not my understanding of how it works. There is some control over capping resources used by vertex/pixel threads but it isn't at the thread level.
Might be a terminology thing. In M$ speak, I'm referring to caps on the number of vectors, i.e. 32 vertex vectors, each of 64 vertices and 64 pixel vectors, each of 64 pixels.
Jawed
Might be a terminology thing. In M$ speak, I'm referring to caps on the number of vectors, i.e. 32 vertex vectors, each of 64 vertices and 64 pixel vectors, each of 64 pixels.
Jawed
Your misinterpretting the numbers
Well, howsabout posting your interpretation?
Jawed
Threads are limited by the number of available registers, these come from a pool, more registers in the shader = less threads in flight, how the register pool is distributed, pixel / vertex is programmable.
But threads also do not represent anything other than the ability to hide latency, they don't dictate how the ALU's can be distributed. All 48 ALU's can be doing vertex work.
There is also a second level of granularity to threads that you're missing.
Dave Baumann
28-May-2006, 22:58
At the hardware level the reservation stations are able to handle 48 Vertex threads and 64 Pixel.
Threads are limited by the number of available registers, these come from a pool, more registers in the shader = less threads in flight, how the register pool is distributed, pixel / vertex is programmable.
Agreed - although nothing I've read says that the register file is common to both vertices and pixels. The only thing I've seen is along the lines "performance starts to fall off if you use more than 12 FP32s in a pixel shader".
I wouldn't be surprised if the register file is common to both, because it seems the storage for shader instructions is. But I don't remember a statement one way or the other.
But threads also do not represent anything other than the ability to hide latency, they don't dictate how the ALU's can be distributed. All 48 ALU's can be doing vertex work.
Agreed. I wasn't suggesting a limitation on the number of pipes being used. When all 48 ALUs are doing vertex shading, that's only 3 threads (vectors). There's, nominally, another 29 waiting in the wings to hide texturing latency (or latency due to additional vertex fetches), if need be. Obviously if the shaders' register counts are really high, then Xenos will be forced to reduce the number of threads it can support.
Jawed
The GPU shader processor consists of a number of logical units: 16 vertex fetch units, 16 texture fetch units, and 48 arithmetic logic units, or ALUs. The processor can work on up to 32 vertex vectors and 64 pixel vectors at a time. A total of up to 2,048 vertices and up to 4,096 pixels can be in flight at once. Shader resources are dynamically allocated based on the current load to different vertex shader vectors and pixel shader vectors.
Jawed
Chalnoth
29-May-2006, 03:02
This is not my understanding of how it works. There is some control over capping resources used by vertex/pixel threads but it isn't at the thread level.
Possibly. But you probably also want to have separate queues for the vertex and pixel shaders, and it may be difficult to dynamically allocate those queues.
Mintmaster
29-May-2006, 09:36
Agreed. I wasn't suggesting a limitation on the number of pipes being used. When all 48 ALUs are doing vertex shading, that's only 3 threads (vectors). There's, nominally, another 29 waiting in the wings to hide texturing latency (or latency due to additional vertex fetches), if need be. Obviously if the shaders' register counts are really high, then Xenos will be forced to reduce the number of threads it can support.
I don't see how this leads you to believe this:
This isn't usually a problem because with little texturing in vertex shaders, there's not much latency to hide (i.e. not much reason to swap-around vertex threads when texturing is required). Although I'm sure it wouldn't be hard to construct an example that makes Xenos crawl, with heavy vertex texturing.
Even if your vertex shader had a bunch of texture fetches in a row, you have 2048 verts in flight, and at a theoretical 16 fetches per clock, that's 128 cycles to absorb latency. Even if it takes 200 cycles for a texture fetch, you're only going to drop 35% from the theoretical peak. Such a pathological example would run one or two orders of magnitude faster than on a non-unified design anyway.
If you also had lots of registers in the vertex shaders, then you'd cut the number of threads, hence vertices, in flight.
But yeah, still fast compared to a non-unified design.
I think the shader arrays actually execute 6 threads concurrently - execution of two threads in each shader array is alternated each clock cycle.
Jawed
Demirug
29-May-2006, 11:29
The pure ALU/FPU execution units should have a low latency. This mean even with a high register count it should be possible to reach a high efficient as long as there are enough calculation instructions between the texture instructions.
I don’t see the necessary to execute two alternated threads per unit. Every thread already contains multiple quads that can easily pushed in the pipeline one by one.
Dave Baumann
29-May-2006, 12:01
I think the shader arrays actually execute 6 threads concurrently - execution of two threads in each shader array is alternated each clock cycle.
Its not alternate cycles, its alternate 4 cycles. Thats why the batch size per "thread" is 64 Pixels / Verts (16 ALU's x 4 cycles).
The pure ALU/FPU execution units should have a low latency. This mean even with a high register count it should be possible to reach a high efficient as long as there are enough calculation instructions between the texture instructions.
I don’t see the necessary to execute two alternated threads per unit. Every thread already contains multiple quads that can easily pushed in the pipeline one by one.
ALU latency is 8 cycles, which is why they switch between two threads every 4 cycles so that the can hide the ALU latency between instructions on a single thread. Technically they could have chosen to increase the batch size to 128 and just execute a single thread for 8 cycles, but I guess they were looking for high braching performance and small triangle efficiency.
And that's as far I get - I've speculated loads about this stuff already. . .
Pretend (ha!) I'm not too bright. . .how does all that wind back to what they might have done with the ROPs?
Even if your vertex shader had a bunch of texture fetches in a row, you have 2048 verts in flight, and at a theoretical 16 fetches per clock, that's 128 cycles to absorb latency. Even if it takes 200 cycles for a texture fetch, you're only going to drop 35% from the theoretical peak
Only if none of those texture fetches depend on each other. If there is a dependency chain, you need to add the full 200 cycles (using your numbers) of latency for each link in the longest chain.
If just one fetch depends on the result of one other fetch, that's 400 cycles of latency to cover. With only 128 cycles worth of work to absorb latency, you'll be running at 128/400 == 32% of the peak speed if nothing else slows you down even more.
Edit: Of course, if you can schedule ALU ops from the same thread and/or you can run other thread types at the same time, then you effectively have more latency coverage than just 128 cycles.
Pretend (ha!) I'm not too bright. . .how does all that wind back to what they might have done with the ROPs?
Well I did go on, for example suggesting that R600 might be constructed with a number of independent arrays. That's what the 3:1 + 3:1 + 3:1 + 3:1 thing was about.
If that were done, each group of 3 ALU arrays and 1 texture array (3 quads, 1 quad? - just like R580?) would have its own ROP. Pixels would be rasterised and despatched just like we see in R580. The ROPs would still be screen space tiled and the various buffers/caches for the ROPs would retain their current configuration (beefed-up for whatever D3D10 demands).
Vertices and primitives would be despatched to arbitrary arrays, presumably based upon workload. There wouldn't be any meaning attached to the array used to process vertices and primitives (I can't think of a parallel to "screen space tiling" that might apply to them).
---
I have my fingers crossed that R600 is a 32-ply beastie: 8 shader units, each consisting of 3:1 arrays, each with a ROP. So 32-1-3-2. If you include vertex fetch, then you might say it's 32-2-3-2. Hopefully it will have RV530-style fast ROPs.
Prolly time to start a thread about render target formats under D3D10. I'm not sure what improvements are coming...
---
If R600 is a D3D10 version of Xenos, with just a big, shared, block of ROPs (and no screen space tiling) then erm I dunno, no good ideas there...
Jawed
PeterAce
29-May-2006, 21:05
So your thinking a batch size of 48.
My gut-feeling (the current favorite B3D saying) thats it is 64 like Xenos.
Would this mean that a config more like 6 (16-way) arrays, a batch being 4-cycles making 64. But it also could still be somthing like 12 (8-way) arrays.
I'm not sure of the trade-offs between a few 'large arrays' or more 'smaller arrays'.
Anyone care to take a stab?
I like the multiple shader units configuration because not only does it directly support screen space tiling, it also makes for a nice even distribution of ring-bus clients.
Jawed
Dave Baumann
29-May-2006, 21:44
I'm not sure of the trade-offs between a few 'large arrays' or more 'smaller arrays'.
Larger batch/thread sizes vs more arbiters and sequencers (i.e. more control silicon).
Mintmaster
30-May-2006, 02:43
I have my fingers crossed that R600 is a 32-ply beastie: 8 shader units, each consisting of 3:1 arrays, each with a ROP. So 32-1-3-2. If you include vertex fetch, then you might say it's 32-2-3-2. Hopefully it will have RV530-style fast ROPs.That would be freaking enormous. I was crossing my fingers for 16 ROPs, 96 shader pipes and 32 texture units, and would be happy with even 24 texture units. If it really is 64 shader pipes and 16 texture units, that's a bit of a dissapointment in comparison to Xenos.
(BTW, I continue to use the term "shader pipes" simply for comparison to this generation's structure, i.e. NVidia has dual issue and ATI has their mini ALU. I haven't seen any evidence where NVidia's pipes perform closer to two ATI pipes than to one, especially when looking at R520 vs. G70 benchmarks and synthetic tests.)
That would be freaking enormous. I was crossing my fingers for 16 ROPs, 96 shader pipes and 32 texture units, and would be happy with even 24 texture units.
Interesting. That's where I am. Nice to see someone else there.
Think 80/24 makes any sense? Still > 3/1, but save a little silicon at 80nm.
Ailuros
30-May-2006, 05:28
Interesting. That's where I am. Nice to see someone else there.
Think 80/24 makes any sense? Still > 3/1, but save a little silicon at 80nm.
Why wouldn't 16 TMUs at anything =/>750MHz be not enough? What's the real theoretical goal with next generation GPGPUs: =/>20 GTexels fillrates or =/>400 GFLOPs of pure arithmetic horsepower?
Besides frankly there's nothing yet that can tell me whether 64 (replace the number with anything else you want) ALUs are not sufficient. Without knowing the characteristics of those units and some essential architectural tidbits, speculating around with numbers and determine what is sufficient and isn't, is a tad off base IMHO.
If those 64 ALUs would be dual-issue (and yes that's just a hypothetical example grabbed out of thin air), you'd reach 768 GFLOPs at a 750MHz frequency.
Why wouldn't 16 TMUs at anything =/>750MHz be not enough?
Are they going to make 750mhz at 80nm? I'm not so sure.
I'm going mainly on the fact that Sireric seemed to think more TMU would be lovely so long as there was more BW to feed it (seems there will with gddr4), and the ratio didn't decrease.
Edit: I'm beginning to wonder if there is a general clocks issue building in the gpu world. The data points are building up that way, and not just ATI but NV as well. But then I'm widely known to worry too much. :)
Mintmaster
30-May-2006, 05:48
Why wouldn't 16 TMUs at anything =/>750MHz be not enough? What's the real theoretical goal with next generation GPGPUs: =/>20 GTexels fillrates or =/>400 GFLOPs of pure arithmetic horsepower?Hehe, well I want to do some stuff with spherical harmonics and 3D textures. :wink:
I figure l=2 is a minimum, and trilinear filtering of 3D textures brings me up to nearly 40 texture unit cycles per pixel. Lots of overdraw too.
On the math side I think its mostly for non-graphics applications that I'd like that much power.
Mintmaster
30-May-2006, 05:57
Think 80/24 makes any sense? Still > 3/1, but save a little silicon at 80nm.Makes some sense, and I thought about it, but given the rumours I doubt it. I hope they aren't right, because staying with 16 texture units is dangerous for ATI, IMO. Especially in the value sector, where reduced graphics settings could be less math heavy.
Ailuros
30-May-2006, 05:59
Hehe, well I want to do some stuff with spherical harmonics and 3D textures. :wink:
I figure l=2 is a minimum, and trilinear filtering of 3D textures brings me up to nearly 40 texture unit cycles per pixel. Lots of overdraw too.
On the math side I think its mostly for non-graphics applications that I'd like that much power.
How about some deferred texturing with single cycle trilinear TMUs then? :D (j/k)
I'm not sooo sure we'll see even 32 TMUs in any of the upcoming D3D10 GPUs, but as I said it's merely a gut feeling and not based on any reliable background info.
Keep in mind that ALUs will be more important as there will be more math based on the addition of geometry shaders, physics, and predicated rendering.
Additional transistors will be used for virtual memory and management of graphics objects.
Tim Murray
30-May-2006, 07:10
Makes some sense, and I thought about it, but given the rumours I doubt it. I hope they aren't right, because staying with 16 texture units is dangerous for ATI, IMO. Especially in the value sector, where reduced graphics settings could be less math heavy.
R600 texture units != Xenos texture units.
Mintmaster
30-May-2006, 07:27
How about some deferred texturing with single cycle trilinear TMUs then? :D (j/k)Heh, it already is deferred. The overdraw is from alpha blending.
Basically I want to start seeing neighbourhood transfer of radiance. Even if it's pretty local, I think that's the missing link in realtime lighting that we need for realism. There are a lot of approximations in the method I'm thinking of, but I think it could work.
R600 texture units != Xenos texture units.Well I was comparing to R5xx texture units, but even if changed from them, how different could they be? Maybe FP16/I32 filtering, more intelligent aniso walking, and better efficiency. Doubt 16 units is enough to effectively counter even a simple G71 x 1.5 w/ DX10.
Kombatant
30-May-2006, 08:30
I don't see a 3:1 ratio myself.
I have my fingers crossed that R600 is a 32-ply beastie: 8 shader units, each consisting of 3:1 arrays, each with a ROP. So 32-1-3-2. If you include vertex fetch, then you might say it's 32-2-3-2. Hopefully it will have RV530-style fast ROPs.
That would be freaking enormous. I was crossing my fingers for 16 ROPs, 96 shader pipes and 32 texture units, and would be happy with even 24 texture units. If it really is 64 shader pipes and 16 texture units, that's a bit of a dissapointment in comparison to Xenos.
Erm, you're practically agreeing with me, 32 shader units equals:
32 bilinear TMUs
32 vertex fetch units (suppose we might as well start calling em VFUs)
96 ALUs
32 ROPs (hopefully RV530 fast)So the only material difference between you and I is ROP count.
If you have an 8x32-bit GDDR3 interface, then perhaps it makes sense to use 8 quad ROPs, one per ring stop and one ring stop per 32-bit channel.
R580 has two 32-bit channels per ring stop, hence only 4 ring stops and 4 quad ROPs.
I'm not going to push too hard for 8 ring stops.
Jawed
I believe that they are going to the 32 TMU, 32 ROPS, 32 Z at the beginning with 2 ALU per pipe instead of 3. If they are going to launch it at the end of this year and Vista with DX10 isn´t going to be ready until 2007 I don´t see the possibility of a DirectX 10 GPU and I see more a DX9 GPU with 10 Vertex Shaders 3.0 (more enhaced than the R5x0 architecture) and 64 Pixel Shaders clocked at 700Mhz and with all the ALU being MADD capable.
Of course this is just speculation.
Dave Baumann
30-May-2006, 11:04
Given Eric's comments in the R580 interview, if core speeds scale at a similar ratio at memory, why would we expect any more of the elements that consume bandwidth? Take a read of the interview again, carefully, and bear in mind what items are likely to be on his mind when he's replying (given that R580 is a historic item to him at that point).
Also, note, just because GDDR4 is coming doesn't mean that this will translate into an immediate and massive leap in bandwidth. GDDR3 started at the high point of GDDR2's end - i.e. GDDR2's high point was 500MHz, but 500MHz GDDR3 was far more prevelent; it wouldn't surprise me to see GDDR4 coming in at the 900-1000MHz range initially.
chavvdarrr
30-May-2006, 11:18
Given Eric's comments in the R580 interview, if core speeds scale at a similar ratio at memory, why would we expect any more of the elements that consume bandwidth? Take a read of the interview again, carefully, and bear in mind what items are likely to be on his mind when he's replying (given that R580 is a historic item to him at that point).so, your guess is 16 ROPs/TMU with 64 ALUs ?
Not exactly what most of us expected.
it wouldn't surprise me to see GDDR4 coming in at the 900-1000MHz range initially.
Second that.
Given Eric's comments in the R580 interview, if core speeds scale at a similar ratio at memory, why would we expect any more of the elements that consume bandwidth? Take a read of the interview again, carefully, and bear in mind what items are likely to be on his mind when he's replying (given that R580 is a historic item to him at that point).
Also, note, just because GDDR4 is coming doesn't mean that this will translate into an immediate and massive leap in bandwidth. GDDR3 started at the high point of GDDR2's end - i.e. GDDR2's high point was 500MHz, but 500MHz GDDR3 was far more prevelent; it wouldn't surprise me to see GDDR4 coming in at the 900-1000MHz range initially.
So 750/1.8 would be roughly balanced, so long as current TMU is actually BW-limited and not unit-limited?
Kombatant
30-May-2006, 13:27
On a bit unrelated note, is it me or some people see the R600 like the second coming of Christ or something? :)
On a bit unrelated note, is it me or some people see the R600 like the second coming of Christ or something? :)
R400 hangover. :lol: Tho the comparison is interesting, given that --Christ is a bit late too based on the original thinking of when he'd return. :lol:
PeterAce
30-May-2006, 13:38
After re-reading the interview it very clear that number of ROPs will prob not increase.
So the question turns to will the current 1:1 TEX to ROP ratio change? 32 TEX with 16 ROPs seems very unbalanced, but maybe 24 TEX. Still 16 TEX doesn't seem to bad.
Reguarding ALUs than R580 was designed to perform well on titles for the comming year (higher ALU to TEX ratios).
R600 will be also in that period so the question becomes will R600 increase the ALU to TEX ratios again to higher than 3:1. Maybe only the high-end refreshes will alter the ratio?
ATI's design route of high numbers of threads seams to match well the increase in ALUs as this helps hide the texture fetch latency.
I'd definitely lean towards four arbiter/sequencer logic blocks for the first R600-series chip (and I'm not convinced the first chip is actually called R600), controlled by three command queues and 4:1 shader:tex ratio (and 16 ROPs).
I'd definitely lean towards four arbiter/sequencer logic blocks for the first R600-series chip (and I'm not convinced the first chip is actually called R600), controlled by three command queues and 4:1 shader:tex ratio (and 16 ROPs).
Arrgh! :sad: Somebody poke Huddy and tell him to clean up his mess! :lol: Xbit isn't responding for me, or I'd provide the supposed quote from Huddy that sounds very much like "a" chip, and not a family.
Kombatant
30-May-2006, 19:16
I'd definitely lean towards four arbiter/sequencer logic blocks for the first R600-series chip (and I'm not convinced the first chip is actually called R600), controlled by three command queues and 4:1 shader:tex ratio (and 16 ROPs).
I wonder where you got that idea... :lol:
I wonder where you got that idea... :lol:
My backside, as always :lol:
I'd definitely lean towards four arbiter/sequencer logic blocks for the first R600-series chip (and I'm not convinced the first chip is actually called R600), controlled by three command queues and 4:1 shader:tex ratio (and 16 ROPs).
I sense an inside joke...:|
4:1 ratio...fine. 16 ROPs...fine. Four arbiter/sequencer with three command queues...??? (Might help if if knew the difference between a arbiter/sequencer and a command queue.)
Seriously tempted to say: "You're just making stuff up." :)
*sigh* I betray my own ingorance.
ERK
Hmm. That third command queue is interesting. PS, VS, texturing? The third has to be textures or GS, I'm guessing. . . tho I've been putting GS with VS.
Edit: Otoh, I don't think Xenos has a texture command queue, so that would suggest GS is more likely.
I sense an inside joke...:|
4:1 ratio...fine. 16 ROPs...fine. Four arbiter/sequencer with three command queues...??? (Might help if if knew the difference between a arbiter/sequencer and a command queue.)
Seriously tempted to say: "You're just making stuff up." :)
*sigh* I betray my own ingorance.
ERK
It's the basic makeup of Xenos, really. A command queue for each thread type, then arbiters/sequencers to decide what's run and when, and in what order, then the shader arrays and texture units to service those requests, ROPs to output.
It's basically me saying there's a GS queue and the chip is wider than Xenos, and you take it from there.
Read Wavey's Xenos piece to get the goodies.
Dave Baumann
31-May-2006, 00:17
Hmm. That third command queue is interesting. PS, VS, texturing? The third has to be textures or GS, I'm guessing. . . tho I've been putting GS with VS.
Edit: Otoh, I don't think Xenos has a texture command queue, so that would suggest GS is more likely.
Texturing is just an instruction within a shader program, so any texture ops would exist within the programs scheduled to execute in one of the reservation stations.
Thanks, Rys,
You know I've read the Xenos article a couple of times, but there's so much information in there it's easy to forget a lot.
ERK
Sorry for going on tangent with the kind of discussion going on here but Fuad is back with piece and this on on R600. Supposedly R600 is the first cable-less crossfire (http://theinquirer.net/?article=32034) chip even though RV560/RV570 with the same functionality are being released first.
Chalnoth
31-May-2006, 04:48
Sorry for going on tangent with the kind of discussion going on here but Fuad is back with piece and this on on R600. Supposedly R600 is the first cable-less crossfire (http://theinquirer.net/?article=32034) chip even though RV560/RV570 with the same functionality are being released first.
Well, it would be interesting if they went with another line of communication between the cards, like nVidia has. Having no cable only works well with lower-end designs where bandwidth between the cards isn't such a huge concern. Anyway, just like most of what Fuad spews, I'm sure it's got no credibility at all.
EasyRaider
31-May-2006, 07:52
Also, note, just because GDDR4 is coming doesn't mean that this will translate into an immediate and massive leap in bandwidth. GDDR3 started at the high point of GDDR2's end - i.e. GDDR2's high point was 500MHz, but 500MHz GDDR3 was far more prevelent; it wouldn't surprise me to see GDDR4 coming in at the 900-1000MHz range initially.
Samsung announced 1.6 GHz GDDR4 in February. I got the impression the chips could be delivered in sufficient quantity for high-end this year.
For a R580 update this summer, I would expect 1 - 1.2 GHz.
Dave Baumann
31-May-2006, 08:24
Well, it would be interesting if they went with another line of communication between the cards, like nVidia has. Having no cable only works well with lower-end designs where bandwidth between the cards isn't such a huge concern.
Its not talking about about communication via the bus.
Its not talking about about communication via the bus.
no cable, no bus...wireless? :) (kidding)
Kombatant
31-May-2006, 09:52
Its not talking about about communication via the bus.http://img160.imageshack.us/img160/7070/yep1ko.gif
Mintmaster
31-May-2006, 09:55
Erm, you're practically agreeing with me, 32 shader units equals:Maybe I wasn't clear. I was implying that 32 texture units was my "unreasonable wish", and I'd be happy with 24. You're taking the bigger one and asking for twice the ROPs and 4 times the z units.
Sure, GDDR4 may give a big bump in BW, but I don't think ROP's are much of a limitation. 16 ROPs at 700MHz means a fullscreen 1600x1200 write in 20 microseconds, i.e. 1% of rendering time at 60fps.
RV530-style double-rate ROPs aren't enough magic, here's two graphics cards that both have 22.4GB/s of bandwidth:
SC:CT:
http://www.beyond3d.com/reviews/ati/rv5xx/index.php?p=13
http://www.beyond3d.com/previews/nvidia/g73/index.php?p=09
FEAR:
http://www.beyond3d.com/reviews/ati/rv5xx/index.php?p=14
http://www.beyond3d.com/previews/nvidia/g73/index.php?p=10
It seems to me that X1600XT has a serious excess of bandwidth. It doesn't have enough TMUs and ROPs to fully utilise the available bandwidth. (I'm not saying 7600GT is using its bandwidth efficiently - it appears bandwidth-starved.)
I'd guess that an ATI GPU would need to have 8 TMUs and 8 ROPs to fully utilise that 22.4GB/s of bandwidth. Supposedly that's RV560.
A simple linear scaling by a factor of three brings us to 67GB/s with 24 TMUs and 24 ROPs. And that's with fast-RV530 ROPs, not the leisurely R520/R580 ROPs.
Jawed
Dave Baumann
31-May-2006, 11:34
16 is already bandwidth limited, any more is just a waste.
Chalnoth
31-May-2006, 13:04
16 is already bandwidth limited, any more is just a waste.
Even with high-sample AA and frame/z-buffer compression?
EasyRaider
31-May-2006, 17:17
16 is already bandwidth limited, any more is just a waste.
Surely not for depth-only rendering with compression. It might make sense to allocate transistors for up to 64 samples/clock, depending on compression, frequencies and bandwith efficiency.
Surely not for depth-only rendering with compression. It might make sense to allocate transistors for up to 64 samples/clock, depending on compression, frequencies and bandwith efficiency.
Do you question the Great Dave?
Hallowed is B3D:!:
Dave Baumann
31-May-2006, 20:53
Surely not for depth-only rendering with compression. It might make sense to allocate transistors for up to 64 samples/clock, depending on compression, frequencies and bandwith efficiency.
Thats what testing indicates. Even 7800 GTX 512MB, which has one of the largest bandwidth to core clock ratio's fall's shy of its theoretical colour write and Z only rates.
EasyRaider
31-May-2006, 21:51
Thats what testing indicates. Even 7800 GTX 512MB, which has one of the largest bandwidth to core clock ratio's fall's shy of its theoretical colour write and Z only rates.
Well, for next-gen parts with GDDR4 that ratio may increase even further. Still, it would probably be more efficient to have 16 ROPs with quad-speed Z, or 32 simple ones (say, two cycles for 4-channel colour). (edit: as opposed to simply doubling ROPs)
EasyRaider
31-May-2006, 22:15
16 is already bandwidth limited, any more is just a waste.
Surely not for depth-only rendering with compression.
To clarify, I meant 16 ATI-typical ROPs. Just going double-Z would likely be enough.
Dave Baumann
31-May-2006, 22:17
As I said upstream, I wouldn't be expecting some massive schism in relative clock to memory bandwidths.
EasyRaider
31-May-2006, 22:30
I fully expect 1.4 - 1.6 GHz memory for top of the line. I doubt core speeds will reach 1 GHz or higher, though I wouldn't mind being wrong.
Chalnoth
31-May-2006, 23:42
As I said upstream, I wouldn't be expecting some massive schism in relative clock to memory bandwidths.
I don't either. But I also don't entirely buy the statement that ROPs are currently bandwidth-limited, given frame/z-buffer compression. I mean, sure, there's a fair possibility that they are memory bandwidth-limited at relatively low resolutions without FSAA (say, in 3DMark), but make the move to high-res with FSAA, and suddenly those compression techniques become much more effective, decreasing the memory bandwidth pressure on these units.
Now, all that said, I do expect ROPs to become less important for the majority of rendering completely independent of memory bandwidth, in that as shaders get longer, the ROP becomes a smaller and smaller portion of the needed processing power. So by that token we may still expect the number of ROPs to not increase.
Dave Baumann
31-May-2006, 23:54
I mean, sure, there's a fair possibility that they are memory bandwidth-limited at relatively low resolutions without FSAA (say, in 3DMark), but make the move to high-res with FSAA, and suddenly those compression techniques become much more effective, decreasing the memory bandwidth pressure on these units.
Enabling FSAA will increase bandwidth requirements in comparisn to no AA at the same resolution. The compression factor may increase, but you will still be producing more samples no matter what happens - the very best you can hope for is the same relative bandwidth utilisation as no AA, but thats unlikely and it will go up.
Chalnoth
01-Jun-2006, 00:16
Enabling FSAA will increase bandwidth requirements in comparisn to no AA at the same resolution. The compression factor may increase, but you will still be producing more samples no matter what happens - the very best you can hope for is the same relative bandwidth utilisation as no AA, but thats unlikely and it will go up.
Not relative to fillrate usage. The only pixels where bandwidth requirements would increase are those that share multiple triangles, and those also require more fillrate with multisampling (here I'm obviously considering frame buffer compression: z-buffer can be a bit more difficult to think about it such a simple manner, but is likely to follow the same basic idea).
Now, if you consider that current ROP units apparently can't output more than 2 samples per clock, it seems that there is an inefficiency there waiting to be exploited for the currently more frequent 4-sample and (for ATI) 6-sample multisampling patterns. If this were addressed, it could either come in the form of more ROP units, or the same number of ROP units capable of doing more.
EasyRaider
01-Jun-2006, 00:48
Now, if you consider that current ROP units apparently can't output more than 2 samples per clock, it seems that there is an inefficiency there waiting to be exploited for the currently more frequent 4-sample and (for ATI) 6-sample multisampling patterns.
So what if the unit spends 2 or 3 cycles? Just a single trilinear texture read takes 2 cycles anyway. The only time I can see it really matter is for shadow volumes, hardly worth spending more transistors on.
(edit)
Texture units which can do a trilinear or 2x AF texture sample per clock would be more sensible.
Chalnoth
01-Jun-2006, 00:53
So what if the unit spends 2 or 3 cycles? Just a single trilinear texture read takes 2 cycles anyway. The only time I can see it really matter is for shadow volumes, hardly worth spending more transistors on.
Well, it depends upon how many operations the hardware can do per clock cycle, and how long the typical shaders are.
Dave Baumann
01-Jun-2006, 00:56
Not relative to fillrate usage. The only pixels where bandwidth requirements would increase are those that share multiple triangles, and those also require more fillrate with multisampling (here I'm obviously considering frame buffer compression: z-buffer can be a bit more difficult to think about it such a simple manner, but is likely to follow the same basic idea).
Under normal operation, if you're dealing with colour operation then you're probably going to have z sample/update and blends occuring (as well as texture sampling); post processing may yeild closer to the colour fill peak, but is still has to take the orginal sample in and if the post process op takes a number of shader instructions then you may not hit the maximum ROP utilisation. FSAA is also going to require bandwidth for the resolve.
Will ati bother to make XP drivers for that thing?
Will ati bother to make XP drivers for that thing?
"that thing"? :lol: Not a fan, I take it? It'd be suicide if they didn't, wouldn't it?
EasyRaider
01-Jun-2006, 03:16
Well, it depends upon how many operations the hardware can do per clock cycle, and how long the typical shaders are.
During normal rendering, I think it's safe to assume that a texture unit and multiple ALUs will need to spend a bunch of cycles for every visible fragment. So unless the VPU had relatively few ROPs, single cycle 4x AA couldn't make a difference most of the time.
Do the math and you'll find Dave is right, bandwith is the limitation. Just 32-bit colour blending limits high end cards to roughly 8 pixels/clock. Many ATI cards can max out Z-only fillrate well below what bandwith with compression should allow, but I doubt you'll find more limitations than that.
I also don't think it's a good idea to have a higher AA sample rate than depth/stencil rate with AA off. If you have enough bandwith for a certain amount of AA samples with colour on some of the time, then you should be able to push at least as many depth samples when that's all you're doing, AA or not.
So I think a unit that does 4 AA samples per clock, should be designed to also push 4 depth samples with AA off. (I suspect shadow map render speed will become quite important soon.)
EasyRaider
01-Jun-2006, 03:39
Even 7800 GTX 512MB, which has one of the largest bandwidth to core clock ratio's fall's shy of its theoretical colour write and Z only rates.
BTW, this is a bit odd. Assuming that Z values typically compress to 8 bits, it should be able to hit theoretical Z only, with some margin to spare. It seems that either Z data doesn't compress that well (at least in that measurement), or bandwith efficiency is quite far from optimal.
Chalnoth
01-Jun-2006, 04:40
Do the math and you'll find Dave is right, bandwith is the limitation. Just 32-bit colour blending limits high end cards to roughly 8 pixels/clock. Many ATI cards can max out Z-only fillrate well below what bandwith with compression should allow, but I doubt you'll find more limitations than that.
Fine. So let's say you can do 8 pixels per clock given memory bandwidth constraints. If you're running at 4x AA with blending, since the ROP's of modern architectures, I believe, run at half-speed with blending enabled, the ROP's will only be able to output 4 pixels per clock. If you're running at 6x AA with blending, they'll only be able to output 2.666 pixels per clock (on average).
In the situation of framebuffer compression combined with higher-sample AA, the ROP's are not memory bandwidth limited.
I also don't think it's a good idea to have a higher AA sample rate than depth/stencil rate with AA off. If you have enough bandwith for a certain amount of AA samples with colour on some of the time, then you should be able to push at least as many depth samples when that's all you're doing, AA or not.
Well, of course. But the compressability of the z-buffer similarly improves with higher-sample AA. I was thinking of increasing z-rate along with color sample output rate as implied in my argument.
Mintmaster
01-Jun-2006, 06:42
So what if the unit spends 2 or 3 cycles? Just a single trilinear texture read takes 2 cycles anyway. The only time I can see it really matter is for shadow volumes, hardly worth spending more transistors on.
At high res most pixels are magnified, so trilinear is a non-factor. NVidia also has more texture units than ROPs, so only with two textures are you guaranteed not to be ROP limited.
But, in the end I'm going to agree with you. 16 ROPs lets you fill a 2Mpix screen over 5000 times per second. Who cares if that gets halved.
Texture units which can do a trilinear or 2x AF texture sample per clock would be more sensible.
I'd have to disagree here. First of all, magnification and trilinear optimizations mean you only fetch texels from two mipmaps maybe 20% of the time or less. Secondly single cycle trilinear costs almost as much as double the texture units. You need almost double the filtering hardware, double the data path from there to the texture cache (which needs double the read ports), double the request queue for memory loads, almost double the address calcs, etc. It's a waste of space compared to the more flexible option of additional bilinear texture units. NVidia abandoned this strategy with the Geforce2, even when they didn't have any trilinear opts.
Mintmaster
01-Jun-2006, 07:00
Fine. So let's say you can do 8 pixels per clock given memory bandwidth constraints. If you're running at 4x AA with blending, since the ROP's of modern architectures, I believe, run at half-speed with blending enabled, the ROP's will only be able to output 4 pixels per clock. If you're running at 6x AA with blending, they'll only be able to output 2.666 pixels per clock (on average).
Just curious, how do current GPUs do alpha blending with AA? If a pixel is compressed and all samples are identical, does it only take one cycle to do the blending?
FSAA is also going to require bandwidth for the resolve.Yeah, but this is relatively peanuts. Colour buffer for 4xAA, 1600x1200, FP16 is 60MB. Needs only 7% of a frame's available BW (on high end cards) in the worst case of every pixel being uncompressed, and will more likely be in the 2-3% neighbourhood.
Chalnoth
01-Jun-2006, 07:52
Just curious, how do current GPUs do alpha blending with AA? If a pixel is compressed and all samples are identical, does it only take one cycle to do the blending?
That's a damned fine question. You'd think the hardware should be capable of that. I've never heard of such a thing, though, so it is conceivable that the functionality was never implemented just because it wouldn't have improved performance much.
Yeah, but this is relatively peanuts. Colour buffer for 4xAA, 1600x1200, FP16 is 60MB. Needs only 7% of a frame's available BW (on high end cards) in the worst case of every pixel being uncompressed, and will more likely be in the 2-3% neighbourhood.
Right. I think this is the real answer: for nearly any application, the ROPs aren't your limitation. I don't buy that bandwidth is what is influencing IHV's to increase fillrate while not increasing ROP's. This would be true if there was no frame/z-buffer compression, of course, but this is hardly the norm.
Dave Baumann
01-Jun-2006, 11:04
Many ATI cards can max out Z-only fillrate well below what bandwith with compression should allow, but I doubt you'll find more limitations than that.
Yes, its clear that they have plenty room to manouver on the Z sample rate, as pointed out in the R520/R580 articles - RV530's Z sampling capabilities would be sensible across the entire range would be sensible for the next generation, but whether they will use it I don't know. RV530's ROP's also maintain their double Z rate even with FSAA as well.
So let's say you can do 8 pixels per clock given memory bandwidth constraints. If you're running at 4x AA with blending, since the ROP's of modern architectures, I believe, run at half-speed with blending enabled, the ROP's will only be able to output 4 pixels per clock.
That was the case for NV4x, not for G7x, nor any ATI - there's no ROP limitation in relation to the colour write performance for these parts.
Yeah, but this is relatively peanuts. Colour buffer for 4xAA, 1600x1200, FP16 is 60MB. Needs only 7% of a frame's available BW (on high end cards) in the worst case of every pixel being uncompressed, and will more likely be in the 2-3% neighbourhood.
Yes, which is why it was tacked on the end of a list of other bandwidth consumers. Point being, its still a requirement thats there in a bandwidth cnstrainted environment.
Right. I think this is the real answer: for nearly any application, the ROPs aren't your limitation. I don't buy that bandwidth is what is influencing IHV's to increase fillrate while not increasing ROP's. This would be true if there was no frame/z-buffer compression, of course, but this is hardly the norm.
If you are at a point where you're at the maximum ROP utilisation you'll be at the limits of bandwidth, but more often, when you're rendering a scene, you'll not reach that ROP utilisation in the first place because you're backed up by shaders, so this is a balance. Before going to more ROP's I would guess that they will get more capable ROP's in terms of high depth rendering needing fewer cycles and more of the "right type" of samples for a particular operation per ROP.
Kombatant
01-Jun-2006, 13:08
Will ati bother to make XP drivers for that thing?
The real question is, how much flaming will there be in the Rage3D forums if ATI doesn't release Win98 drivers? :lol:
Chalnoth
01-Jun-2006, 13:21
That was the case for NV4x, not for G7x, nor any ATI - there's no ROP limitation in relation to the colour write performance for these parts.
Well, that makes some sense, but if they can't do as many z-tests, then there's not a whole lot of point. It seems like this functionality would only be used in cases where you could have a trivial z-pass, but I doubt that's going to be the norm.
Just curious, how do current GPUs do alpha blending with AA? If a pixel is compressed and all samples are identical, does it only take one cycle to do the blending?
It does for NV40 and later, not sure about ATI.
trinibwoy
01-Jun-2006, 13:49
"that thing"? :lol: Not a fan, I take it?
Is there anything to be a fan of at this point besides a codename? What do we know about R6xx?
NocturnDragon
01-Jun-2006, 13:57
Is there anything to be a fan of at this point besides a codename? What do we know about R6xx?
First PC Unified Shader chip?
Is there anything to be a fan of at this point besides a codename? What do we know about R6xx?
"bother. . .that thing" suggested to me that tEd had something more specific that displeased him in mind. I was hoping he would tell us what it is --possibly something to do with USCs in a non-DX10/Vista environment? I dunno, but I'm curious.
It's a very common phrase around here ("that thing"), I wouldn't interprete too much into it.
It's a very common phrase around here ("that thing"), I wouldn't interprete too much into it.
The man asked if ATI would "bother" to make WinXP drivers for what is going to be their next flagship, released when XP is going to be somewhere between 95-100% of all installs, and you're not interested in the collection of unspoken assumptions he had in mind that would make that a reasonable question? tEd is no noob; I'm thinking there was something interesting behind that.
trinibwoy
01-Jun-2006, 14:51
lol, maybe he had a temporary lapse into noobness :lol:
lol, maybe he had a temporary lapse into noobness :lol:
:lol: If that's it, I'll have to apologize to him for pointing at it. I certainly do it often enuf! :razz:
The man asked if ATI would "bother" to make WinXP drivers
Maybe just because XP will be the "old" OS, since the world will instantly switch to Vista once it's released. You know, like when ATI expected everyone to switch to PCIe instantly back then, kinda... ;)
EasyRaider
01-Jun-2006, 19:00
At high res most pixels are magnified, so trilinear is a non-factor.
[...]
First of all, magnification and trilinear optimizations mean you only fetch texels from two mipmaps maybe 20% of the time or less.
This depends on texture resolution and scene composition. It's not hard to find cases where nearly everything is minified. You could be sampling from two mipmaps up to half of the time. I'd guess more than 20% typically.
Secondly single cycle trilinear costs almost as much as double the texture units. You need almost double the filtering hardware, double the data path from there to the texture cache (which needs double the read ports), double the request queue for memory loads, almost double the address calcs, etc. It's a waste of space compared to the more flexible option of additional bilinear texture units. NVidia abandoned this strategy with the Geforce2, even when they didn't have any trilinear opts.
Back then anisotropic wasn't the norm either. Tril. opts. reduce fetches less than high AF increases them.
8x AF needs up to 16 bilinear reads, maybe about 3 on average, while bandwith should increase much less due to cache hits. So the idea is to save some transistors by using half as many, more capable units, while retaining texturing speed when bandwith is less of a factor. But I really don't know if it would be worth it.
SugarCoat
01-Jun-2006, 19:07
Well its not exactly stupid. There will be no DX10 for XP but of course the market of DX9 games will continue so one would assume these coming parts, aimed primarily at being the best DX9 parts yet, would have support for XP. However if ATI or Nvidia wanted they could absolutly cause "issues" to force Vista sales and DX10 adoption by dropping driver support for XP. It doesnt sound plausible but its possible. Though if anyone were to cause that it would be some strong arming from Microsoft. I think Devs across the world would go nuts if that ever happened.
ATIs PCIE transission was fine. All the way up to the X850 they had AGP and PCIE parts. It was the 7800GTX in PCIE only that slapped people. R520 didnt come until months later when people were no longer really surprised by the lack of AGP.
Chalnoth
01-Jun-2006, 19:14
Microsoft is already doing that by not supporting any SM4 functionality in XP, though. If you want to make best use of your new DX10 card, you're going to need Vista, independent of whether or not IHV's have reduced driver support for XP (no driver support seems ridiculous to me).
ATIs PCIE transission was fine. All the way up to the X850 they had AGP and PCIE parts. It was the 7800GTX in PCIE only that slapped people. R520 didnt come until months later when people were no longer really surprised by the lack of AGP.
Hrrm? Perhaps a search on "Rialto" + delays might be in order? It was, what, 4 months late? 6 months?
trinibwoy
01-Jun-2006, 21:12
However if ATI or Nvidia wanted they could absolutly cause "issues" to force Vista sales and DX10 adoption by dropping driver support for XP.
I'm sure that would be tops on their list if their goal was to sell as few next generation cards as possible. Game developers are going to maintain support for XP and DX9 even in upcoming games and engines since that combo will describe the overwhelming majority of the gaming market well into 2007.
If DX9 is anything to go on, we'll be lucky to have more than a handful of DX10 engines next year and a smattering of games to go along with them.
Chalnoth
01-Jun-2006, 21:16
I expect DX10 adoption to be significantly slower than previous DX versions, too, partially due to Microsoft's refusal to implement any level of support of it in previous Windows versions.
SugarCoat
01-Jun-2006, 21:25
Hrrm? Perhaps a search on "Rialto" + delays might be in order? It was, what, 4 months late? 6 months?
Color me confused. ATI beat Nvidia to market with Retail PCIE products by 3-4 months which were in good availability. This is at the time when AGP was still dominate so other then the X800XL launch and the early X700s i dont remember ATI forcing anyone to go PCIE for specific price/performance sectors...infact im confused entirely because you seem to be complaining about something opposite of what i am defending. Ralito was mainly made for PCIE to go AGP.... What was the issue where that was required in Late 04 early 05? There were plenty of AGP parts at that time. And the X850 wasnt any huge leap over the X800s so...I'm confused by what you're talking about.
Sorry for OT.
I started to write a whole thing circling back to the orginal context of this threadlet, and realized the whole threadlet is OT anyway, and based on a bit of a throw-a-way quip in the second place.
So, meh, the uffdaland rep votes to table and move on. :smile:
After finally optimizing a bit the Texture Cache in the simulator the Texture Units don't seem that much bandwidth limited in old games like Doom3. The configuration is something like an unified R520 with same clock for GPU and memory.
Doom3
http://people.ac.upc.edu/vmoya/img/doom3-charac-f767.png
Blazkowicz
03-Jun-2006, 14:35
wow, doom 3 is already old? afterall, it's a game made for NV2x. what makes it old is the lowres textures. with games such as quakewars with more and higher res textures, should we say we need even more texturing power, or we won't because the TMU will use more bandwith :?:
Chalnoth
03-Jun-2006, 14:38
wow, doom 3 is already old? afterall, it's a game made for NV2x. what makes it old is the lowres textures. with games such as quakewars with more and higher res textures, should we say we need even more texturing power, or we won't because the TMU will use more bandwith :?:
Well, since I'd be willing to bet that compressed textures aren't going anywhere for some time, even with HDR rendering, I think we could have quite a few more texture units before running into a bandwidth wall (assuming memory clocks continue to trace core clocks, roughly, and memory bus width doesn't change).
Galduta
04-Jun-2006, 21:18
http://www.ati-power.fr/Pourquoi-passer-a-une-architecture-unifiee-,ah147-3.htm
Aux dires de Richard Huddy, les cartes seront prêtes dès septembre donc bien avant la sortir de Vista annoncé pour 2007. Nous ;avons donc plus longtemps * attendre afin de pouvoir profiter de cette évolution technologique qui est un tournant dans ;industrie graphique et qui je le souhaite continuera * porter la supériorité des cartes ATI.
http://www.ati-power.fr/Pourquoi-passer-a-une-architecture-unifiee-,ah147-3.htm
Aux dires de Richard Huddy, les cartes seront prêtes dès septembre donc bien avant la sortir de Vista annoncé pour 2007. Nous ;avons donc plus longtemps * attendre afin de pouvoir profiter de cette évolution technologique qui est un tournant dans ;industrie graphique et qui je le souhaite continuera * porter la supériorité des cartes ATI.
English please? :mad:
Galduta
04-Jun-2006, 22:08
According to the statements of Richard Huddy, the cards will be ready as of September
According to the statements of Richard Huddy, the charts will be ready as of September
cartes can also be translated as cards, which makes a whole lot more sense than charts ;)
September? That's just a month after RV560/RV570. And no R590?
Galduta
04-Jun-2006, 22:59
Yes , cartes is " cards" :oops:
Wasn't the r590 release depending on how development for r600 goes anyway. Never really a done deal. I guess if r600 comes that early that there won't be a r590.
soo, conclusions ppl, boards ready september, on markets for holiday season?
http://www.hexus.net/content/item.php?item=5886
As for R600, he's tasked Jeff Fu with an innovative cooling solution for a high-end GPU he describes as "very very hot", and they're both convinced the market will be very happy with what's coming. From what I've seen, as Jeff finalises some placement issues, it'll be worth waiting for when it arrives decently before Christmas.
Nice work Rys.
Jawed
very very hot doesn't sound very very good
SugarCoat
08-Jun-2006, 21:17
I'd be quite surprised if its much worse then what the X1900XTX clocks in at. Over 85C would be quite strange for 24/7 operation in my opinion.
it'll be worth waiting for when it arrives decently before Christmas.
That sounds very November-y as well. I think I'm at the point of just solidly sticking R600 in the November bucket and daring somebody to produce a credible source worth calling that into question.
Now G80? Still head-scratching on that one, tho November as well seems most likely at this point (and supported by some data points, just not as many as R600), I'm just not entirely comfortable that's nailed down.
SugarCoat
08-Jun-2006, 21:37
still waiting for the "Taped out" reports from the Inq on both chips. ;)
PeterAce
08-Jun-2006, 21:43
That sounds very November-y as well. I think I'm at the point of just solidly sticking R600 in the November bucket and daring somebody to produce a credible source worth calling that into question.
Now G80? Still head-scratching on that one, tho November as well seems most likely at this point (and supported by some data points, just not as many as R600), I'm just not entirely comfortable that's nailed down.
So we will get a couple of months testing R600 on WinXP + DX9, before the January-y 2007 release of Vista (and D3D10).
still waiting for the "Taped out" reports from the Inq on both chips. ;)
Well, if they haven't taped out yet then it is unlikely we'd see them in November, that is certainly true! That's less than 6 months now to Nov 30.
EasyRaider
08-Jun-2006, 21:54
I'd be quite surprised if its much worse then what the X1900XTX clocks in at. Over 85C would be quite strange for 24/7 operation in my opinion.
Think wattage, not temperature. To the end user, 0 C or 200 C doesn't matter as long as it keeps working and stable. A higher wattage means higher requirements for PSU and cooling, which is bad for everyone.
Chalnoth
08-Jun-2006, 22:23
More correctly, power output and operating temperature aren't directly related, because you also need to know the efficiency of the heat dissippation to obtain normal operating temperatures.
It is entirely possible to have a product that has twice the power output operate at the same temperatures. Thus, the statement, "it's very, very hot," can really only be a statement about its power output. Operating temperature will be set by the cooling solution used.
EasyRaider
08-Jun-2006, 22:56
It is entirely possible to have a product that has twice the power output operate at the same temperatures.
It's not only possible, but very common.
Thus, the statement, "it's very, very hot," can really only be a statement about its power output.
Although unlikely, it could mean that it's designed to operate at an unusually high temperature. (In theory, that is. In the context of innovative cooling, it surely means high power output.)
SugarCoat
08-Jun-2006, 23:04
Well, if they haven't taped out yet then it is unlikely we'd see them in November, that is certainly true! That's less than 6 months now to Nov 30.
They usually say something about taped out dates, i think i remember G80 being in a second or third revision but info on the R600 series has been quiet.
More correctly, power output and operating temperature aren't directly related, because you also need to know the efficiency of the heat dissippation to obtain normal operating temperatures.
It is entirely possible to have a product that has twice the power output operate at the same temperatures. Thus, the statement, "it's very, very hot," can really only be a statement about its power output. Operating temperature will be set by the cooling solution used.
You read more into it then me. Cores are getting bigger and faster, and the acceptable operating temperature has risen quite a bit. Thus, very very hot means it runs hot. Just look at how small the fans and heatsinks were on the past cards, how can you say its not directly related. Back in 2001 they were these little 20mm fans and sometimes didnt have fans at all all the way up to some 9800 parts. There is most certainly a trend
I could care less about the power requirments, if you can put down $600 or more for a card you can certainly put down $150-$200 for a very decent PSU. I invested in a 650 watt enermax almost a year ago for $170ish which has been running great and i expect to have zero issues with the next cards as well, even in SLI or CrossFire since it has 4 different sufficient rails. Basically i fail to see the power issue when it comes to the high end cards. They'll have to break the barrier of the common 400-450W that most people have at some point. Might as well be now.
Talking in terms of over all system stability that is, not 400-450W for just the card.
Something else people are forgetting, Vista in WGF2.0 will have optional 2D mode but most users will be running it in 3D mode which means the cards will be drawing substantially more power 24/7 for those users based off what the difference is today in 2D and 3D. I'm sure both companies will have their own tricks to help with that.
Think wattage, not temperature. To the end user, 0 C or 200 C doesn't matter as long as it keeps working and stable. A higher wattage means higher requirements for PSU and cooling, which is bad for everyone.
PSUs have never had a never ending shelf life. Especially for purchasers of enthusiast products. I fail to see the issue. Larger PCBs with fancy power grids/gates, more memory, faster memory, larger cores, faster cores, more cores = more juice. Its not like the performance is going to be the same but the power is growing. This happens with every release.
EasyRaider
09-Jun-2006, 00:05
Cores are getting bigger and faster, and the acceptable operating temperature has risen quite a bit. Thus, very very hot means it runs hot. Just look at how small the fans and heatsinks were on the past cards, how can you say its not directly related. Back in 2001 they were these little 20mm fans and sometimes didnt have fans at all all the way up to some 9800 parts. There is most certainly a trend
What are you saying? We need more cooling because parts are running at ever higher temperatures? Wrong. Increasing acceptable operating temperature means less cooling needed (at the same wattage).
I could care less about the power requirments, if you can put down $600 or more for a card you can certainly put down $150-$200 for a very decent PSU. I invested in a 650 watt enermax almost a year ago for $170ish which has been running great and i expect to have zero issues with the next cards as well, even in SLI or CrossFire since it has 4 different sufficient rails. Basically i fail to see the power issue when it comes to the high end cards. They'll have to break the barrier of the common 400-450W that most people have at some point. Might as well be now.
PSUs have never had a never ending shelf life. Especially for purchasers of enthusiast products. I fail to see the issue. Larger PCBs with fancy power grids/gates, more memory, faster memory, larger cores, faster cores, more cores = more juice. Its not like the performance is going to be the same but the power is growing. This happens with every release.
Everything else staying the same, higher wattage is worse, that's all.
Personally, I'm not concerned about the need for expensive PSUs. But I care a lot about noise. Many people prefer NVidia cards because of lower heat and noise for similar performance.
SugarCoat
09-Jun-2006, 01:41
You're not making too much sense....we need more robust cooling because parts are using more and more power due to running faster then ever and thus creating more heat then ever. Why would you argue against that?
I was miffed at Chalnoth's comment and was using the example to what we've bloated to in the last 4 years for acceptable cooling as proof that it is possible to wager that it will be hard pressed to get much better air cooling then what we're seeing without getting even larger. Whats next, 3 slot fan/copper block cooling? I dont want that. Power output and operating temp certainly are related, in that they are growing. And i dont see too many more years left in fans and blowers if they keep growing at the rate they have. Not for flagship high end parts anyway.
Heat is bad, i dont know where the idea came from that its no big deal. Not to mention the hotter components get the more likely internal errors will happen. It also contributes largly to the MTBF.
EasyRaider
09-Jun-2006, 02:00
Heat is bad, i dont know where the idea came from that its no big deal.
Noone claimed otherwise. In fact, that's just what I was saying.
A higher wattage means higher requirements for PSU and cooling, which is bad for everyone.
Chalnoth
09-Jun-2006, 02:12
I still think you're mixing two different things, SugarCoat. Heat, the transfer of thermal energy, is not the same as temperature.
Yes, heat is bad, because it requires larger, more elaborate cooling systems, and has often been associated with loud cooling systems as well (which I think may be one big reason why the 7900 GTX seems to be doing better than the X1900 XT/XTX).
Yes, high operating temperature is bad, because it contributes to overall system instability and shorter mean time before failure.
All I'm saying is that the two things aren't directly correlated, because companies don't apply a GeForce 7900 GTX-style cooler to a GeForce 7300, or vice versa. So the real cost that you eat on the higher-end designs comes in with the bulkier, and sometimes noiser, cooling setup. Operating temperatures are usually kept about the same.
Well, if they haven't taped out yet then it is unlikely we'd see them in November, that is certainly true! That's less than 6 months now to Nov 30.
Here's something to keep in mind if you're talking about the release of R600... On a roadmap from MSI they mentioned a new card for the July/August timeframe. Funny thing is that it was called R580XTP. Obviously roadmaps can change in the blink of an eye, so make of it what you wish. ;)
nutball
09-Jun-2006, 07:38
I could care less about the power requirments, if you can put down $600 or more for a card you can certainly put down $150-$200 for a very decent PSU.
I'm happy for you, however I do care about power requirements. I care about noise you see. If I care about noise I have to care about heat in my case. If I care about heat in my case I have to care about the power consumption of the parts I put in the case.
ATI have already lost a sale with me this generation due to their apparent inability to produce a product which competes with 7900GT on performance and power consumption. Everything I'm reading about the R6xx generation leads me to believe they're going to repeat that little trick.
silent_guy
09-Jun-2006, 07:58
Although unlikely, it could mean that it's designed to operate at an unusually high temperature. (In theory, that is. In the context of innovative cooling, it surely means high power output.)
It's indeed very unlikely that it's designed for a high temperature. Companies that use standard cell based chip design (and that's basically all of them, except Intel, AMD and maybe a few others who have their own fab) use a library of cells.
All libraries are characterized at a few select operating conditions.
- fast: 0C, high voltage, fast process corner
- slow: 125C, low voltage, slow process corner
Those are the minimum necessary.
Sometimes also provided, but usually not used:
- typical: 25C, medium voltage, standard process
- very fast: like fast, but at -40C (used for e.g. outdoors telecom equipment and military)
It is *extremely* time consuming to characterize a standard cell library for different operating conditions, so no fabless semiconductor company is going to do this (anymore).
All chips are designed for worst-case conditions: slow for timing violations, fast for hold violations (the latter is not relevant here). That means: if the goal of the chip is to meet a certain speed, it will always be measured against the slow library conditions.
If you are going to use your chips at speeds that are guaranteed by the slow conditions, you will, in theory, not need to do at-speed testing or speed binning during production, which makes it much cheaper to test.
If you want to go higher (and I assume that's what GPU manufacturers do), you will have to do chip specific characterization tests, because the library will not guarantee the speed by default.
As for temperature and power: the temperatures in the libraries are the junction temperatures of the transistors. A fab (TSMC, UMC, IBM, ...) will guarantee correct operation and reliability up to 125C. Reliability will typically means a 15 to 20 years life time. If you increase the voltage (to go faster) or increase temperature, reliability will go down. Higher voltages are much worse than higher temperatures: at a certain point, they will drastically increase all kinds of nasty effects that can effectively destroy a transistor.
Now when you're going to design the cooling of your boards, you're going to have to determine the target temperature of the die that's necessary to get a certain speed and then it's mathematics that entirely equivalent to calculation Ohm's law:
Temperature ~ voltage
Thermal resistance ~ resistance
Power ~ current
Say you want a die temperature of 90C, you have an ambient temperature (in your computer case) of 40C and your chip consumes 100W. Then you'll need a thermal resistance of no more than R = T/P = (90-40)/100 = 0.5 C/W.
The resistance has multiple components: resistance between die and package (if any), inside the package, between package and cooling element, inside the cooling element etc. All this together has to be less than the 0.5 C/W.
If you don't make it, your junction tempature will go up, switching speed goes down.
If you can't find a good enough cooler, you could decrease the ambient temperature from 40 to 20 by using a fan to blow 20C air into your computer case etc.
If you want to add cooling yourself, the C/W is the number you're looking for to compare thermal quality. Do a google for "zalman cooler C/W" and you'll end up here:
http://www.directron.com/cnps6500aalcu.html
which has a thermal resistance of 0.36 C/W.
Anyway, I guess the main point is: it's all a matter of removing the heat as quickly as possible to prevent higher junction temperatures, since that's what kills the transistor speed.
LeStoffer
09-Jun-2006, 08:35
Runs very, very hot + rumors of very high watt usage = most OEM's running screaming away.
Yes, I know things are probably much worse with the high end parts of the R600 series, but these issues will migrate somewhat down to the mid range as well (unless performance is cut a lot of course).
To be honest I now think that the R600 won't be much of a succes before they move it to the 65nm process. And I bet you that ATI will try to do that as fast as possible.
I think both vendors will have very hot and power hungry parts, thus the OEM's won't have much choice.
They usually say something about taped out dates, i think i remember G80 being in a second or third revision but info on the R600 series has been quiet.
Can you find that one to point at?
SugarCoat
09-Jun-2006, 15:47
hmmm no i cant. just this
http://www.theinquirer.net/?article=30624
may of been my imagination
While graphics card power requirements have scaled quite sharply the past couple of years, it won't go on forever. CPUs are already there: new generation cores require the same amount or even less power than previous generation. GPUs will get there very soon, and maybe have already. In any case I don't see single chip graphics card power requirements increasing drastically anymore.
Sunrise
10-Jun-2006, 12:34
While graphics card power requirements have scaled quite sharply the past couple of years, it won't go on forever. CPUs are already there: new generation cores require the same amount or even less power than previous generation. GPUs will get there very soon, and maybe have already. In any case I don't see single chip graphics card power requirements increasing drastically anymore.As already been discussed, "power requirements" in the GPU sector is not really the most optimal measuring unit, when thinking about the future. Sure, you could always design even your high-end parts to not reach new highs, but we certainly haven´t reached that point yet, nor will we in about 5 years from now.
From an ecological standpoint it´s always highly debatable, but this is not really how IHVs think, sadly.
What we certainly can all agree on is that it won´t drastically increase forever, not at this current scale. If we keep up with this development and as long as consumer cooling solutions keep that "limited" (concerning heat transport and transfer off graphics cards) we´ll certainly hit a wall pretty soon, but this may be only temporary, if you take process technology into the equation again. As a side note, there are already enough solutions (for the enthusiast market) which up these limits substancially.
One thing you shouldn´t do however (i know that you only said "single chip", but let me mention it anyway), is compare current and coming GPUs (GPUs is a little short sighted, because graphics cards don´t solely consist of necessarily only one GPU and one frame buffer (memory) anymore, so these "power requirements" you are talking about can add up pretty quickly) to CPUs. We can speak again when GPUs get even more general purpose, but in the near future (5 years) i certainly don´t see that coming. A CPU still is a general purpose processor, whereas GPUs are getting better at it every day, but they certainly won´t replace them, nor will they be comparable process technology-wise etc. to CPUs just yet.
http://www.vr-zone.com/?i=3724
vr-zone learned that r600 taped out and on 65nm!?
Kinda douptful that it is on 65nm imo.
VR Zone with Red rumors? 65nm you say? Uh huh. Is it 192 shaders but only 3% yield?
Is it 192 shaders but only 3% yield?
no :smile:
ATi solutions require 2 bridges to work so perhaps the bandwidth is doubled as such.
So 2 bridges will be the new reason for crying about ATI's solution now that the dongles have gone. :lol:
Nice find tEd. :)
LeStoffer
13-Jun-2006, 15:03
ATI's R600 is on track for Q4:
http://www.theinquirer.net/?article=32387
Another interesting tidbit in the report is the mention of ATI’s next high end GPU—the R580+. It is expected the R580+ will be an interim product that is essentially the R580 with GDDR4 support till R600 (http://www.dailytech.com/article.aspx?newsid=2785) is ready. R580+ is expected to be available by the end of the year, possibly in time for the holiday shopping season. However, R580+ is not supposed to be an 80nm component.
http://www.dailytech.com/article.aspx?newsid=2889
There seems to be a messaging struggle going on about when R600 might be available. Why would you need an "interim" high-end part timed for late 3rd or 4th quarter unless R600 is into 2007?
LeStoffer
17-Jun-2006, 15:13
Why would you need an "interim" high-end part timed for late 3rd or 4th quarter unless R600 is into 2007?
I most certainly hope that the R580+ is just meant as a safety precaution in the unlucky event that R600 slips into 2007.
But if it means that ATI themselves dosen't believe that they can have R600 ready in any decent mass production scale before 2007, I'll be very sad. I can't wait that long to built my next DX10-capable rig and would most certainly prefer to have both G80 and R600 ready to choose freely from.
karlotta
17-Jun-2006, 15:52
http://www.dailytech.com/article.aspx?newsid=2889
There seems to be a messaging struggle going on about when R600 might be available. Why would you need an "interim" high-end part timed for late 3rd or 4th quarter unless R600 is into 2007? because it wont be a highend part when r600 is released? and they will keep the 80nm going for midrange till the r600 is 65nm. why have a 9800xt when the x800s came out...
If you draw out six-monthly intervals from when R520 should have released, you inevitably get R590 being an 80nm refresh around May/June 06 and R600 at the end of the year.
Seems to me that R590 and R600 will co-exist, with the latter occupying some nutty price bracket designed to entrap the early adoptors of D3D10 tech.
Jawed
Bouncing Zabaglione Bros.
17-Jun-2006, 18:53
Cound R600 be delayed due to (a) delay in Vista/DX10, and (b) delays in the .65u process, which if rumours are to be believed is running behind schedule at most fabs?
Sunrise
17-Jun-2006, 19:39
If you draw out six-monthly intervals from when R520 should have released, you inevitably get R590 being an 80nm refresh around May/June 06 and R600 at the end of the year.
Seems to me that R590 and R600 will co-exist, with the latter occupying some nutty price bracket designed to entrap the early adoptors of D3D10 tech.
Jawed
R580+ doesn´t seem to be 80nm based, i would be very surprised if it is, actually. It looks like R580+ is exactly gonna be what it´s naming suggests, it´s "just" a R580 based part, selected ASICs with the possibility of AIB SKUs with GDDR4. RV570XT should be enough to take care of price/performance. Going even further with that ASIC, with dual-chip-/crossfire-configurations in mind, this would also be very competetive.
Since R580+ can be equipped with GDDR4, which also was speculated to be possible in our R580+/R590-thread, it only makes sense as a high-end part that is solely for the purpose of upping the performance bar for about the same price (MSRP) the R580XTX originally had when introduced or possibly even cheaper. It´s just an interim part, like geo mentioned.
JHoxley
17-Jun-2006, 19:42
Seems to me that R590 and R600 will co-exist, with the latter occupying some nutty price bracket designed to entrap the early adoptors of D3D10 tech.Good thing I'll have a job paying a decent salary by then. I require D3D10 hardware as a matter of urgency :cry:
Cound R600 be delayed due to (a) delay in Vista/DX10I doubt it. Remember that Vista's business/enterprise release is still November (unless I missed something?) - its only consumer thats been delayed until January.
A telling statement by SteveB a while back suggested that the MS side of things is fine (especially the system level components) and the consumer delays are to give the ISV's/IHV's time to get ready. Depending on how you look at it, things like R600 could be causing the delay or they could be taking advantage of the delay and not rushing it out any sooner than they need to.
hth
Jack
It was said, that WinXP will not fully support DX10. Couldn't be R580+ a high-end alternative to R600 for WinXP users, who aren't going to upgrade their OS? That could explain their possible coexistence.
It was said, that WinXP will not fully support DX10. Couldn't be R580+ a high-end alternative to R600 for WinXP users, who aren't going to upgrade their OS? That could explain their possible coexistence.
Man, I hope that's not it! For one thing, it strikes me as economic suicide for the R6 family.
SugarCoat
17-Jun-2006, 20:09
It was said, that WinXP will not fully support DX10. Couldn't be R580+ a high-end alternative to R600 for WinXP users, who aren't going to upgrade their OS? That could explain their possible coexistence.
Your theory would only make sense if people saw very little benefit from next gen parts on winXP which is not true. They'll be marketed as the best performing DX9 parts to date.
I dont really see what the big deal is. Until we see a real sign of the card going to mass production it may not show its head. Remember the X850XT 512MB that was only to show off the higher amount of memory at tradeshows (and i think 1 or 2 people won them)?
Its simply a marketing ploy should Nvidia get to market first with their cards. In order to keep interest by making a high end part (even one with low demand/stock) ATI pushes the street prices on all previous high end cards down not to mention price cuts that they would be sure to do. Even if Nvidia only got a 2-3 month head start its just good business to try to stifle their launch with something.
R580+ doesn´t seem to be 80nm based, i would be very surprised if it is, actually. It looks like R580+ is exactly gonna be what it´s naming suggests, it´s "just" a R580 based part, selected ASICs with the possibility of AIB SKUs with GDDR4. RV570XT should be enough to take care of price/performance. Going even further with that ASIC, with dual-chip-/crossfire-configurations in mind, this would also be very competetive.
Since R580+ can be equipped with GDDR4, which also was speculated to be possible in our R580+/R590-thread, it only makes sense as a high-end part that is solely for the purpose of upping the performance bar for about the same price (MSRP) the R580XTX originally had when introduced or possibly even cheaper. It´s just an interim part, like geo mentioned.
I shoulda spent a bit longer making my earlier posting.
The way I see it, R590 would have been the 80nm refresh. But if ATI's given up on it (because 80nm is late) then perhaps the alternative is R580+, a tweaked R580 on 90nm - with the supposed 15% gain from GDDR4 :roll:
Additionally, ATI's still going to need a $300 GPU (X1900XT) and a $400 GPU (X1950XT) when R600 launches in $500 and $600 variants. RV570=X1900GTO will be higher-margin replacement for X1900GT and will be around $200 by the end of the year.
If ATI doesn't meet those prices then clearly NVidia will just carry on cleaning up in the mainstream. I see it as extremely unlikely ATI will release anything other than an enthusiast R6xx part, therefore X1900GTO/XT prices have to push further down through the mainstream.
Does anyone doubt that 7900GT (or an 8-series replacement at ~performance) will cost about $200 by the end of the year?
Jawed
INKster
17-Jun-2006, 21:09
Your theory would only make sense if people saw very little benefit from next gen parts on winXP which is not true. They'll be marketed as the best performing DX9 parts to date.
I dont really see what the big deal is. Until we see a real sign of the card going to mass production it may not show its head. Remember the X850XT 512MB that was only to show off the higher amount of memory at tradeshows (and i think 1 or 2 people won them)?
Its simply a marketing ploy should Nvidia get to market first with their cards. In order to keep interest by making a high end part (even one with low demand/stock) ATI pushes the street prices on all previous high end cards down not to mention price cuts that they would be sure to do. Even if Nvidia only got a 2-3 month head start its just good business to try to stifle their launch with something.
Even now Nvidia has a higher profit margin per each 7 series GPU than ATI's X1K (due to the relatively lower transistor count and chip design complexity).
I doubt that strategy would work for ATI once G80 is out.
I believe Nvidia can push the 79xx series prices further down, and still be competitive with the upcoming X1K SKU's.
No one expected 7950 GX2 to be so "cheap", giving it's new approach on SLI, and i think the 90nm TSMC process will be quite mature by the time G80/NV50 is out this Fall, even considering the 500M+ transistors in it.
IMHO, it would be wise for ATI to not to release "side-by-side" top of the line cards for DX9 and DX10.
If anything, the appeal to buy new DX10 parts should come from the top, in order to start the build up of the famous "Halo Effect" and word-of-mouth among enthusiasts, and then trickle down to the rest of the market segments when the production costs/yields fall to DX9 mainstream or lower levels.
Chalnoth
17-Jun-2006, 21:13
Cound R600 be delayed due to (a) delay in Vista/DX10, and (b) delays in the .65u process, which if rumours are to be believed is running behind schedule at most fabs?
Well, ATI had no problem launching the Radeon 9700 Pro quite a while before DX9, so I doubt (a) is an issue.
I shoulda spent a bit longer making my earlier posting.
The way I see it, R590 would have been the 80nm refresh. But if ATI's given up on it (because 80nm is late) then perhaps the alternative is R580+, a tweaked R580 on 90nm - with the supposed 15% gain from GDDR4 :roll:
Additionally, ATI's still going to need a $300 GPU (X1900XT) and a $400 GPU (X1950XT) when R600 launches in $500 and $600 variants. RV570=X1900GTO will be higher-margin replacement for X1900GT and will be around $200 by the end of the year.
If ATI doesn't meet those prices then clearly NVidia will just carry on cleaning up in the mainstream. I see it as extremely unlikely ATI will release anything other than an enthusiast R6xx part, therefore X1900GTO/XT prices have to push further down through the mainstream.
Does anyone doubt that 7900GT (or an 8-series replacement at ~performance) will cost about $200 by the end of the year?
Jawed
That could work. Using the existing gpu means they wouldn't have much investment they have to make back on R580+. But given that ddr4 support is already in R580, if it isn't going to 80nm, what justifies the "+" at all --why isn't there just an X1950xt sku with gddr4 featuring R580?
INKster
17-Jun-2006, 21:21
That could work. Using the existing gpu means they wouldn't have much investment they have to make back on R580+. But given that ddr4 support is already in R580, if it isn't going to 80nm, what justifies the "+" at all --why isn't there just an X1950xt sku with gddr4 featuring R580?
I wonder if an ATI R600 with GDDR3 would be cheaper than a R580+ with GDDR4, which is a non-existent type of memory (for now) ?
Bouncing Zabaglione Bros.
17-Jun-2006, 21:29
Well, ATI had no problem launching the Radeon 9700 Pro quite a while before DX9, so I doubt (a) is an issue.
They had to write and release a DX8 driver for it though. I don't know how practical that is given the major differences in DX10 and how R600 will leverage them in hardware.
That could work. Using the existing gpu means they wouldn't have much investment they have to make back on R580+. But given that ddr4 support is already in R580, if it isn't going to 80nm, what justifies the "+" at all --why isn't there just an X1950xt sku with gddr4 featuring R580?
I suppose "+" could be like an interim coding (as opposed to adding 1 for alternate fab or adding 10 for 80nm variant), so it could be anything, faster running, cooler, lower-voltage, better-yielding...?
I don't actually believe the "+" to be honest - it's prolly some dodgy thinking introduced at some point to account for GDDR4 memory being on board. Or some wishful marketing aimed at the AIBs by ATI: "now with GDDR4!!!".
Jawed
Sunrise
17-Jun-2006, 22:34
I shoulda spent a bit longer making my earlier posting.
The way I see it, R590 would have been the 80nm refresh. But if ATI's given up on it (because 80nm is late) then perhaps the alternative is R580+, a tweaked R580 on 90nm - with the supposed 15% gain from GDDR4 :roll:That´s basically following my own logic.
It would´ve made perfect sense to bring a 80nm high-end part to market, because as ATi transitions to cheaper, but very performant SKUs, they will certainly welcome every cost saving measure they can get. R590 would´ve made a lot of sense in that context, but it doesn´t look like it´s going to happen, because they likely found their yields of R580 ASICs have improved to a point where there is a rather high percentage of cores that would easily run with >650MHz and it would also benefit/effect their inventory. These could then be selected and paired with more capable memory that at least gives you 15-20% more performance through the bank and they could also price them very competetive. It maybe not as good as a potential, speculative R590 part, but they also saved time and other associated costs of having to tape-out another ASIC, working on yields and the other headaches.
Sunrise
17-Jun-2006, 22:43
I suppose "+" could be like an interim coding (as opposed to adding 1 for alternate fab or adding 10 for 80nm variant), so it could be anything, faster running, cooler, lower-voltage, better-yielding...?
I don't actually believe the "+" to be honest - it's prolly some dodgy thinking introduced at some point to account for GDDR4 memory being on board. Or some wishful marketing aimed at the AIBs by ATI: "now with GDDR4!!!".
Jawed
Well, i suppose they just needed something to differentiate between already "available" R580 cores/SKUs and a new part, so maybe it´s just an internal "ATi to AIB"-moniker, because it makes talking about those a lot easier. It certainly won´t tell the typical laymen what exactly ATi could´ve changed, but i think we´re pretty much on the right track. ;)
@geo: Well, it´s basically the same with RV560/RV570, which haven´t been given a marketing name, yet. Not officially, at least, because you know: "We don´t talk about unanounnced products." :lol:
SugarCoat
17-Jun-2006, 22:46
Even now Nvidia has a higher profit margin per each 7 series GPU than ATI's X1K (due to the relatively lower transistor count and chip design complexity).
I doubt that strategy would work for ATI once G80 is out.
I believe Nvidia can push the 79xx series prices further down, and still be competitive with the upcoming X1K SKU's.
No one expected 7950 GX2 to be so "cheap", giving it's new approach on SLI, and i think the 90nm TSMC process will be quite mature by the time G80/NV50 is out this Fall, even considering the 500M+ transistors in it.
IMHO, it would be wise for ATI to not to release "side-by-side" top of the line cards for DX9 and DX10.
If anything, the appeal to buy new DX10 parts should come from the top, in order to start the build up of the famous "Halo Effect" and word-of-mouth among enthusiasts, and then trickle down to the rest of the market segments when the production costs/yields fall to DX9 mainstream or lower levels.
It wouldnt be anymore different then what Nvidia did with the 7800GTX 512 when the X1800XT launched. They dont have to spend alot of money to try to take some glory away from Nvidias launch (price cuts are pretty much a given). And if they can infact show a tangible benefit because of the memory controller with GDDR4 that they say exists then the earlier they show that the better because people will see it as free performance that the R600 series is gaurentied to have. If its indeed large enough people may infact wait. What they certainly dont want to do is watch Nvidia launch their card and then ATI remains quiet for the next 2-3 months, which causes people and OEMs alike to get paranoid that its R520 all over again and they may as well purchase the only offering out now, behing the 8800GTX or what have you.
ATI can use the NV50 launch for something like "Here is a little taste of whats to come"
INKster
17-Jun-2006, 22:59
It wouldnt be anymore different then what Nvidia did with the 7800GTX 512 when the X1800XT launched. They dont have to spend alot of money to try to take some glory away from Nvidias launch (price cuts are pretty much a given). And if they can infact show a tangible benefit because of the memory controller with GDDR4 that they say exists then the earlier they show that the better because people will see it as free performance that the R600 series is gaurentied to have. If its indeed large enough people may infact wait. What they certainly dont want to do is watch Nvidia launch their card and then ATI remains quiet for the next 2-3 months, which causes people and OEMs alike to get paranoid that its R520 all over again and they may as well purchase the only offering out now, behing the 8800GTX or what have you.
ATI can use the NV50 launch for something like "Here is a little taste of whats to come"
That is, of course, dependent of Samsung's own timetable.
What if the 7800 GTX 512 case (you know, the "1.1ns GDDR3, very expensive, still low product quantities on the market") happens again ?
That was serious enough for having to downclock those same 1.1ns chips for the 7900 GTX, 4 months later !
As far as i know, GDDR4 is an extension of GDDR3, but was designed to achieve higher operational speeds.
But what if it is so expensive and scarce that even a top-of-the-line card can't have it's launch depending on it ?
Going top grade GDDR4 in '06 is too risky, either for ATI or Nvidia.
The first product on the market with GDDR3 was the "rev. 2.0" of GeforceFX 5700 Ultra, remember ?
Lower speed memory chips, higher yields.
That's what they need to do this year, to "test the waters" first with a mainstream product.
SugarCoat
17-Jun-2006, 23:07
Who said they need the best GDDR4? GDDR4 is going to essentially start where GDDR3 left off, around 800-900MHz. Samsung should be all set to supply both companies with GDDR4 at those speeds and yields and thus cost of the chips should actually be better then, for instance 1.1ns GDDR3. Not to mention power consumption which is going to be a real issue with both camps I dont expect either R600 or NV50 to have GDDR4 memory speeds too far past 2000MHz effective if they can even get there at all. Eventually it should support speeds of up to 3200MHz but that might not be until 2008 or even 2009.
Chalnoth
17-Jun-2006, 23:17
They had to write and release a DX8 driver for it though. I don't know how practical that is given the major differences in DX10 and how R600 will leverage them in hardware.
I don't see any reason why it'd be hard to write a DX9 driver for DX10 hardware. It's still a superset of the functionality.
I don't see any reason why it'd be hard to write a DX9 driver for DX10 hardware. It's still a superset of the functionality.
Must take a lot of resources though.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.