View Full Version : So is R580 supposed to be 48 pipes or 48 ALUs?
http://www.theinquirer.net/?article=27805
If it's pipes, seems they would have no problem "taking back the performance crown from Nvidia" as INQ says. So why phrase it like that?
If it's ALU's, we get a good idea of X360 performance early. Which most excites me.
Though it may be clocked 700 mhz.
http://www.xbitlabs.com/news/video/display/20051111144411.html
Also Xbit refers to them as "pixel processors" rather than pipes, which is odd.
And another oddity, 8 vertex processors wouldn't seem to be able to support 48 pipes, but 48 ALU's maybe.
SynapticSignal
19-Nov-2005, 11:09
I think 48 Alu
so 12 Shader core
SynapticSignal
19-Nov-2005, 11:12
to be clear, 12 of this?
http://www.tgmonline.it/tgmfiles/bovas/r520/foto/engin3.jpg
trinibwoy
19-Nov-2005, 11:37
48 shader cores, 2 ALU's each, one full, one mini ?
48 shader cores, 2 ALU's each, one full, one mini ?
Yes and each ALU is to coissue 1 scalar and 1 Vec3 instruction.
SynapticSignal
19-Nov-2005, 12:11
Yes and each ALU is to coissue 1 scalar and 1 Vec3 instruction.
mmh appear to be different as ATI named the alu (alu 1 and alu 2)
seems as scalar ALU 1 + scalar ALU 2 + vector ALU 1 + vector ALU 2 = 4 ALU
Ailuros
19-Nov-2005, 13:19
4 quad "MIMD" (I'd actually call them 16 SIMD channels, but then I'll get corrected again LOL) and 48 ALUs.
If it's ALU's, we get a good idea of X360 performance early. Which most excites me.
Not really comparable either, since Xenos has "general purpose" ALU units; 3* 16-way SIMD to be more precise.
I'm not so sure Xenos is even faster than R520, let alone R580.
Dave Baumann
19-Nov-2005, 13:31
(I'd actually call them 16 SIMD channels, but then I'll get corrected again LOL)
And that certaintly wouldn't work in this case.
Megadrive1988
19-Nov-2005, 18:04
http://www.theinquirer.net/?article=27805
If it's pipes, seems they would have no problem "taking back the performance crown from Nvidia" as INQ says. So why phrase it like that?
If it's ALU's, we get a good idea of X360 performance early. Which most excites me.
Though it may be clocked 700 mhz.
http://www.xbitlabs.com/news/video/display/20051111144411.html
Also Xbit refers to them as "pixel processors" rather than pipes, which is odd.
And another oddity, 8 vertex processors wouldn't seem to be able to support 48 pipes, but 48 ALU's maybe.
it seems R580 will be 16 pixel pipes, like R520 (both have 16 ROPs) but R580 will have 3 shader ALUs (or fragement shaders) per pipe, so that's where the 48 comes from.
that's the explanation from me, a non-techie :)
R520 has four quads. Each quad has:
four texture pipes,
four shader pipes and
four ROP pipes.R580 also has four quads. Each quad has:
four texture pipes,
12 shader pipes and
four ROP pipes.The shader pipes in both R520 and R580 consist of a primary ALU that can execute all maths instructions and a secondary ALU with minimal functionality.
R580's 12 shader pipes in each quad take it in turns to use the texture and ROP pipes in the quad - four pipes at a time.
Jawed
I don’t get it how some people on this forum failed to see from Dave article and the explanation of RV530 architecture what R580 will look alike. To totally simplify, R580 will be 4 times RV530, just like R520 is 4 time RV515!
Word “Quad” in this context comes from grouping of four PS pipes, so you can’t really have “quads” with twelve pipes?! But what you can have is “twelve quad engines” – 12 sets of four PS pipes.
To totally simplify, R580 will be 4 times RV530, just like R520 is 4 time RV515!
Not quite. rv530 has 8 z units per quad. r580 has 4
48 fragments shader units, each of which consist of one full and one mini ALU (arranged in four quads with 12 shader units per quad). So, multiple fragment ALUs per shader unit.
AFAIK, Xenos also has 48 shaders (http://www.beyond3d.com/articles/xenos/index.php?p=07) (not specifically fragment), but they're differently arranged (three shader "pipes" with 16 shaders per pipe) and don't include the mini ALU. So, one ALU per shader unit, but the ALU isn't equivalent to something on R5x0.
You'd be doing yourself a favor by not using The Inq--OK, Fuad specifically--for technical details of GPUs. :)
R300King!
20-Nov-2005, 06:13
This is all fine and dandy but what performance % over the R520 will the R580 be?
5% - 15%
15% - 25%
25% - 35%
35% - 45%
45% - 60%
60% - 80%
80% - 100%
100%+
Skrying
20-Nov-2005, 06:28
Depends HIGHLY on what game you're playing.
Chalnoth
20-Nov-2005, 07:36
There may be some very special scenarios where the performance improvement climbs to 100%. But most of the time it's going to be under 50%, is my guess.
Edit:
Of course, some pure ALU synthetic benchmarks may show the full 3x performance.
Skrying
20-Nov-2005, 07:47
I think R580 will be locked at 725Mhz and 1800Mhz. I think we'll see around a 35% average increase and in some cases much higher.
wireframe
20-Nov-2005, 07:56
I think R580 will be locked at 725Mhz and 1800Mhz. I think we'll see around a 35% average increase and in some cases much higher.
Those seem to be some fairly optimistic numbers on the core clock. Maybe ATI can pull it off, but let's not forget how massive this thing is compared to R520. Even if they could pull that off, would they want to? You don't want to take away the incentives for purchasing the next generation GPU (yay, I said GPU for ATI instead of VPU...*erm*)
The point here is that R580 is a step up in shading power. Technically, I don't think it needs a higher clock because what it proposes to do it will do through sheer parallelism. There is no way R520 can touch it and compete. It's in the clear, by design.
Also, is it really interesting to think about those 35% improvements? That would be for current games and I think it safe to say that it will be for somewhat dated current games that have moderate shader usage. Perhaps something like Far Cry. The thing to keep in mind here is that those games run just fine on current hardware, R520 and G70, and I don't know that it is interesting to even pay attention to those 35% numbers. In newer, more shader heavy title, however, you should see a huge impact and I don't think it is stretching it to say that upcoming games that rely heavily on shaders will see a 100% increase in performance. Then again, those games will surely also scale to suit less endowed hardware and the question then becomes: are those effects that are made feasible with the new hardware worth it?
Basically, what I am getting at here is that performance can no longer be incremental like back in the vertex & texture days. I think the next step in shading usage can be likened to going from wireframe graphics to solid fill graphics and then to high-resolution textured graphics. The first step, done, is just a "filler" (pardon the pun) while the next, adding textures, you either go all the way or you forget about it. It's going to be costly.
Chalnoth
20-Nov-2005, 07:58
How well it does will depend on the ALU to TEX op ratios. And let's not forget that texturing can take quite a while with anisotropic filtering.
wireframe
20-Nov-2005, 08:07
I just thought of another thing that goes with how I see the difference in computation/performance evolving. Take a newish game like FEAR and look at it. You can actually see where shaders are running, like boundaries, like little dabs of color here and there on a canvas. I think it is safe to say that in future games there will be absolutely no pixel on the screen that hasn't been shaded at least once or twice (or more). Add to that more complex operations and you get some idea exactly how processing intensive it will be to generate a complete frame.
The R580 seems to take a step down this path. I don't even know that it should be compared to R520. We saw the same thing going from Geforce 2 Ultra to the Geforce 3. People were comparing numbers and wondering why the GF2 was still faster in many ways, ways that didn't matter for the future. While the R580 doesn't have a new technology like in the step from GF2 to GF3, it has it in quantity that enables a new way of operating.
Skrying
20-Nov-2005, 17:20
My predicated clocks are what I think ATi might do. They just simply always seem to clock something beyond what they can get a reasonable supply at.
I agree, the R580 will be much more "future" proof (if there is such a thing) than any other current card right now. With the right mix of dynamic branching and shading power it could be a really great card.
When is the theoretical invisible launch of the R580 taking place?
wireframe
20-Nov-2005, 18:24
I agree, the R580 will be much more "future" proof (if there is such a thing) than any other current card right now. With the right mix of dynamic branching and shading power it could be a really great card.
I realize you put future in quotation marks and even questioned it in parentheses, but I think you could have solved the problem by thinking another way. The R580 design is the future. It is not future proof. It is just the proportions that are needed to advance the industry.
In this way, I don't think R580 will be anything special, just like how Geforce 3, when all was said and done, wasn't very special either. It was a first-of-breed and did what it had to do.
One thing to look for, which might be interesting, is to see how R520 and G70 compare in the near future. The G70 shares some of this top-heavy design. More is, of course, needed, but it hints at that direction. It might also be fun to speculate on how far the 16 ROP base will carry. Will we see 16-3-9-1 designs in the future?
Matasar
20-Nov-2005, 18:41
I think it will be a monster card, something that drops jaws.
CarstenS
20-Nov-2005, 19:03
There may be some very special scenarios where the performance improvement climbs to 100%. But most of the time it's going to be under 50%, is my guess.
Edit:
Of course, some pure ALU synthetic benchmarks may show the full 3x performance.
Has that been proven yet?
In all the benchmarks i've run so far, even the most synthetic ones, the highest i achieved was a factor of 2,81 for X1600XT versus X1300 Pro. Far more common was a factor of only two and below - the average should be between 1,6 and 2,4 (just a guess, did not bother to calculate). Sometimes it wasn't even faster at all, though those must have been the proverbial exceptions to the rule.
Those shaders were run with the texture instructions left in place, but no texturing carried out at all. So in "real" applications, were textures and filtering play a role also, this performance gain might even shrink further.
I doubt that this will change much with R580 compared to R520, partially due to the increased thread size which seems to reduce efficiency a bit from the 16-Pixel-Threads on R520 and RV515.
Does anyone know about Fetch4 and the doubled amount of Z-throughput (without MSAA applied) to be in- or excluded in R580 for sure? My guess would be "included". :)
If AnandTech is right about R580 launch timeframe, and if we're generally right that we aren't expecting NV 90nm high-end until March, then I'd tend to think R580 will launch a little more conservatiely on clocks then just suggested and leave a little something in their pocket for a topper if they need one in March/April. Much like NV did with G70.
Chalnoth
20-Nov-2005, 19:28
Has that been proven yet?
Of course not. There won't be any proof until the card is released.
I expect R580 somewhere around Christmas… Orthodox, ofcourse! And concerning the NV 90nm product, they have one right now, but it’s only two pipe architecture – the integrated 6100. For the CeBIT, I expect only mid range products – G72(3), and for the Computex their 90nm high end. Superior Architecture of R580 will not require clocks higher then those we’ve seen on R520, and I think that ATI will push more in to tuning the architecture for transfer to 80nm, and go for more convenient clocks at the beginning of R580
Dave Baumann
21-Nov-2005, 00:17
NVIDIA have two 90nm products already - Go 7300 is shipping.
DOGMA1138
21-Nov-2005, 00:20
it seems R580 will be 16 pixel pipes, like R520 (both have 16 ROPs) but R580 will have 3 shader ALUs (or fragement shaders) per pipe, so that's where the 48 comes from.
that's the explanation from me, a non-techie :)
isnt the 520 texture and fillrate limited with alot of "extra" shader power?
so whats the point of "gimping" the R580 in the same way?
AlphaWolf
21-Nov-2005, 00:27
isnt the 520 texture and fillrate limited with alot of "extra" shader power?
so whats the point of "gimping" the R580 in the same way?
Depends on the titles. The idea is that as time goes by the games become more shader limited.
Ailuros
21-Nov-2005, 03:35
And that certaintly wouldn't work in this case.
We've had that debate again, hence my parenthesis. Neither "full" SIMD nor "full" MIMD, but in a relative sense something in between.
Personally I'd be very surprised if I'd see in the foreseeable future full MIMD units with the first unified shader cores for the PC; it rather sounds to me like an increased transistor budget with both benefits and downsides. Add the possible workarounds for the downsides and I'm not so sure such a case scenario is affordable yet.
Always IMHLO of course ;)
Ailuros
21-Nov-2005, 03:38
48 fragments shader units, each of which consist of one full and one mini ALU (arranged in four quads with 12 shader units per quad). So, multiple fragment ALUs per shader unit.
AFAIK, Xenos also has 48 shaders (http://www.beyond3d.com/articles/xenos/index.php?p=07) (not specifically fragment), but they're differently arranged (three shader "pipes" with 16 shaders per pipe) and don't include the mini ALU. So, one ALU per shader unit, but the ALU isn't equivalent to something on R5x0.
You'd be doing yourself a favor by not using The Inq--OK, Fuad specifically--for technical details of GPUs. :)
Beautifully illustrated I might say :)
http://www.digitimes.com/news/a20051124A7039.html
ATI Technologies is currently in the final stage of testing its R580 graphics processor units (GPUs) with customers, sources at Taiwan graphics card makers indicated, adding that the graphics chip vendor expects to launch the flagship chip in early 2006.
ATI will also be enhancing its entry-level GPU product line on the 80nm production node, with new RV560, RV535, and RV505 chips, the sources noted.
The R580 will be ATI’s latest offering in its flagship GPU line, following the launch of the Radeon X1800 series in early October. The new chip, which entered the tape-out stage in the middle of the third quarter, is now in volume shipments for sampling, according to the sources.
ATI has started implementing an 80nm process technology to produce its entry-level RV560, RV535, and RV505 chips at Taiwan Semiconductor Manufacturing Company (TSMC), aiming for better production cost efficiency, indicated the sources. Volume shipments are slated for the first quarter of next year, the sources expect.
80nm in the first quarter? My goodness. Wouldn't it be something if ATI is shipping 80nm before NV is shipping their top-end in 90nm. . .
wireframe
24-Nov-2005, 14:00
80nm in the first quarter? My goodness. Wouldn't it be something if ATI is shipping 80nm before NV is shipping their top-end in 90nm. . .
It would be something if ATI shipped anything at all! :razz:
(that one was difficult to resist)
kyetech
24-Nov-2005, 14:02
so i assume that by knowing the config of the r580 and knowing how many transistors the r520 is made from that it is possible to deduce the transistor count?
So what will it be roughly does anyone have any ideas?
410m +-5%?
I'll sign up for that range as well (390-430).
vol2005
24-Nov-2005, 23:04
According to analysis firm (http://www.xbitlabs.com/news/multimedia/display/20051123214405.html) the Xenos' BOM cost was estimated to be $141 (incl. EDRAM chip).
So, is it possible to evaluate somehow what the cost of R580 hence would be :?:
AlphaWolf
24-Nov-2005, 23:10
According to analysis firm (http://www.xbitlabs.com/news/multimedia/display/20051123214405.html) the Xenos' BOM cost was estimated to be $141 (incl. EDRAM chip).
So, is it possible to evaluate somehow what the cost of R580 hence would be :?:
Do you mean just the cost of the r580 chip? Or a card based on it?
vol2005
24-Nov-2005, 23:19
Do you mean just the cost of the r580 chip? Or a card based on it?
Chip, of course...
(Though the later is of not less interest also :wink: )
I expect R580 somewhere around Christmas… Orthodox, ofcourse!
Orthodox Christmas is 25th December...unless you're Russian, Copt (IIRC)...
NVIDIA have two 90nm products already - Go 7300 is shipping.
RSX also? ;)
Orthodox Christmas is 25th December...unless you're Russian, Copt (IIRC)...
What do you mean? Russians are orthodox. And the xmass is somewhere beginning of january, on 7th I think.
EDIT:
So is it 4 x 12 or 16 x 3 now?
Ailuros
25-Nov-2005, 12:07
There are two kinds of Calendars in the Orthodox churches.
So is it 4 x 12 or 16 x 3 now?Is what what? R500 appears to be 3x16, R520 is 16x1, and R580 will be 16x3. It's just that the second figure isn't equal between both chip families. Clear now? ;)
If I underdstund well r580 should be 3x morepowerfull, than r520, in GPGPU and pixelshader intensive games (like UE3 I supose) but only in that the rest will come from speed increasses (if there is any), right:?:
AlphaWolf
26-Nov-2005, 19:32
If I undeedtand well r580 should be 3x morepowerfull in GPGPU and shader intensive games (like UE3 I supose), right:?:
That would be the theoretical maximum assuming equal clocks. The reality is most likley going to lie somewhere below that.
Due to lower efficience, I guess...if so why it do have lower efficience?
Chalnoth
27-Nov-2005, 05:33
Due to lower efficience, I guess...if so why it do have lower efficience?
It has the same number of texture units, for one. So we're not even talking shader-intensive games (as shaders may well use a large number of textures). We're talking about games that use very long shaders compared to the number of texture accesses within the shader.
I don't believe that 3x improvement will ever be seen outside of synthetic benchmarks.
That would be the theoretical maximum assuming equal clocks. The reality is most likley going to lie somewhere below that.
I seem to recall caboosemoose saying on a regular basis that the original ATI roadmap suggests 1.5x performance. Of course that may be old enuf now (9 months now since it was first discussed here), that the actual trend in games may (or may not) impact their forecast. Come to think of it, that interview being discussed in another thread seems to suggest the 3-1 thing is actually running a bit ahead of that in recent games, so maybe it'll do a bit better.
It has the same number of texture units, for one. So we're not even talking shader-intensive games (as shaders may well use a large number of textures). We're talking about games that use very long shaders compared to the number of texture accesses within the shader.
I don't believe that 3x improvement will ever be seen outside of synthetic benchmarks.
I don't see how you can say that. With a 256-bit interface adding more texture units won't buy you much.
Chalnoth
28-Nov-2005, 00:11
I don't see how you can say that. With a 256-bit interface adding more texture units won't buy you much.
Well, that doesn't really impact what I wrote at all.
But I just crunched the numbers, and I don't believe this is the case. Consider the case of trilinear filtering and compressed textures. Compressed textures can take up only 4 bits per texel.
If the texture cache is doing its job, with trilinear filtering, one would average one texel per pixel for the first MIP map level, and one quarter that for the second. This makes for an average of 5 bits per two bilinear filters (trilinear filtering is made up of two bilinear filters).
So, that's going to be an average of only 40 bits per clock that you would need to keep a 16-texture pipeline architecture full, when using your standard compressed textures.
Now, one can imagine that with anisotropic filtering, there may not be as much overlap between the vairous bilinear filters that the hardware takes within each pixel, but this will not result in more than a 4-fold increase in bandwidth for each bilinear filter. So we're only up to 160 bits per clock for a 16-texture pipeline architecture.
You could even go back to generating the second MIP map for trilinear filtering on the fly, which would result in a bandwidth drop to 128 bits per clock.
Anyway, for compressed textures at least, I can easily see improvements in performance all the way up to 32 texture pipelines per clock.
Are those texture units reserved solely for textures, or are they used for general memory access? For instance, do the shader units use the texture units to access other data types in memory?
Mintmaster
28-Nov-2005, 04:57
Anyway, for compressed textures at least, I can easily see improvements in performance all the way up to 32 texture pipelines per clock.
I think it could be even more than that.
You example is only considering minification. Under magnification, which covers a lot of pixels at high resolution, you'll have even lower bandwidth requirements.
The GTX has 70 bytes per clock, so you could easily do 100-200 texture accesses per clock on average if bandwidth is your limitation and compressed textures are used.
Of course, the situation changes drastically when textures are not compressed and even more so when high precision formats are used. Dynamically rendered textures, shadow maps, HDR, etc. are changing the BW demands of texturing. So, rwolf, I don't think this had anything to do with the decision.
Chalnoth
28-Nov-2005, 05:02
Are those texture units reserved solely for textures, or are they used for general memory access? For instance, do the shader units use the texture units to access other data types in memory?
Textures are the only data type available for reading within the pixel shader.
Well, that doesn't really impact what I wrote at all.
But I just crunched the numbers, and I don't believe this is the case. Consider the case of trilinear filtering and compressed textures. Compressed textures can take up only 4 bits per texel.
If the texture cache is doing its job, with trilinear filtering, one would average one texel per pixel for the first MIP map level, and one quarter that for the second. This makes for an average of 5 bits per two bilinear filters (trilinear filtering is made up of two bilinear filters).
So, that's going to be an average of only 40 bits per clock that you would need to keep a 16-texture pipeline architecture full, when using your standard compressed textures.
Now, one can imagine that with anisotropic filtering, there may not be as much overlap between the vairous bilinear filters that the hardware takes within each pixel, but this will not result in more than a 4-fold increase in bandwidth for each bilinear filter. So we're only up to 160 bits per clock for a 16-texture pipeline architecture.
You could even go back to generating the second MIP map for trilinear filtering on the fly, which would result in a bandwidth drop to 128 bits per clock.
Anyway, for compressed textures at least, I can easily see improvements in performance all the way up to 32 texture pipelines per clock.
512 bits (256 x 2 for DDR2 and up) / 32 bits for each memory access = 16 memory accesses per memory clock.
512 bits (256 x 2 for DDR2 and up) / 64 bits for each memory access = 8 memory accesses per memory clock.
You would need pretty good caching of textures when you surpass 16 texture units because you can only make 8 or 16 memory accesses per memory clock depending on the granularity of the individual memory access. Nvidia I believe can only do 8 large reads per memory clock and ATI can do 16 small reads.
Chalnoth
28-Nov-2005, 08:27
You would need pretty good caching of textures when you surpass 16 texture units because you can only make 8 or 16 memory accesses per memory clock depending on the granularity of the individual memory access. Nvidia I believe can only do 8 large reads per memory clock and ATI can do 16 small reads.
Compressed textures must be read in at 64 bits at a time regardless. And for uncompressed textures, 64 bits is at most four texels, which is an ideal amount to read for texture data.
Mintmaster
30-Nov-2005, 02:36
You would need pretty good caching of textures
Trust us, rwolf. ATI and NVidia have had good texture caching for many generations now.
If you want to know how much bandwidth is consumed per pixel, you take the number of texels on the screen (or a portion of it), multiply by bits per texel, and divide by the number of pixels. This gives you the BW required given a perfect cache, assuming the cache can't hold the whole texture during tiling.
I'd wager that even in 2001, ATI and NVidia didn't consume much more than two times this number very often. Now, it's probably not far from unity. When they load a block of texels, chances are most will be used before being flushed from the cache. There are tiling mechanisms and pixel rendering orders to maximize this.
I remember people used to calculate bandwidth usage by saying 32-bits * 4 for bilinear filtering. The reality is that texture bandwidth pales in comparison to framebuffer/z bandwidth the majority of the time. Only advanced texture uses start to consume more memory.
Chalnoth
30-Nov-2005, 04:57
Yeah. An obvious situation where you could use tons of texture memory bandwidth would be one where you are calculating the texture coordinates to read within the pixel shader (i.e. using a lookup table). Here the texture cache has a much harder time. Current texture caches may be okay for small, 1-D textures. But large 2-D lookup tables can be hell on any texture cache (if the usage pattern isn't very coherent).
I remember people used to calculate bandwidth usage by saying 32-bits * 4 for bilinear filtering. The reality is that texture bandwidth pales in comparison to framebuffer/z bandwidth the majority of the time. Only advanced texture uses start to consume more memory.
You must mean when AA is enabled because when relative long shaders are used this can't be the case. When you are fragment shading limited you are likely to be accessing more than one texture at the time while you only output color once every n cycles. Add z compression (and the HZ) and may be even some color compression (but I doubt that there is much gain from using color compression without MSAA) and your framebuffer bandwidth is reduced to nearly nothing. With the simple z compression algorithm I'm using bw is already around half of color and color is already low.
I know that the simulator is currently quite bad at properly using texture bandwidth but even with that into account for non AA scenearios the framebuffer bw is quite low unless you are in fillrate limited regions (for example when stencil building or testing in Doom3 for shadows). Fragment shading limited regions don't consume much bandwidth ... unless you are applying a few textures.
. When you are fragment shading limited you are likely to be accessing more than one texture at the time while you only output color once every n cycles.
If you take Chalnoth's number of 5 bits/pixel/texture request of bandwidth, you'd need to make ~11 texture requests with little or no math in between to match the 32-bits color + 24-bit Z's output bandwidth, in aliased mode.
Compression can buy you a little in the aliased case, but not much. AA makes this worse: ROP bandwidth increases, but Texture's doesn't. MRT makes this *much* worse.
Of course, non-compressed textures and floating-point textures change this story significantly: You've multiplied the required texture bandwidth by 8 or 16 (or more).
Chalnoth
30-Nov-2005, 16:32
Of course, non-compressed textures and floating-point textures change this story significantly: You've multiplied the required texture bandwidth by 8 or 16 (or more).
But it does make more sense to use floating point framebuffers before you start using floating point textures, typically.
If you take Chalnoth's number of 5 bits/pixel/texture request of bandwidth, you'd need to make ~11 texture requests with little or no math in between to match the 32-bits color + 24-bit Z's output bandwidth, in aliased mode.
Are we talking about real world here or about some ideal scenary? Because my ideal scenary for texturing requires something like 8-16 KB of bandwidth ... per scene. Just use a few of those 16x16 textures repeated for the whole scene so that they can be kept in texture cache the whole time. Memory bandwidth required for texturing? None.
And why 24-bit z if you could be using z compression and then can be as low as 6 bits?
Do you really think that in real games only DXT1 textures are used? Do you really think that 16xAF has no impact on the required texture bandwidth? Do you really think that caches are perfect and can hold the data for 8 concurrent textures without any penalty or inefficiency?
Are we talking about real world here or about some ideal scenary?
If your textures are 1x1, then you need virtually no memory bandwidth. Obviously, that's not what I'm talking about.
See Chalnoth's post (http://www.beyond3d.com/forum/showpost.php?p=636560&postcount=56) for the assumptions.
I'm obviously not going to reveal what the actual numbers are on real apps on the GPUs we make.
And why 24-bit z if you could be using z compression and then can be as low as 6 bits?
Because compression doesn't work all that well in aliased mode, and some GPUs don't compress at all in aliased mode. Ok, so let's say I ignore Z altogether; If we just look at color, that's 6-7 texture requests for 1 32-bit color write, which is usually not compressed.
Do you really think that in real games only DXT1 textures are used? Do you really think that 16xAF has no impact on the required texture bandwidth? Do you really think that caches are perfect and can hold the data for 8 concurrent textures without any penalty or inefficiency?
Obviously not. I'm not going to go out of my way to get real numbers, and even if I did, I couldn't reveal them to you anyway.
Chalnoth
30-Nov-2005, 21:20
Obviously not. I'm not going to go out of my way to get real numbers, and even if I did, I couldn't reveal them to you anyway.
Come on! Be a rebel!
After all, what's in a job?
Because compression doesn't work all that well in aliased mode, and some GPUs don't compress at all in aliased mode. Ok, so let's say I ignore Z altogether; If we just look at color, that's 6-7 texture requests for 1 32-bit color write, which is usually not compressed.
6-7 textures is about what DOOM3 is using and some of those aren't even compressed.
Obviously not. I'm not going to go out of my way to get real numbers, and even if I did, I couldn't reveal them to you anyway.
Well I have no problems revealing numbers but it's late here and you would have to wait until tomorrow :lol:
I'm pretty sure that the simulator is consuming more texture bandwidth that it should but I doubt it's an order of magnitude or worst larger than the real thing. The tiled approach, theorically replicating ATI, the lack of L2, the sick minimun transaction size (hopefully that will change soon) and the 'weird' texture cache line sizes required for those transaction sizes take a toll. But I doubt it's that large. And with no AA and AF set to 8x the shader limited regions of all the games I'm testing use more bw for textures than for color or z.
OpenGL guy
30-Nov-2005, 22:28
If you take Chalnoth's number of 5 bits/pixel/texture request of bandwidth, you'd need to make ~11 texture requests with little or no math in between to match the 32-bits color + 24-bit Z's output bandwidth, in aliased mode.
I don't think it's quite that bad. First, not all color values will be written because of Z rejection. Second, Z compression helps alleviate some of the Z bandwidth. Also, not all Z values will be written because of Z rejection.
Compression can buy you a little in the aliased case, but not much. AA makes this worse: ROP bandwidth increases, but Texture's doesn't. MRT makes this *much* worse.
If things are highly compressible, then AA doesn't have to be much different than non-AA. MRTs are bad though.
Chalnoth
30-Nov-2005, 23:01
I don't think it's quite that bad. First, not all color values will be written because of Z rejection. Second, Z compression helps alleviate some of the Z bandwidth. Also, not all Z values will be written because of Z rejection.
One would hope that z rejection is caught early these days so that pixels that are rejected usually don't contribute to fillrate or texture bandwidth, either.
One would hope that z rejection is caught early these days so that pixels that are rejected usually don't contribute to fillrate or texture bandwidth, either.
If you enable alpha test or alpha blending you can't use early z. And obviously neither if you are modifying the z in the shader (but I guess this happen less often in games).
Demirug
01-Dec-2005, 06:35
If you enable alpha test or alpha blending you can't use early z. And obviously neither if you are modifying the z in the shader (but I guess this happen less often in games).
Alpha blending works fine with Early-Z because all pixel are still written to the buffers. As long as you have only a Z-Compare and no Z-Write Early-Z can work always.
It can even work with limitations in alpha-test and shader with fragment kill scenarios.
But shader with z-change are really a bad thing for Early-Z. IIRC I have never seen such a shader in a real game.
Chalnoth
01-Dec-2005, 07:21
If you enable alpha test or alpha blending you can't use early z. And obviously neither if you are modifying the z in the shader (but I guess this happen less often in games).
I really don't see why not. Z compression is surely broken with alpha tests, but early z should work just fine. And alpha blending shouldn't ever break early z.
The bit on alpha blending was a mistake on my part (I was half asleep). If anything I wanted to mean texture kill.
If you perform the z read before shading you could still be using early z rejection. But then, can you garantee that the line for that fragment, if passes, will remain in the z cache until the fragment is shaded? That can take hundreds of cycles depending of how long is the shader program, how your shader processor works and how many texture misses happen. And if you implement z write as write through (or may be, but I doubt it really works this way, as masked Z similar to how color without blending may be implemented) you can't use z compression.
And in the case you have alpha test or fragment kill and a stencil update function based on the depth test result there is no way around. All fragments must be shaded.
Chalnoth
01-Dec-2005, 07:30
And in the case you have alpha test or fragment kill and a stencil update function based on the depth test result there is no way around. All fragments must be shaded.
I don't see why. You'd just only do the z-compare in the early z unit, then do the z write on pixel write. Clearly you have to be careful that nothing has written to that pixel in the meantime (z or color), but that shouldn't be such a huge deal.
Clearly you have to be careful that nothing has written to that pixel in the meantime (z or color), but that shouldn't be such a huge deal.
You need a slightly stronger guarentee: No overlapping pixel can enter the EarlyZ unit until the final Z for the previous overlapping fragments have been resolved.
I don't see why. You'd just only do the z-compare in the early z unit, then do the z write on pixel write. Clearly you have to be careful that nothing has written to that pixel in the meantime (z or color), but that shouldn't be such a huge deal.
Where is alpha test/fragment kill in the logic graphic pipeline? Before or after z/stencil? If you reject a fragment before z/stencil that fragment can't update stencil or z.
arjan de lumens
01-Dec-2005, 08:08
You need a slightly stronger guarentee: No overlapping pixel can enter the EarlyZ unit until the final Z for the previous overlapping fragments have been resolved.
Not necessarily; if the Z-compare function is LESS or LEQUAL all the time, then you know that the far-Z values in the EarlyZ buffer are always greater than both the current Z-buffer contents and any value that will ever be read from or written to the Z-buffer. So you can still use the far-Z value for early pixel rejection, even if the application uses alpha-test on everything.
arjan de lumens
01-Dec-2005, 08:10
Where is alpha test/fragment kill in the logic graphic pipeline? Before or after z/stencil? If you reject a fragment before z/stencil that fragment can't update stencil or z.
Alpha-test is logically placed before Z/stencil test, at least in OpenGL/Direct3D. This makes Early-Z more complicated but doesn't prevent it.
Wrong? Sure, simulators are always wrong. The question is by how much.
All the frames were simulated at 1024x768 resolution with 8xAF enabled (traces from our old GeForce 5900 and I was too lazy to patch or force 16xAF). The traces are from Doom 3 trdemo2, UT2004 Primeval (3DCenter timedemo) and Quake 4 trdemo4 (I think). The configuration may resemble a unified version of a R520 in terms of number of pipelines and organization.
DOOM3 Frame 758
http://people.ac.upc.edu/vmoya/img/DOOM3.frame0758.png
BW (absolute)
http://people.ac.upc.edu/vmoya/img/DOOM3-frame758-mem-absolute.png
BW (relative)
http://people.ac.upc.edu/vmoya/img/DOOM3-frame758-mem-relative.png
if the Z-compare function is LESS or LEQUAL all the time, then you know that the far-Z values in the EarlyZ buffer are always greater than both the current Z-buffer contents and any value that will ever be read from or written to the Z-buffer.
What does "far-Z" mean in the context of doing the depth test early?
But yes, I agree: If the depth test is LESS (for example), the incoming Z value is greater than the Z value in the depth buffer, and you don't replace depth in the shader, then you can cull the fragment without shading, even if you already have an overlapping fragment in flight.
However, you can't commit either Z write because you don't know which of the Z values currently in flight is the one that is supposed to be on top. So you'll need to redo the Z test at the exit of the fragment shader.
UT2004 Frame 417
http://people.ac.upc.edu/vmoya/img/UT2004.frame0417.png
BW (absolute)
http://people.ac.upc.edu/vmoya/img/UT2004-frame-417-mem-absolute.png
BW (relative)
http://people.ac.upc.edu/vmoya/img/UT2004-frame-417-mem-relative.png
Quake 4 Frame 299
http://people.ac.upc.edu/vmoya/img/QUAKE4.frame0299.png
BW (absolute)
http://people.ac.upc.edu/vmoya/img/QUAKE4-frame-299-mem-absolute.png
BW (relative)
http://people.ac.upc.edu/vmoya/img/QUAKE4-frame-299-mem-relative.png
Alpha-test is logically placed before Z/stencil test, at least in OpenGL/Direct3D. This makes Early-Z more complicated but doesn't prevent it.
I may be missing something but what do you win from doing early z in that case anyway? You still need to perform alpha test for all fragments regardless of what the z result is. The result of alpha test is required to change the z and stencil buffer.
And sorry for the mega sized graphic spam.
I may be missing something but what do you win from doing early z in that case anyway? You still need to perform alpha test for all fragments regardless of what the z result is. The result of alpha test is required to change the z and stencil buffer.
Ignoring stencil, you can perform the Z test (but *not* Z write) early. You can then write the Z value later, if the pixel wasn't killed by the alpha test (subject to the conditions I mentioned in a previous post in this thread).
This saves you some shader work if you are drawing something with alpha test turned on, with a long shader, and you happen to draw that behind other objects that mask it.
arjan de lumens
01-Dec-2005, 08:38
What does "far-Z" mean in the context of doing the depth test early?
If you are using some sort of Hierarchical Early-Z mechanism, you would typically for each pixel block store a 'far-Z' value that is guaranteed to be larger (farther away) than the actual Z value of every pixel in the block, and a 'near-Z' value which is guaranteed to be smaller than the actual Z value of very pixel too.
But yes, I agree: If the depth test is LESS (for example), the incoming Z value is greater than the Z value in the depth buffer, and you don't replace depth in the shader, then you can cull the fragment without shading, even if you already have an overlapping fragment in flight.
However, you can't commit either Z write because you don't know which of the Z values currently in flight is the one that is supposed to be on top. So you'll need to redo the Z test at the exit of the fragment shader.
Not necessarily; at the Early-Z level, you can test against 'near-Z' too. If this test passes, you never need to redo the Z-test. (When you test against the 'near-Z' value, take care to update the 'near-Z' value too; this way, the 'near-Z' test will continue to function correctly for subsequent polygons too. It is perfectly safe to update 'near-Z' even for a polygon where all pixels subsequently fail Alpha-test.)
If you are using some sort of Hierarchical Early-Z mechanism, you would typically for each pixel block store a 'far-Z' value that is guaranteed to be larger (farther away) than the actual Z value of every pixel in the block, and a 'near-Z' value which is guaranteed to be smaller than the actual Z value of very pixel too.
Oh yes, I am well aware of the many optimizations that open up if you do coarse-grained fragment rejection (whether it's HierZ or ZCull or anything else). I tried to avoid taking the thread in that direction because this doesn't really solve the whole problem and because there is far too much detail that no one really wants to or can reveal.
arjan de lumens
01-Dec-2005, 08:45
I may be missing something but what do you win from doing early z in that case anyway? You still need to perform alpha test for all fragments regardless of what the z result is. The result of alpha test is required to change the z and stencil buffer.
And sorry for the mega sized graphic spam.
If Z-test fails, then the pixel (color/Z value) isn't drawn in any case, regardless of whether alpha-test passes or not - thus you can safely do Early-Z rejection even if alpha-test is enabled. (There are some special cases with certain stencil sfail/zfail functions where this isn't strictly true, but these cases are rare in practice)
Ignoring stencil, you can perform the Z test (but *not* Z write) early. You can then write the Z value later, if the pixel wasn't killed by the alpha test (subject to the conditions I mentioned in a previous post in this thread).
This saves you some shader work if you are drawing something with alpha test turned on, with a long shader, and you happen to draw that behind other objects that mask it.
But wouldn't that require two different shader programs? Or branching based on z test result? And fragment color is the likely final result for the shader, unless the color and alpha channels are calculated different for some reason, so there should be few instructions that you can remove.
I'm talking about the stencil update function here that requires to know both if alpha passed and z passed.
Chalnoth
01-Dec-2005, 16:11
But wouldn't that require two different shader programs? Or branching based on z test result?
Er, this is all fixed function stuff here. If the z test is set so that it is write on z pass, do nothing on z fail, then no matter what happens in the fragment program, a z fail means that nothing needs to be done.
Conditional writes have been a part of pixel shaders for a while (1.something).
See the Digitimes article that Kemosabe pointed at in another thread? It seems to say that R580 will only be announced in January, and won't ship until March. I sure hope that's not true.
http://www.digitimes.com/news/a20051201A5027.html
AlphaWolf
02-Dec-2005, 05:39
See the Digitimes article that Kemosabe pointed at in another thread? It seems to say that R580 will only be announced in January, and won't ship until March. I sure hope that's not true.
http://www.digitimes.com/news/a20051201A5027.html
Whos first quarter? Doesn't ATI's first quarter end in december or something?
See the Digitimes article that Kemosabe pointed at in another thread? It seems to say that R580 will only be announced in January, and won't ship until March. I sure hope that's not true.
http://www.digitimes.com/news/a20051201A5027.html
Right with the launch of GDDR4
SugarCoat
02-Dec-2005, 06:53
See the Digitimes article that Kemosabe pointed at in another thread? It seems to say that R580 will only be announced in January, and won't ship until March. I sure hope that's not true.
http://www.digitimes.com/news/a20051201A5027.html
Business week says Vista will be launching by 3rd quarter next year. Speculation is it will launch with a further modified DX9 version with DX10 to follow. This makes sense to me for a refresh in all cores on the part of ATI in March. Earlier seems very strange to me at this time. ATI still has alot of money to make off current just launched low/mid end cards. R580 will play the stop gap and launch card with vista, and then R600 to follow just before DX10's release hopefully in early 2007 or late 2006. Things dont always move at the speed of light in the hardware industry :).
To add a bit of good news to ATI, anylists have listed warnings saying Nvidia may be aiming too high for their coming earnings and that they're still underestimating ATI.
Dont forget, we're not going to see CrossFire for X1800s either until January, and thats IF ATI finally keeps a launch window. They wont undermine their current high end no matter what Nvidia does by launching the R580. They knew what they were going to take a hit. They need that to delay a bit longer i think.
Chalnoth
02-Dec-2005, 08:06
Business week says Vista will be launching by 3rd quarter next year. Speculation is it will launch with a further modified DX9 version with DX10 to follow. This makes sense to me for a refresh in all cores on the part of ATI in March. Earlier seems very strange to me at this time. ATI still has alot of money to make off current just launched low/mid end cards. R580 will play the stop gap and launch card with vista, and then R600 to follow just before DX10's release hopefully in early 2007 or late 2006. Things dont always move at the speed of light in the hardware industry :).
But nVidia should be more than ready to move to a new architecture by late next year. Why should ATI hold up this process?
SynapticSignal
02-Dec-2005, 13:35
But nVidia should be more than ready to move to a new architecture by late next year. Why should ATI hold up this process?
I agree :???:
SynapticSignal
02-Dec-2005, 13:50
so if I undestand well...
we'll have an improvement with r480 on x1800xt as the x1600xt have on x1300pro (same clock, same rops, same TMU but more shaders)?
so if I undestand well...
we'll have an improvement with r480 on x1800xt as the x1600xt have on x1300pro (same clock, same rops, same TMU but more shaders)?
not exactly, more like the 1600xt to the x1300pro, is the r580 to the x1800xt. must of been a typo :wink:
Dave Baumann
02-Dec-2005, 14:08
I wouldn't put much stock in Digitimes timings.
I wouldn't put much stock in Digitimes timings.
Good! :grin:
SynapticSignal
02-Dec-2005, 14:44
not exactly, more like the 1600xt to the x1300pro, is the r580 to the x1800xt. must of been a typo
ERRRRRR
R580 no R480
excuse me:oops:
so, for curiosity I'm looking nordichardware for some x1600xt/1300pro to compare....
HL2 1280
X1600XT 78 FPS
1300PRO 46 FPS
IMPROV. +70%
BF2 1280
X1600XT 35 FPS
1300PRO 19 FPS
IMPROV. +84%
FC 1280
X1600XT 74 FPS
1300PRO 38 FPS
IMPROV. +94,7%
FEAR (with bug :( ) 1024
X1600XT 63 FPS
1300PRO 33 FPS
IMPROV. +91%
COD2 1280
X1600XT 21 FPS
1300PRO 14 FPS
IMPROV. +50%
SPLINTER CELL 1280
X1600XT 23 FPS
1300PRO 13 FPS
IMPROV. +77%
COLIN 2005 1600
X1600XT 39 FPS
1300PRO 21 FPS
IMPROV. +77%
NFSU2
X1600XT 36 FPS
1300PRO 24 FPS
IMPROV. +50%
so, same clock, same rops, same tmu, more shaders makes average improvements of +74,2 %
if R580 is +74% than x1800xt.. :shock:
The ROPs aren't the same, though.
Additionally, X1300Pro doesn't have the ring bus memory controller.
So that's a double-whammy on the "memory efficiency" of X1300Pro.
It's why I've shied away from making a specific comparison of X1300Pro and X1600XT so far.
B3D's analysis of this question, so far, has been disappointingly thin.
Jawed
ERRRRRR
R580 no R480
excuse me:oops:
so, for curiosity I'm looking nordichardware for some x1600xt/1300pro to compare....
HL2 1280
X1600XT 78 FPS
1300PRO 46 FPS
IMPROV. +70%
BF2 1280
X1600XT 35 FPS
1300PRO 19 FPS
IMPROV. +84%
FC 1280
X1600XT 74 FPS
1300PRO 38 FPS
IMPROV. +94,7%
FEAR (with bug :( ) 1024
X1600XT 63 FPS
1300PRO 33 FPS
IMPROV. +91%
COD2 1280
X1600XT 21 FPS
1300PRO 14 FPS
IMPROV. +50%
SPLINTER CELL 1280
X1600XT 23 FPS
1300PRO 13 FPS
IMPROV. +77%
COLIN 2005 1600
X1600XT 39 FPS
1300PRO 21 FPS
IMPROV. +77%
NFSU2
X1600XT 36 FPS
1300PRO 24 FPS
IMPROV. +50%
so, same clock, same rops, same tmu, more shaders makes average improvements of +74,2 %
if R580 is +74% than x1800xt.. :shock:
The ROPs aren't the same, though.
Additionally, X1300Pro doesn't have the ring bus memory controller.
So that's a double-whammy on the "memory efficiency" of X1300Pro.
It's why I've shied away from making a specific comparison of X1300Pro and X1600XT so far.
B3D's analysis of this question, so far, has been disappointingly thin.
Jawed
You have to look at pure speed results not AA and AF results, since the added ALU's will only effect that area, not aa and af :wink: its more like 25-50%. There won't be much change in the aa and af department you will see a similiar performance hit that you see from the r520 when aa and af are applied. IMHO of course.
The tests I have performed show that with AF (usually set to 8x) the texture unit utilization (address and filter ALUs) is quite high in frames of UT2004, Quake4 and Doom3. Sometimes even higher than shader ALU utilization. So with a few more shader ALUs it may become the bottleneck quite easily.
I didn't test a configuration similar to RV515 against a configuration similar to RV530 (but I could try now that I have a few free days) but I tested a configuration with 3 quad shader units from which I removed first one and then two of the corresponding quad texture units and the hit was pretty large for two less TUs. The tested frames were completely limited by the texture ALUs (utilization at 100%) while the shader ALUs were at 50% or less.
I'm using an AF algorithm similar to the '6 petal' angle dependant algorithm used by NVidia and ATI GPUs so I guess that a non angle dependant algorithm may show even greater utilization of the texture ALUs.
The simulator is actually quite suboptimal right now, with may weird bottlenecks and excess bandwidth usage compared with I think real GPUs consume, so if anything it could be considered like a worst case lower bound to performance.
I wouldn't put much stock in Digitimes timings. In general or in this instance?
I could be wrong, but I seem to recall Digitimes being fairly on the money when mentioning timescales in their previous pieces -- weren't they also one of the first sources to cite the soft ground issue?
Whos first quarter? Doesn't ATI's first quarter end in december or something?
Nov 30 should have been end of Q1...
You have to look at pure speed results not AA and AF results, since the added ALU's will only effect that area, not aa and af :wink: its more like 25-50%. There won't be much change in the aa and af department you will see a similiar performance hit that you see from the r520 when aa and af are applied. IMHO of course.
I'm confused. Those were the pure speed numbers.
I'm confused. Those were the pure speed numbers.
Sorry wasn't paying attention to the review :oops:
Nordic differences seems to be quite a bit higher thne the other reviews I've seen where it compairs both cards.
Also it depends on what subsystem is being stressed. I'm thinking the x1300 is getting hit alot harder then the x1600 as res goes up, pixel fillrates seem to be getting hit hard on it
Unknown Soldier
02-Dec-2005, 21:14
7800 GTX 512 gets the nod over the 1800 XT (http://www.gamepc.com/labs/view_content.asp?id=h2k5512&page=1&MSCSProfile=95385A1F52DEA1A229D5B375420544640EBEDB 284F2B194E39D62AD0B3E7B2BFFE400B20E5FA59C9D1571BE9 0BFE4A4EC9F8905D3F4544C1250C017BDE6B9F73EDE43353D7 DAADB4E47C91C4875414E460EFE642592E2E9612C01B6F3768 9626FA10699AC0264FED7556FBBDC255FDCF798683BA28D631 DE03C38D9EE0F6211C8DBC1562E585F4C8)
US
so, same clock, same rops, same tmu, more shaders makes average improvements of +74,2 %
Did you guys forgot X1600 XT have like +70% higher memory clock over X1300 Pro or did I miss something? Also there is big VS disproportion.
Did you guys forgot X1600 XT have like +70% higher memory clock over X1300 Pro or did I miss something? Also there is big VS disproportion.
oh yeah :lol:
Skrying
03-Dec-2005, 01:12
I'm thinking right now that the R580 will be on average 15%-25% faster than the X1800XT at the same clock speeds. But I dont think R580 will ship with the same clock speeds. I think core will either stay the same or only raise a bit, but I expect to see 1.8Ghz memory on the R580, or the fastest possible memory that can be delievered in good quanity at time of shipping. So I expect on average the shipping version of R580 will be about 25%-45% depending on clocks.
Chalnoth
03-Dec-2005, 01:39
Did you guys forgot X1600 XT have like +70% higher memory clock over X1300 Pro or did I miss something? Also there is big VS disproportion.
Yeah, if you drop the memory clock of the X1600 XT, you'd probably have a fair idea of how well the R580 could perform. Most likely, it'd be an upper bound on performance, given that the X1600 XT also has the more advanced memory controller.
Given the low VS count of R515 I tend to believe it is quite geometry limited, opposed to every other X1K chip. There is one question bugging my mind for a while- does RV530 really have same TMUs like others?
It seems like it does - I can't think why the TMUs would be different in RV530 from RV515 or R520.
Jawed
If X1600 can perform how it does with just four usual texturing units then it means every other current chip have about half of its units for nothing. All we need is decoupling?
That depends on the workload. If you were testing texture fillrate or a scene that has a high AF hit I doubt the results would be that nice. But the trend is towards more arithmetic instructions in shaders and relatively less texture loads (even if the number keeps growing arithmetic instructions should grow more).
I wonder how many of the textures used in this Quake 4 frame were using 8x AF ...
http://people.ac.upc.edu/vmoya/img/quake4.f299.workload.png
It's the out of order thread scheduling in combination with the de-coupling of the TMUs that should be weaving some magic.
But as I hinted earlier in the thread, getting our hands on concrete evidence for magic is extremely tortuous.
One thing it's worth remembering is that with a 64-fragment thread (batch) size, it's generally not possible to hide texturing latency, even with the de-coupled design of the TMUs (which means that texture address calculation runs independently of the main ALUs). In earlier GPUs threads are much larger, which means that the couple of hundred (or more) ALU clock cycles of latency inherent in texturing and texture filtering have a reduced impact. Otherwise the ALU pipeline stalls, waiting for texturing results to be returned.
So a small thread-size means less possibility of hiding latency. So to combat that requires out of order scheduling - to find another shader program that can execute one or more instructions on threads of 64 fragments. R5xx GPUs have up to 128 threads in flight at one time, per quad-pipeline, so in theory there's plenty of choice!
The question is, where does the new architecture break even in comparing texturing latency-hiding with the older architectures? Is a 1:1 ALU:TMU ratio (such as that found in R520 and RV515) breaking even?
Or does the 3:1 ratio break even?
I'll be honest, before R520 launched, I expected R520 (RV515, too) to be well past break-even, i.e. significantly more efficient in texturing.
Right now, I dare say R520 is roughly breaking-even. RV530 with its increased ALU:TMU ratio should be significantly beyond break-even - but that may only come with shader code that actually isn't texturing-bound.
An awful lot of shader programs would seem to be fairly texture-bound (D3 seems to fit the bill, as far as I can tell, due to its normal, ambient and specular maps) but the problem is that the games are often bound by other issues simultaneously. I've spent a while trying to unravel this stuff and failed miserably.
Jawed
To add software based ratio data to what you're talking about....
A little birdy tells me F.E.A.R. is ~7:1 ALU:TEX in its shader programs (hadn't looked at that one myself, cheers for the data little birdy!), and analysis of something like B&W2 for example shows ~5:1. Most modern games will outstrip 3:1 (at least before post processing effects) quite easily, I'd bet (although I've not had a look at much recently, due to time constraints).
So the hardware processing side of things that'll run these shaders seems vindicated in favour of increasing ALU:TEX in silicon.
Demirug
03-Dec-2005, 17:32
FarCry with SM3 but no HDR have ~5:1.
Yesterday I have modify one of my small Tool to calculate this values very fast after you have extract the shaders from a game.
If you compare X1300 and X1600 you will see an average increase of raw shaderpower in current games of factor 2(measured with an other tool; ATI don't seems to like it).
Raw Shaderpower = (only Pointfilter on very small textures)
Wow, sounds groovy.
The distribution of texture operations over the course of a shader is generally going to be pretty uneven I would think - so that there are prolly sections of code where the ALU:TEX ratio is 1:1 and others where it might be 20:1.
My understanding is that the compiler and/or driver compiler will identify which texture operations can be prefetched and perform those prefetches as soon as possible while the shader executes.
So, while the code might look like this:
tex A
alu
alu
alu
alu using A
alu
alu
tex B (not dependent on previous instructions)
alu
alu
alu
alu using B
alu
alutex B can actually start executing at 5/6.
Jawed
Demirug
03-Dec-2005, 18:19
Jawed, if you prefetch to early you will need to much memory to store the values.
I can not say much about ATI chips as they don't give the right tools for this to us developers. But nVidia try to fetch as late as possible. The prefere to use the fetched value in the ALU behind the TMU in the same cycle.
Primary shadercompilers try to move as much instructions in one cycle as possible. NV4x/G7X works best if they have one Tex instruction and many ALU instruction in each cycle. Cycles with only ALU instructions can cause performances problems.
IMHO ATI use the late fetch method too
Sure ALU:TEX ratio is going balistics, but I was pointing rather on decrease of TMU then increase of PS ALU count. With efficiency like RV530, 16 TMU looks like overkill even for R580. Maybe they will take larger hit with good filtering in specific cases, but I doubt designers care when most benchmarks provide only average FPS scores. X1600 does not take any special hit with AF.
Aha! Demirug, I didn't think of the register count/register re-use problem.
I have to admit I was prompted partly by thinking about out of order scheduling with small thread sizes - where you presumably want to maximise the number of threads that are not in the state "waiting for texture result". That's because you don't want to exhaust the supply of "tex-ready - alu needed" threads when a high latency texture operation arrives (particularly if all the threads are actually running the same shader).
I was thinking that by fetching as early as possible you can more-evenly distribute the total number of texture operations over the duration of all the threads. e.g. if you have one shader for 8192 fragments across 128 threads, with an average latency for the texture result of 2 cycles (due to caching) the first threads can't afford to wait until 2 cycles before the texture result is needed because texturing latency on the un-cached textures is going to be much higher, say 4 cycles average for the first 20 threads. (Naturally I'm making these figures up...)
Running these threads with out of order scheduling would seem to re-write the rules. In older GPUs, with fixed thread-sizes the GPU just took a brute-force, one-size fits all approach: "fetch as late as possible".
Hmm... maybe this is all junk... Still, tis fascinating.
Jawed
Chalnoth
03-Dec-2005, 19:04
FarCry with SM3 but no HDR have ~5:1.
Yesterday I have modify one of my small Tool to calculate this values very fast after you have extract the shaders from a game.
I doubt that's going to be the case that often in this game. There's a whole lot of rather simple-looking foliage. Maybe indoors that ratio will hold up.
Yeah, the infamously long 4-light shader in FC (i.e. indoors) is 107 shader and 7 texture operations in length.
Jawed
Demirug
03-Dec-2005, 19:09
Your latency values are too low. If you have 16 pixel per thread but the units do one quad per cycle you will need 4 cycle per thread and pipeline stage. TMUs have many pipeline stages. Many of this stages are only there to compensate cache misses.
You don't need to care about the number of threads in the TMUs as long as you have every time the ALUs can take more threads the right number available.
One effect I have seen with our implementation of in order execution of fragments is that it tends to concentrate processing on each kind of unit in different phases. When all the fragments start to fetch a texture the texture unit soon starts to get saturated. And if there is something that limits the throughput, either be memory bandwidth, a queue not properly sized or a bug performance gets hit a lot. Out of order execution of fragments seems to distribute the workload more evenly and there are texture fetches and ALU operations through all the execution time. That seems to be one of the reasons our implementation of in order execution is quite slower (20% or more) than a similar (same number of fragments on execution) out of order execution implementation.
On the other hand in order execution seems to get more benefit from adding more shader ALUs as the ALU phases are executed faster. But still the performance never comes near out of order execution. Out of order execution in our current implementation seems to be either not limited at all, at around 2048 fragments/4096 registers per quad shader processor, or limited by some bug or bottleneck we haven't discovered yet in the Texture Unit. And adding more shader ALUs doesn't seem to help much (2-5% in the tested frames of Doom3 and Quake4, 10% in UT2004) maybe because of the bottleneck, or as the previous graphic shows, because for some reason the Texture Unit is already fully utilized.
Putas mentioned that tests with AF didn't seem to make a large difference with the X1600 but I haven't seen those tests and I would thank a link. That makes me wonder if the texture unit is exactly the same as the 'normal' texture units.
R5xx has the fully associative (and I think bigger) cache and, in X1600/1800, the new memory controller - the net effect is less of a hit from AF, as I understand it.
You can see a small single texture performance gain:
http://www.beyond3d.com/reviews/ati/r520/index.php?p=15#fill
increasing as texture layers are added.
Also, with a careful bit of page jumping:
FarCry no AF (http://www.beyond3d.com/reviews/ati/r520/index.php?p=20)
HL-2 no AF (http://www.beyond3d.com/reviews/ati/r520/index.php?p=24)
FarCry and HL-2 16xAF & HQ AF (http://www.beyond3d.com/reviews/ati/r520/index.php?p=27)
you can see the hit for AF for the X800XL is 8% in FC and 19.5% HL-2 at 1600x1200.
Jawed
SugarCoat
04-Dec-2005, 04:52
I wouldn't put much stock in Digitimes timings.
I would. Its the best business plan ati has. I think a January or febuary launch, seeing cards in stock, would be crazy on their part. Some of todays hottest titles are barely working on the R520. Embarassingly so. And they just cleared AIBs for out of the box overclocking.
And honostly, whats ATI have to rush against? They have a good product at low prices (just ordered an X1800xt 512mb for 507 shipped). They'd have to bite down pretty hard on Nvidia's 512gtx P.R. campaign in order to push another card out that fast in my opinion. Nvidia's AIB's, eVGA specifically, has stated in its forums that Nvidia is shipping very limited numbers of these cores, and what has it caused other then some of the highest prices ever and lowest quantites in a retail launch.
Theres just too many variables, such as immature drivers for the R520, clearing AIB's for overclocking yet few have released public SKU's yet except for one or two. No master cards for the current line up for crossfire. GDDR4 being something they could definitly wait for and use against Nvidia. And the accompying refresh cores for the current just released low and mid range 90nm. All of that would be thrust out the door by the arrival of an R580 so early. I give the current cards 4 or 5 months at least lifespan from time of launch. Anyless would be bad business sense. I understand that the card may be in good health and prepped, perhaps thats what you hear, but i dont think they would dare undermine sales of current products that they need badly by releasing an expensive high end and shutting down production on current low-high end cores after it just took full swing.
AlphaWolf
04-Dec-2005, 05:55
I would. Its the best business plan ati has. I think a January or febuary launch, seeing cards in stock, would be crazy on their part. Some of todays hottest titles are barely working on the R520. Embarassingly so. And they just cleared AIBs for out of the box overclocking.
And honostly, whats ATI have to rush against? They have a good product at low prices (just ordered an X1800xt 512mb for 507 shipped). They'd have to bite down pretty hard on Nvidia's 512gtx P.R. campaign in order to push another card out that fast in my opinion. Nvidia's AIB's, eVGA specifically, has stated in its forums that Nvidia is shipping very limited numbers of these cores, and what has it caused other then some of the highest prices ever and lowest quantites in a retail launch.
Theres just too many variables, such as immature drivers for the R520, clearing AIB's for overclocking yet few have released public SKU's yet except for one or two. No master cards for the current line up for crossfire. GDDR4 being something they could definitly wait for and use against Nvidia. And the accompying refresh cores for the current just released low and mid range 90nm. All of that would be thrust out the door by the arrival of an R580 so early. I give the current cards 4 or 5 months at least lifespan from time of launch. Anyless would be bad business sense. I understand that the card may be in good health and prepped, perhaps thats what you hear, but i dont think they would dare undermine sales of current products that they need badly by releasing an expensive high end and shutting down production on current low-high end cores after it just took full swing.
They can't just keep pushing products back because the ones they have out there are ok. There is no reason to assume the r600 will be late so no matter what you do with r580 its either going to infringe on the lifecycle of r520 (at the backend) or r600 (at the frontend), or perhaps some optional inbetwen which seems unlikely given given talk of midyear vista. If those are the choices you don't screw over r600, you bail on the failed product.
That said, I still think they could get plenty of play out of r520 in the higher midrange and release r580 as an enthusiast only part. This would only mean holding back on rv560 for a little while.
SugarCoat
04-Dec-2005, 06:23
They can't just keep pushing products back because the ones they have out there are ok. There is no reason to assume the r600 will be late so no matter what you do with r580 its either going to infringe on the lifecycle of r520 (at the backend) or r600 (at the frontend), or perhaps some optional inbetwen which seems unlikely given given talk of midyear vista. If those are the choices you don't screw over r600, you bail on the failed product.
That said, I still think they could get plenty of play out of r520 in the higher midrange and release r580 as an enthusiast only part. This would only mean holding back on rv560 for a little while.
then the impact will be from the R600. An early R580 would only make business sense if that product were to be looming. I have seen very very little info on the R600. So i must assume it is still scheduled for a late 2006 early 2007 introduction. ATI simply has no need for the next few months to go to a larger more expensive more complex core and supercede its low/mid range. Especially if the performance isnt there due to drivers. At the current moment i think the R580 may look very embarassing in todays titles with the exception of some synthetic shader benchmarks. One thing i do think we can count on is further development driver wise into areas where the R520 and R580 are the same, namely the memory controller. I would expect things of this nature to pick up heavily over the next few Cat releases. While other areas such as AOE3, FEAR, HL2, will hopefully get attention as well.
AlphaWolf
04-Dec-2005, 06:45
then the impact will be from the R600. An early R580 would only make business sense if that product were to be looming. I have seen very very little info on the R600. So i must assume it is still scheduled for a late 2006 early 2007 introduction. ATI simply has no need for the next few months to go to a larger more expensive more complex core and supercede its low/mid range. Especially if the performance isnt there due to drivers.
Define looming. If they release r580 in March (as you think is the earliest possible that makes sense), then matching an r600 release with a possible august vista release would only give r580 a total of 5 months, which you seem to think is crazy to do for r520.
At the current moment i think the R580 may look very embarassing in todays titles with the exception of some synthetic shader benchmarks.
How is it going to be slower than r520? There may be instances where it won't be much faster, but embarrasing?
The simple fact is that they had a significant product delay and they are going to have to short cycle something if they want to get back on track.
R5xx has the fully associative (and I think bigger) cache and, in X1600/1800, the new memory controller - the net effect is less of a hit from AF, as I understand it.
I haven't tested yet how much a fully associative cache impacts on performance. I use to set a 16 KB, 4 set associative cache. In any case a better cache isn't going to help when you are limited by address and filter ALUs (1 bilinear per cycle) that is what the blue line is showing. When you hit 1 (100% utilization) the Texture Unit is already at full rate and only implementing more ALUs (an higher bilinear throughput) would allow to increase the performance.
You can see a small single texture performance gain:
http://www.beyond3d.com/reviews/ati/r520/index.php?p=15#fill
increasing as texture layers are added.
Also, with a careful bit of page jumping:
FarCry no AF (http://www.beyond3d.com/reviews/ati/r520/index.php?p=20)
HL-2 no AF (http://www.beyond3d.com/reviews/ati/r520/index.php?p=24)
FarCry and HL-2 16xAF & HQ AF (http://www.beyond3d.com/reviews/ati/r520/index.php?p=27)
you can see the hit for AF for the X800XL is 8% in FC and 19.5% HL-2 at 1600x1200.
Jawed
But what I'm interested is the impact of AF on the X1600 which should be quite large if the three games I'm testing are representative or if it isn't there is something not obvious that I'm missing in its architecture.
The R3xx used 8 KB texture caches (with some strange limitations when the accessed mipmap texel to pixel ratio wasn't 1:1 or near) while the NV3x had 4 KB texture caches (with no limitations and the observed performance seemed to correspond with the four banked, free bilinear cache described in the literature). And my cheap Radeon 9000 (RV250) at home has a 2 KB cache with no limitations like the NVidia caches. My theory is that in the R3xx generation ATI increased the cache size by large, likely reducing the number of banks, keeping bilinear free for most 1:1 or 1:2 texel to pixel ratios (minification) and allowing 2 or even 4 cycle penalties for the other more uncommon ratios (shouldn't happen if mipmapping is used).
Discovering the size of the texture caches (at any level) shouldn't be hard. Just render a few dozens to hundred full screen textured quads and measure framerate, divide and you get texture bw. Increasing the size of the texture, set to cover the whole quad, from 1 KB to 256/512+ MBs if you want also to test the system bus bw, you get the 'bw steps' that define the different cache sizes. Doesn't GPUBench have that test?
Increasing associativity helps to avoid conflict misses when a number of different cache lines are mapped to the same cache set and they keep replacing each other without having been fully used increasing (in the worst case by a lot) the number of misses. With multitexturing you are potentially accessing more buffers that may be aligned in a similar way in GPU memory, thus increasing the chance of two accesses to different textures to be mapped to the same set.
Kombatant
04-Dec-2005, 10:22
If you think about it, R580 is already late.
I haven't tested yet how much a fully associative cache impacts on performance. I use to set a 16 KB, 4 set associative cache. In any case a better cache isn't going to help when you are limited by address and filter ALUs (1 bilinear per cycle) that is what the blue line is showing. When you hit 1 (100% utilization) the Texture Unit is already at full rate and only implementing more ALUs (an higher bilinear throughput) would allow to increase the performance.
I've always seen out of order, fully-decoupled TMU, shader execution as a means to increase both ALU and TEX utilisation.
But what I'm interested is the impact of AF on the X1600 which should be quite large if the three games I'm testing are representative
Sadly, I don't think UT2K4, D3 and Q4 are representative. ATI says the 3:1 thing is forward-looking. Those are "DX8 games", as far as I can tell - very short fragment shaders with extremely high texturing intensity.
But...
or if it isn't there is something not obvious that I'm missing in its architecture.
The way I see it is that when AF is switched on, the scheduling of threads for an identical shader, becomes quite different.
Brutally simplifying: imagine 128 threads (for 2048 fragments within one triangle) running the same shader. Without filtering:
16 threads run to completion
16 threads run to completion
...
final 16 threads run to completionIn this scenario, the effective batch size of the GPU is 16 threads x 16 fragments, i.e. 256 fragments. This is big enough to hide the latency associated with texturing, including filling the cache.
With trilinear, the same 128 threads have to run in "bigger" batches to hide texturing latency:
32 threads run to completion
32 threads
32 threads
final 32 threadsThis means that the effective batch size is now 512 fragments.
With maximum AF we might see 128 threads running together - a batch of 2048 fragments - but most importantly, that's 512 cycles per AF texture op (is that enough?).
As the effective batch size increases, the TMUs run continuously, for longer stretches of time, with increasing cache coherency.
I'm making this up...
Discovering the size of the texture caches (at any level) shouldn't be hard. Just render a few dozens to hundred full screen textured quads and measure framerate, divide and you get texture bw. Increasing the size of the texture, set to cover the whole quad, from 1 KB to 256/512+ MBs if you want also to test the system bus bw, you get the 'bw steps' that define the different cache sizes. Doesn't GPUBench has that test already?
It certainly covers similar territory. There's enough parameters in the Fetch Costs test to have a field day.
http://graphics.stanford.edu/projects/gpubench/test_fetchcosts.html
I presume altering the size of the framebuffer alters the size of the source texture - but I'm not sure.
Jawed
If you think about it, R580 is already late.
I keep saying this, and Dave keeps saying that ATI says they won't hold-up R580 just because R520 was late, but...
Jawed
I keep saying this, and Dave keeps saying that ATI says they won't hold-up R580 just because R520 was late, but...
Jawed
Well if it did encounter the same problems as R520/RV530, it could be a bit late because of that, even though they know the fix, it takes a second to apply it too?
So it would be late because of that, not because ATI would be holding it back
I think you are confusing a bit latency with throughput here.
As we have implemented out of order execution the only limitation in the number of fragments on execution comes from the per fragment register usage. For example, I'm setting 128 threads, each thread a group of four quads (or 16 fragments) for each quad shader processor. That is a maximum of 2048 fragments stored in the shader at any time. Then I set 2 registers per fragment as storage (so 4096 128-bit registers). That means that if the fragment program requires 8 registers only a maximum of 512 (32 threads) will be really stored in the shader because you are limited by the number of registers. Those 512 fragments could then hide a texture access latency of 128 cycles (in case of in order execution) or more.
But the number of AF samples required by a texture access, or the maximum AF has nothing to do with the shader. That's a problem for the Texture Unit and the queues that are implemented there. Depending the on how you implement the different queues and how the bilinear or trilinear accesses are created and stored the Texture Unit may be limited to store a number of 'fragments'. But those fragments, the thread that represents those fragments in any case, is still in the shader in a blocked state waiting for the Texture Unit. Whether you have have a single blocked state (waiting for texture result) or two (waiting to enter TU and waiting to exit TU) because limitations in the TU queues doesn't affect the number of fragments in the shader.
With AF the texture access takes longer and the latency is increased, so you require more fragments, more threads than could be ready, to hide that latency. But what is really limiting the performance, even in the most ideal case that all the data was already stored in the cache, is the sample throughput. If a texture access requires 16 bilinear samples and your TU pipeline is designed for a througput of 1 bilinear per cycle, all the ALUs and stages in the TU pipeline will be reserved, at some point, for 16 cycles. In those 16 cycles the TU stage won't be working in other texture accesses. That means that for 16 cycles if the shader requires another texture access it will have to stall (or change to another thread, but at the end if all the threads require the same number of samples this doesn't help).
You can hide latency adding more fragments (and corresponding registers) on execution and increasing the TU pipeline depth. But the only way you can 'hide' (in the sense of fully utilizing all other processing units) througput is providing another task to be done in parallel for the other, non limited, processing units. And that means that for 'hiding' a 16 sample texture access you need 16 execution cycles (the number of corresponding instructions depending in the ALU configuration and the program) in the shader ALUs. And the worst case of 32 samples per texture access requires at least a 1:32 ratio between texture and arithmetic instructions in the fragment program.
I have to agree with UT2004 and DOOM3 being quite DX8 like (and UT2004 being more DX7 like). But we can't do anything about it if there are no 'DX9' OpenGL games out there (or at least none I know and can make it work in the framework). ATI's 3:1 seems too low for those games but seems ok for future or DX9 games. But in my opinion AF changes that a bit. Of course that depends on how many fragments in a scene require a large number of AF samples and how many of the textures applied to ech fragment is set to use high AF modes. Only the IHVs know those numbers for all (or at least most) current games, and future ones (with their close relation with developers).
Kombatant
04-Dec-2005, 12:41
I keep saying this, and Dave keeps saying that ATI says they won't hold-up R580 just because R520 was late, but...
Jawed
A good example of how these companies handle situations like that, is nVidia. Just remember the life span of NV30 and how quickly NV35 was introduced, and you will have your answer.
I think you are confusing a bit latency with throughput here.
As we have implemented out of order execution the only limitation in the number of fragments on execution comes from the per fragment register usage. For example, I'm setting 128 threads, each thread a group of four quads (or 16 fragments) for each quad shader processor. That is a maximum of 2048 fragments stored in the shader at any time. Then I set 2 registers per fragment as storage (so 4096 128-bit registers). That means that if the fragment program requires 8 registers only a maximum of 512 (32 threads) will be really stored in the shader because you are limited by the number of registers. Those 512 fragments could then hide a texture access latency of 128 cycles (in case of in order execution) or more.
Agreed.
Xenos has a limit of 12 registers across 63 threads (each thread being 64 fragments) - therefore 48384 register file slots. The documentation I have on this is somewhat sketchy, because it links the 12-register limit to "pixel shaders" (of which there can only be 63 threads), but I'm not sure if the same or a lower or higher limit applies to the 31 vertex threads that can also be current along with those 63 fragment threads. If we assume that the 31 vertex threads also have a 12-register limit, then the total register slots for Xenos is 72192.
In X1600XT there's 128 threads x 48 fragments active, so that's 6144 slots in the register file per coded register (a 2-register shader uses 12288 slots). I've no idea what the cut-over is, whether it's 4 or 8 etc. before the number of concurrent threads is restricted by the slots available in the register file.
In X1800XT there's 512 threads x 16 fragments active, so 8192 slots per coded register.
In X800XT there's 4 threads x 256 fragments (? assuming a 16x16 screen-tile defines a thread), so that's 1024 slots per coded register.
That's a problem for the Texture Unit and the queues that are implemented there. Depending the on how you implement the different queues and how the bilinear or trilinear accesses are created and stored the Texture Unit may be limited to store a number of 'fragments'.
You got me there. I've always assumed that the texture pipes in Xenos/R5xx are single-threaded (but pipelined, obviously). I don't have a good understanding of the data structures required to support the looping required to implement AF, or how multi-texturing commands are implemented...
But those fragments, the thread that represents those fragments in any case, is still in the shader in a blocked state waiting for the Texture Unit. Whether you have have a single blocked state (waiting for texture result) or two (waiting to enter TU and waiting to exit TU) because limitations in the TU queues doesn't affect the number of fragments in the shader.
You can hide latency adding more fragments (and corresponding registers) on execution and increasing the TU pipeline depth. But the only way you can 'hide' (in the sense of fully utilizing all other processing units) througput is providing another task to be done in parallel for the other, non limited, processing units.
Agreed, "hide" should mean that an ALU instruction can execute in some other thread, even when the shader contains a tight dependency:
tex r0, texture A
mul r0, r0, parameterIf that's the entire shader, and all 128 threads contain that shader, then texturing latency can't be fully hidden because the ALU pipeline will be idle for some time.
But if there's more to the shader, including an extended portion with no texturing, then that's a perfect opportunity to hide AF - as long as you have some threads that are ready to execute that portion of the shader when the heavy AF load of instructions 1 and 2 hits other threads.
I'm sure I don't need to say this to you, cos this is simple stuff to you - I'm saying it just to be sure I'm on the right wavelength.
And that means that for 'hiding' a 16 sample texture access you need 16 execution cycles (the number of corresponding instructions depending in the ALU configuration and the program) in the shader ALUs. And the worst case of 32 samples per texture access requires at least a 1:32 ratio between texture and arithmetic instructions in the fragment program.
The slow-down associated with AF has never been as dramatic as 4x though, has it? I've got a pretty poor understanding of texture filtering and the history of implementations and performance in different games and synthetic benchmarks :oops:
I suppose what would be interesting is to see a graph like those that you showed earlier for texture pipe utilisation in a single-quad GPU:
256-fragment, single-threaded, no-AF
256-fragment, single-threaded, 16xAF
16-fragment, 128-threaded (1:1), no-AF
16-fragment, 128-threaded (1:1), 16xAF
48-fragment, 128-threaded (3:1), no-AF
48-fragment, 128-threaded (3:1), 16xAF---
After all this (it's been about 5 hours mulling!), I think the question of 3:1 versus 1:1 may be moot in terms of increased AF efficiency (which is the point you've been arguing), because the scheduler is still only operating on two threads: one thread to run on the ALUs and another to run in the TMU - the ratio is immaterial. All the AF efficiency improvements in R5xx seem to come from cache and memory bus/controller features.
Going round in circles here. Oh well... Wish B3D had tested RV515 and RV530...
Jawed
Well if it did encounter the same problems as R520/RV530, it could be a bit late because of that, even though they know the fix, it takes a second to apply it too?
So it would be late because of that, not because ATI would be holding it back
Yes, R580 was held-up at some point in spring/summer due to the problem.
All I'm saying is that ATI has since pushed-forwards with R580, but leaving room for RV530 and RV515 to take priority.
Actually, I forgot that the roadmap from spring shows RV530 as coming before RV515, originally - so they were swapped around. I think RV515 came out almost on time and not having had the problem, like R520, was prolly only affected by the engineers shifting focus onto R520.
Jawed
kemosabe
04-Dec-2005, 17:19
A good example of how these companies handle situations like that, is nVidia. Just remember the life span of NV30 and how quickly NV35 was introduced, and you will have your answer.
Except that, if my recollection is accurate, NV35 had already reached its projected launch schedule since NV30 was held up longer than R520 (in excess of 6 months). Moreover, R520 is not nearly the performance disappointment that NV30 was (relative to the competition) and is looking to sell rather well despite the delay.
Dave Baumann
04-Dec-2005, 23:05
Jawed - ATI is still working on the original execution path, the working order of the chips hasn't changed save for the fact that RV515 didn't experience problems, so it only required two spins, as opposed to RV530's 3.
http://www.pcbuyersguide.co.za/showthread.php?t=834
Obviously questionable provenance that diagram (could date from Q3 2004 for all I know...)
So, is there a diagram around with RV515 before RV530?
Jawed
Dave Baumann
04-Dec-2005, 23:32
Err, Jawed, what did I just say?
They began work on R520 first, then RV530, then RV515 then R580 - had they all hit on the same number of spins thats the order they would have come out in; RV530 took an extra spin so it was delayed an extra length of time in relation to RV515. Work didn't stop on RV515 so the upshot was that it got released before RV530 despite being started after.
Dunno, it seemed like you were trying to say that what I said about the order of RV515 and RV530 was wrong, so I posted a picture of why I said that.
But I suspect there are other roadmap pix around (if only privately), so I was fishing for alternatives to see if the roadmap itself had changed.
Jawed
Megadrive1988
04-Dec-2005, 23:46
Except that, if my recollection is accurate, NV35 had already reached its projected launch schedule since NV30 was held up longer than R520 (in excess of 6 months). Moreover, R520 is not nearly the performance disappointment that NV30 was (relative to the competition) and is looking to sell rather well despite the delay.
my thoughts exactly.
anyway, I'm expecting R580 to be ATI's winter-spring flagship product - then R600 in the fall of 06.
SugarCoat
05-Dec-2005, 00:01
Define looming. If they release r580 in March (as you think is the earliest possible that makes sense), then matching an r600 release with a possible august vista release would only give r580 a total of 5 months, which you seem to think is crazy to do for r520.
How is it going to be slower than r520? There may be instances where it won't be much faster, but embarrasing?
The simple fact is that they had a significant product delay and they are going to have to short cycle something if they want to get back on track.
No i never said 5 months was crazy, i said its the least amount of time i'd see ATI giving the R520 before they bring in the R580. And i think you'll find the R580 filling a longer cycle then 5 months should it launch in febuary or march. Dont forget as well that if vista launches early, there is a chance that i may launch with a modified DX9 version rather then DX10. ATI and Nvidia would know this, i dont, but if it doesnt, whats the rush on the R600? Thats one card i think they hope to cash in on holiday sales and market heavily on the first unified architecture for PC gaming. The core will be very complex, large, and costly.
One thing theinq has been very decent about info on chips getting tapped out, so once we see that the R600 has been tapped, then you can bring it into the equation. Right now its a non-existant product. I cannot wait till the R600 is released or even before we start to get technical details, but it wont be for awhile. If we still know nothing by March i do indeed expect the R600 products to launch into a holiday 2006 plan.
And my speculation on the R580 launch date, is to those that treat it like we could or will see it this month or the next, which i think makes zero sense coexisting with the R520. Late or not ATI does want to maximize profits, and they do not want to fail that area and simply take the lead in benchmarks, otherwise we WOULD have seen the R520 cancelled.
At $500 or $550 the 512mb XT still has plenty of selling potential.
Ailuros
05-Dec-2005, 05:14
Microsoft's Vista delays turned the IHV's roadmaps into a mess long before today.
Whenever ATI/NVIDIA will release their D3D10 GPUs, it sounds more to me like a flagship GPU release with mainstream and budget still using the existing inventory of SM3.0 GPUs throughout 2006 until they release R6xx/G8x (?) related mainstream and budget GPUs later on.
It's quite natural that most of us enthusiasts concentrate mostly on the higher end of the market, but the high volumes and real profit are elsewhere. Or else the product line based on R580 will have a way longer lifetime than I can read out of some posts here.
hope this wasn't mentioned before:
Seoul, December 4, 2005 - Hynix Semiconductor, Inc. (‘Hynix’ or ‘the Company’, www.hynix.com) today announced the availability of the world’s first 512 Megabit GDDR4 DRAM, the DRAM industry’s fastest and highest density graphics memory.
GDDR is an ultra-high speed graphics DRAM that processes moving pictures and graphic data in personal computers and game consoles. The fourth-generation graphics memory GDDR4, which improves data processing speed by close to two times than that of GDDR3, is ideal for 64-bit computer operating systems that manage vast amounts of data at once.
The Hynix’s 16Mx32 512Mb GDDR4 operates at 2.9Gbps and processes 11.6 Gigabytes of data in 1 second.
Hynix will shortly begin to sample its GDDR4 products to leading graphic chipset suppliers and plans to start mass production in early 2006.
Hynix plans to introduce its GDDR4 DRAM with 14.4GB speed by second half of 2006.
http://hynix.com/allnews/eng/preng_readB.jsp?NEWS_DATE=2005-12-05:09:39:37
Interesting.. Do you think it might be to early for ATI to use this ram If the R580 is released in January or february?
AlStrong
06-Dec-2005, 01:12
Interesting.. Do you think it might be to early for ATI to use this ram If the R580 is released in January or february?
Maybe not for the X1850 XTPEUGTO etc. :wink:
But seriously, they could just start out with limited supplies of R580 with GDDR4.
kemosabe
06-Dec-2005, 04:57
Interesting comments about R580 from Rich Heye (http://phx.corporate-ir.net/phoenix.zhtml?p=irol-eventDetails&c=105421&eventID=1171470) (VP desktop business unit) at the recent Credit Suisse First Boston Technology conference (12 min 30 sec mark). Confirmed the improved design vs. R520 with substantially higher performance. Currently sampling to partners and OEMs (where they are bullish about spring design wins) and first production wafers now coming out of TSMC with volume shipments slated for Q1/06.
vol2005
07-Dec-2005, 13:18
x1900's ? (http://www.hkepc.com/bbs/viewthread.php?tid=517255) (chinese )
Same PCB and heatsink
russo121
07-Dec-2005, 13:26
x1900's ? (http://www.hkepc.com/bbs/viewthread.php?tid=517255) (chinese )
Same PCB and heatsink
And they are pointing to US$1000 - that's expensive - I think it's time to put a console in the buy list and forget pc.
vol2005
07-Dec-2005, 13:33
And they are pointing to US$1000 - that's expensive - I think it's time to put a console in the buy list and forget pc.
I think they have mentioned $1000 as the price of high-end towards 2007 ( not this card ). Here is link (http://www.systranbox.com/systran/box?systran_lp=zt_en&systran_id=SystranSoft-en&systran_url=http://www.hkepc.com/bbs/viewthread.php?tid=517255&systran_f=1133961258) to translated page
Chalnoth
07-Dec-2005, 18:42
And they are pointing to US$1000 - that's expensive - I think it's time to put a console in the buy list and forget pc.
Oh, come on. It's not like these high-end parts are close to necessary to play games well on a PC. It is possible to play any game out there today on the PC with a $50 graphics card.
Oh, come on. It's not like these high-end parts are close to necessary to play games well on a PC. It is possible to play any game out there today on the PC with a $50 graphics card. Not true.
Try 100+.
You might be able get it to render the game but not at any resonable framerate.
http://www.newegg.com/Product/ProductList.asp?Manufactory=&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&PropertyCodeValue=0&description=&MinPrice=40&MaxPrice=50&SubCategory=48&Submit=Property
whats what 50 gets you at newegg.
6200TCs and radeon 9XXX and even a beloved FX5200 and 5500 and hyper memory X300s etc.
Chalnoth
07-Dec-2005, 19:08
Not true.
Try 100+.
You might be able get it to render the game but not at any resonable framerate.
Just because you can't play the games at 1024x768 or higher doesn't mean you can't play them.
Just because you can't play the games at 1024x768 or higher doesn't mean you can't play them. So you regulary recommend 6200TC and radeon X300SE to people based on that?
A X300SE Hypermem can run doom 3 at 640 at at 40~fps thoug,h twice the speed of my old 8500 I might add:D
So basically you're gonna be playing at 800x600 in most games, 1024 for the lightweight games (hl2)
And if said system has an lcd display, like dell and other pc makers like to get you to buy, the image will look pretty bad.
And games do look horrendous (bar dark games like doom 3/Q4 riddick etc) at 1024 or lower if you're every looked at games with fsaa and/or 4x fsaa.
Ailuros
08-Dec-2005, 05:18
You can always combine a $50 GPU with a 17"-whatever monitor. Especially on a CRT where you're not bound to a native resolution, the resolution is in analogy to the viewable space. While 800*600 will look mediocre on a =/>19" display it'll look more than acceptable on a 17" display.
You can always combine a $50 GPU with a 17"-whatever monitor. Especially on a CRT where you're not bound to a native resolution, the resolution is in analogy to the viewable space. While 800*600 will look mediocre on a =/>19" display it'll look more than acceptable on a 17" display.
I have a 17" CRT and 800*600 looks like crap, You see more aliased textures and the dark spacing in between the horizontal scan lines (I think thats what it's called) is allot more visible.
DudeMiester
08-Dec-2005, 20:12
You would be insane to play next gen games like UT2007 with anything less then a $250CAD vid card these days, imho.
Ailuros
08-Dec-2005, 21:38
I have a 17" CRT and 800*600 looks like crap, You see more aliased textures and the dark spacing in between the horizontal scan lines (I think thats what it's called) is allot more visible.
The quoted text states quite clearly more acceptable on a 17" display. Do measures of relativity tell you anything because that's what it was about; resolution in relation to viewable screen space.
800 by 600 pixels means usually something around 64 dpi on a 17" CRT, while on a 19" it drops to around 56 dpi.
Ideal resolutions for 17" CRTs are either 1024*768 or 1152*864 depending always on supported refresh rates.
The quoted text states quite clearly more acceptable on a 17" display. Do measures of relativity tell you anything because that's what it was about; resolution in relation to viewable screen space.
800 by 600 pixels means usually something around 64 dpi on a 17" CRT, while on a 19" it drops to around 56 dpi.
Ideal resolutions for 17" CRTs are either 1024*768 or 1152*864 depending always on supported refresh rates.
Ya obviously a bigger crt is going to need a higher res to look good but in games 800x600 still looks horrible in all but the darkness games.
You would be insane to play next gen games like UT2007 with anything less then a $250CAD vid card these days, imho.
i agree, its why im reading a forum about R580 :grin:
250CA is like 50 dollers USD right:lol:
AlphaWolf
08-Dec-2005, 23:23
250CA is like 50 dollers USD right:lol:
At the rate its going, by the time r580 is released the US dollar might be worth less than the CAD. (It's currently at its highest point in 15 years).
Inquirer: ATI's R580 chips in production
http://www.theinquirer.net/?article=28228
The question is - how long? :-)
Ailuros
08-Dec-2005, 23:55
Ya obviously a bigger crt is going to need a higher res to look good but in games 800x600 still looks horrible in all but the darkness games.
Can you understand the sense my former reply was written in or not? It's tiresome if you have to repeat the same things over and over again.
Again resolution in relation to monitor size. While 1024*768 is a high resolution on a 15" display, it does look more than mediocre on a 21" monitor. I said nothing else than "more acceptable than".
At the rate its going, by the time r580 is released the US dollar might be worth less than the CAD. (It's currently at its highest point in 15 years).
What's goin on with the us doller:cry:
Can you understand the sense my former reply was written in or not? It's tiresome if you have to repeat the same things over and over again.
Again resolution in relation to monitor size. While 1024*768 is a high resolution on a 15" display, it does look more than mediocre on a 21" monitor. I said nothing else than "more acceptable than".
But... there's still the problem of aliasing and 1152x852 is not great enough for a 17" and 50 doller cards can usually only run 2x fsaa and alot of games these days need good old SSAA to look right at low res.
if that's all you've known then I guess it's acceptable.
sonix666
09-Dec-2005, 05:20
OT: Euro for the win!!! ;)
Ailuros
09-Dec-2005, 05:23
But... there's still the problem of aliasing and 1152x852 is not great enough for a 17" and 50 doller cards can usually only run 2x fsaa and alot of games these days need good old SSAA to look right at low res.
if that's all you've known then I guess it's acceptable.
For one CRTs that can display "real resolutions" beyond a 960 height are rare and second if you're spending only $50 for a GPU your demands should be also equally modest and that goes for both resolutions and general in game settings.
Second and again (since this hasn't been also the first time it's being mentioned) Supersampling will NOT replace any higher resolution, it'll merely improve output on the same resolution or else 1024*768 noAA << 1024*768 + 4xSSAA and not in any way 1024*768 + 4xSSAA > 1280*960 noAA as simple examples.
And a very redundant third, could we stick to the facts and avoid even indirect personal crap?
For one CRTs that can display "real resolutions" beyond a 960 height are rare and second if you're spending only $50 for a GPU your demands should be also equally modest and that goes for both resolutions and general in game settings.
Second and again (since this hasn't been also the first time it's being mentioned) Supersampling will NOT replace any higher resolution, it'll merely improve output on the same resolution or else 1024*768 noAA << 1024*768 + 4xSSAA and not in any way 1024*768 + 4xSSAA > 1280*960 noAA as simple examples.
I like the free "aa" I get with my .26 or somewhere about there on my 19" crt thank you very much.. I run at 1600x1200 with 1280x960 likely being the preferred res.
I didnt say it did, it would merely clean up the whole screen instead of just the edges:???:
Ailuros
09-Dec-2005, 05:33
I like the free "aa" I get with my .26 or somewhere about there on my 19" crt thank you very much.. I run at 1600x1200 with 1280x960 likely being the preferred res.
I didnt say it did, it would merely clean up the whole screen instead of just the edges:???:
If you want free AA - for which there never will be something entirely for free anyway - then you obviously need a >mainstream or $300 GPU. That means way higher demands than someone that spends only $50 on a GPU. And yes there are also users out there that are not willing or can't spend more.
What Supersampling does is known; it'll still won't replace a higher resolution. Dots per inch.
If you want free AA - for which there never will be something entirely for free anyway - then you obviously need a >mainstream or $300 GPU. That means way higher demands than someone that spends only $50 on a GPU. And yes there are also users out there that are not willing or can't spend more.
What Supersampling does is known; it'll still won't replace a higher resolution. Dots per inch.
What I mean about "free aa" is because there isn't enough dots to show 1600x1200.
I flat out said SSAA doesn't replace resolution.
Where are you getting this stuff?
overclocked
09-Dec-2005, 13:26
Im wondering little about the supply.
In sweden you cant get a X1800XT512(havent seen a 256 yet either) but the X1800Xl256 you can get decently, secondly the X1600 are to come in feb or so. Is ATI cutting the in the production for the highendparts to have plenty of R580 based cards with their refreshes? Ie the x1600 80nm will go to the x1300 price-segment and so on for the rest. It just seems wierd if ATi isnt trying to put the whole lineup togheter as fast as possible. I understand that they try to make as mutch money possible with their cards but is it likely that the memory on the X1800XT and specially the 512M version is in so tight supply thats it better to shift the focus and start inventory of said memory to the R580?
http://www.xbitlabs.com/articles/video/display/games-2005.html
However, our “mega-review” that covers loads of games and graphics boards is on track to be released early next year and it will include not only currently unrivalled NVIDIA’s SLI platform, but also ATI’s CrossFire platform featuring a brand-new ATI GPU that is supposed to be released massively in mid- or late-January, 2006.
A pretty bold statement from Anand regarding the release of the R580, no? It'd be quite misleading if he was referring to an X1800 XT PE when saying "brand-new".
Hmm. :)
That's pretty early.
Anand? Did you mean Anton?
Gotta like "massively" tho. Just gotta. "From his lips" etc.
If it's going to be widely available in late January that probably means GDDR3 memory and not GDDR4.
If it's going to be widely available in late January that probably means GDDR3 memory and not GDDR4.
Most of us are expecting a R580+GDDR3 combo and possibly R590+GDDR4.
Anand? Did you mean Anton?
Gotta like "massively" tho. Just gotta. "From his lips" etc.
No, I had just finished reading something @ Anandtech and still had it in my head. Just a mixup in my head.
If it's going to be widely available in late January that probably means GDDR3 memory and not GDDR4.
Wavey has hinted as much previously.
Most of us are expecting a R580+GDDR3 combo and possibly R590+GDDR4.
I thought that 80nm parts are slower/cheaper versions (like 110nm R430 vs. 130nm R480)...
Posted by someone on a Dutch forum (http://gathering.tweakers.net/forum/list_message/24780395#24780395):
X520XT (625/750) 18971 / 11185
X580?? (621/650) 20033 / 12583
He didn't say what the scores were for, but he was pretty spot on previously with the entire R520/RV530/RV515 lineup (with some minor adjustments but those were made by ATi at last minute). He also mentions that he saw RV580 and R590 passing by. The time that we will have to wait according to him: 1.5 months and a lot could still be tweaked/changed. So Xbit Labs statement about the "brand-new ATI GPU that is supposedly to be released massively in mid- or late-January, 2006" seems to be correct. The name will be X1900.
Someone mentioned the first score being 3DM2K3 and the second one being 3DM2K5, but the X1800XT seems to be scoring a bit too much. 12583 for the X1900XT seems possible though.
AlphaWolf
12-Dec-2005, 04:18
Posted by someone on a Dutch forum (http://gathering.tweakers.net/forum/list_message/24780395#24780395):
He didn't say what the scores were for, but he was pretty spot on previously with the entire R520/RV530/RV515 lineup (with some minor adjustments but those were made by ATi at last minute). He also mentions that he saw RV580 and R590 passing by. The time that we will have to wait according to him: 1.5 months and a lot could still be tweaked/changed. So Xbit Labs statement about the "brand-new ATI GPU that is supposedly to be released massively in mid- or late-January, 2006" seems to be correct. The name will be X1900.
Someone mentioned the first score being 3DM2K3 and the second one being 3DM2K5, but the X1800XT seems to be scoring a bit too much. 12583 for the X1900XT seems possible though.
An x1800xt wouldn't score near that in 3dmark with stock settings. 10k is about the max for a non-overclocked x1800xt in 05.
The x1600xt actually does really well in 3dmark so it wouldn't surprise me if the r580 also does relatively well. 12.5k wouldn't be a high guess imo, although cpu limits might be an issue at that level.
kemosabe
12-Dec-2005, 04:25
Posted by someone on a Dutch forum (http://gathering.tweakers.net/forum/list_message/24780395#24780395):
He didn't say what the scores were for, but he was pretty spot on previously with the entire R520/RV530/RV515 lineup (with some minor adjustments but those were made by ATi at last minute). He also mentions that he saw RV580 and R590 passing by. The time that we will have to wait according to him: 1.5 months and a lot could still be tweaked/changed. So Xbit Labs statement about the "brand-new ATI GPU that is supposedly to be released massively in mid- or late-January, 2006" seems to be correct. The name will be X1900.
Someone mentioned the first score being 3DM2K3 and the second one being 3DM2K5, but the X1800XT seems to be scoring a bit too much. 12583 for the X1900XT seems possible though.
With slower memory than R520, I take it that would be a preproduction sample (core speed would appear somewhat lowish as well). Also those X1800XT 3DMark scores are indeed well above those reported in any review (at stock speeds). I agree that a R580 score in that range would seem realistic.
Edit: Alpha took the words out of my mouth. That's what you get when you open a reply window and then pause to brush your teeth. :)
2nd edit: Errr, actually those R580 specs/scores could also fit nicely with an XL version as" X580??" would allow for that possibility in the mind of eternal optimists. ;)
A (http://www.hkepc.com/bbs/viewthread.php?tid=519829) + B (http://www.systranbox.com/systran/box) = R580 (X1900) on January 24, 2006 with an MSRP of $599-649
Bring out your calendars folks ;)
trinibwoy
12-Dec-2005, 16:41
http://www.hkepc.com/bbs/viewthread.php?tid=519829 (http://A + http://www.systranbox.com/systran/box = R580 (X1900) on January 24, 2005 with an MSRP of $599-649
:!:
That be bonus season too!! :grin:
Unknown Soldier
12-Dec-2005, 17:32
Ye . .but will there be stock ;)
I hope so . .ATI (according to the old CEO) was stocking them for a while already.
Bouncing Zabaglione Bros.
12-Dec-2005, 18:51
A (http://www.hkepc.com/bbs/viewthread.php?tid=519829) + B (http://www.systranbox.com/systran/box) = R580 (X1900) on January 24, 2006 with an MSRP of $599-649
Bring out your calendars folks ;)
Holy Crap! That's my birthday!
Holy Crap! That's my birthday!
but, you're not on the list yet! :(
January 24th in History
This Day in History > January 24th
January 24th in History
1839: Charles Darwin elected member of Royal Society
1848: Gold is first discovered in California,starting the gold rush
1902: Denmark sells Virgin Islands to USA
1924: Russian city of St Petersburg renamed Leningrad
1924: Italian dictator Mussolini disallows non-fascists work union
1935: Canned Beer is sold for the first time, in Virginia (United States)
1978: A Russian satellite accidently crashes in Canada's Northwest Territory
1984: Apple unveils its Macintosh personal computer, the first with a graphical interface
1985: The space shuttle 'Discovery' is launched on its first military mission
January 24th birthdays
1732: Pierre de Beaumarchais, French writer
1862: Edith Wharton, American writer
1917: Ernest Borgnine, American actor
1941: Neil Diamond, American singer/songwriter
1949: John Belushi, American actor
1961: Nastassja Kinski, German actress
Martin Eddy
14-Dec-2005, 13:06
Fuad's a bit slow isn't he. The Inq (http://www.theinquirer.net/?article=28341)
We finally confirmed that ATI's upcoming R580 GPU will again have sixteen pipelines only but this time it will have 48 Pixel Shader units. What does that mean?
Well, it's not the easiest subject to explain, but pipelines in this case can be associated with a job done by texture memory units. :roll:
Megadrive1988
14-Dec-2005, 21:08
Inq is really falling behind. they used to be "better" that this
Fuad's a bit slow isn't he. The Inq (http://www.theinquirer.net/?article=28341)
:roll:
Inquirer / Fuad about R580:
21. 03. 2005: "same number of pipelines, similar specifications, but a faster card" (http://www.theinquirer.net/?article=21981)
09. 06. 2005: "Whatever happens with R520, R580 will have all 32 pipelines enabled." (http://www.theinquirer.net/?article=23822) ;-)
30. 08. 2005: "We don’t expect that it (R580) will be just have a different number of pipes, it might have a few changes" (http://www.theinquirer.net/?article=25802)
04. 11. 2005: "ATI will add more pipes to R580 and it will get back in the game, aiming for the performance crown." (http://www.theinquirer.net/?article=27456)
14. 12. 2005: "ATI's R580 has 16 pipelines but 48 Pixel Shader units" (http://www.theinquirer.net/?article=28341) - finally correct info about R580!
---
Beyond 3D forum / caboosemoose:
22. 03. 2005: "my information describes R520 as: 16-1-1-1 and R580 as 16-1-3-1" (http://www.beyond3d.com/forum/showpost.php?p=433130&postcount=168)
...9 months earlier than The Inquirer :D
LVSeminole
14-Dec-2005, 23:39
:lol: Yeah. Its good to know that the Inq has known what they have been talking about all this time.
LVS
caboosemoose
15-Dec-2005, 01:37
Inquirer / Fuad about R580:
21. 03. 2005: "same number of pipelines, similar specifications, but a faster card" (http://www.theinquirer.net/?article=21981)
09. 06. 2005: "Whatever happens with R520, R580 will have all 32 pipelines enabled." (http://www.theinquirer.net/?article=23822) ;-)
30. 08. 2005: "We don’t expect that it (R580) will be just have a different number of pipes, it might have a few changes" (http://www.theinquirer.net/?article=25802)
04. 11. 2005: "ATI will add more pipes to R580 and it will get back in the game, aiming for the performance crown." (http://www.theinquirer.net/?article=27456)
14. 12. 2005: "ATI's R580 has 16 pipelines but 48 Pixel Shader units" (http://www.theinquirer.net/?article=28341) - finally correct info about R580!
---
Beyond 3D forum / caboosemoose:
22. 03. 2005: "my information describes R520 as: 16-1-1-1 and R580 as 16-1-3-1" (http://www.beyond3d.com/forum/showpost.php?p=433130&postcount=168)
...9 months earlier than The Inquirer :D
Thank you fans. No autographs, please.
Martin Eddy
15-Dec-2005, 02:01
Thank you fans. No autographs, please.
Oooh pleeeeaaaaase. I could sell it on Ebay. :razz:
Dave Baumann
15-Dec-2005, 02:16
Well, the point being is that Fudo probably earnt a reasonable amount of money for each article submitted; caboose got f' all!
kemosabe
15-Dec-2005, 02:32
I'd say that's Mike Magee's problem to rectify. How many people out there would stop taking a paycheque for doing a job poorly while being held accountable for nothing? :lol:
P.S. By the way, caboose, what's your info saying about R600? :)
Dave Baumann
15-Dec-2005, 02:34
I think Fudo sometimes knows more than he's letting on, but that doesn't mean he has to let you know that straight away.
IRQ Conflict
15-Dec-2005, 03:01
So, thats Kudo's and Mulo for Fudo's FUD? is that about right Dave? :)
I think it takes a real Moose to tell the truth and not expect anything in return ;)
Well, the point being is that Fudo probably earnt a reasonable amount of money for each article submitted; caboose got f' all!
Ailuros
15-Dec-2005, 06:05
I think Fudo sometimes knows more than he's letting on, but that doesn't mean he has to let you know that straight away.
Uh oh....there's a higher degree of "nothing" <shrug> :D
Unknown Soldier
15-Dec-2005, 08:10
I think Fudo sometimes knows more than he's letting on, but that doesn't mean he has to let you know that straight away.
Or it might be he's covering all his bases.
Says this, says that .. hey .. one of them has to be right. Right!?
;)
US
For once I thought the Inquirer had something insightful to ask today. It hasn't happened before so they must be learning as they poach points of view from the folk here at Beyond3d.
The article in question (below) had the insight that ATI goes for fewer, busier, higher clocked pipes with a 3:1 texture unit to pixel shader ratio - whilst NVidia backs slower pipes but more of them and a 3:2 TMU : Pixel Shader ratio. The Inq postulated that those differing ratios might make it quite hard for a game developer to generally optimise games given the majors have different balance points.
I'd love a game developer to respond to that...
http://www.theinquirer.net/?article=28341
ATI's R580 has 16 pipelines but 48 Pixel Shader units
BACK IN September, ATI told us that you need more per pixel operations then you need textures. It’s the ratio that changed over the years, but we don’t want to make a history class of this article. It said that it believes that the industry should go that way. Nvidia, on the other hand, believes in different ratios between pixels and textures.
We finally confirmed that ATI's upcoming R580 GPU will again have sixteen pipelines only but this time it will have 48 Pixel Shader units. What does that mean?
Well, it's not the easiest subject to explain, but pipelines in this case can be associated with a job done by texture memory units.
In the R580 case, the chip has 16 traditional pipelines that can process 16 textures while at the same time it can calculate 48 pixel data. The RV530 marchitecture is a good way to understand what ATI is doing now. With this chip, you have 12 pixel Shaders but just four texture memory units. Its a 3:1 ratio, as ATI believes that you need that many pixel information compared to textures.
In R520 for example ATI is using 16 pixel shaders, pipelines and has 16 matching texture memory units. In perfect scenario in each clock you can get 16 textures over 16 pipelines. Nvidia's G70 has 24 pipelines, let's call them pixel Shaders but it can process 16 textures. Nvidia is using a 3:2 ratio, as it believes that just need a little bit more pixels than textures.
We strongly believe that Nvidia is going to increase the number of its pixel operations and texture units in its upcoming G71 GPU. We believe it might move to 32 pixels/pipelines and 24 TMUs but we cannot confirm this at press time.
This might bring us back to the times when graphic companies will pressure developers to program for their hardware. It will be their choice to use 3:2 Nvidia's pixel to texture ratio or 3:1, which is what ATI wants to do. Nvidia spends more money in its TWIMTBP, The Way Its Meant To Be Played so it might win more of these battles
Ailuros
15-Dec-2005, 09:45
I'd love to have those supposed 3:2 or whatever ratios explained on GeForces *snicker*
fbomber
15-Dec-2005, 10:40
I think the inquirer is confusing texture units with rops.
G70 has 24 pipes, but only 16 rops, a 3:2 ratio.
Ailuros
15-Dec-2005, 12:13
Gaining tidbits of information is half of what you'd need for any sort of supposed journalism; the other half would be the ability to decypher said information otherwise it's useless.
Chalnoth
15-Dec-2005, 12:54
Wow, that's just....stupid. But it's what I've come to expect from the Inquirer. Just like the Radeons, GeForces have always been capable of executing more math operations than texture operations per clock (even back before pixel shaders).
Blastman
16-Dec-2005, 11:06
I don’t think it would make much sense to triple the shader power from the R520 to R580. I think the die size would be unmanageable.
Even going to 48 single full-ALU’s would give the 580 a minimum of 50% more shader power (48 vs 32 -- 16 full-ALU + 16 mini-ALU’s) over the R520. Probably quite a bit more when one factors in how efficiently the second mini-ALU can be used on the current R520.
Martin Eddy
16-Dec-2005, 13:12
I don’t think it would make much sense to triple the shader power from the R520 to R580. I think the die size would be unmanageable.
Even going to 48 single full-ALU’s would give the 580 a minimum of 50% more shader power (48 vs 32 -- 16 full-ALU + 16 mini-ALU’s) over the R520. Probably quite a bit more when one factors in how efficiently the second mini-ALU can be used on the current R520.
It's already been established that R580 has/will have 3 times the (theoretical) shader power of R520.
Even going to 48 single full-ALU’s would give the 580 a minimum of 50% more shader power (48 vs 32 -- 16 full-ALU + 16 mini-ALU’s) over the R520.Efficient shader compilers are harder to write than you think they are.
Uttar
Blastman
16-Dec-2005, 23:17
It's already been established that R580 has/will have 3 times the (theoretical) shader power of R520.I read the whole thread. I didn’t see anywhere where is was confirmed that the R580 would have triple the shader power (unless I missed that).
Efficient shader compilers are harder to write than you think they are.
My thinking is that one could avoid the inefficiencies of parallelism in putting a mini-ALU with a full-ALU by just going with lots of single ALU’s.
How much extra shader power does the mini-ALU add in real world situations considering it both -- has to work in parallel with the full-ALU, and is a mini-ALU to begin with and can’t handle all types of calculations.
If the 16 mini-ALU’s (R520) add … lets say 60% (a guess) more real world shader power to the full ALU’s. Then they would add that to the R580. 3.0/1.6 = 1.875 … the 48 single full-ALU’s would have 87.5% more shader power in real world situations over the R520. Almost double -- it would depend on how efficient those second mini-ALU’s are. I think doubling the shader power over the R520 would be considered a large jump in overall shader power.
"I read the whole thread. I didn’t see anywhere where is was confirmed that the R580 would have triple the shader power (unless I missed that)."
Imagine the X1600 with 4 times the ROPs and 4 times the ALUs thats basically the R580.
Martin Eddy
17-Dec-2005, 01:19
I read the whole thread. I didn’t see anywhere where is was confirmed that the R580 would have triple the shader power (unless I missed that).
Follow this link that was posted at the top of this page.
Beyond 3D forum / caboosemoose:
22. 03. 2005: "my information describes R520 as: 16-1-1-1 and R580 as 16-1-3-1" (http://www.beyond3d.com/forum/showpost.php?p=433130&postcount=168)
From the 57 page R520 informania (http://www.beyond3d.com/forum/showthread.php?t=18270) thread.
AlphaWolf
17-Dec-2005, 01:29
He's using the 'confirmed' argument, it won't be confirmed until it is announced.
Blastman:
You can choose to go against the concensus and make your argument, but just your personal opinion isn't going to get you very far as you can see from the lengths of the debate on this topic most everything has been covered.
This interview (http://www.digit-life.com/articles2/video/r520-part6.html) might be of interest to you as regarding shader power levels that ATI perceives necessary in the future.
If we take a look at high-tech games, released last year (Half Life 2, Far Cry) and this year (Age of Empires 3, FEAR, and Splinter Cell), you will see the ratio between texture and arithmetic operations already within 1:3.5 - 1:5. Extrapolating to the next couple of years, we can expect a larger ratio.
Frostwake
19-Dec-2005, 21:03
Imagine the X1600 with 4 times the ROPs and 4 times the ALUs thats basically the R580.
And the addition of a 256 bit bus... I think this will be a very impressive card.. specially in 3dmark06 (and future games of course), I wouldnt be surprised if it managed 2x r520 score
I'd say that's Mike Magee's problem to rectify. How many people out there would stop taking a paycheque for doing a job poorly while being held accountable for nothing? :lol:
P.S. By the way, caboose, what's your info saying about R600? :)
I understand your concern, but his job is not to be accurate. As the name of the website implies, he collects information -- whether it's right or wrong. We wouldn't be having half the information we have today. Of course he feels he's a star and thinks he has a big [excuse the expression] IT penz0r; he travels a lot, meets different people (including a handful of engineers). What do you expect him to report? Yea, the information he provides is sometimes questionable, but you don't expect him to verify every single bit?
TheInq:
A single X1600XT card will score 5093 in 3Dmark05 while the two cards will score respectable 8730 marks (in Crossfire).
http://www.theinquirer.net/?article=28422
If R580 is really 4 times the RV530, I guess it would be more or less reasonable to expect around 15 000 3DMarks in 05, especially considering the higher core clock (RV530 is 590MHz, R580 definitley more). That would be noticeably better than the 512MB 7800 'Ultra', but would it be enough to beat G80?
AlphaWolf
19-Dec-2005, 22:33
TheInq:
A single X1600XT card will score 5093 in 3Dmark05 while the two cards will score respectable 8730 marks (in Crossfire).
http://www.theinquirer.net/?article=28422
If R580 is really 4 times the RV530, I guess it would be more or less reasonable to expect around 15 000 3DMarks in 05, especially considering the higher core clock (RV530 is 590MHz, R580 definitley more). That would be noticeably better than the 512MB 7800 'Ultra', but would it be enough to beat G80?
90nm G7x vs r580, G80 vs r600 I'd think are the battles that are setting up for next year.
Some R580 and RV560 info.. . .
http://www.theinquirer.net/?article=28460
R580 boards wandering the countryside (can 3DM scores be far behind?), and I think that's the first I've seen someone land a solid timeframe on RV560.
trinibwoy
20-Dec-2005, 13:25
Some R580 and RV560 info.. . .
http://www.theinquirer.net/?article=28460
R580 boards wandering the countryside (can 3DM scores be far behind?), and I think that's the first I've seen someone land a solid timeframe on RV560.
Hmm only about a month to launch - would have thought the Inq would have more "concrete" info by now. We should at least have some clockspeeds or something already.
Bouncing Zabaglione Bros.
20-Dec-2005, 14:09
Hmm only about a month to launch - would have thought the Inq would have more "concrete" info by now. We should at least have some clockspeeds or something already.
ATI managed to keep things nailed up pretty tightly until just before R520 launched - maybe they've been able to do the same again with R580. Like last time however, that may imply that launch and shipping will not happen together.
Dave Baumann
20-Dec-2005, 14:21
OEM's had had eval samples for a while now.
trinibwoy
20-Dec-2005, 14:54
OEM's had had eval samples for a while now.
Which makes it even more curious why nothing solid has leaked yet....
Which makes it even more curious why nothing solid has leaked yet....
NDA finally working?
pakotlar
20-Dec-2005, 15:24
NDA finally working?
:D hah!
PatrickL
20-Dec-2005, 15:27
Maybe because Nvdia's "leaks" are part of their marketing plan to create buzz while ATI use another way ?
NDA finally working?
AIBs/OEMs not wanting to crap their own nest during Christmas buying season?
trinibwoy
20-Dec-2005, 15:33
Maybe because Nvdia's "leaks" are part of their marketing plan to create buzz while ATI use another way ?
Actually, if I remember correctly, solid info on upcoming Nvidia parts is usually harder to come by. Look at how long ago we knew about the pipeline configuration of R520/R580 - we knew practically nothing about G70 until reviews popped up.
Actually, if I remember correctly, solid info on upcoming Nvidia parts is usually harder to come by. Look at how long ago we knew about the pipeline configuration of R520/R580 - we knew practically nothing about G70 until reviews popped up.
I would agree with that. Can't decide if it is competitive secrecy, trying to hold to the "hard launch" mantra, or both.
AIBs/OEMs not wanting to crap their own nest during Christmas buying season? Aye, given R580's seemingly imminent arrival, it's probably a case of making hay while the sun shines/snow falls, especially with decent X1800XL/XT supplies about and 512MB GTX supplies seriously constrained.
anyone has info on those new x1900 chips?
i meant is the 1900xl supposed to replace the x1800xl
and 1900xt replace the 1900xt
or they come additional in the line? at about same prices??
because i was about to purchase a 1800xl card, but i don't know about that anymore :\
Kanyamagufa
20-Dec-2005, 16:59
anyone has info on those new x1900 chips?
i meant is the 1900xl supposed to replace the x1800xl
and 1900xt replace the 1900xt
or they come additional in the line? at about same prices??
A very good point. Is R580 going to replace the R520 altogether, or expand the entire line for ATI?
No point of expanding in my opinion. Problem is they [ATi] will have to lower the price of X1800 XT by a great margin. Those parts (as it's always been) will most likely replace this years chips.
A very good point. Is R580 going to replace the R520 altogether, or expand the entire line for ATI?
Replace the R520 chip completely. ATI is no longer ordering R520 chips to be fabbed. They will run through their existing inventory, but no telling how long those boards will remain in the channel.
kemosabe
20-Dec-2005, 17:28
Considering they're getting higher than expected yields on 625MHz+ X1800XT and availability seems strong, I would think they'll want to continue selling them at the ~$400 price point (competing with the 256MB GTX) when R580 comes along and bring the XL down to $300 where they currently have nothing very competitive.
IMO, this is what things might look like when the new chips are released and shipping in volume (street prices):
X1300/Pro ---> sub-$100
X1600Pro/XT ---> up to $150
X1700Pro/XT ---> $200-$250
X1800XL ---> $300-350
X1800XT ---> $400-450
X1900XL? ---> $500
X1900XT ---> $550-600
kemosabe
20-Dec-2005, 17:29
ATI is no longer ordering R520 chips to be fabbed. They will run through their existing inventory, but no telling how long those boards will remain in the channel.
Speculating or you know this for a fact?
Replace the R520 chip completely. ATI is no longer ordering R520 chips to be fabbed. They will run through their existing inventory, but no telling how long those boards will remain in the channel.
With the Inquirer reporting that the board design for R580 is pin-compatible with R520 (with other minor changes) might the final R520s appear on the R580-revision of the board?
That would seem to me to be a good way of smoothing out the cut-over from R520 to R580, while providing bargain hunters with hope of some seriously cheap R520s throughout the months following R580's release.
Jawed
Considering they're getting higher than expected yields on 625MHz+ X1800XT and availability seems strong, I would think they'll want to continue selling them at the ~$400 price point (competing with the 256MB GTX) when R580 comes along and bring the XL down to $300 where they currently have nothing very competitive.
People have already reported within the past month seeing X1800XL in retail stores for $300.
As for the item of R520 still being produced, it had been reported (rumors?) that ATI has not ordered any more chips to be fabbed after the beginning of December.
Seems to be grumbling about 90nm capacity. Gotta wonder if that is/was a factor in their thinking.
People have already reported within the past month seeing X1800XL in retail stores for $300.
As for the item of R520 still being produced, it had been reported (rumors?) that ATI has not ordered any more chips to be fabbed after the beginning of December.
Doesn't it take about 3 months for a wafer order to be satisfied?
So there'll be fresh X1800s entering retail in March.
Jawed
kemosabe
20-Dec-2005, 17:55
Seems to be grumbling about 90nm capacity. Gotta wonder if that is/was a factor in their thinking.
Rich Heye recently made it clear that ATI was seeing no 90nm capacity constraints at TSMC.
kemosabe
20-Dec-2005, 18:09
People have already reported within the past month seeing X1800XL in retail stores for $300.
As for the item of R520 still being produced, it had been reported (rumors?) that ATI has not ordered any more chips to be fabbed after the beginning of December.
XLs at $300 are still very rare birds - most are still $350 and up.
Josh wrote (http://www.penstarsys.com/) on Nov. 16:
It looks as though all R520 chip production will be halted by December, so the final orders will be filled by finished products by March.
Not quite sure where that idea originated but hopefully Orton will shed some light on R520/R580 plans at tomorrow morning's conference call.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.