View Full Version : The NEXT LAST R600 Rumours & Speculation Thread
Pressure
05-May-2007, 11:31
Looks nice, this ice/cold sticker in the cooler much better than the reference flames :smile:
I couldn't care less, it is not like I am looking inside my computer. I tend to look at the screen connected to my computer but people are different I know... ;)
NocturnDragon
05-May-2007, 12:41
I couldn't care less, it is not like I am looking inside my computer. I tend to look at the screen connected to my computer but people are different I know... ;)
Cases are for wimps. :P
Skinner
05-May-2007, 13:51
Still waiting for a fist in depth writeup about IQ and perf. under heavy shader/gfx envoirements, like Stalker indoor... etc.
Still waiting for a fist in...
You're willing to do that for a videocard? :eek:
Arnold Beckenbauer
05-May-2007, 13:58
Or 64 Vec4+1 ALUs with 5 MADDs and you get 475 GLOP/s (740 MHz).
R520's eight VS ALUs are "5D" - Vec4+1 and groupped in one "block" (SIMD).
The next idea:
:lol:
I win.
http://directupload.com/files/myjzmivzgmyw5iwzdgxi.jpg
http://img251.imageshack.us/img251/6121/superscalar1tg0.jpg
Is it possible, that R600 splits all instructions > Vec1 in small Vec1 instructions like G80?
Chalnoth
05-May-2007, 14:07
Awwww....look at that little branch execution unit! It's so cute!
(sorry, couldn't resist)
Skinner
05-May-2007, 14:14
You're willing to do that for a videocard? :eek:
Yeah, not bad huh ;)
Is it possible, that R600 splits all instructions > Vec1 in small Vec1 instructions like G80?
Yes:
http://forum.beyond3d.com/showthread.php?p=978652#post978652
http://forum.beyond3d.com/showthread.php?p=979437#post979437
except that, per pixel, the scalar instructions are issued concurrently rather than being issued sequentially as in G80.
Jawed
trinibwoy
05-May-2007, 15:24
I'm still baffled by their decision to go this way yet continue to rely on instruction co-issue.
Actually, maybe not so much. The other option was to issue the same scalar instruction for a different object per slot. But to keep a decent batch size these slots would have to be broken up in clusters - say 16 each given ATI's obsession with DB performance? But each cluster would then require its own set of control logic, thread scheduling, cache etc. Probably not possible to do this with 320 scalars / 20 clusters in a reasonable transistor budget - so the other option is fewer clusters at a higher clock. Oh wait, that's G80 :)
It's going to be veeeeery interesting to see how these two architectures scale at 65nm.
except that, per pixel, the scalar instructions are issued concurrently rather than being issued sequentially as in G80.
Isn't thins going to make scheduling (given register files restrictions) an headache?
The pro is that you need less pixels to be processed in flight..
On the evening of May 10th, AMD is throwing a party in San Francisco for you! There will be demos of the soon-to-market Barcelona and Radeon 2900 XT as well as the chance the play Prey DM on some of the newest AMD hardware. AMD staff will be there to answer your questions. Our own HardForum members will get priority and are encouraged to take pictures and post about the event. Please email Kyle@HardOCP.com with an invitation request for you and a guest. I will be there as well as other Web journalists to partake in the fun. You must be 21 and IDs will be checked at the door. See you there, Kyle!
http://www.hardocp.com/
AMD party in SF. Too bad it's such a long trip from Sweden. :lol:
Arnold Beckenbauer
05-May-2007, 16:46
http://www.hardocp.com/
AMD party in SF. Too bad it's such a long trip from Sweden. :lol:
Let me guess: Alcatraz?
Sound_Card
05-May-2007, 16:54
Let me guess: Alcatraz?
I fell out of my chair.:lol:
It's a trap by AMD for Kyle.:razz:
PatrickL
05-May-2007, 16:57
Web journalist? Is is a code to let people know you are in fact not a journalist at all ?
Let me guess: Alcatraz?
Doesn't say. :smile:
Didn't ATI and Valve have an event there before they launched R520 and HL2?
Arnold Beckenbauer
05-May-2007, 17:01
Doesn't say. :smile:
Didn't ATI and Valve have an event there before they launched R520 and HL2?
R360 aka 9800XT. And HL2 voucher.
R360 aka 9800XT. And HL2 voucher.
Ahh yes! Time does fly. :wink:
Mintmaster
05-May-2007, 17:45
Isn't thins going to make scheduling (given register files restrictions) an headache?
The pro is that you need less pixels to be processed in flight..
I agree. It doesn't really make much sense to design it this way.
The fewer pixels in flight is only a marginal advantage, too. Because the texture latency is a few hundred cycles, that is your primary design factor for hiding latency, and it doesn't matter much if you co-issue arithmetic instructions or run them sequentially. 300 cycle latency for 16 texture units, for example, needs 4800 pixels in flight to maintain throughput. 320 scalar units with a lengthy 15 cycle latency can operate efficiently with that. On top of that, R580 apparently had 25k pixels in flight (http://www.beyond3d.com/content/reviews/2/3) (talk about overkill!), so that doesn't seem to be an issue.
The only reason I can think of for doing this is that maybe the batch size is small, so if R600 is 4 groups of a texture quad and 80 stream processors, then you can't easily attack the math one channel at a time (G80, being comprised of 8 groups of a texture quad and 16 stream processors, doesn't have that problem). Still, I'd rather divide the chip into smaller groups and share the texture quads rather than go through all this co-issue scheduling.
I really wish that for R600's shader architecture ATI just doubled Xenos and went scalar in the Nvidia/SOA way (with, of course, DX10 additions). Process the 64-pixel batches one channel at a time instead of 16 pixels at a time.
On the other hand, ATI did fit more than double the math units (but half the texture units) in the same space as G80, and Nvidia trounced ATI in this respect last generation. Maybe we're missing something.
leoneazzurro
05-May-2007, 17:53
I agree. It doesn't really make much sense to design it this way.
The fewer pixels in flight is only a marginal advantage, too. Because the texture latency is a few hundred cycles, that is your primary design factor for hiding latency, and it doesn't matter much if you co-issue arithmetic instructions or run them sequentially. 300 cycle latency for 16 texture units, for example, needs 4800 pixels in flight to maintain throughput. 320 scalar units with a lengthy 15 cycle latency can operate efficiently with that. On top of that, R580 apparently had 25k pixels in flight (http://www.beyond3d.com/content/reviews/2/3) (talk about overkill!), so that doesn't seem to be an issue.
The only reason I can think of for doing this is that maybe the batch size is small, so if R600 is 4 groups of a texture quad and 80 stream processors, then you can't easily attack the math one channel at a time (G80, being comprised of 8 groups of a texture quad and 16 stream processors, doesn't have that problem). Still, I'd rather divide the chip into smaller groups and share the texture quads rather than go through all this co-issue scheduling.
I really wish that for R600's shader architecture ATI just doubled Xenos and went scalar in the Nvidia/SOA way (with, of course, DX10 additions). Process the 64-pixel batches one channel at a time instead of 16 pixels at a time.
On the other hand, ATI did fit more than double the math units (but half the texture units) in the same space as G80, and Nvidia trounced ATI in this respect last generation. Maybe we're missing something.
What if R600 is divided in 4 groups and each group has more levels consistsing of 4 subgroups of 4 shader units with 5 scalar processors each (we already know that they are grouped by 5, with une unit to handle the SF)?
Thanks for the good post Mint.
Also we don't know how much orthogonal al R600 stream processors, for example from the leaked slides is not very clear if they support integer math on any ALU, or at least on most of them.
Kinc is such a tease.. :lol:
http://www.kinc.se/05.JPG
Cant tell you much about the screen but looks good for a single graphics card with its default cooler.
http://www.nordichardware.com/forum/viewtopic.php?topic=8428&forum=45
06 scores too.
http://www.kinc.se/06.JPG
I'm still baffled by their decision to go this way yet continue to rely on instruction co-issue.
It seems like an evolution of the R3xx pipeline (which, don't forget, is four-instructions per object). Fat trimmed off, D3D10-specific stuff added-in.
I've decried the NVidia superscalar pipeline in NV4x/G7x for the complexities of instruction issue, how instruction dependencies and ordering can really hammer the utilisation (along with the complexities of texture addressing being dependent upon the top ALU). The R3xx pipeline seems only marginally better.
R600 is more symmetrical, but I definitely wonder how well it's going to maintain utilisation. For example we don't know if the special function unit is single-cycle for all instructions - that can mess things up.
Actually, maybe not so much. The other option was to issue the same scalar instruction for a different object per slot.
Yay, thread packing, as I waxed lyrically many moons ago.
But to keep a decent batch size these slots would have to be broken up in clusters - say 16 each given ATI's obsession with DB performance?
Well I proposed something more fiddlesome to do with instruction windowing as well as thread packing (effectly packing 1, 2, 3 or 4 batches of objects to make up a temporary batch for each instruction) to keep batch sizes low whilst maintaining high utilisation etc.
The R600 ALU pipeline seems like it'll be distinctly more efficient than the R3xx pipeline (which is also in R5xx). Worst case is no slower than R3xx, but wasting less units, while the best case would appear to be multiples faster. I just dunno what the typical case is, 20% more throughput per clock? 50%?...
Also, there's been no slides leaked about dynamic branching. That may be because R600 DB is nothing to write home about...
It's going to be veeeeery interesting to see how these two architectures scale at 65nm.
I suppose I see 65nm as being more about die size than clocks (and about cutting power/heat). ALUs, historically, seem to be a relatively small part of the die - 30% tops kinda thing.
At the same time I wonder if AMD will ever get past using four shader units (16 TUs, 16 RBEs + some variable quantity of ALU pipelines) in a GPU or just ramp up the clock rate.
Erm...
Jawed
Isn't thins going to make scheduling (given register files restrictions) an headache?
It'd be nice to know for sure about the old R3xx pipeline design, but it really does seem like it's a four-instruction issue pipeline:
vec3 MAD - 3 operands
scalar MAD/SF - 3 operands or 1 operand? How many of the 3 operands can be distinct from those associated with the vec3 MAD?
vec3 ADD - 2 operands
scalar ADD/SF - 2 operands? SF present?R600 is hairier, agreed. I'm just not convinced it's a night and day difference in terms of operand fetch complexity. I think MAD instructions (3 operands) are the worst part of it, because it would appear to require a 3-way (all duplicates) register file.
Jawed
The only reason I can think of for doing this is that maybe the batch size is small, so if R600 is 4 groups of a texture quad and 80 stream processors, then you can't easily attack the math one channel at a time
You might argue that ATI didn't want it to have more than 4 shader units, because of its memory tiling (textures and render targets) and hierarchical-Z tiling systems. Additionally, if ATI didn't have the tools to build a 2x-clocked ALU pipeline, then a sequential-scalar ALU would create a large increase in batch size. This is assuming that they've kept the 1-instruction=4-clocks organisation. If they went serial-scalar, that'd produce batches 5x bigger (now there's 5x more pixels per clock), hurting DB. Helps the texture caches, though.
Maybe we're missing something.
I know I'm missing something with the texturing units, so there's prolly more in there...
Jawed
silent_guy
05-May-2007, 20:03
I'm still baffled by their decision to go this way yet continue to rely on instruction co-issue.
I think it's a logical decision to go this way: a nice evolution of something they already had. Definitely more efficient, but with the option to initially reuse the compiler that was already there (while temporarily forgoing the additional efficiency until improvements are made.)
Not knowing the architecture of the competition, it's the low risk and low cost way to transition to your next product. Most established companies work that way, especially when they are resource limited due to a number of other projects on the side (xBox/Wii): instead of throwing everything overboard you only improve existing stuff and spend most time on what needs to be designed from scratch. (DX10 stuff, tessellation, ...)
I don't really see how the shader architecture is going to be the source of performance problems for 3D stuff.
It's going to be veeeeery interesting to see how these two architectures scale at 65nm.
I wouldn't expect too much, if they are just evolutions: the incremental speed gain going from 80nm to 65nm isn't very impressive. And that's to be expected with wire delays become the determining factor even more...
I really think that the instructions issued to the individual ALUs include wide (4vec + scalar) as well as small (scalar) ops, simply because that would reduce the needed bandwidth a whole lot. And they might include a loop counter as well as some conditionals. Those conditionals could be applied to the exit condition, or the conditional writing/branching of individual instructions.
Could you please explain that ?
We will show a two examples
Let's add two 4D vectors and two 1D scalar information. ATI will do this in a single clock because it has one 4D and one 1D unit to spend. Nvidia will do this in a single clock but it will take 5 units from 128 possible while ATI used only 2 from 128 possible (64 4D + 64 1D).
ATI is theoretically faster, but Nvidia has Shaders that run at 1.35 GHz or some 60 percent faster than ATI's that run at 800 MHz. This makes the situation much better for Nvidia. This example is very rare in real games. Shaders very rarely use the operations with 4D vectors. Shaders usually work with 3D data such as 3D coordinates, 3D normals and 3D RGB channels without alpha. In some cases Shaders use 2D functions only e.g. 2D coordinate in texture or even 1D.
Example 2:
Let's add two 3D vectors, two 2D Vectors and one 1 D scalar. Nvidia again needs 5 Shaders or 2+2+1 units while ATI can not make this in a single clock. ATI needs a single clock to add two 3D vectors and at the same clock it can add the 1D scalar. However it needs a second clock to add 2 2D vectors. This means that ATI used 1+1 units in first and a single unit in second clocks, together it used three units in two clocks. As R600 Shaders are 60 percent slower than Nvidia's it means that R600 is two times slower than G80 in this particular example.
If you decide to make a Shader that will add eight 1D vectors Nvidia can take 8 units in a single clocks and ATI has to do in four clocks as it can use 1 4D and 1 1D unit per clock to finish this operation. In this case Nvidia can be four times faster and still finishes the calculation faster.
Even if it looks really bad for ATI there is still some hope for the R600. You have to take into consideration that Microsoft HLSL (Higher Level Shader Language) compiler will do its small miracle and will try to optimize the Shader code. Compiler will try to "glue" a few scalars in a single VEC4 unit.
Compiler
We have two scalar informations, the first one called Fudo and second ona called JenHsun. The compiler will glue these two scalar values in a single 2D vector called "FudoJenHsun" and it will access Fudo scalar part as the FudoJenHsun.X and JenHsun part as FudoJenHsun.Y. The good part is that all possible operations with a new variable 2D vector FudoJenHsun will run parallel as one 2D vector. This wont help Nvidia at all, but it will really mean a lot for ATI. Nvidia still has an advantage of having full flexible scalar units, while ATI doesn’t, at least not that we or any of our sources knows of.
We know it is a complicated part but we could not simplify things much more than this. We apologise for the complexity of the part but it just doesn’t go easier than that. We simplified it as much as we could.
and will the future games use these vec4 instructions for shaders ?
Thx
Chalnoth
05-May-2007, 21:51
There's really no reason to expect any increase in vec4 instruction utilization.
so how ppl expect a better performance of R600 than G80 in DX10 games ??
Could you please explain that ?
Thx
source: Fudzilla. That's the whole explanation :-)
but i really wanna know why r600 would be better in Dx10 and G80 is better in DX9
will DX10 take more advantage of the Vec5 *vec4 + scalar* architecture and R600's 320 Sps will be used in a more efficient way ?
It will because it must because it`s from ATi and it`s late and it may not be a scorcher in terms of general performance so there must be a catch because ATi doesn`t make sucky stuff ever ever ever.
On a serious note, I think it`s fairly hard to predict wheter or not it`ll rule WRT DX10, because we've yet to see typical DX10 workloads(no, dubious demos quoted in marketing slides don`t count). It`ll come down to what ppl actually do with DX10 and how that jibes with the 2 competing architectures. The R600 seems to be more adept than the G80 when it comes to Geometry shading, wheter or not that`ll matter/it is fast enough for it to matter is a different thing. Then there`s compiler magic that comes into play...until the NDA is off an a pertinent analysis can be performed, it`s all a complex exercise of who can pull theories from the deeper parts of their anatomy at a greater pace(no disrespect intended towards fellows like Jawed, mind you)
Say, you have an instruction stream coming into your GPU, that uses half the available memory bandwidth. The other half is used for data. Say, you have 256 ALUs (processing elements), that all need a new instruction each clock to be able to do something useful. And each instruction + operands is 64 bits wide (that's very conservative), and you can read one each clock from memory. That requires a 256 * 64 * 2 (for data) = 32768 (!!) bits wide bus.
Of course, that isn't practical. Fortunately, GPUs are SIMD: Single Instruction, Multiple Data. In other words: the ALUs are grouped. Each instruction is executed multiple times in parallel. And there are other ways to reduce the bandwidth (bus width) needed: either you use instructions that tell multiple groups of ALUs what to do (like the superscalar R600), or you execute those instructions over multiple clock cycles (like the serial 8800). Both ways, you reduce the amount of instructions needed each clock to a manageable amount.
Branching is further increasing the need for instruction scheduling, or reducing the throughput. Because, at each if..then..else statement, some of the elements might go one way, and the others the other way. At that point, you can split the streams (doubling the amount of ALUs and bandwidth needed), calculate both possibilities in sequence and only write the ones that are valid for that case, or simply calculate both paths in sequence. The latter two both halve the throughput every time they happen.
As the R600 has 4 ALU blocks, that all consist of 6 (4 vec, 1 scalar and 1 branching/conditional) units, which all have to receive instructions each clock, you either need instructions that tell all of them what to do in all cases (from 6 independent, scalar instructions, up to a single combined vec4 + scalar and a conditional instruction). That requires either very long instruction words (VLIW, say up to 1024 bits each for each block), or clever scheduling.
Because, most combined instructions can be simplified into a single instruction for the whole ALU block. But, if you only schedule a single instruction for a single ALU, all the others would be wasted for that clock.
So, it's most likely, that they kept the instruction length manageable, but made it possible to stack instructions, so they can be executed all at once.
silent_guy
05-May-2007, 22:41
Could you please explain that ?
We will show a two examples
Let's add two 4D vectors and two 1D scalar information. ATI will do this in a single clock because it has one 4D and one 1D unit to spend. Nvidia will do this in a single clock but it will take 5 units from 128 possible while ATI used only 2 from 128 possible (64 4D + 64 1D).
... (I stopped reading after this)
Julidz, forget everything of the paragraph above: it's completely wrong.
ATI will do the 4D operation and the 1D operation in 1 clock cycle and can do 64 of those actions (for 64 different threads) at the same time.
The 8800GTX requires 5 clock cycles to do them and can do 128 in parallel, but at a higher clock speed.
...and will the future games use these vec4 instructions for shaders ?
AFAIK, current shaders are already using 4D instruction: they are used in homogeneous [x,y,z,w] coordinate system. (But I'm sure somebody will correct me if I'm talking BS. :wink:)
Arnold Beckenbauer
05-May-2007, 22:46
but i really wanna know why r600 would be better in Dx10 and G80 is better in DX9
will DX10 take more advantage of the Vec5 *vec4 + scalar* architecture and R600's 320 Sps will be used in a more efficient way?
Let's add two 4D vectors and two 1D scalar information. ATI will do this in a single clock because it has one 4D and one 1D unit to spend. Nvidia will do this in a single clock but it will take 5 units from 128 possible while ATI used only 2 from 128 possible (64 4D + 64 1D).
Pardon?
G80 never works on Vec2, Vec3, Vec4... instructions, but always on Vec1. If there is a eg a Vec4 instruction, G80's compiler splits it in four Vec1 and it takes four cycles two work on it.
Imagine: You have to work on four RGBA quads. The G70's quad ALU (4D) needs four cycles for these four RGBA quads: first cycle - first RGBA quad, second cycle - second, third...
The G80 splits these four quads in four pixel groups with 16 pixels: 16*R, 16*G, 16*B, 16*A. A Vec16-ALU needs four cycles for these four groups: first cycle - 16*R, second cycle - 16*G...
A Vec16-ALU is clocked up to 1512 MHz.
And now: You forget all these pixel stuff. Current GPUs like R520/R580 or G80 work on threads/batches/green elephants.
What really matters is: input (10101010101...) and output (what you see on you display).
The only reason I can think of for doing this is that maybe the batch size is small, so if R600 is 4 groups of a texture quad and 80 stream processors, then you can't easily attack the math one channel at a time (G80, being comprised of 8 groups of a texture quad and 16 stream processors, doesn't have that problem). Still, I'd rather divide the chip into smaller groups and share the texture quads rather than go through all this co-issue scheduling.
Or eight SIMDs and four fully decoupled TUs (texture units)?
silent_guy
05-May-2007, 22:52
but i really wanna know why r600 would be better in Dx10 and G80 is better in DX9
will DX10 take more advantage of the Vec5 *vec4 + scalar* architecture and R600's 320 Sps will be used in a more efficient way ?
If you look at pure calculation power, R600 has the advantage wrt the amount of calculations per second it can do. That's very nice to have and will almost certainly make it a winner for a number of calculation oriented benchmarks.
But the execution pipes (shaders) don't have much to do with the DX9 vs DX10 performance. For that, the stuff surrounding the shaders is what counts: how do you store geometry shading data? How do you feed the shaders with data? How fast can it do branching?
We currently don't know a lot about the DX10 organization for G80, and even less for R600. It's impossible to know how they will behave for DX10 games, but if there's a major difference, I think it's reasonable to say that the difference in shader executing pipeline won't be the major factor.
Right now, I know of only 1 report about relative GS performance, and that's a blog post about MS Flight Simulator. Not a exactly a large body of evidence to go on...
Silent_Buddha
05-May-2007, 23:08
Aye it'd be interesting to hear from the developers of Call of Juarez, Company of Heroes, Crysis, etc, who are either working on DX10 or patching their games to DX10 what they think of the two architechtures. Although I'm supposing they are also still under NDA from ATI.
From the admittedly VERY few screens of Call of Juarez, the DX10 version looks absolutely nothing like the DX9 version that I tried out. Well other than buildings and terrain being in the same place. :D
I'm not sure however if that is a sign of things to come, or if it's just that they've had an extra year or so to add more bling to the game.
Regards,
SB
btwango
06-May-2007, 03:01
What are the ramifications of the R600 "doing" audio as well as graphics? Is this going to render (no pun intended) after market sound cards superfluous?
Anarchist4000
06-May-2007, 04:31
What are the ramifications of the R600 "doing" audio as well as graphics? Is this going to render (no pun intended) after market sound cards superfluous?
The main benefits of an actual sound card are:
1) Offload processing requirements from the CPU
2) Better sound quality with analog output.
3D effects are the only thing that comes to mind that requires serious audio processing and with multicore CPUs being more prevalent the need to offload the processing to a card seems increasingly less. As far as quality goes a CPU can do just as well as a card, if not better since there are no processing restrictions. The big sound difference in the past came from better components used to output analog signals. With digital signals there isn't a whole lot that can be done to improve their quality component wise.
I'd be willing to bet that the only audio functions the card will have will be to output that digial signal through HDMI. Most onboard audio solutions are nothing but software. The hardware part is just a matter of analog to digital conversion and vice versa.
Unknown Soldier
06-May-2007, 07:38
http://img405.imageshack.us/img405/2517/x2900xt102401319944ik7.pnghttp://img405.imageshack.us/img405/5362/x2900xt10240231a133qw8.png
http://www.xtremesystems.org/forums/showthread.php?t=143104
Hi,
I'm surprised that no one picked up that the GDDR4 1GIG card seems to be underclocked. The original rumours spouted that the GDDR4 card(ala XTX?) would use 2.2(or 2200Mhz) yet these pics only show 2000Mhz.
Of course as someone mentioned, these are most probably developer cards thus why it only has 2000Mhz and is called XT.
US
HD2900XT GDDR4 OEM? (750/2000). Maybe the same version, like in DailyTech "preview"... (they call i XTX (http://dailytech.com/ATI+Radeon+HD+2900+XTX+Doomed+from+the+Start/article7052.htm), but I think they are just wrong :-) )
The main benefits of an actual sound card are:
1) Offload processing requirements from the CPUAccording to publicly leaked information, the processing is still done on the CPU. You know, just like what any modern integrated audio solution will do; these things leave all the major processing duty to the CPU. In the end, that kind of work is so minimalist is doesn't really matter. Some EAX effects (which neither R6xx nor 99% of integrated solutions support in practice anyway) *might* be a bit more expensive.
So, R600 would just have an integrated audio codec, similar to those in all over the mainboards?
Unknown Soldier
06-May-2007, 09:28
According to publicly leaked information, the processing is still done on the CPU. You know, just like what any modern integrated audio solution will do; these things leave all the major processing duty to the CPU. In the end, that kind of work is so minimalist is doesn't really matter. Some EAX effects (which neither R6xx nor 99% of integrated solutions support in practice anyway) *might* be a bit more expensive.
If so, then the PCI-e should cope well. ;)
Other sound cards usually work off PCI.
US
bigtabs
06-May-2007, 09:37
It will be nice to finally get rid of those sound + graphics IRQ conflicts anyway. :smile:
This should be from AMD's own papers:
http://img523.imageshack.us/img523/6871/tdpvfr600td6.jpg
BlizzardOne
06-May-2007, 10:55
This should be from AMD's own papers:
http://img523.imageshack.us/img523/6871/tdpvfr600td6.jpg
:???:
I thought GDDR4 was supposed to be more efficient from an energy side of things, when compared to GDDR3, at the same clockspeeds? (I know 2.0 W is next to nothing anyway.. but still)
And the GPU TDP looks off too, both are estimated at 750-800, same voltages etc.. just one is GDDR3, one is GDDR4, and the latter has a TDP 20 W higher?
Maybe i'm just being oblivious to the obvious though (i'll blame the hangover):-|
Twinkie
06-May-2007, 11:05
:???:
I thought GDDR4 was supposed to be more efficient from an energy side of things, when compared to GDDR3, at the same clockspeeds? (I know 2.0 W is next to nothing anyway.. but still)
And the GPU TDP looks off too, both are estimated at 750-800, same voltages etc.. just one is GDDR3, one is GDDR4, and the latter has a TDP 20 W higher?
Maybe i'm just being oblivious to the obvious though (i'll blame the hangover):-|
Ones 1024mb of GDDR4, and the others 512mb of GDDR3. :wink:
BlizzardOne
06-May-2007, 11:07
*goes and sits in the corner with his dunce hat*
pjbliverpool
06-May-2007, 11:18
From the admittedly VERY few screens of Call of Juarez, the DX10 version looks absolutely nothing like the DX9 version that I tried out. Well other than buildings and terrain being in the same place. :D
Given that the DX9 version of that game is already easily one of the best looking games available today, the DX10 version if the difference is as big as you say it is must be out of this world!
I may even pick it up. Didn't have much interest in this game but tried the demo recently and it was supprisingly very good. And like I mentioned, the graphics are stunning!
Hi,
I'm surprised that no one picked up that the GDDR4 1GIG card seems to be underclocked. The original rumours spouted that the GDDR4 card(ala XTX?) would use 2.2(or 2200Mhz) yet these pics only show 2000Mhz.
Of course as someone mentioned, these are most probably developer cards thus why it only has 2000Mhz and is called XT.
US
What you quoted is not an XTX and nor is it retail, so you shouldn't worry. :wink:
Pre-order on Amazon. $448.99
http://www.amazon.com/ATI-Radeon-2900-512MB-PCIE/dp/B000Q6J17G/ref=sr_1_1/103-9399096-8907021?ie=UTF8&s=electronics&qid=1178446300&sr=1-1
PatrickL
06-May-2007, 11:43
Pre orders attempting to rip off as usual ?
EasyRaider
06-May-2007, 11:48
Ones 1024mb of GDDR4, and the others 512mb of GDDR3. :wink:
How does that change GPU TDP? And why does board TDP stay the same?
How does that change GPU TDP? And why does board TDP stay the same?
Because it's pulled out of someone's arse? Amazing huh what the internet can do with numbers..
225W doesn't stroke with the numbers of the GPU and memory. and TDP values are always represented as "typical" values under normal operation and not the maximum value. that would mean that under heavy benchmarking the card would draw.. what.. 50% more power than the 225W represented there?
EdiT: Oops wait.. if ATI follows AMD suit than the numbers there should be the maximum possible power draw, period. 8800GTX TDP is 185Watt so the whole world is going to melt with this one 40W lightbulb extra?
Kombatant
06-May-2007, 13:22
Pre orders attempting to rip off as usual ?Hehe, I guess so :p
There are othere sources telling the same TDP data.
There are othere sources telling the same TDP data.
Do you seriously believe that R600 uses 20Watts more just because it works with GDDR4 instead of GDDR3.
I know the 1900XTX had a TDP of 90W officially, does anyone know the number of the 1950XTX?
Do you seriously believe that R600 uses 20Watts more just because it works with GDDR4 instead of GDDR3?
Dont know how much 1024MB GDDR4 uses, but i saw a other document with the same data.
Dont know how much 1024MB GDDR4 uses, but i saw a other document with the same data.
The table itself says the PROCESSOR uses 20 Watt more.. not the board.
The table itself says the PROCESSOR uses 20 Watt more.. not the board.
Perhaps the GDDR4 is "harder" on memorycontroller, which would then eat more power?
AnarchX
06-May-2007, 15:57
http://img440.imageshack.us/img440/2783/06b3daa9rv4.jpg
http://we.pcinlife.com/thread-760129-3-1.html
Not bad for stock cooling.
Pressure
06-May-2007, 16:01
The table itself says the PROCESSOR uses 20 Watt more.. not the board.
Higher voltage, higher core clock (would be the obvious call).
Guess we will know soon enough :D
{Sniping}Waste
06-May-2007, 16:01
Found some more Benches on a HD2900XT 1 gig GDDR4.
This is Specviewperf v9 with older drivers but give some numbers for OpenGL.
http://www.xtremesystems.org/forums/showthread.php?t=143104
http://www.iamxtreme.net/video/r600/x2900xt1024_02.png
Perhaps the GDDR4 is "harder" on memorycontroller, which would then eat more power?No not really, if anything GDDR4 is supposed to consume about much less power due to a lower voltage requirement (1.5 volts) than GDDR3, clock for clock. The higher wattage number (under GPU TDP too) is probably indicative of a higher clock on the core, eg. the XTX is understandably shipping at higher frequencies than the regular XT.
No not really, if anything GDDR4 is supposed to consume about much less power due to a lower voltage requirement (1.5 volts) than GDDR3, clock for clock. The higher wattage number (under GPU TDP too) is probably indicative of a higher clock on the core, eg. the XTX is understandably shipping at higher frequencies than the regular XT.
Yes but the GDDR4's lower voltage isn't related to GPU's power draw, also, the volts & clocks are mentioned to be the same for both R600 GDDR3 and R600 GDDR4 (750-800MHz on both, VDDC is 1.175-1.2V on both)
Oh dang, you're right. :oops: I skimmed over the chart and assumed the higher clocking for what was the XTX, since it's the only one shipping with 1 gb of GDDR4... man, that's gotta be a typo.
Geeforcer
06-May-2007, 17:37
http://img440.imageshack.us/img440/2783/06b3daa9rv4.jpg
http://we.pcinlife.com/thread-760129-3-1.html
Not bad for stock cooling.
The scores themselves, while very good, should throw some cold water on the "IT WILL MURDER THE GTX!!!!2" crowd.
nicolasb
06-May-2007, 17:44
If so, then the PCI-e should cope well. ;)
Other sound cards usually work off PCI.PCIe is apparently quite problematic for sound cards. I forget the exact argument, but it's to do with PCIe being designed to shift very high-bandwidth data in short bursts. If you want to transmit something like an audio stream, where you have small data bandwidth but it has to be broadcast in a long-term, uninterrupted stream, the overhead is apparently prohibitively large on PCIe (as compared with PCI). This is the supposedly the reason why it's taken Creative (and, indeed, everybody else) so long to make a PCIe soundcard.
However, this is all a bit off-topic for R600, because...
So, R600 would just have an integrated audio codec, similar to those in all over the mainboards?...this statement is incorrect.
Look at all of the photos of R600 boards: do you see any analogue sound outputs? Do you even S/PDIF? No. That's because R600 is not a sound-card in the normal sense of that term. It will not be able to function in the way that either an integrated sound-chip or an add-on soundcard does. It will not be able to generate sound at all. What it will do is simply act as a pass-through for pre-recorded digital audio soundtracks associated with (e.g.) DVD/HDDVD/BluRay movies.
This is designed to appeal to the HTPC crowd. The optimum way to feed digital audio to an off-board surround-sound processor is via HDMI. (S/PDIF will no longer be adequate for the best-quality soundtracks on BluRay and HD-DVD discs, because it simply doesn't have the bandwidth to transmit lossless, high-resolution, multi-channel audio). If you're not proposing to use your PC as a video player connected to an outboard sound processor then, as far as you're concerned, R600 has no useful audio capability at all.
while it can only output digital signal, there's no any apparent reason why it wouldn't work with games too
Higher voltage, higher core clock (would be the obvious call).
The table lists the same voltage and clocks for both boards, the same cpu ..
just another piece of BS posted on some forum by some asian guy looking for hits.
The table lists the same voltage and clocks for both boards, the same cpu ..
just another piece of BS posted on some forum by some asian guy looking for hits.
It isn't from asian source :wink:
edit: and haven't seen it on any website (as news or anything) either
The scores themselves, while very good, should throw some cold water on the "IT WILL MURDER THE GTX!!!!2" crowd.
Who ever said the 2900xt would kill the GTX? Maybe one or two lurkers but definitely not a "crowd".
On a serious note, I think it`s fairly hard to predict wheter or not it`ll rule WRT DX10, because we've yet to see typical DX10 workloads(no, dubious demos quoted in marketing slides don`t count). It`ll come down to what ppl actually do with DX10 and how that jibes with the 2 competing architectures.
I'm not sure that we'll see that many games that'll look different between the DX9-DX10 paths.
Take these comments from some well known developers:
http://www.gameinformer.com/News/Story/200701/N07.0109.1737.15034.htm
There’s no massive pull for me for DX10. It would be more a question of if we don’t think we’re going to get done until Vista is broadly adopted, it might just save us development and support things to say it’s a DX10 game--but there’s no huge thing where we’re dying to use any particular DX10 feature. It would just more be a question about practically, is the market there where we can write off everything else? Quake Wars is definitely not DX10.
http://www.megaleecher.net/node/886
Tim Sweeney: Unreal Engine 3 will make full use of DirectX 10, and many of our and our partners' games will ship in 2007 with full support for DirectX 10 and Windows Vista. But, despite the marketing hype, DirectX 10 isn't all that different from DirectX 9, so you'll mainly see performance benefits on DirectX 10 rather than striking visual differences.
It'll be interesting to see much impact the DX10 paths will have with regards to performance. I don't really expect more then 10%, perhaps 20% at the most. But i guess that some developers here might know more about this.
edit: and haven't seen it on any website (as news or anything) either
Where do you think the "OMG R600 225Watts" come from?
Inq already reported this exact number in November last year. so this graph (or, the entire R600 service manual for that matter) has been in the circuit since that time.
Oooohhh...
Orange X2900... (must be the lighting though.. Window or something...)
http://www.fx57.net/?p=637
So who's going to at the HD2600 as Physics GPU to their set-up?
SugarCoat
06-May-2007, 19:03
looks like a reflection off the copper block, though if you take the heatsink shell off and leave it out in the hot mid-day sun for a few days im sure you can get a colour like that.
Sound_Card
06-May-2007, 19:12
So who's going to at the HD2600 as Physics GPU to their set-up?
I'm just going to use my x1800gto2 for physics.
Where do you think the "OMG R600 225Watts" come from?
Inq already reported this exact number in November last year. so this graph (or, the entire R600 service manual for that matter) has been in the circuit since that time.
They also reported gazillion other "exact numbers" in case you forgot, so what makes you think this would be related to that in any way?
So who's going to at the HD2600 as Physics GPU to their set-up?
That was my plan, if the 2600 is decent enough to hold me over until the refresh options.
Did anyone notice this?
ROPs ve TMUs: 16 ve 32 adet
They also reported gazillion other "exact numbers" in case you forgot, so what makes you think this would be related to that in any way?
That the 225 obviously wasn't guesswork, unlike the temperature numbers in which R600 does a better job at staying cool...
So who's going to at the HD2600 as Physics GPU to their set-up?
That would be an utter waste of money. Better invest it in a quad-core CPU, or a balanced CrossFire setup.
Dalton Sleeper
06-May-2007, 19:34
Anyone knows how much you gain running a gfx that calculate physics?
I guess it must be implemented in every single game, are there any games that support this today?
Skrying
06-May-2007, 19:36
Anyone knows how much you gain running a gfx that calculate physics?
I guess it must be implemented in every single game, are there any games that support this today?
No game is currently implementing it and therefore its currently impossible to say. I wouldn't hold my breathe on the issue personally. But if you have say a X1900 card currently then it might be useful holding onto it for the time being.
NH|Delph1
06-May-2007, 19:38
http://img440.imageshack.us/img440/2783/06b3daa9rv4.jpg
http://we.pcinlife.com/thread-760129-3-1.html
Not bad for stock cooling.
Correct source:
http://www.nordichardware.com/forum/viewtopic.php?topic=8428&forum=45 ;) :P
//Andreas
Anyone knows how much you gain running a gfx that calculate physics?
I guess it must be implemented in every single game, are there any games that support this today?
That's not the only point. CTM could be used for a wide variety of APIs both by first and third-party developers, there just isn't any created yet.. Perhaps first physics will be off-loaded, but later perhaps AI will be, as well as many other things. There's tons of room for middleware IE what havok/AGEIA have done for physics, specifically aimed towards using the FLOPs of a secondary lesser card, which should still be much greater than or comparative to that of a high-end CPU (wish I could find that slide with RV630's flop count...). This is why I think it's a good option.
Currently physics are done mainly on the CPU; Havok 4.0 added using the GPU as well, I imagine it can be adapted for use on only secondary cards, leaving the main GPU for other things, although I havn't heard anything from ATi/Nvidia confirming that beyond that their physics systems were going to use Havok, but it would make the most sense. Any differance is a positive one, and considering how many current and future games use havok (http://en.wikipedia.org/wiki/List_of_games_using_physics_engines), it's a step in the right direction.
We don't really have anything to judge the impact it will cause because the only thing we have to compare it to is PhysX, which obviously isn't a great example, as it's proprietary.
Mintmaster
06-May-2007, 20:33
What if R600 is divided in 4 groups and each group has more levels consistsing of 4 subgroups of 4 shader units with 5 scalar processors each (we already know that they are grouped by 5, with une unit to handle the SF)?Or eight SIMDs and four fully decoupled TUs (texture units)?I don't know R600's exact layout, but I was just trying to see why they went the co-issue route instead of the SOA route (as in G80) for scalar granularity. I heard about R600's scalar ability almost a year ago, and the SOA implementation was the first thought in my head due to its simplicity and efficiency.
If the 320 math units were divided further into smaller subgroups, then batch size would no longer be an issue. They could all execute the same instruction on the same channel, just like G80 does, and the co-issue decision remains a baffling one.
Thanks for the good post Mint.
Also we don't know how much orthogonal al R600 stream processors, for example from the leaked slides is not very clear if they support integer math on any ALU, or at least on most of them.True, but I expect it to be the same as in G80. It's pretty trivial to use the mantissa of FP math units for most integer operations. AFAIK, everything from R300 onwards worked that way on DX8 shaders.
Additionally, if ATI didn't have the tools to build a 2x-clocked ALU pipeline, then a sequential-scalar ALU would create a large increase in batch size. This is assuming that they've kept the 1-instruction=4-clocks organisation. If they went serial-scalar, that'd produce batches 5x bigger (now there's 5x more pixels per clock), hurting DB. Helps the texture caches, though.That's the other thing I was thinking about: 4 clocks to change the instruction.
It doesn't seem like much of a design issue to me if one was to reduce this. Maybe the dynamic branching aspect of instruction scheduling is a little expensive to make that fast, but I see no problem in stuffing some NOP instructions to keep jump frequency down to once every 4 instructions. The co-issue method wouldn't be any faster.
I know you have the Xenos documents, so you know that the three 16-ALU (5D) shader arrays operate on vectors of 64 pixels/vertices. It would take 4 clocks for an array to execute any instruction pair (4D+scalar) on a vector. Now imagine if the arrays were simply a group 64 ALUs executing the same 1D instruction. It would still take 4 clocks to execute a 4D instruction, but only one clock to execute a scalar instruction provided it wasn't a special function (which can be handled several different ways).
Yes, you need to change the instruction every clock instead of every 4 clocks, but you also have 64-way SIMD instead of 16-way SIMD for the R600 route (assuming 4x16 organisation). Moreover, you can have dependency between the instructions and don't have to worry about efficient packing by the compiler or scheduler.
Silent_Buddha
06-May-2007, 20:35
PCIe is apparently quite problematic for sound cards. I forget the exact argument, but it's to do with PCIe being designed to shift very high-bandwidth data in short bursts. If you want to transmit something like an audio stream, where you have small data bandwidth but it has to be broadcast in a long-term, uninterrupted stream, the overhead is apparently prohibitively large on PCIe (as compared with PCI). This is the supposedly the reason why it's taken Creative (and, indeed, everybody else) so long to make a PCIe soundcard.
However, this is all a bit off-topic for R600, because...
...this statement is incorrect.
Look at all of the photos of R600 boards: do you see any analogue sound outputs? Do you even S/PDIF? No. That's because R600 is not a sound-card in the normal sense of that term. It will not be able to function in the way that either an integrated sound-chip or an add-on soundcard does. It will not be able to generate sound at all. What it will do is simply act as a pass-through for pre-recorded digital audio soundtracks associated with (e.g.) DVD/HDDVD/BluRay movies.
This is designed to appeal to the HTPC crowd. The optimum way to feed digital audio to an off-board surround-sound processor is via HDMI. (S/PDIF will no longer be adequate for the best-quality soundtracks on BluRay and HD-DVD discs, because it simply doesn't have the bandwidth to transmit lossless, high-resolution, multi-channel audio). If you're not proposing to use your PC as a video player connected to an outboard sound processor then, as far as you're concerned, R600 has no useful audio capability at all.
Almost correct. You can't just pass through audio for HD-DVD (not sure about Blue-Ray) as HD-DVD requires you to composite different audio streams. In dedicated HD-DVD players, this is done in hardware usually. I'm assuming that for the R600 and RV6xx cards that the audio compositing will be done in software. At which point it will be passed to the Video Card which in turn just passes it through the HDMI connector.
However, I'm also assuming that you are correct in that it won't generate audio like a traditional sound card. Meaning it can either 1) pass through audio from a DVD source or 2) take the audio streams from a HD-DVD, composite them as required and pass them through.
Although since the low end cards ARE meant for HTPC. One might also assume that those people would do their computing on their HDTV's. So there is a possibility that RV6xx might be able to either pass through sound from onboard audio codecs (?) or have some software codec that would allow general system sounds to be passed through the HDMI connector.
Regards,
SB
Russell
06-May-2007, 20:45
Take these comments from some well known developers:
Sweeney's comment shouldn't be a surprise as he's absolutely right. This has been known for some time now. The benefit of DX10 is the performance, but also the fact that the reduced overhead will allow greater complexity in rendered scenes.
That would be an utter waste of money. Better invest it in a quad-core CPU, or a balanced CrossFire setup.
We're talking about the 2600 as a third card here
Silent_Buddha
06-May-2007, 20:56
Or for someone like me with multiple monitors where crossfire is basically useless (or at best a royal pain in the behind) since I'd have to constantly enable/disable it.
I'm actually more interested to see how the HD 2400 series does with Physics and GPGPU. If it does well, then a passively cooled card that can drive my 3rd and 4th monitors while also doing phsyics calcs in possible future games would be right up my alley.
Regards,
SB
Anarchist4000
06-May-2007, 21:09
Almost correct. You can't just pass through audio for HD-DVD (not sure about Blue-Ray) as HD-DVD requires you to composite different audio streams. In dedicated HD-DVD players, this is done in hardware usually. I'm assuming that for the R600 and RV6xx cards that the audio compositing will be done in software. At which point it will be passed to the Video Card which in turn just passes it through the HDMI connector.
However, I'm also assuming that you are correct in that it won't generate audio like a traditional sound card. Meaning it can either 1) pass through audio from a DVD source or 2) take the audio streams from a HD-DVD, composite them as required and pass them through.
Although since the low end cards ARE meant for HTPC. One might also assume that those people would do their computing on their HDTV's. So there is a possibility that RV6xx might be able to either pass through sound from onboard audio codecs (?) or have some software codec that would allow general system sounds to be passed through the HDMI connector.
Regards,
SB
An onboard audio codec doesn't even need hardware to go along with it. It just needs a way to get the data to the speakers. Then some inputs never hurt to have around. If the HDMI audio output is something that the system could easily map, and I hope it will be, I don't think it would be unreasonable to forward all audio data through the HDMI port. Since all audio is digital to begin with R600 could probably just grab the audio stream and forward it on.
This might even be a reason why Vista doesn't seem to like sound cards. On a lot of cards the audio compositing is done on the card. In this case you'd have to get the final audio stream back off the card to be forwarded elsewhere. I suppose SPDIF would work but there would be some quality loss and going over PCI-E seems to be problematic for sound cards. For the onboard solutions it seems fairly straightforward as long as they can provide some sort of handle to the audio stream.
leoneazzurro
06-May-2007, 21:41
I don't know R600's exact layout, but I was just trying to see why they went the co-issue route instead of the SOA route (as in G80) for scalar granularity. I heard about R600's scalar ability almost a year ago, and the SOA implementation was the first thought in my head due to its simplicity and efficiency.
If the 320 math units were divided further into smaller subgroups, then batch size would no longer be an issue. They could all execute the same instruction on the same channel, just like G80 does, and the co-issue decision remains a baffling one.
I asked this because an extremely reliable source told me that if G80 organization can be seen as a 8-2-8 R600 is like a 4-4-4-5.
...specifically aimed towards using the FLOPs of a secondary lesser card, which should still be much greater than or comparative to that of a high-end CPU...
Core 2 Extreme QX6800: 93.8 GFLOPS
GeForce 8600 GTS: 92.8 GFLOPS
I know this is just multiply-accumulate, but the numbers don't lie: CPU's are catching up in floating-point performance, FAST. Prices for quad-cores will drop pretty quickly soon. And what else to use these cores for?
This is why I believe 'dedicated' physics will never be more than a gimmick. There's nothing special about physics calculations that would make another processor significantly faster than the CPU. The 530 million sphere collisions per second specified for PhysX is laughably low.
There's just no convincing reason to send the phyiscs calculations to another processor. I also utterly dislike the idea of being forced to invest in another co-processor just to play a game properly. With just a little more effort it can all be done on a multi-core CPU.
PaulDune
06-May-2007, 22:40
This should be from AMD's own papers:
http://img523.imageshack.us/img523/6871/tdpvfr600td6.jpg
Hmmm, first post, hope I dont make a fool out of myself.
But, has anyone noticed the max clock: (750-800)E/1100M
Could this mean: 750-800 MHz economy; 1100 MHz max?
It couldnt be the DDR speed ; either true or effective....
Or am I completely missing something?:roll:
Gr. Paul
E = engine (core)
M = memory
Hmmm, first post, hope I dont make a fool out of myself.
But, has anyone noticed the max clock: (750-800)E/1100M
Could this mean: 750-800 MHz economy; 1100 MHz max?
It couldnt be the DDR speed ; either true or effective....
Or am I completely missing something?:roll:
Gr. Paul
Welcome Paul Atreides.
the clocks also go below 750. you'll see cards at, say 742, some at 750, some at 757. Memory speed will also vary but we know the current stock speed is 1Ghz
I haven't seen any XT yet which will ship at 800 but the professional overclock people allready wrote that this card easily does 1Ghz (cpu) on stock cooling with stock voltage.
Welcome Paul Atreides.
the clocks also go below 750. you'll see cards at, say 742, some at 750, some at 757. Memory speed will also vary but we know the current stock speed is 1Ghz
I haven't seen any XT yet which will ship at 800 but the professional overclock people allready wrote that this card easily does 1Ghz (cpu) on stock cooling with stock voltage.
Actually I don't think anyone said 1GHz would be on stock cooling, just stock volts.
Also, since the default coolers supposed "capacity" is 210W, I don't see that 1GHz happening on it.
Actually I don't think anyone said 1GHz would be on stock cooling, just stock volts.
Also, since the default coolers supposed "capacity" is 210W, I don't see that 1GHz happening on it.
Ah sorry...
I misread Kinc's It cant do 1000MHz on stock cooler. That would be sick really
I hate it when people forget to push one key and make the sentence take a u-turn in meaning.. because of dailytech already hitting 850 on stock cooling (http://www.dailytech.com/article.aspx?newsid=6138)
I'll go check. it was in a post about Volt adjustment not being possible yet so it was of no use install the LN cooling etc.
Who said LN2? Might have been just beefier aircooling, perhaps some water or dryice testing too
Who said LN2? Might have been just beefier aircooling, perhaps some water or dryice testing too
Sampsa said Dry Ice (http://www.xtremesystems.org/forums/showpost.php?p=2169939&postcount=67) and I got so aroused about kingpin's Ultra numbers and forgot that he is the one using LN (
http://www.xtremesystems.org/forums/showthread.php?t=143117)
Sampsa said Dry Ice (http://www.xtremesystems.org/forums/showpost.php?p=2169939&postcount=67) and I got so aroused about kingpin's Ultra numbers and forgot that he is the one using LN (
http://www.xtremesystems.org/forums/showthread.php?t=143117)
Sampsa said "Now we just need to wait for new version of ATITool which brings voltage adjustments for R600 and I'll run tests under dry ice cooling." - No dryice yet :wink:
Sampsa said "Now we just need to wait for new version of ATITool which brings voltage adjustments for R600 and I'll run tests under dry ice cooling." - No dryice yet :wink:
you forget..
Hopefully that'll be next week.
Sound_Card
07-May-2007, 01:12
Actually I don't think anyone said 1GHz would be on stock cooling, just stock volts.
Also, since the default coolers supposed "capacity" is 210W, I don't see that 1GHz happening on it.
210w means exactly what though? I don't think it's refering to the GPU and memory combined. It's refering to the GPU's thermal output max. We have been told 170w-180w on R600. Increasing freq. by 250mhz without increasing the volts is not going to change the wattage by 30w-40w.
Wizzard is doing a review for R600, so I would take it he is not messing with the cooler.
That's the other thing I was thinking about: 4 clocks to change the instruction.
G80 appears to be the same, 4 clocks per instruction. Vertex/geometry shaders supposedly have half-sized batches (16) so I'm not sure what's going on there.
It doesn't seem like much of a design issue to me if one was to reduce this.
Seems incredibly fundamental to me.
Maybe the dynamic branching aspect of instruction scheduling is a little expensive to make that fast, but I see no problem in stuffing some NOP instructions to keep jump frequency down to once every 4 instructions.
Whatever the branching capability of Xenos is, it appears to suffer performance degredation due precisely to short loops (which have to be "padded out" with NOPs). I get the feeling it's something ATI wasn't happy with...
[...]Moreover, you can have dependency between the instructions and don't have to worry about efficient packing by the compiler or scheduler.
G80 is heavily dependent upon its compiler to produce a decent sequencing of the scalar instructions. Merely scheduling by dependency isn't the answer because of co-issue of SFs (this co-issue will introduce bubbles in the MAD ALU if you naively schedule - this is the problem I was having originally with my pipeline pizza). This is further complicated because some SFs take twice as long as others.
Also, different compilations/schedulings affect the way temporary registers are assigned, e.g. think of an "unrolled loop" per component of a clause of instructions that uses 3 or 4 temporaries. There's a big saving in register file to be had if you can get the compilation right by this kind of unrolling - but the interaction of SFs + interpolations for texturing causes all sorts of co-issue complications.
The result is that the compiler (and the scheduler) in G80 is most definitely not sitting back with a cigar and glass of whisky :smile:
Ultimately R600's evolution of the ALU pipeline might be nothing more than a case of diminishing returns, seen from where ATI was already with R5xx. ATI hasn't been paying the instruction throughput price of in-ALU pipe interpolation and texture addressing
Jawed
210w means exactly what though? I don't think it's refering to the GPU and memory combined. It's refering to the GPU's thermal output max. We have been told 170w-180w on R600. Increasing freq. by 250mhz without increasing the volts is not going to change the wattage by 30w-40w.
Wizzard is doing a review for R600, so I would take it he is not messing with the cooler.
Oh yes it could, if you look at the g80, and overclock it, with just a 75 mhz bump on the core it goes from 175-180 to over 200 watts.
INKster
07-May-2007, 02:02
Oh yes it could, if you look at the g80, and overclock it, with just a 75 mhz bump on the core it goes from 175-180 to over 200 watts.
G80 is a 90nm core, not a 80nm one. ;)
Sound_Card
07-May-2007, 02:04
Oh yes it could, if you look at the g80, and overclock it, with just a 75 mhz bump on the core it goes from 175-180 to over 200 watts.
How are you monitoring your G80's wattage?
How are you monitoring your G80's wattage?
my ups,
Inkster, the process doesn't really matter much, if you're overclocking you need more power.
INKster
07-May-2007, 02:55
my ups,
Inkster, the process doesn't really matter much, if you're overclocking you need more power.
Process technology and core voltage have a large impact on power consumption as well.
Just look at the 7800 GTX vs 7900 GT situation.
Sound_Card
07-May-2007, 03:23
I'm buying water block for R600. If wiz can hit 1ghz with stock volts, I bet I can hit 1.3ghz with a water block and a good ol volt mod.
Process technology and core voltage have a large impact on power consumption as well.
Just look at the 7800 GTX vs 7900 GT situation.
Design is more important then the process itself. But if you increase the frequency you need more power.
Edit: the r600 xt is already in the 200 - 210 watt range at stock, it doesn't need much overclocking to go above 225 watts. Now the cooler on this thing, I'm not sure if the max wattage is 210 or if thats a regular fan speed, has there been any confirmation on this. Its a bit unbelievable since the the GTX is an all aluminum heat sink and from what we saw of the r600 coolers they are copper.
210w means exactly what though? I don't think it's refering to the GPU and memory combined. It's refering to the GPU's thermal output max. We have been told 170w-180w on R600. Increasing freq. by 250mhz without increasing the volts is not going to change the wattage by 30w-40w.
Wizzard is doing a review for R600, so I would take it he is not messing with the cooler.
It means the max that the default cooler can dissipate, if the information is correct
Sweeney's comment shouldn't be a surprise as he's absolutely right. This has been known for some time now. The benefit of DX10 is the performance, but also the fact that the reduced overhead will allow greater complexity in rendered scenes.
True, but there has been some talks about "DX10 workloads", and some DX10 benchmarks here where the G80 supposedly is a LOT slower then the R600. I'm just a bit sceptical that any of the IHV's will have serious problems even if they are much slower on some DX10 specific things.
On the other hand, if one IHV can gain say 20% with DX10 and the other 5% then that's still a huge difference.
True - regarding to gaming first generation being more or less effecient doestn't matter that much. For example 9700 was very good DX9 card, but there weren't mountains of games that didn't run on NVidia offerings. Of course it would be interesting to run Oblivion/Stalker on those cards to see how these would run "real" DX9 games ;).
Also, R580 is very good in running complex shaders, but all it did was F@H client and maybe larger GPGPU-research penetration. But there devrel plays at least as (if not more) important role as actual harware. As far as I remember, there are GPGPU-offerings based on both ATI and NVidia cards.
If R600 is clearly better or worse at something, again it doesn't change the world, it merely leaves some playing field for refreshes. Remember 3Dc (and Fetch4)? What's the state of those in R600 (and in games)?
DemoCoder
07-May-2007, 10:40
G80 is heavily dependent upon its compiler to produce a decent sequencing of the scalar instructions. Merely scheduling by dependency isn't the answer because of co-issue of SFs (this co-issue will introduce bubbles in the MAD ALU if you naively schedule - this is the problem I was having originally with my pipeline pizza). This is further complicated because some SFs take twice as long as others.
Mint's point isn't that the G80 compiler doesn't need to do any work, but that scheduling co-issue of a scalar MAD + SF is simpler than scheduling co-issue of 5 ALUs (with SFs) + a branch unit. No matter how you slice it, the R600 introduces more work for the compiler author, and more potential points of failure for pathological performance breakdown (e.g. by the compiler goofing up). This was one of the problems that hamstrung the NV3x and invited all the shader replacement issues, because the first rev of the drivers for the new architecture had a crappy compiler, it took ages for it to mature (by which time, the NV4x was out), and by contrast, the R300 architecture was alot simpler for a DX9 driver to optimize for.
This is not to say that the r600 is an Nv3x, but that the R600 is much more dependent on good compiler technology than the R300 was to extract peak throughput.
I highly doubt that the G80 driver only schedules by dependency. A topological sort of the dependency graph is merely the first step of any scheduler. Pretty much anyone who does compiler work is aware that data hazards are only the trivial issue with scheduling (and a solved problem, algorithmically), and most of the work that goes into schedulers deals with the other hazards, resource hazards, pipeline hazards, etc all NP complete, which is why there is still tons of active research for best heuristics.
I will say this tho, the more help you get from the HW, the better, which is why VLIW architectures demand much beefier compilers, and in general, compilers are limited which don't use profile-feedback of dynamic reoptimization, since there are limits to how well a static compiler can do.
Mintmaster
07-May-2007, 12:07
G80 appears to be the same, 4 clocks per instruction. Vertex/geometry shaders supposedly have half-sized batches (16) so I'm not sure what's going on there.I think that's just for efficiency purposes. Texturing is rarely used, so having fewer vertices in flight doesn't hurt performance in reality. The advantage is that you don't need as large of a post-transform cache and get finer granularity for load balancing and DB/early-cull.
Seems incredibly fundamental to me.Fundamental, sure, but I don't think it's expensive. On a hardware level, selecting the instruction shouldn't be much different than selecting the data. Even the instruction issue rate is the same whether you go serial or 5x1D co-issue, assuming your compiler can make the latter as efficient as the former with perfect packing.
Whatever the branching capability of Xenos is, it appears to suffer performance degredation due precisely to short loops (which have to be "padded out" with NOPs). I get the feeling it's something ATI wasn't happy with...Yeah, but my point is that doing this doesn't make you any slower than the co-issue route. Whether it's 64 scalar processors or 16 vector processors, you can still make it branch once every 4 clocks, 8 clocks, or whatever.
G80 is heavily dependent upon its compiler to produce a decent sequencing of the scalar instructions. Merely scheduling by dependency isn't the answer because of co-issue of SFs (this co-issue will introduce bubbles in the MAD ALU if you naively schedule - this is the problem I was having originally with my pipeline pizza). This is further complicated because some SFs take twice as long as others.The simplest way to deal with this is for the compiler to just treat the SF as a texture instruction. It's really the same as the DX9 era, where you want to co-issue texture and math instructions together.
It's not a hard scheduling problem, either. On Vec4+scalar hardware, you'd execute a vector instruction alongside a SF. With the same pairing on G80, it's easy to divide up the task amongst the 16 SPs and 4 SFs:
Clock 1: Vec instr on x channel of pixels 1-16, SF instr on pixels 1-4
Clock 2: Vec instr on y channel of pixels 1-16, SF instr on pixels 5-8
Clock 3: Vec instr on z channel of pixels 1-16, SF instr on pixels 9-12
Clock 4: Vec instr on a channel of pixels 1-16, SF instr on pixels 13-16
Anyway, as DC said, the point is not that G80 has a trivial compiler, but rather that a co-issue system is much harder to compile for. Sometimes even the perfect compiler won't help you because the dependancy issues won't allow co-issue to have the efficiency of serial execution.
I'm buying water block for R600. If wiz can hit 1ghz with stock volts, I bet I can hit 1.3ghz with a water block and a good ol volt mod.
The screenshot of the AMD clock tool is just a demonstation that there is no limits in the tool for how high it can set the clocks, like there is with current versions of powerstrip. On the stock cooler (without any extra fan) at default voltages the range of 830-860MHz is realistic. With voltage modifaction (done by software) and watercooling there should be alot of potential, maybe 1GHz for 24/7 usage.
A side note to everyone that is planing on using watercooling on 2900 should know that the the shim is higher then the core wich will result in the need of a extra coldplate or a custom waterblock. Removing the shim turns out to be harder then it looks.
Kinc
:smile2:
it's a nice thing that you're happy, do you want to make us happy as well? :razz:
Any 2900XL specs/clockspeeds out yet? Fuad talks a little about that card today.
http://www.fudzilla.com/index.php?option=com_content&task=view&id=854&Itemid=1
I hope they haven't crippled the GPU and just lowered the clocks. A $349 card maybe? :) Very interesting days to come.
Very interesting days to come.
And that's to say the very least.
http://www.vr-zone.com/?i=4946&s=1
X2900 review. Well, not really, but the menu "works". :lol:
Page 19: UnReal Overclocking! Probably playing around with voltmods and uber cooling.
Kombatant
07-May-2007, 13:17
:smile2:I take your :smile2: and raise it to :wink:
Sound_Card
07-May-2007, 13:35
http://www.vr-zone.com/?i=4946&s=1
X2900 review. Well, not really, but the menu "works". :lol:
Page 19: UnReal Overclocking! Probably playing around with voltmods and uber cooling.
haha :shock:
18. overclocking:razz:
19. UnReal overclocking!!:shock:
Sound_Card
07-May-2007, 13:55
The screenshot of the AMD clock tool is just a demonstation that there is no limits in the tool for how high it can set the clocks, like there is with current versions of powerstrip. On the stock cooler (without any extra fan) at default voltages the range of 830-860MHz is realistic. With voltage modifaction (done by software) and watercooling there should be alot of potential, maybe 1GHz for 24/7 usage.
A side note to everyone that is planing on using watercooling on 2900 should know that the the shim is higher then the core wich will result in the need of a extra coldplate or a custom waterblock. Removing the shim turns out to be harder then it looks.
Kinc
Thanks for the clear up, but what about this cooler?
http://resources.vr-zone.com//newspics/Mar07/18/tt.jpg
R600 happens to be late, without XTX and insanely overclockable at the same time. AMD sure knows how to butter up enthusiasts ;).
True, but I expect it to be the same as in G80. It's pretty trivial to use the mantissa of FP math units for most integer operations. AFAIK, everything from R300 onwards worked that way on DX8 shaders.
DX8 shaders just use FP math.
PSU-failure
07-May-2007, 14:44
It's not a hard scheduling problem, either. On Vec4+scalar hardware, you'd execute a vector instruction alongside a SF. With the same pairing on G80, it's easy to divide up the task amongst the 16 SPs and 4 SFs:
Clock 1: Vec instr on x channel of pixels 1-16, SF instr on pixels 1-4
Clock 2: Vec instr on y channel of pixels 1-16, SF instr on pixels 5-8
Clock 3: Vec instr on z channel of pixels 1-16, SF instr on pixels 9-12
Clock 4: Vec instr on a channel of pixels 1-16, SF instr on pixels 13-16
I think you're wrong here...
G80 has to execute all work related to each pixel on the same SIMD unit as the cache is inside it, so it needs a lot of data manipulations in order to get it to a good efficiency.
If you exit a SIMD unit with a result to use it inside another one, it will add even more latency and chance is you're going to have a bottleneck here if you do this too often.
Anon Lamer
07-May-2007, 14:56
Actually I don't think anyone said 1GHz would be on stock cooling, just stock volts.
Also, since the default coolers supposed "capacity" is 210W, I don't see that 1GHz happening on it.
If you read (good) heatsink reviews closely, you will notice that heatsinks benefit noticeably from a higher airspeed over their fins. This is particulary true of the closely spaced fins of heatsinks that are supposed to be fan driven. I think heavier fans will push the max up higher. Then some of the board makers will replace the reference heatsink with one of their own design. The XTX fan has a max output of 24 watt, thats humongous! I held a 120x38 propeller fan with similar power in my hand once, it was like an outboard engine propeller!
trinibwoy
07-May-2007, 14:57
I take your :smile2: and raise it to :wink:
My :twisted: hates both of you!
If you read (good) heatsink reviews closely, you will notice that heatsinks benefit noticeably from a higher airspeed over their fins. This is particulary true of the closely spaced fins of heatsinks that are supposed to be fan driven. I think heavier fans will push the max up higher. Then some of the board makers will replace the reference heatsink with one of their own design. The XTX fan has a max output of 24 watt, thats humongous! I held a 120x38 propeller fan with similar power in my hand once, it was like an outboard engine propeller!
This should be from the same documents as the previous pic
http://img391.imageshack.us/img391/4407/tsr600jb7.jpg
102-B007 is DragonsHead2 aka 2900 XTX 1GB DDR4 .. the delayed one..
B006 was CatsEye the 10-15% slower clocked than XTX with 512MB .
B001 was Dragonshead, or the OEM XTX
So XT could launch at 750 and XTX at 825
I do however see no mention of UFO (XL, B002) in there.
DemoCoder
07-May-2007, 16:31
I think you're wrong here...
G80 has to execute all work related to each pixel on the same SIMD unit as the cache is inside it, so it needs a lot of data manipulations in order to get it to a good efficiency.
If you exit a SIMD unit with a result to use it inside another one, it will add even more latency and chance is you're going to have a bottleneck here if you do this too often.
There is no data exchange between SIMD units needed during a PS execution on the G80 and vectorized shaders are trivially serialized by a compiler. If you're thinking about swizzling, the G80 needs no swizzling, swizzling is done at compile time.
All I'm gonna say is, 3dmark scores mean jack in relation to real gaming performance that I am experiencing.
Brent Justice over at [H] makes me nervous again. :???:
http://www.hardforum.com/showpost.php?p=1031020107&postcount=28
It would be nice to fall asleep tonight and wake up on May 14th. This waiting is killing me. :lol:
vertex_shader
07-May-2007, 16:54
Brent Justice over at [H] makes me nervous again. :???:
http://www.hardforum.com/showpost.php?p=1031020107&postcount=28
It would be nice to fall asleep tonight and wake up on May 14th. This waiting is killing me. :lol:
Buy a 8800gtx and than you can sleep :wink:
Robin B
07-May-2007, 17:03
Brent Justice over at [H] makes me nervous again. :???:
http://www.hardforum.com/showpost.php?p=1031020107&postcount=28
It would be nice to fall asleep tonight and wake up on May 14th. This waiting is killing me. :lol:
Or if you have a wife or girlfriend, you know what to do to fall a sleep.:oops:
SugarCoat
07-May-2007, 17:29
Or if you have a wife or girlfriend, you know what to do to fall a sleep.:oops:
tell them you'd really like a sandwhich and a warm glass of milk at around midnight?
Remember 3Dc (and Fetch4)? What's the state of those in R600 (and in games)?
3Dc is part of D3D10, it's just not called that any more:
DXGI_FORMAT_BC4_xxx
DXGI_FORMAT_BC5_xxxhttp://msdn2.microsoft.com/en-us/library/bb173059.aspx
Don't know about fetch4 though.
Jawed
wishiknew
07-May-2007, 17:38
Didn't know Brent was still there.
cadaveca
07-May-2007, 18:24
Jawed...looking like scalar gou...so, how is gou organized? scalar ALU's, or VEC4+1? We disscussed this a bit earlier, however now that moer info is out, I'm interested in what you think....:wink:
3Dc is part of D3D10, it's just not called that any more. Glad to hear that (btw, you are one (y) body of knowledge indeed!).
Mint's point isn't that the G80 compiler doesn't need to do any work, but that scheduling co-issue of a scalar MAD + SF is simpler than scheduling co-issue of 5 ALUs (with SFs) + a branch unit. No matter how you slice it, the R600 introduces more work for the compiler author, and more potential points of failure for pathological performance breakdown (e.g. by the compiler goofing up).
I agree, R600 has increased complexity. But my point stands that with R5xx as a base, which has 4 instruction-issue + branch, it's an evolution. R600's ALU organisation simplifies things to a degree, because for non-SF instructions, the 5 ALUs are equivalent. Whereas in R5xx (and earlier) you had:
vec3 MAD
scalar MAD
vec3 ADD
scalar ADDwhich increases the complexity of compilation.
This is not to say that the r600 is an Nv3x, but that the R600 is much more dependent on good compiler technology than the R300 was to extract peak throughput.
I think you're wrong, because R3xx isn't as simple as it first appears. Originally everyone interpreted it as merely vec3+scalar MAD - it was a long time before the second ALU, ADD, came out of the woodwork.
In worst-case 1 instruction per clock, R600 will have better throughput than R300, simply because more of R300 will be idle.
In average-case 1 vec4 (or vec3) instruction per clock, R300 is still worse off, because more of if it is idle.
If you look at CTM code you can see the futzing required across the doubly-asymmetric ALU organisation.
I highly doubt that the G80 driver only schedules by dependency. A topological sort of the dependency graph is merely the first step of any scheduler. Pretty much anyone who does compiler work is aware that data hazards are only the trivial issue with scheduling (and a solved problem, algorithmically), and most of the work that goes into schedulers deals with the other hazards, resource hazards, pipeline hazards, etc all NP complete, which is why there is still tons of active research for best heuristics.
Scheduling within the G80 pipeline is nowhere near as simple as you make out, either. In addition to the programmer-generated shader code, there's the interpolation overhead being carried by the SF unit. That interpolation overhead has to be performed no later than just-in-time for the decoupled TMUs to then use those texture coordinates.
That's extra code that the compiler has to decide on the correct timing to issue. Do you decant the interpolants at the start of a shader, consuming a lump of temporary registers or do you try to muddle through issuing interpolation instructions as you go? Sounds like a compilation/scheduling headache to me.
At least it increases the utilisation of the SF units, though, since co-issuing MUL through them doesn't seem to work :smile:
Jawed
Brent Justice over at [H] makes me nervous again. :???:
http://www.hardforum.com/showpost.php?p=1031020107&postcount=28
Keep this in mind, while I like the style H uses in their reviews. It only one of many other methods to use in juding a video cards perfromance. And his level of playable prefromance != yours all the time. Food for thought...
I think that's just for efficiency purposes. Texturing is rarely used, so having fewer vertices in flight doesn't hurt performance in reality.
Hmm what about fetches from various (up to 8) vertex buffers? Or do you think D3D10 functionality is irrelevant?
The advantage is that you don't need as large of a post-transform cache and get finer granularity for load balancing and DB/early-cull.
So, what you're saying is that reduced peak throughput is ok?
Fundamental, sure, but I don't think it's expensive. On a hardware level, selecting the instruction shouldn't be much different than selecting the data. Even the instruction issue rate is the same whether you go serial or 5x1D co-issue, assuming your compiler can make the latter as efficient as the former with perfect packing.
I'm not arguing that sequential-scalar is wrong - I'm arguing that it's structurally a much bigger change from R5xx than R600 currently is. Most obviously ATI would have needed to increase either the number of batches in flight or widened the ALUs. They're both significant structural changes, based on the 4-way tiled architecture that already exists. That's what I'm saying is "fundamental".
Yeah, but my point is that doing this doesn't make you any slower than the co-issue route. Whether it's 64 scalar processors or 16 vector processors, you can still make it branch once every 4 clocks, 8 clocks, or whatever.
And, maybe, having a minimum loop length like that burns all that was gained by implementing a dedicated branching pipeline? Why do CPUs invest so much effort in branch prediction?
The simplest way to deal with this is for the compiler to just treat the SF as a texture instruction. It's really the same as the DX9 era, where you want to co-issue texture and math instructions together.
Except in G80 the MAD and SF/MI are co-issued.
It's not a hard scheduling problem, either. On Vec4+scalar hardware, you'd execute a vector instruction alongside a SF. With the same pairing on G80, it's easy to divide up the task amongst the 16 SPs and 4 SFs:
Clock 1: Vec instr on x channel of pixels 1-16, SF instr on pixels 1-4
Clock 2: Vec instr on y channel of pixels 1-16, SF instr on pixels 5-8
Clock 3: Vec instr on z channel of pixels 1-16, SF instr on pixels 9-12
Clock 4: Vec instr on a channel of pixels 1-16, SF instr on pixels 13-16
Yep, that's how G80 co-issues that particular pair of instructions, as long as the vec instruction is identical for all four channels.
G80 is actually 16x (8x MAD + 2x SF). A 32-wide warp therefore takes 16 clocks for an SF (or 32 if it's one of the double-duration SFs).
Jawed
Jawed...looking like scalar gou...so, how is gou organized? scalar ALU's, or VEC4+1? We disscussed this a bit earlier, however now that moer info is out, I'm interested in what you think....:wink:
http://forum.beyond3d.com/showthread.php?p=981827#post981827
That should answer everything!
One thing we don't know is how SFs execute, what variation in instruction duration do they have? Are they all able to complete in 1-clock or are they variable?
Additionally, there may be some fancy hidden stuff in the pipeline that reduces operand lag between dependent instructions. We'll have to wait for the launch to find out how it really works.
Jawed
Keep this in mind, while I like the style H uses in their reviews. It only one of many other methods to use in juding a video cards perfromance. And his level of playable prefromance != yours all the time. Food for thought...
Hmm has Brent ever come out and said anything like this before, must be infectious :razz:
DemoCoder
07-May-2007, 20:03
I think you're wrong, because R3xx isn't as simple as it first appears. Originally everyone interpreted it as merely vec3+scalar MAD - it was a long time before the second ALU, ADD, came out of the woodwork.
The ADD on the R300 is much like the 'MUL' on the G80 in the sense that, the ADD was doing double duty as both a real ADD ALU as well as implementing the typical DX source modifiers like *2, bias, etc. It would not have been idle, and most likely, the first trivial driver compiler simply used it as a 'mini ALU' with later compiler revisions trying to utilize it more to pack in an extra co-issued ADD when needed. Modifier operations can be trivially scheduled. The R300 had to handle alot more DX8-port-to-DX9 shader workloads back then, didn't have branches or loops, and the R300's decoupled texture system removes a huge headache of memory hazard scheduling. (Vertex texture fetch compilation on the NV4x anyone?) Needless to say, instruction scheduling is *alot* easier right when you don't have dynamic branching and loops. Most of the work in instruction scheduling algorithms deals with resource bounded loop scheduling, since on CPU architectures, this is the main workload.
In worst-case 1 instruction per clock, R600 will have better throughput than R300, simply because more of R300 will be idle.
Only if your workload had lots of potential co-issuable ADDs. If for example, your shader is 90% MULs, then the ADD will be idle, period, and there's nothing that the compiler can do about it, and likewise, doing a huge amount of work to co-issue ADDs on the R300 may have marginal gains compared to co-issue on the R600.
I don't do any co-issue ADDs at all on the R300, I'm not losing 50% potential performance over all workloads. But if I don't any co-issue on the R600, I lose 4/5ths of my performance.
In other words, co-issue scheduling is more important on the R600 else you have a higher probability of wasted potential, but on the R300, co-issuing the ADD is merely 'nice', and that's because the second ALU has less opportunity to be used anyway than the 5 general purpose units on the R600. The R300 could get alot of mileage out of a trivial compiler on the workloads being sent to it, especially in comparison to the tough job the NV3x had, especially with all of the numerous pipeline hazards in that architecture.
Not to mention that if you have lots of data dependence and short shaders, you could be screwed.
Look, if the 'missing MUL' was permanently missing and never co-issued, would it be so bad for the G80? The MUL is not likely eating up that much real estate since some portion of it is shared by the other SF/interpolator HW, and the chip is performing spectacularly, and appears to be holding its own against a much beefier R600 design, and whereas ATI had an advantage in the drivers of merely being an 'evolution', NVidia had to deal with a revolutionary architectural change, and yet a relatively new architecture seems to be doing well against a mature one. (e.g. not getting it's ass handed to it the way the on-paper specs say it should) I'd say that's a testament to the fact that extracting high efficiency from the G80 is a simpler task.
Scheduling within the G80 pipeline is nowhere near as simple as you make out, either. In addition to the programmer-generated shader code, there's the interpolation overhead being carried by the SF unit. That interpolation overhead has to be performed no later than just-in-time for the decoupled TMUs to then use those texture coordinates.
I don't know where I've made it out to be "simple", just "simpler" than the R600. I would say you're making instruction scheduling out to be harder than it is, treating anything more than dependency graph sorting as some kind of programming headache. Something is only a programming headache if you haven't done it before. The guys working on the G80 and R600 compilers are not likely to be newbies to scheduling. And I think it is fairly obvious if you've done any scheduling before, that the less hazards, the better.
I mean, vec3/4 workloads are still going to be dominant in 3D graphics, so even if the R600 driver does a bad job, it's still going to deliver decent performance. I just think it's alot easier to leave performance on the table for the R600 vs the G80 and that the ATI driver team is going to have to work harder (and IMHO, you will definately see hand-tweaked shader replacements, especially for benchmark games)
trinibwoy
07-May-2007, 20:03
Hmm has Brent ever come out and said anything like this before, must be infectious :razz:
Well they lost a lot of credibility in my book after undeservedly gushing over the 8600GTS....
The ADD on the R300 is much like the 'MUL' on the G80 in the sense that, the ADD was doing double duty as both a real ADD ALU as well as implementing the typical DX source modifiers like *2, bias, etc. It would not have been idle, and most likely, the first trivial driver compiler simply used it as a 'mini ALU' with later compiler revisions trying to utilize it more to pack in an extra co-issued ADD when needed.
Precisely my point, that in SM2/3 code, R300 needed a more-capable compiler than was originally considered, based upon the simplistic DX8 view.
Only if your workload had lots of potential co-issuable ADDs. If for example, your shader is 90% MULs, then the ADD will be idle, period, and there's nothing that the compiler can do about it, and likewise, doing a huge amount of work to co-issue ADDs on the R300 may have marginal gains compared to co-issue on the R600.
I don't do any co-issue ADDs at all on the R300, I'm not losing 50% potential performance over all workloads. But if I don't any co-issue on the R600, I lose 4/5ths of my performance.
Huh? A sequence of vec4 MULs on R300 leaves ADD idle, but on R600 it's trivially issuable across 4/5ths of the scalar ALUs. You're asserting that issuing a single vector instruction as a co-issue across 5 scalar ALUs is difficult :?: :!:
In other words, co-issue scheduling is more important on the R600 else you have a higher probability of wasted potential, but on the R300, co-issuing the ADD is merely 'nice', and that's because the second ALU has less opportunity to be used anyway than the 5 general purpose units on the R600. The R300 could get alot of mileage out of a trivial compiler on the workloads being sent to it, especially in comparison to the tough job the NV3x had, especially with all of the numerous pipeline hazards in that architecture.
For what it's worth, I agree that an idling ADD unit is not a disaster - compared with a MUL (MAD) unit, it's simpler - resulting in less transistors doing sod all.
Not to mention that if you have lots of data dependence and short shaders, you could be screwed.
Short shaders aren't a problem because the shader export block is going to be throughput limit (post transform cache, streamout bandwidth, fillrate, whatever follows on once the shader has completed).
But if you issue scalar-only code on R600 with tight dependencies you'll certainly get far worse performance than G80. You won't get me arguing with that. (Unless there's something really funky in R600 we're not aware of...)
Look, if the 'missing MUL' was permanently missing and never co-issued, would it be so bad for the G80? The MUL is not likely eating up that much real estate since some portion of it is shared by the other SF/interpolator HW,
I agree.
and the chip is performing spectacularly, and appears to be holding its own against a much beefier R600 design, and whereas ATI had an advantage in the drivers of merely being an 'evolution', NVidia had to deal with a revolutionary architectural change, and yet a relatively new architecture seems to be doing well against a mature one. (e.g. not getting it's ass handed to it the way the on-paper specs say it should) I'd say that's a testament to the fact that extracting high efficiency from the G80 is a simpler task.
When you find an example of an ALU-limited game that backs up this assertion, I'd be intrigued to see the numbers.
In terms of ALU capability, all we have is anecdotal evidence that G80 is no more capable than R580 at folding@home or GPGPU ray-tracing. A very rough indication, unquestionably and biased against G80 simply because they'll all be vector-instruction dominated.
I mean, vec3/4 workloads are still going to be dominant in 3D graphics, so even if the R600 driver does a bad job, it's still going to deliver decent performance.
On this kind of workload it'll easily have better throughput per ALU "unit" than R3xx...R5xx. Of all the "easy" bits of the R600 compiler's job, that's the easiest.
I just think it's alot easier to leave performance on the table for the R600 vs the G80 and that the ATI driver team is going to have to work harder (and IMHO, you will definately see hand-tweaked shader replacements, especially for benchmark games)
I think they each have their corner cases: R600's are tightly-dependent scalar/vec2 instructions, while G80's are centred on bottlenecking in the SF/MI.
Jawed
Silent_Buddha
07-May-2007, 20:59
I just think it's alot easier to leave performance on the table for the R600 vs the G80 and that the ATI driver team is going to have to work harder (and IMHO, you will definately see hand-tweaked shader replacements, especially for benchmark games)
So, when R600 finally reaches retail and if at that point in time R600 drivers are better and more stabled than G80 drivers. Does that mean the NV driver team is understaffed or lazy considering it's easier to make a driver for the G80 architechture than the R600 architechture?
Not talking about G80 drivers when G80 launched, but G80 drivers when R600 launches. Presumably G80 is easier to write a driver for according to DemoCoder and they've had a longer time to work on them.
However, if R600 drivers end up being worse than G80 drivers, I guess that would prove his point, no?
Or maybe I'm oversimplying things, and R600 ends up having twice the resources assigned to it for driver developement compared to G80?
It'll be interesting to see whether Jawed or DemoCoder are more or less correct in their interpretations of how G80 and R600 work. In either case, it makess for some very interesting and enlightening reading.
Regards,
SB
So, when R600 finally reaches retail and if at that point in time R600 drivers are better and more stabled than G80 drivers.
Relative to current games, R600 and G80 have such an over-abundance of ALU capability (see also R580), that you're unlikely to find much evidence for a better "ALU-compiler" in either camp.
Jawed
Amazing.. the XT prices drop even before the card is out in retail yet..
€357 in holland.. (sapphire)
http://www.icomputers.nl//articledetail.aspx?A_ID=10878
€369 for the HIS
http://www.sallandautomatisering.nl/?redirect=/content/productinfo.php?pid=34368
On the topic of driver performance as far as I heard, ati went with stability first, performance later.
Silent_Buddha
07-May-2007, 21:24
Relative to current games, R600 and G80 have such an over-abundance of ALU capability (see also R580), that you're unlikely to find much evidence in current games for a better "ALU-compiler" in either camp.
Jawed
Do you think the DX10 patched games will be able to stress the ALU-compiler? Or it is something we'll have to wait for native DX10 apps to be able to judge with any degree of certainty?
Or is it something that upcoming late gen DX9 games might be able to expose?
If performance for both ends up being roughly the same, then it comes down to Price, Driver Stability, and IQ as the determining factors for purchase I would imagine.
Regards,
SB
willardjuice
07-May-2007, 21:33
Brent Justice over at [H] makes me nervous again. :???:
http://www.hardforum.com/showpost.php?p=1031020107&postcount=28
It would be nice to fall asleep tonight and wake up on May 14th. This waiting is killing me. :lol:
That post by Brent is fairly confusing though because I am not sure which 3DMark scores he is referencing. :???:
The link in the first post of that thread shows the 2900 XT getting low 3DMark scores, while a few posts up someone linked to a site showing the 2900 XT getting high 3DMark scores. So is Brent trying to hint that the 2900 XT has high real word performance (as opposed to its low 3DMark score) or that the 2900 XT has low real world performance (as opposed to its high 3DMark score)? I don't know. :sad:
I think that Brent is saying that an overclocked 2900 that reaches Ultra performance in 3dmark is not a guarantee that all game benchmarks will show similar results..
trinibwoy
07-May-2007, 21:44
Or maybe I'm oversimplying things
Yeah you are. I am sure that most of the Vista and game issues G80 is facing have nothing to do with the shader compiler.
willardjuice
07-May-2007, 22:21
I think that Brent is saying that an overclocked 2900 that reaches Ultra performance in 3dmark is not a guarantee that all game benchmarks will show similar results..
Since when does [H] overclock in their reviews? I doubt Brent is talking about the performance of an overclocked 2900 XT, probably the general performance of the R600.
Amazing.. the XT prices drop even before the card is out in retail yet..
it does not seem a like a good sign..
expletive
07-May-2007, 22:39
I dont know if this was announced or if its appropriate in this thread but there's a system integrator/reseller on AVS that may have let slip the MSRP of the 2900XT being $399. Sorry if this is useless news but thought i'd post it here. He also mentioned that the card can only do 2 channel LPCM over HDMi (but wasnt 100% on that), much like the 360 Elite.
Mintmaster
07-May-2007, 22:41
Huh? A sequence of vec4 MULs on R300 leaves ADD idle, but on R600 it's trivially issuable across 4/5ths of the scalar ALUs. You're asserting that issuing a single vector instruction as a co-issue across 5 scalar ALUs is difficult :?: :!: He was talking about the penalty in efficiency when the compiler can't co-issue, not comparing R600 to R300 for the same instruction.
A 4-instruction scalar expression like A = B * (C + D * (E + F * (G + H))) can't be packed due to dependency issues, and would run at 20% efficiency on R600. It would run at 80% efficiency on G80 (SF will be idle).
Hmm what about fetches from various (up to 8) vertex buffers? Or do you think D3D10 functionality is irrelevant?
So, what you're saying is that reduced peak throughput is ok?Yup. Even if VTF or VB fetches are half as efficient as pixel shader fetches (which may not be the case, because batches of 16 don't necessarily halve performance from batches of 32 even with the same batch count), G80 still has many times more vertex performance than G70. It would have to be some kind of pathological workload for this design choice to really hurt G80.
On the other hand, 16-vertex granularity probably doesn't offer much, so who knows why they did it.
I'm not arguing that sequential-scalar is wrong - I'm arguing that it's structurally a much bigger change from R5xx than R600 currently is. Most obviously ATI would have needed to increase either the number of batches in flight or widened the ALUs. They're both significant structural changes, based on the 4-way tiled architecture that already exists. That's what I'm saying is "fundamental".But I just explained to you a modification of Xenos that doesn't require either of those. It's really just analogous to SOA vs AOS. Same vector size, and about the same number of read ports and write ports.
As to your "4-way tile architecture" idea, I assume you're talking about tiling the 64 pixels in a vector into 4 groups, right? Tiling them by channel is really no different than tiling them by location. That's what I mean by SOA. It's trivial stuff.
And, maybe, having a minimum loop length like that burns all that was gained by implementing a dedicated branching pipeline? Why do CPUs invest so much effort in branch prediction?You're not understanding what I'm trying to say. Look back to my modified Xenos example again.
In the original Xenos, it would take 4 cycles for the shader array to process an instruction for a batch. I don't know its branching limitation, but the fastest it could branch, then, is once every 4 cycles (i.e. a single instruction loop in the shader code). For the modified Xenos, we could keep that same branching rate. It would be no slower than the original. Boost the non-branching issue rate to allow the instruction to change every cycle, and you now have perfect scalar efficiency without any dependency limitations and a simple compiler to boot.
Now if we compare this to a third option of a 16 shader 5x1D architecture like R600, it would execute 5 instructions in parallel on 16 pixels at a time, and it would take 4 clocks to . The instruction issue rate is up to 5/4 per clock, so this design is no simpler in that respect. Branching is no better than the other design either.
Except in G80 the MAD and SF/MI are co-issued.Not sure what you're saying. I'm suggesting that you treat the SF instructions like a texture instruction, then compile just like you'd do for R300 (except, of course, no Vec3+scalar co-issuing to worry about). It's an identical problem, so G80's design doesn't add any new difficulties for the compiler; rather, it makes it easier.
Yep, that's how G80 co-issues that particular pair of instructions, as long as the vec instruction is identical for all four channels.
G80 is actually 16x (8x MAD + 2x SF). A 32-wide warp therefore takes 16 clocks for an SF (or 32 if it's one of the double-duration SFs).I wrote it that way for simplicity. Whether it's 16 SP + 4 SF or 2x(8 MAD + 8 quarter-throughput SF) doesn't really matter.
I'm just pointing out that compiling for the SF in G80 is no harder than it was for a Vec4+scalar architecture. The worst case scenario is a lack of improvement in efficiency when the old compiler co-issued something smaller than a vec4 with the scalar.
Amazing.. the XT prices drop even before the card is out in retail yet..
€357 in holland.. (sapphire)
http://www.icomputers.nl//articledetail.aspx?A_ID=10878
€369 for the HIS
http://www.sallandautomatisering.nl/?redirect=/content/productinfo.php?pid=34368
On the topic of driver performance as far as I heard, ati went with stability first, performance later.
Lower prices for different brands and in different shops doesn't mean the prices are dropping before launch [wink]
He was talking about the penalty in efficiency when the compiler can't co-issue, not comparing R600 to R300 for the same instruction.
"Can't co-issue" is a straw man (how else do I interpret his "lose 4/5ths"?), because general shader code (vec3 or vec4 operations) is full of trivially-schedulable co-issues across the R600's 5 scalar ALUs - if the R600 compiler treats it as a fixed vec4+SF pipeline, it will generally have better utilisation of its ALU components (i.e. RGBA is 4 components) than R300.
So whatever corner-case scenarios of being "unable to co-issue" that afflict the "v0.9" R600 compiler, it won't be for trivial cases such as single vec3/4 instructions.
A 4-instruction scalar expression like A = B * (C + D * (E + F * (G + H))) can't be packed due to dependency issues, and would run at 20% efficiency on R600. It would run at 80% efficiency on G80 (SF will be idle).
Given that I keep pointing at this case, don't colour me surprised.
Yup. Even if VTF or VB fetches are half as efficient as pixel shader fetches (which may not be the case, because batches of 16 don't necessarily halve performance from batches of 32 even with the same batch count), G80 still has many times more vertex performance than G70. It would have to be some kind of pathological workload for this design choice to really hurt G80.
Hey, hopefully we'll see with D3D10 code, eh?
On the other hand, 16-vertex granularity probably doesn't offer much, so who knows why they did it.
Yeah, hence the other thread.
As to your "4-way tile architecture" idea, I assume you're talking about tiling the 64 pixels in a vector into 4 groups, right?
No. Think of SuperAA, how since R300, tiled screen-space regions have been a key part of the architecture. Once a triangle has been setup and split into regions that cover independent screen-space tiles, all further operations on that portion of the triangle can progress entirely independently of the other screen-space tiles.
R420, R520 and now R600 (apparently - with the caveat of the L2/L3 cache trick) all have 4 independent rasterisation/pixel-shading/RBE pipelines (I refer to these as shader units, for what it's worth) - they're all 4-way tiled.
If R600 was 4 Xenos GPUs tied together, where each quarter is:
1 rasteriser
1 sequencer
4x 4-way ALUs (scalar, say) (Xenos is 3x 16-way)
1x 4-way TU (texture and vertex fetch combined) (Xenos is 1x 16-way - excluding vertex fetch)each sequencer would be more costly simply because there's more ALU batches in flight (four, instead of R600's one) whilst having to support the same operand fetch complexity per clock (as R600 currently does). Additionally, shader export would have increased complexity, now having to deal with more input queues than we see in R600 (because of the increased number of independent ALU pipelines).
I'm merely guessing at their motivations, suggesting that from the R5xx baseline the 5x superscalar ALU pipeline is evolutionary - and that the 4-way tiled architecture is, in a way, a constraint.
I'm just pointing out that compiling for the SF in G80 is no harder than it was for a Vec4+scalar architecture. The worst case scenario is a lack of improvement in efficiency when the old compiler co-issued something smaller than a vec4 with the scalar.
(Xenos is actually the only recent vec4+scalar architecture GPU that matches your crieteria here. R5xx and NV4x both have two vector instruction ALUs to be co-issued (as well as SF).)
So, G80 is, in one sense easier than these older GPUs. But the timing of SFs and the added workload of interpolations cut into those benefits, at least in terms of compiler implementation if not efficiency. G80 is, generally, radically more efficient than G7x so don't think I'm down on it. The goodness in G80 is intensified by the badness that came before.
R600 seems to be a revision of the R3xx...R5xx superscalar ALU, tweaked for utilisation while keeping the surrounding hardware as unchanged as possible. That's how I read it.
One thing we don't know is R600's throughput for integer instructions and other funnies that D3D10 introduces. Are all ALUs equally able or is R600 asymmetric in this respect? I can't remember the capability of G80 in this respect. Does it matter if these new instructions are not first class citizens? Have these new instructions influenced the ALU organisation?
Jawed
it does not seem a like a good sign..
It is for us as consumers. ;-)
Mintmaster
08-May-2007, 01:36
"Can't co-issue" is a straw man (how else do I interpret his "lose 4/5ths"?), because general shader code (vec3 or vec4 operations) is full of trivially-schedulable co-issues across the R600's 5 scalar ALUs - if the R600 compiler treats it as a fixed vec4+SF pipeline, it will generally have better utilisation of its ALU components (i.e. RGBA is 4 components) than R300.
So whatever corner-case scenarios of being "unable to co-issue" that afflict the "v0.9" R600 compiler, it won't be for trivial cases such as single vec3/4 instructions.I don't know how I can explain it any simpler than I did in the last post. For the last time, DC is not talking about a vector instruction on R300 vs. a vector instruction on R300. He is simply pointing out that if ADD co-issue wasn't done on R300, no big deal. If scalar co-issue isn't done on R600, it is a big deal. He's talking about extracting the max throughput possible. Shader code is more likely to have lots of scalar and vec2 code than a crapload of ADDs.
ATI obviously put a lot of work and die area into making R600 a 5x1D architecture, clearly for the purpose of improving speed more than could be done by extra Vec4+1D units instead. Not getting co-issue right in the compiler for R600would be a lot more damaging (especially in comparison to G80) than with R300.
R420, R520 and now R600 (apparently - with the caveat of the L2/L3 cache trick) all have 4 independent rasterisation/pixel-shading/RBE pipelines (I refer to these as shader units, for what it's worth) - they're all 4-way tiled.Oh, okay.
So basically you're suggeting the same rationale that I did previously (http://forum.beyond3d.com/showthread.php?p=981888#post981888). The batch size is too small for one quarter of the ALUs to all operate on the same channel like in G80.
I still think that's the way to go. Just use a 64-pixel batch size. It won't make that much difference, and I certainly think it would be much less than from near-perfect utilization.
(Xenos is actually the only recent vec4+scalar architecture GPU that matches your crieteria here. R5xx and NV4x both have two vector instruction ALUs to be co-issued (as well as SF).)Yeah, I know. I just thought it would be best for comparison to isolate my point. Compiling for Xenos is even simpler than for R5xx/NV4x.
A side note to everyone that is planing on using watercooling on 2900 should know that the the shim is higher then the core wich will result in the need of a extra coldplate or a custom waterblock. Removing the shim turns out to be harder then it looks.
Let's hope that EK will come out with another awesome waterblock for the R600 then. What are the actual measurements of the core and the shim?
mushroom
08-May-2007, 04:11
So, when R600 finally reaches retail and if at that point in time R600 drivers are better and more stabled than G80 drivers. Does that mean the NV driver team is understaffed or lazy considering it's easier to make a driver for the G80 architechture than the R600 architechture?
However, in the same post DemoCoder also says - which you seemed to have ignored :-
ATI had an advantage in the drivers of merely being an 'evolution', NVidia had to deal with a revolutionary architectural change, and yet a relatively new architecture seems to be doing well against a mature one. (e.g. not getting it's ass handed to it the way the on-paper specs say it should) I'd say that's a testament to the fact that extracting high efficiency from the G80 is a simpler task.
Which implies Nvidia will not have as mature driver as ATI's since G80 is a bigger departure from its own previous architectures as compared to ATI...
So I think you are oversimplifiying the situation indeed...
ATi to launch 5 Mobile GPUs on 14 May 2007 (http://hdtvsg.blogspot.com/2007/05/ati-to-launch-5-mobile-gpus-on-14-may.html)
ATI Mobility Radeon™ HD 2300
ATI Mobility Radeon™ HD 2400
ATI Mobility Radeon™ HD 2400XT
ATI Mobility Radeon™ HD 2600
ATI Mobility Radeon™ HD 2600XT
Some 2900XT photos (http://my.ocworkbench.com/bbs/showthread.php?p=412421#post412421), I can see only 2 of them cause I am not a registered member there.
Did I mention that I'm planning to buy a laptop this summer? :grin:
Tho IHVs looking for a fair and impartial evaluation of a fully-disclosed "Vista Premium review laptop" should feel free to drop me a line! :yep2:
I wonder if the "Naked Iceman" Sapphire HD2900XT is going to be a factory OC'ed card... There must be a reason for using a custom three heatpipe HSF instead of the reference design two heatpipe version.
memberSince97
08-May-2007, 07:11
lets see some numbers mushroom :)
Some 2900XT photos (http://my.ocworkbench.com/bbs/showthread.php?p=412421#post412421), I can see only 2 of them cause I am not a registered member there.
Tasty.. R600 with 07-10 Datestamp .. printed on the metal edge of the package and not on top of the die as were used to.. revision A13.
The bottom right corner in this shot:
http://my.ocworkbench.com/bbs/attachment.php?attachmentid=380&d=1178590395
is the right corner in the roden shot
http://vgacentral.files.wordpress.com/2006/11/ati_r600_1.jpg
The package itself looks much healthier...
http://i18.tinypic.com/677o8xf_th (http://i14.tinypic.com/677o8xf.jpg)
http://i14.tinypic.com/628znt0_th (http://i14.tinypic.com/628znt0.jpg)
epicstruggle
08-May-2007, 08:49
http://i18.tinypic.com/677o8xf_th (http://i14.tinypic.com/677o8xf.jpg)
http://i14.tinypic.com/628znt0_th (http://i14.tinypic.com/628znt0.jpg)
Thats a lot better than the videos that floated a few weeks ago.
First picture: the level of detail is really impressive. But I can't believe the left edge of the chin is blocky... did they run out of polygons or what?
Tasty.. R600 with 07-10 Datestamp .. printed on the metal edge of the package on not on top of the die as were used to..
Is that doubling as the shim?
flopper
08-May-2007, 09:20
Jawed,
with the intel as of now and in a more english without to much tech lingo since I have no references to those,
what does the difference between Nvidias and ati´s cards really do in performance.
We can talk, dx9 and xp or dx10 and vista.
Open gl and d3d.
I understand that ati did something that seems more future oriented, longer shaders effiency and more dx10.
I am just a user of the cards, overklock a little and also read forums as these just for fun even if vector and tmu and such words brings little reference and meaning for me.
If you would put what is known and guess as of now, and translate and calculate what that means for games that are of now and are coming, I be interested to hear how you think the cards will do.
3DMARK06: 8800Ultra(default) VS. 2900XT(oc)
http://r800.blogspot.com/2007/05/3dmark06-8800ultra-vs-2900xt.html
3DMARK06He probably overclocked until reached desired result :wink:. What I find interesting is that 2900XT has relatively higher SM3 score (SM2 score is 267 points lower, but SM3 score is 364 points higher, 5645 vs 6009). Or alternatively - it's just noise...
AnarchX
08-May-2007, 10:04
http://img91.imageshack.us/img91/8485/msi2900xt01zj0.jpg
http://img91.imageshack.us/img91/9803/msi2900xt12hu3.jpg
16 memory chips...
http://forum.coolaler.com/showthread.php?t=152864 More inside!
SCDA PIC Compared :R600 VS. G80 VS. R580
http://r800.blogspot.com/2007/05/scda-pic-compared-r600-vs-g80-vs-r580.html
trinibwoy
08-May-2007, 11:50
SCDA PIC Compared :R600 VS. G80 VS. R580
http://r800.blogspot.com/2007/05/scda-pic-compared-r600-vs-g80-vs-r580.html
Yaaay more single screenshot performance numbers!
Cuthalu
08-May-2007, 12:01
That wire tray floor looks bad with R600.
Yaaay more single screenshot performance numbers!
well 3dmark really doesn't mean much, but at least its better then the game screenshots lol.
Look at the detail on the ground, its all blurred out in the r600 supposed screenshot again :/, looks better for AA, but AF is shot to hell or something is going on.
That wire tray floor looks bad with R600.
it just looks different to me, not bad (different lighting)
That wire tray floor looks bad with R600.
The mspaint-style compression isn't helping one bit... to me all the pictures look terrible because of it. It's hard to tell if the shitty compression is fudding the edges or if there's really a difference :mad:
http://i18.tinypic.com/677o8xf_th (http://i14.tinypic.com/677o8xf.jpg)
Looks like she's about to throw up :razz: Seriously, where's the Dangerous Curves Ruby? She was hot...
trinibwoy
08-May-2007, 12:34
Why are they comparing IQ on the X1900GT to 8800GTX :?:
Arnold Beckenbauer
08-May-2007, 12:40
http://www.forum-3dcenter.org/vbulletin/showthread.php?p=5474374#post5474374
Click on "spoiler" to see all graphics.
8 SPs per SM (streaming multi-processor), 512 threads per SM???
Why is the SFU "outside"?
Hm, is it possible, that it's G80?
Why are they comparing IQ on the X1900GT to 8800GTX :?:
that's just some screenshots thrown together on different cards. it's also there to show the "bugs" on g80/r600 in relation to the older hardware...
Dalton Sleeper
08-May-2007, 12:52
All those SC pics look like ****, like they've been blurred and then sharpened a few times.
Yaaay more single screenshot performance numbers!
if I told you the GTX card was extremely oced? Also pls pay attention to the pic quality of GTX. btw, what's the current retail price of GTX? if you could tell me, I will be very appreciated.
That wire tray floor looks bad with R600.
well 3dmark really doesn't mean much, but at least its better then the game screenshots lol.
Look at the detail on the ground, its all blurred out in the r600 supposed screenshot again :/, looks better for AA, but AF is shot to hell or something is going on.
Hey, two guys, you'd better play SCDA by yourself, when fisher reached the top of
Shanghai mansion, it's raining heavily, with strong wind, so the ground detail is changing every second.
it just looks different to me, not bad (different lighting)
correct.
Why are they comparing IQ on the X1900GT to 8800GTX :?:
I don't have too much time, you can compare IQ between 2900XT and 8800 GTX for me, thank you.
SCDA PIC Compared :R600 VS. G80 VS. R580
http://r800.blogspot.com/2007/05/scda-pic-compared-r600-vs-g80-vs-r580.html
The guy used the older drivers, same as XS guy Denny. Use the 8.37 they are much much better believe me. :wink:
Plus be careful with that. Denny doesn't have a R600XT 512MB GDDR3, but he has an OEM workstation version 1GB GDDR4. :wink:
And Brent meant a good thing for the R600, not a bad thing. :wink:
Just hold on.
Put it this way, and I've posted this elsewhere too. In mid-November on exactly the same platform and test bench (MB is nV 680i and thus optimized to some extent for for nV) the G80GTX posted a 11500 3DM06 score at best with the QX6700.
The HD2900XT 512MB GDDR3 beats that on the same platform with pre-release drivers.
The difference between the two cards is what Kombatant will also know- R600XT is a massive OC'er, better than G80GTX and G80U and PCIe frequency scaling is also very good.
itsmydamnation
08-May-2007, 13:28
this thread is getting worse then cartman waiting for the wii,
only 6 more days till r600.....
what time UTC is the NDA no more?
The guy used the older drivers, same as XS guy Denny. Use the 8.37 they are much much better believe me. :wink:
Plus be careful with that. Denny doesn't have a R600XT 512MB GDDR3, but he has an OEM workstation version 1GB GDDR4. :wink:
And Brent meant a good thing for the R600, not a bad thing. :wink:
Just hold on.
Put it this way, and I've posted this elsewhere too. In mid-November on exactly the same platform and test bench (MB is nV 680i and thus optimized to some extent for for nV) the G80GTX posted a 11500 3DM06 score at best with the QX6700.
The HD2900XT 512MB GDDR3 beats that on the same platform with pre-release drivers.
The difference between the two cards is what Kombatant will also know- R600XT is a massive OC'er, better than G80GTX and G80U and PCIe frequency scaling is also very good.
I do believe you, buddy.
The guy used the older drivers, same as XS guy Denny. Use the 8.37 they are much much better believe me. :wink:
Plus be careful with that. Denny doesn't have a R600XT 512MB GDDR3, but he has an OEM workstation version 1GB GDDR4. :wink:
And Brent meant a good thing for the R600, not a bad thing. :wink:
Just hold on.
Put it this way, and I've posted this elsewhere too. In mid-November on exactly the same platform and test bench (MB is nV 680i and thus optimized to some extent for for nV) the G80GTX posted a 11500 3DM06 score at best with the QX6700.
The HD2900XT 512MB GDDR3 beats that on the same platform with pre-release drivers.
The difference between the two cards is what Kombatant will also know- R600XT is a massive OC'er, better than G80GTX and G80U and PCIe frequency scaling is also very good.
waouuu impressive ! kudos ATI/AMD for bringing better product than 6 months old G80 with better process :roll:
well if it's true, I hope ATI/AMD will enjoy these coming 3 months of performance lead because when G92 will arrive, R600/650 will look like a toy, same way as R580 was a toy compared to G80...
Galduta
08-May-2007, 13:51
Yaaay more single screenshot performance numbers!
One IQ comparative not is posible in this point ;) is a variable scene, with water particules etc. - .
vertex_shader
08-May-2007, 13:53
waouuu impressive ! kudos ATI/AMD for bringing better product than 6 months old G80 with better process :roll:
well if it's true, I hope ATI/AMD will enjoy these coming 3 months of performance lead because when G92 will arrive, R600/650 will look like a toy, same way as R580 was a toy compared to G80...
:lol:
waouuu impressive ! kudos ATI/AMD for bringing better product than 6 months old G80 with better process :roll:
well if it's true, I hope ATI/AMD will enjoy these coming 3 months of performance lead because when G92 will arrive, R600/650 will look like a toy, same way as R580 was a toy compared to G80...
Wasn't G9x a bit delayed and now supposed to arrive in Dec 07/Jan 08? The 3 months you're talking about would indicate a release in Aug/Sep which should coincide with the release of R650.
Fornowagain
08-May-2007, 14:08
That wire tray floor looks bad with R600.
Where is the ground mist, its on the GTX pic?
Sound_Card
08-May-2007, 14:23
waouuu impressive ! kudos ATI/AMD for bringing better product than 6 months old G80 with better process :roll:
well if it's true, I hope ATI/AMD will enjoy these coming 3 months of performance lead because when G92 will arrive, R600/650 will look like a toy, same way as R580 was a toy compared to G80...
I sure am going to enjoy my R600.:razz:
The R600 must die a horrible death...just because it seems to have spawned a wanking generation of single screenshot releasing ZOMG OWNORRZZZZZZZ ROXXXORZ ZA BEST forumites/internetites whatever. It`s a sexy architecture, kudos to ATi for bringing an arguably good product to the market, but this capital sin cannot be forgiven. BUUUUURN:D
The guy used the older drivers, same as XS guy Denny. Use the 8.37 they are much much better believe me. :wink:
Plus be careful with that. Denny doesn't have a R600XT 512MB GDDR3, but he has an OEM workstation version 1GB GDDR4. :wink:
And Brent meant a good thing for the R600, not a bad thing. :wink:
Just hold on.
Put it this way, and I've posted this elsewhere too. In mid-November on exactly the same platform and test bench (MB is nV 680i and thus optimized to some extent for for nV) the G80GTX posted a 11500 3DM06 score at best with the QX6700.
The HD2900XT 512MB GDDR3 beats that on the same platform with pre-release drivers.
The difference between the two cards is what Kombatant will also know- R600XT is a massive OC'er, better than G80GTX and G80U and PCIe frequency scaling is also very good.
I do believe you, buddy.
Brent was hinting that 3dmark scores matter shit to be frank, and yes he didn't mean it in a good way, the r600 will do well in 3dmark...... And the new drivers you guys are talking about 3% improvement overall in games, but gives like 700-1000k 3dmark 06 bonus
The R600 must die a horrible death...just because it seems to have spawned a wanking generation of single screenshot releasing ZOMG OWNORRZZZZZZZ ROXXXORZ ZA BEST forumites/internetites whatever. It`s a sexy architecture, kudos to ATi for bringing an arguably good product to the market, but this capital sin cannot be forgiven. BUUUUURN:D
I've seen driver profiles especially for FRAPS.. ;)
dizietsma
08-May-2007, 14:46
How come R600 did not get scrapped for R620 this time like the previous generations of R4 and R5 before it?
How come R600 did not get scrapped for R620 this time like the previous generations of R4 and R5 before it?
uhm..
R300 and R400 were developed quite close to each other.
R400 was also not performing the way ATI expected it. So instead of going for this radical "unified" design ati Pumped R350, tweaked it, added a few new tricks and launched it as R420.
R400 project was picked up as R500 which.. again could not fulfill expectations as a PC part. R520 is a contingency plan on that part. don't see it as a last ditch attempt. but quite early in the development of that project it was clear that it would not perform in the current environment.
That's why R500 was placed in Xenos. having a clear environment to start from was obviously more beneficial than trying to fit the unified design in a DX9 suit.
The reason it didn't get "scrapped" for R620 is, in my view, simply because this new architecture simply did not have a back-up plan.. there isn't something like bolting a few alu's on R580 and hoping it will do DX10.
Heh, the roadmap/codename juggling that began in 2002 finally rejoins the main line.
Btw, I'm feeling cranky today. If I see another rolly eyes sarcasm-dripping troll in this thread today there will be one or more vacations handed out that will last until a few days after the expected R600 launch. :twisted:
bigtabs
08-May-2007, 15:19
Brent was hinting that 3dmark scores matter shit to be frank, and yes he didn't mean it in a good way, the r600 will do well in 3dmark...... And the new drivers you guys are talking about 3% improvement overall in games, but gives like 700-1000k 3dmark 06 bonus
Why is it that when any drops of ATI info land on your stony surface they always fall a particular direction? I think that either you broke the chaos theory or perhaps wind is a factor. :razz:
Brent was hinting that 3dmark scores matter shit to be frank, and yes he didn't mean it in a good way, the r600 will do well in 3dmark...... And the new drivers you guys are talking about 3% improvement overall in games, but gives like 700-1000k 3dmark 06 bonus
Well, there are multiple possibilities here.
1). They're cheating. Get the pitchforks.
2). Their part is particularly suited to 3DM06. Possible, I suppose.
3). As some people have suggested upstream, their compiler "optimization opportunity" is pretty steep, and they're starting with high profile benchie stuff like 3DM06. . . and those kind of increases will be along for many other apps in the months to come as they get their arms around their new baby.
We want the whole Ruby Demo with maximum CCC settings at 1680*1050, if anyone got enough space for upload :grin:
I have 3.xG pics of it on my dvdr
Well, there are multiple possibilities here.
1). They're cheating. Get the pitchforks.
2). Their part is particularly suited to 3DM06. Possible, I suppose.
3). As some people have suggested upstream, their compiler "optimization opportunity" is pretty steep, and they're starting with high profile benchie stuff like 3DM06. . . and those kind of increases will be along for many other apps in the months to come as they get their arms around their new baby.
I don't think they are cheating, driver opts, possibly, 3dmark 06 does push pixel shaders and vertex shaders pretty hard, so thats probably where the performance is coming from.
Why is it that when any drops of ATI info land on your stony surface they always fall a particular direction? I think that either you broke the chaos theory or perhaps wind is a factor. :razz:
Because look at crap Mao was posting for the past 2 weeks, he says it way better then gtx, but shows screenshots of where the screenshots arn't even close to the same, and when they are, the frame rates come out very close. Coincidence, I don't think so.
INKster
08-May-2007, 15:51
Wasn't G9x a bit delayed and now supposed to arrive in Dec 07/Jan 08? The 3 months you're talking about would indicate a release in Aug/Sep which should coincide with the release of R650.
If the "R650 in the Summer" silly season starts in force now, AMD might have an "Osborne Effect" on their hands.
I don't think they would want to undermine the R600 family launch like that.
they wouldn't have put the price of the r600 as it is, there won't be an osborne effect.
I don't think they are cheating, driver opts, possibly, 3dmark 06 does push pixel shaders and vertex shaders pretty hard, so thats probably where the performance is coming from.
Could be. All that "version 2 of unified" messaging they've been doing, while literally true, I don't know that it helps them much on optimizing, frankly. A closed environment where everyone writes for your part on an api customized for your part is quite a bit different than the PC platform. So I'm going to guess they really do still have a significant optimization opportunity in front of them. But then 6 months in, my impression is that G80 hasn't really shown any eye-popping performance improvements, and I'd have thought they'd have had a similar opportunity.
INKster
08-May-2007, 15:57
they wouldn't have put the price of the r600 as it is, there won't be an osborne effect.
And what if R650 comes out priced the same as the R600 XT ?
It's a direct successor after all, isn't it ?
But then 6 months in, my impression is that G80 hasn't really shown any eye-popping performance improvements, and I'd have thought they'd have had a similar opportunity.
The delay of G80's new XP drivers might have something to do with it.
Transition from WDM to WVDDM is no easy feat (let alone with a brand new hardware architecture), and it could forced performance optimizations (that were there in XP, with a mature driver model) to take a backseat to compatibility. This, of course, in light of the strong public backlash regarding the poor Vista driver quality.
Fortunately, most 150 series driver releases seem to be much better than the old ones, but there's still a lot of work ahead until they can concentrate on true performance, i'm sure.
nutball
08-May-2007, 16:01
But then 6 months in, my impression is that G80 hasn't really shown any eye-popping performance improvements, and I'd have thought they'd have had a similar opportunity.
True, though arguably NVIDIA has had other priorities for their driver effort to date, and without any meaningful competition there's no real rush for them to bring performance improvements to the end-user just yet (presuming they have them, hiding in their repository somewhere).
Geeforcer
08-May-2007, 16:04
But then 6 months in, my impression is that G80 hasn't really shown any eye-popping performance improvements, and I'd have thought they'd have had a similar opportunity.
Of course, already possessing 50-150% peformace edge, Nvidia had little incentive to showcase any performance improvements.
EDIT: Meh, nutball beat me to it.
Could be. All that "version 2 of unified" messaging they've been doing, while literally true, I don't know that it helps them much on optimizing, frankly. A closed environment where everyone writes for your part on an api customized for your part is quite a bit different than the PC platform. So I'm going to guess they really do still have a signficant optimization opportunity in front of them. But then 6 months in, my impression is that G80 hasn't really shown any eye-popping performance improvements, and I'd have thought they'd have had a similar opportunity.
Well one thing that strikes me odd is every single 3dmark06 score the r600 does much better in sm 3.0 then sm 2.0, which really doesn't make much sense at this point looking at the g80 and r600 architecture since 3dmark06 doesn't really have much benefits over sm 3.0 and sm 2.0 concerning dx10 GPU's. Maybe more game benchmarks will show this as well, or nV is still holding back performance or going to get more performance in upcoming drivers or playing a cruel joke.
Inkster, I think the price level of the r600 is a huge indication of what to expect of the r600, AMD isn't a charitable company lol, yes they want some marketshare back, but if the r600 is a better performer (not just marginally or sometimes), they don't need to drop their pants to get it back. I don't think the r650 will suffer from the same problems of the r600, at least not to the degree I'm expecting. So it should be able to compete well, unless nV has plans to release another GPU in the same time frame.
LeStoffer
08-May-2007, 16:11
3). As some people have suggested upstream, their compiler "optimization opportunity" is pretty steep, and they're starting with high profile benchie stuff like 3DM06. . . and those kind of increases will be along for many other apps in the months to come as they get their arms around their new baby.
:yes:
Geeforcer
08-May-2007, 16:14
The SM3.0 are HDR tests and that's the scenario where you'd expect superior bandwidth of XT to come through. It should be particularly evident in texture-light, HDR-heavy Deep Freeze.
Might R600 finally (for an ATI part) having PCF support (as required in the DX10 spec) play in there somehow?
If the rumored pricing is correct, then it is likely very indicative. Even if it's an excellent price/performance part, history is pretty conclusive that it's unlikely to be an absolute performance king at that price. AMD has a fiduciary responsibility to their shareholders even to not give away the store. What was the last performance king that launched at the price point the rumor sites are publishing? R360? Edit: Nope, 9800 XT was $499. . . .
nicolasb
08-May-2007, 16:28
If the "R650 in the Summer" silly season starts in force now, AMD might have an "Osborne Effect" on their hands.
I don't think they would want to undermine the R600 family launch like that.Much though it pains me to have to agree with Fuad Abazovich about anything, I suspect his guess of "late Q3" (i.e. well into September) is probably a good estimate for R650.
http://forum.coolaler.com/showthread.php?t=152864 More inside!
So, A13 is still the current variation of the chip, it seems. When did this first appear? I can't remember, November? January?
Jawed
Might R600 finally (for an ATI part) having PCF support (as required in the DX10 spec) play in there somehow?
If the rumored pricing is correct, then it is likely very indicative. Even if it's an excellent price/performance part, history is pretty conclusive that it's unlikely to be an absolute performance king at that price. AMD has a fiduciary responsibility to their shareholders even to not give away the store. What was the last performance king that launched at the price point the rumor sites are publishing? R360? Edit: Nope, 9800 XT was $499. . . .
X1950 XTX was fairly close, wasn't it?
Heh, the roadmap/codename juggling that began in 2002 finally rejoins the main line.
Btw, I'm feeling cranky today. If I see another rolly eyes sarcasm-dripping troll in this thread today there will be one or more vacations handed out that will last until a few days after the expected R600 launch. :twisted:
Sorry for your pain Geo, but like you with my post, I've started to be really annoyed by the last 20 pages of single shot framerates showing us nothing except some ATI fanatics trying to convince us R600 is sooooo good...
Hope you understand, I did not wanted to create any problem here, even if I admit my move was not the smartest one.
best regards
aeryon
CarstenS
08-May-2007, 16:41
Well, there are multiple possibilities here.
1). They're cheating. Get the pitchforks.
2). Their part is particularly suited to 3DM06. Possible, I suppose.
3). As some people have suggested upstream, their compiler "optimization opportunity" is pretty steep, and they're starting with high profile benchie stuff like 3DM06. . . and those kind of increases will be along for many other apps in the months to come as they get their arms around their new baby.
1) Everyone cheats. ;) In fact: The whole 3D-graphics-industry is all about cheating. ;)
2) Definitely. This and a 5-wide massive MAD-Kernel really makes an R600 shine.
3) Less likely. Though I agree, that there's still performance buried somewhere in the 20 million lines of catalyst-code, i doubt that a 10 to 20 percent boost will be the standard we're gonna see in games. I'd be more inclined to make this a 5 to 10 percent.
So, A13 is still the current variation of the chip, it seems. When did this first appear? I can't remember, November? January?
What, not A15? :lol:
November is a bit early for A13 I think, at least if you mean coming back. . .
1) Everyone cheats. ;) In fact: The whole 3D-graphics-industry is all about cheating. ;)
Oh, let's do that one in a different thread! Tho I understand what you mean there, re the whole mindset is how to approximate reality with as little resources as possible. I suppose since that's the mindset in the beginning, it can have some unpleasant flow through consequences that might not seem all that odd or objectionable to the folks engaged in them.
But, like I said, "good point, wrong thread". :wink:
{Sniping}Waste
08-May-2007, 16:47
Denny AKA Guess2098 has a HD2900XT 1 gig GDDR4 and looks like theres no NDA on him. He has older drivers but might be able to get newer ones later. This is the latest benchs run with a OC but with the stock cooler. Stock speeds for the card he has is 750/1009. Heres the thread http://www.xtremesystems.org/forums/showthread.php?t=143104
http://www.iamxtreme.net/video/r600/2900XT1024_06.PNG
http://www.iamxtreme.net/video/r600/2900XT1024_07.PNG
http://www.iamxtreme.net/video/r600/2900XT1024_08.PNG
Because look at crap Mao was posting for the past 2 weeks, he says it way better then gtx, but shows screenshots of where the screenshots arn't even close to the same, and when they are, the frame rates come out very close. Coincidence, I don't think so.
Crap Razor1, you can compare default clock 2900XT with exrtreme oc 8800 GTX, you always can, I have strong faith on your crap comparisons.
Crap Razor1, you can compare default clock 2900XT with exrtreme oc 8800 GTX, you always can, I have strong faith on your crap comparisons.
When did I do that :grin: , I don't much stock into DT's benchmarks :wink:
What, not A15? :lol:
November is a bit early for A13 I think, at least if you mean coming back. . .
I think A13 came back at end of Feb.
Geeforcer
08-May-2007, 17:07
After reading this thread one things is obvious: XT 2900 will score between 10 and 14K in 3Dmark06. It's all clear now!
I don't know how I can explain it any simpler than I did in the last post. For the last time, DC is not talking about a vector instruction on R300 vs. a vector instruction on R300. He is simply pointing out that if ADD co-issue wasn't done on R300, no big deal. If scalar co-issue isn't done on R600, it is a big deal. He's talking about extracting the max throughput possible. Shader code is more likely to have lots of scalar and vec2 code than a crapload of ADDs.
Which will run at lower utilisation on R300 than on R600 if no co-issue is possible (due to dependency). Further, for scalar or vec2 instructions, any co-issue that can be identified by the R300 compiler is going to work on R600. So, again, R600 comes out better than R300.
ATI obviously put a lot of work and die area into making R600 a 5x1D architecture, clearly for the purpose of improving speed more than could be done by extra Vec4+1D units instead. Not getting co-issue right in the compiler for R600would be a lot more damaging (especially in comparison to G80) than with R300.
There's a lot of low-hanging fruit in co-issue for this architecture, though. As I keep saying vec3/vec4 instructions make a mockery of the suggestion that ATI would be struggling with a compiler that can co-issue in the most basic cases. And that basic case makes up a lot of code.
Unravelling dependencies and eliminating dead code amongst vector instructions is what the set of static single assignment patent applications is all about. Sure, they make DemoCoder sneer in their obviousness - but they lie at the heart of making the compiler do the non-trivial co-issues that you guys are so desperate to show are impossibly complicated and bound to make R600 fall on its own sword.
I've got no argument that there are corner cases of tightly-dependent code that will run like a dog on R600 and I'm under no illusion that co-issue is generally trivial. The devrel guys are always begging gamedevs to explicitly mask their outputs - so that should be clue enough that the compiler guys have a hard time...
Oh, okay.
So basically you're suggeting the same rationale that I did previously (http://forum.beyond3d.com/showthread.php?p=981888#post981888). The batch size is too small for one quarter of the ALUs to all operate on the same channel like in G80.
No, my suggestion is that having a 4-way tiled architecture, they wouldn't then want to split-up each of the 4 tiles of batch scheduling with a more advanced sequencer. The sequencer from R5xx is enough (one ALU batch, one texture batch - roughly speaking) instead of using the Xenos sequencer (multiple ALU batches and one texture batch - roughly speaking).
I still think that's the way to go. Just use a 64-pixel batch size. It won't make that much difference, and I certainly think it would be much less than from near-perfect utilization.
But now you have spent more die space on the sequencers to get the same batch size. The payback is that two consecutive and dependent scalar or vec2 instructions will run at full speed. Maybe the payback isn't worth it?
---
Last night I realised that R600's ALU organisation offers another fundamental advantage (a direct inheritance from Xenos, prolly). Since each of the four shader units contains five 16-way ALUs, ATI's fine-grained ALU-redundancy scheme works a charm. In this setup, a 17th ALU pipeline is added to each array. So the redundancy overhead is 6%. (The theory is that each of Xenos's three 16-way ALUs have a 17th pipeline for redundancy.)
If R600 was implemented as lots of smaller ALUs (e.g. to build a sequential scalar GPU like G80) then the redundancy overhead would be significantly higher.
Again, another indication of evolution...
Jawed
What, not A15? :lol:
November is a bit early for A13 I think, at least if you mean coming back. . .
http://www.hexus.net/content/item.php?item=7437
Which implies A13 was just about to pop out.
If we're at A15, then, hairy muff...
Jawed
http://www.hexus.net/content/item.php?item=7437
Which implies A13 was just about to pop out.
If we're at A15, then, hairy muff...
Jawed
A15 was me funnin'. Somebody or other was claiming that a few months back. :smile: I see what you see on that chip shot. I suspect the Hexus piece was closer to when the order for A13 went to the fab than when it came back. . .
chavvdarrr
08-May-2007, 17:30
Last night I realised that R600's ALU organisation offers another fundamental advantage (a direct inheritance from Xenos, prolly). Since each of the four shader units contains five 16-way ALUs, ATI's fine-grained ALU-redundancy scheme works a charm. In this setup, a 17th ALU pipeline is added to each array. So the redundancy overhead is 6%. (The theory is that each of Xenos's three 16-way ALUs have a 17th pipeline for redundancy.)
If R600 was implemented as lots of smaller ALUs (e.g. to build a sequential scalar GPU like G80) then the redundancy overhead would be significantly higher.
Again, another indication of evolution...
Jawedehm... and what if there are 18 or 19 ALUs ? Or if the whole redundacy scheme is different...
I just don't get why a hypothetic scheme should be considered as evolution.
http://www.forum-3dcenter.org/vbulletin/showthread.php?p=5474374#post5474374
Click on "spoiler" to see all graphics.
8 SPs per SM (streaming multi-processor), 512 threads per SM???
Why is the SFU "outside"?
Hm, is it possible, that it's G80?
Clicking on the spoiler, the top diagram shows NV4x/G7x ALUs.
The set of slides that follows is about writing good code. The first slide is to encourage developers to specify output masks on their code. The masks then allow the compiler to simply co-issue.
These examples are for old ATI hardware, though they'll continue to be relevant. They appear on slides 51 and 46 of:
http://ati.amd.com/developer/Dark_Secrets_of_shader_Dev-Mojo.pdf
Jawed
ehm... and what if there are 18 or 19 ALUs ? Or if the whole redundacy scheme is different...
I just don't get why a hypothetic scheme should be considered as evolution.
GRAPHICS PROCESSING LOGIC WITH VARIABLE ARITHMETIC LOGIC UNIT CONTROL AND METHOD THEREFOR (http://v3.espacenet.com/textdoc?DB=EPODOC&IDX=US2006053189&F=0)
Jawed
Robin B
08-May-2007, 19:17
Does anyone here know if the R600 comes with a 8 Pin Pci-E adapter, think it could make or brake som sales, not many Psu have one yet.
I don't know that anyone's seen a retail pack yet to say that for sure. Places like Newegg are generally pretty good about showing pictures of all the accoutrements, so checking there when available might be the way to go.
Mintmaster
08-May-2007, 20:59
Which will run at lower utilisation on R300 than on R600 if no co-issue is possible (due to dependency).Forget about it Jawed. You don't get it. This isn't about % ALU utilization on R300 vs. R600. It's about the importance of compiler co-issue for competitiveness.
The extra ADD doesn't use much die space on R300, nor does its usage make much difference in most shader code. Neither applies to R600's coissue. NVidia said somewhere they get nearly 2x speed boost by going scalar, and looking at PS benchmarks I believe them.
But now you have spent more die space on the sequencers to get the same batch size. The payback is that two consecutive and dependent scalar or vec2 instructions will run at full speed. Maybe the payback isn't worth it?Why do we need more sequencers? You're really twisting my Xenos example around with the whole smaller Xenos quadrupled interpretation.
Forget about that example for now. Take R600, and assume the batch size is 64 pixels. Now make the single ALU array per quarter act on one channel of all pixels instead of all channels of 16 pixels. That's pretty much the sum of it (ignoring the SF details).
----------------
HOWEVER, I realized that I forgot something. Latency requires you to cycle between batches. An ALU array in R600 takes 4 cycles to get through a batch, so a 12-cycle instruction latency, for example, requires you to have three batches in a cycle that switches every 4 clocks. My design would need 12 batches in the the cycle that switches every clock. That's a lot more data shuffling.
I guess there is indeed a legitimate reason for going the co-issue route.
Russell
08-May-2007, 21:04
Does anyone here know if the R600 comes with a 8 Pin Pci-E adapter, think it could make or brake som sales, not many Psu have one yet.
When HIS's site posted the R600 product page early, I did note that there was no reference to the 8-pin adapter in the accessories list. Whether that is the case, or is the case for all manufacturers, I do not know.
http://www.imagebanana.com/img/b11ska11/xh29.jpg
Dave Baumann
08-May-2007, 21:18
HOWEVER, I realized that I forgot something. Latency requires you to cycle between batches. An ALU array in R600 takes 4 cycles to get through a batch, so a 12-cycle instruction latency, for example, requires you to have three batches in a cycle that switches every 4 clocks.
If you take Xenos as an example it is built around an 8 cycle ALU latency - each SIMD array has sets two sequencers/arbiters to handle and swap between two separate threads (executing over 4 cycles each).
vBulletin® v3.8.4, Copyright ©2000-2010, Jelsoft Enterprises Ltd.