Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Old 05-May-2007, 20:03   #4526
silent_guy
Senior Member
 
Join Date: Mar 2006
Posts: 1,713
Default

Quote:
Originally Posted by trinibwoy View Post
I'm still baffled by their decision to go this way yet continue to rely on instruction co-issue.
I think it's a logical decision to go this way: a nice evolution of something they already had. Definitely more efficient, but with the option to initially reuse the compiler that was already there (while temporarily forgoing the additional efficiency until improvements are made.)

Not knowing the architecture of the competition, it's the low risk and low cost way to transition to your next product. Most established companies work that way, especially when they are resource limited due to a number of other projects on the side (xBox/Wii): instead of throwing everything overboard you only improve existing stuff and spend most time on what needs to be designed from scratch. (DX10 stuff, tessellation, ...)

I don't really see how the shader architecture is going to be the source of performance problems for 3D stuff.

Quote:
It's going to be veeeeery interesting to see how these two architectures scale at 65nm.
I wouldn't expect too much, if they are just evolutions: the incremental speed gain going from 80nm to 65nm isn't very impressive. And that's to be expected with wire delays become the determining factor even more...
silent_guy is offline  
Old 05-May-2007, 20:47   #4527
Frank
Certified not a majority
 
Join Date: Sep 2003
Location: Sittard, the Netherlands
Posts: 3,182
Default

I really think that the instructions issued to the individual ALUs include wide (4vec + scalar) as well as small (scalar) ops, simply because that would reduce the needed bandwidth a whole lot. And they might include a loop counter as well as some conditionals. Those conditionals could be applied to the exit condition, or the conditional writing/branching of individual instructions.
__________________
The Laws of nature are NOT subject to the majority vote. In the long run.
Frank is offline  
Old 05-May-2007, 21:41   #4528
Julidz
Junior Member
 
Join Date: May 2007
Posts: 57
Default

Could you please explain that ?

We will show a two examples
Let's add two 4D vectors and two 1D scalar information. ATI will do this in a single clock because it has one 4D and one 1D unit to spend. Nvidia will do this in a single clock but it will take 5 units from 128 possible while ATI used only 2 from 128 possible (64 4D + 64 1D).

ATI is theoretically faster, but Nvidia has Shaders that run at 1.35 GHz or some 60 percent faster than ATI's that run at 800 MHz. This makes the situation much better for Nvidia. This example is very rare in real games. Shaders very rarely use the operations with 4D vectors. Shaders usually work with 3D data such as 3D coordinates, 3D normals and 3D RGB channels without alpha. In some cases Shaders use 2D functions only e.g. 2D coordinate in texture or even 1D.


Example 2:

Let's add two 3D vectors, two 2D Vectors and one 1 D scalar. Nvidia again needs 5 Shaders or 2+2+1 units while ATI can not make this in a single clock. ATI needs a single clock to add two 3D vectors and at the same clock it can add the 1D scalar. However it needs a second clock to add 2 2D vectors. This means that ATI used 1+1 units in first and a single unit in second clocks, together it used three units in two clocks. As R600 Shaders are 60 percent slower than Nvidia's it means that R600 is two times slower than G80 in this particular example.
If you decide to make a Shader that will add eight 1D vectors Nvidia can take 8 units in a single clocks and ATI has to do in four clocks as it can use 1 4D and 1 1D unit per clock to finish this operation. In this case Nvidia can be four times faster and still finishes the calculation faster.
Even if it looks really bad for ATI there is still some hope for the R600. You have to take into consideration that Microsoft HLSL (Higher Level Shader Language) compiler will do its small miracle and will try to optimize the Shader code. Compiler will try to "glue" a few scalars in a single VEC4 unit.

Compiler

We have two scalar informations, the first one called Fudo and second ona called JenHsun. The compiler will glue these two scalar values in a single 2D vector called "FudoJenHsun" and it will access Fudo scalar part as the FudoJenHsun.X and JenHsun part as FudoJenHsun.Y. The good part is that all possible operations with a new variable 2D vector FudoJenHsun will run parallel as one 2D vector. This wont help Nvidia at all, but it will really mean a lot for ATI. Nvidia still has an advantage of having full flexible scalar units, while ATI doesn’t, at least not that we or any of our sources knows of.

We know it is a complicated part but we could not simplify things much more than this. We apologise for the complexity of the part but it just doesn’t go easier than that. We simplified it as much as we could.






and will the future games use these vec4 instructions for shaders ?


Thx
Julidz is offline  
Old 05-May-2007, 21:51   #4529
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,679
Default

There's really no reason to expect any increase in vec4 instruction utilization.
Chalnoth is offline  
Old 05-May-2007, 21:52   #4530
Julidz
Junior Member
 
Join Date: May 2007
Posts: 57
Default

so how ppl expect a better performance of R600 than G80 in DX10 games ??
Julidz is offline  
Old 05-May-2007, 21:58   #4531
no-X
Senior Member
 
Join Date: May 2005
Posts: 2,042
Default

Quote:
Originally Posted by Julidz View Post
Could you please explain that ?

Thx
source: Fudzilla. That's the whole explanation
__________________
Sorry for my English. But I hope it's better than your Czech
no-X is offline  
Old 05-May-2007, 22:07   #4532
Julidz
Junior Member
 
Join Date: May 2007
Posts: 57
Default

but i really wanna know why r600 would be better in Dx10 and G80 is better in DX9


will DX10 take more advantage of the Vec5 *vec4 + scalar* architecture and R600's 320 Sps will be used in a more efficient way ?
Julidz is offline  
Old 05-May-2007, 22:16   #4533
AlexV
Heteroscedasticitate
 
Join Date: Mar 2005
Posts: 2,362
Default

It will because it must because it`s from ATi and it`s late and it may not be a scorcher in terms of general performance so there must be a catch because ATi doesn`t make sucky stuff ever ever ever.

On a serious note, I think it`s fairly hard to predict wheter or not it`ll rule WRT DX10, because we've yet to see typical DX10 workloads(no, dubious demos quoted in marketing slides don`t count). It`ll come down to what ppl actually do with DX10 and how that jibes with the 2 competing architectures. The R600 seems to be more adept than the G80 when it comes to Geometry shading, wheter or not that`ll matter/it is fast enough for it to matter is a different thing. Then there`s compiler magic that comes into play...until the NDA is off an a pertinent analysis can be performed, it`s all a complex exercise of who can pull theories from the deeper parts of their anatomy at a greater pace(no disrespect intended towards fellows like Jawed, mind you)
__________________
Donald Knuth: Science is what we understand well enough to explain to a computer. Art is everything else we do.
AlexV is offline  
Old 05-May-2007, 22:35   #4534
Frank
Certified not a majority
 
Join Date: Sep 2003
Location: Sittard, the Netherlands
Posts: 3,182
Default

Say, you have an instruction stream coming into your GPU, that uses half the available memory bandwidth. The other half is used for data. Say, you have 256 ALUs (processing elements), that all need a new instruction each clock to be able to do something useful. And each instruction + operands is 64 bits wide (that's very conservative), and you can read one each clock from memory. That requires a 256 * 64 * 2 (for data) = 32768 (!!) bits wide bus.

Of course, that isn't practical. Fortunately, GPUs are SIMD: Single Instruction, Multiple Data. In other words: the ALUs are grouped. Each instruction is executed multiple times in parallel. And there are other ways to reduce the bandwidth (bus width) needed: either you use instructions that tell multiple groups of ALUs what to do (like the superscalar R600), or you execute those instructions over multiple clock cycles (like the serial 8800). Both ways, you reduce the amount of instructions needed each clock to a manageable amount.

Branching is further increasing the need for instruction scheduling, or reducing the throughput. Because, at each if..then..else statement, some of the elements might go one way, and the others the other way. At that point, you can split the streams (doubling the amount of ALUs and bandwidth needed), calculate both possibilities in sequence and only write the ones that are valid for that case, or simply calculate both paths in sequence. The latter two both halve the throughput every time they happen.

As the R600 has 4 ALU blocks, that all consist of 6 (4 vec, 1 scalar and 1 branching/conditional) units, which all have to receive instructions each clock, you either need instructions that tell all of them what to do in all cases (from 6 independent, scalar instructions, up to a single combined vec4 + scalar and a conditional instruction). That requires either very long instruction words (VLIW, say up to 1024 bits each for each block), or clever scheduling.

Because, most combined instructions can be simplified into a single instruction for the whole ALU block. But, if you only schedule a single instruction for a single ALU, all the others would be wasted for that clock.

So, it's most likely, that they kept the instruction length manageable, but made it possible to stack instructions, so they can be executed all at once.
__________________
The Laws of nature are NOT subject to the majority vote. In the long run.

Last edited by Frank; 05-May-2007 at 22:49.
Frank is offline  
Old 05-May-2007, 22:41   #4535
silent_guy
Senior Member
 
Join Date: Mar 2006
Posts: 1,713
Default

Quote:
Originally Posted by Julidz View Post
Could you please explain that ?

Quote:
We will show a two examples
Let's add two 4D vectors and two 1D scalar information. ATI will do this in a single clock because it has one 4D and one 1D unit to spend. Nvidia will do this in a single clock but it will take 5 units from 128 possible while ATI used only 2 from 128 possible (64 4D + 64 1D).

... (I stopped reading after this)
Julidz, forget everything of the paragraph above: it's completely wrong.

ATI will do the 4D operation and the 1D operation in 1 clock cycle and can do 64 of those actions (for 64 different threads) at the same time.
The 8800GTX requires 5 clock cycles to do them and can do 128 in parallel, but at a higher clock speed.

Quote:
...and will the future games use these vec4 instructions for shaders ?
AFAIK, current shaders are already using 4D instruction: they are used in homogeneous [x,y,z,w] coordinate system. (But I'm sure somebody will correct me if I'm talking BS. )
silent_guy is offline  
Old 05-May-2007, 22:46   #4536
Arnold Beckenbauer
Senior Member
 
Join Date: Oct 2006
Location: Germany
Posts: 1,003
Default

Quote:
Originally Posted by Julidz View Post
but i really wanna know why r600 would be better in Dx10 and G80 is better in DX9


will DX10 take more advantage of the Vec5 *vec4 + scalar* architecture and R600's 320 Sps will be used in a more efficient way?
Quote:
Let's add two 4D vectors and two 1D scalar information. ATI will do this in a single clock because it has one 4D and one 1D unit to spend. Nvidia will do this in a single clock but it will take 5 units from 128 possible while ATI used only 2 from 128 possible (64 4D + 64 1D).
Pardon?
G80 never works on Vec2, Vec3, Vec4... instructions, but always on Vec1. If there is a eg a Vec4 instruction, G80's compiler splits it in four Vec1 and it takes four cycles two work on it.
Imagine: You have to work on four RGBA quads. The G70's quad ALU (4D) needs four cycles for these four RGBA quads: first cycle - first RGBA quad, second cycle - second, third...
The G80 splits these four quads in four pixel groups with 16 pixels: 16*R, 16*G, 16*B, 16*A. A Vec16-ALU needs four cycles for these four groups: first cycle - 16*R, second cycle - 16*G...
A Vec16-ALU is clocked up to 1512 MHz.
And now: You forget all these pixel stuff. Current GPUs like R520/R580 or G80 work on threads/batches/green elephants.
What really matters is: input (10101010101...) and output (what you see on you display).

Quote:
Originally Posted by Mintmaster View Post
The only reason I can think of for doing this is that maybe the batch size is small, so if R600 is 4 groups of a texture quad and 80 stream processors, then you can't easily attack the math one channel at a time (G80, being comprised of 8 groups of a texture quad and 16 stream processors, doesn't have that problem). Still, I'd rather divide the chip into smaller groups and share the texture quads rather than go through all this co-issue scheduling.
Or eight SIMDs and four fully decoupled TUs (texture units)?
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke
Eta Kooram Nah Smech!

Find Chuck Norris.

Last edited by Arnold Beckenbauer; 05-May-2007 at 23:01.
Arnold Beckenbauer is offline  
Old 05-May-2007, 22:52   #4537
silent_guy
Senior Member
 
Join Date: Mar 2006
Posts: 1,713
Default

Quote:
Originally Posted by Julidz View Post
but i really wanna know why r600 would be better in Dx10 and G80 is better in DX9

will DX10 take more advantage of the Vec5 *vec4 + scalar* architecture and R600's 320 Sps will be used in a more efficient way ?
If you look at pure calculation power, R600 has the advantage wrt the amount of calculations per second it can do. That's very nice to have and will almost certainly make it a winner for a number of calculation oriented benchmarks.

But the execution pipes (shaders) don't have much to do with the DX9 vs DX10 performance. For that, the stuff surrounding the shaders is what counts: how do you store geometry shading data? How do you feed the shaders with data? How fast can it do branching?

We currently don't know a lot about the DX10 organization for G80, and even less for R600. It's impossible to know how they will behave for DX10 games, but if there's a major difference, I think it's reasonable to say that the difference in shader executing pipeline won't be the major factor.

Right now, I know of only 1 report about relative GS performance, and that's a blog post about MS Flight Simulator. Not a exactly a large body of evidence to go on...
silent_guy is offline  
Old 05-May-2007, 23:08   #4538
Silent_Buddha
Regular
 
Join Date: Mar 2007
Posts: 9,227
Default

Aye it'd be interesting to hear from the developers of Call of Juarez, Company of Heroes, Crysis, etc, who are either working on DX10 or patching their games to DX10 what they think of the two architechtures. Although I'm supposing they are also still under NDA from ATI.

From the admittedly VERY few screens of Call of Juarez, the DX10 version looks absolutely nothing like the DX9 version that I tried out. Well other than buildings and terrain being in the same place.

I'm not sure however if that is a sign of things to come, or if it's just that they've had an extra year or so to add more bling to the game.

Regards,
SB
Silent_Buddha is offline  
Old 06-May-2007, 03:01   #4539
btwango
Registered
 
Join Date: Apr 2004
Posts: 8
Default

What are the ramifications of the R600 "doing" audio as well as graphics? Is this going to render (no pun intended) after market sound cards superfluous?
btwango is offline  
Old 06-May-2007, 04:31   #4540
Anarchist4000
Member
 
Join Date: May 2004
Location: Somewhere, IN USA
Posts: 313
Default

Quote:
Originally Posted by btwango View Post
What are the ramifications of the R600 "doing" audio as well as graphics? Is this going to render (no pun intended) after market sound cards superfluous?
The main benefits of an actual sound card are:
1) Offload processing requirements from the CPU
2) Better sound quality with analog output.

3D effects are the only thing that comes to mind that requires serious audio processing and with multicore CPUs being more prevalent the need to offload the processing to a card seems increasingly less. As far as quality goes a CPU can do just as well as a card, if not better since there are no processing restrictions. The big sound difference in the past came from better components used to output analog signals. With digital signals there isn't a whole lot that can be done to improve their quality component wise.

I'd be willing to bet that the only audio functions the card will have will be to output that digial signal through HDMI. Most onboard audio solutions are nothing but software. The hardware part is just a matter of analog to digital conversion and vice versa.
Anarchist4000 is offline  
Old 06-May-2007, 07:38   #4541
Unknown Soldier
Senior Member
 
Join Date: Jul 2002
Posts: 2,178
Default

Quote:
Originally Posted by AnarchX View Post
Hi,

I'm surprised that no one picked up that the GDDR4 1GIG card seems to be underclocked. The original rumours spouted that the GDDR4 card(ala XTX?) would use 2.2(or 2200Mhz) yet these pics only show 2000Mhz.

Of course as someone mentioned, these are most probably developer cards thus why it only has 2000Mhz and is called XT.

US
__________________
God put me on earth to do a certain number of things. Right now i'm so far behind that i'll never die.

Random 512Kb onboard -> S3 Virge 4MB -> RivaTNT2 -> GeforcePro -> GF3 -> NV3x -> R420 -> R580 -> G80 -> G92 -> 5870 -> ???
Unknown Soldier is offline  
Old 06-May-2007, 08:17   #4542
no-X
Senior Member
 
Join Date: May 2005
Posts: 2,042
Default

HD2900XT GDDR4 OEM? (750/2000). Maybe the same version, like in DailyTech "preview"... (they call i XTX, but I think they are just wrong )
__________________
Sorry for my English. But I hope it's better than your Czech
no-X is offline  
Old 06-May-2007, 08:35   #4543
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,883
Default

Quote:
Originally Posted by Anarchist4000 View Post
The main benefits of an actual sound card are:
1) Offload processing requirements from the CPU
According to publicly leaked information, the processing is still done on the CPU. You know, just like what any modern integrated audio solution will do; these things leave all the major processing duty to the CPU. In the end, that kind of work is so minimalist is doesn't really matter. Some EAX effects (which neither R6xx nor 99% of integrated solutions support in practice anyway) *might* be a bit more expensive.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline  
Old 06-May-2007, 08:54   #4544
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,832
Send a message via Skype™ to fellix
Default

So, R600 would just have an integrated audio codec, similar to those in all over the mainboards?
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline  
Old 06-May-2007, 09:28   #4545
Unknown Soldier
Senior Member
 
Join Date: Jul 2002
Posts: 2,178
Default

Quote:
Originally Posted by Arun Demeure View Post
According to publicly leaked information, the processing is still done on the CPU. You know, just like what any modern integrated audio solution will do; these things leave all the major processing duty to the CPU. In the end, that kind of work is so minimalist is doesn't really matter. Some EAX effects (which neither R6xx nor 99% of integrated solutions support in practice anyway) *might* be a bit more expensive.
If so, then the PCI-e should cope well.

Other sound cards usually work off PCI.

US
__________________
God put me on earth to do a certain number of things. Right now i'm so far behind that i'll never die.

Random 512Kb onboard -> S3 Virge 4MB -> RivaTNT2 -> GeforcePro -> GF3 -> NV3x -> R420 -> R580 -> G80 -> G92 -> 5870 -> ???
Unknown Soldier is offline  
Old 06-May-2007, 09:37   #4546
bigtabs
Senior Member
 
Join Date: Jan 2007
Location: TDO, Germany
Posts: 1,222
Default

It will be nice to finally get rid of those sound + graphics IRQ conflicts anyway.
bigtabs is offline  
Old 06-May-2007, 10:30   #4547
Kaotik
yes, i'm drunk
 
Join Date: Apr 2003
Posts: 4,854
Send a message via ICQ to Kaotik
Default

This should be from AMD's own papers:
__________________
I'm nothing but a shattered soul...
Been ravaged by the chaotic beauty...
Ruined by the unreal temptations...
I was betrayed by my own beliefs...
Kaotik is offline  
Old 06-May-2007, 10:55   #4548
BlizzardOne
Junior Member
 
Join Date: Sep 2006
Location: North West UK
Posts: 81
Default

Quote:
Originally Posted by Kaotik View Post
This should be from AMD's own papers:


I thought GDDR4 was supposed to be more efficient from an energy side of things, when compared to GDDR3, at the same clockspeeds? (I know 2.0 W is next to nothing anyway.. but still)

And the GPU TDP looks off too, both are estimated at 750-800, same voltages etc.. just one is GDDR3, one is GDDR4, and the latter has a TDP 20 W higher?

Maybe i'm just being oblivious to the obvious though (i'll blame the hangover)
BlizzardOne is offline  
Old 06-May-2007, 11:05   #4549
Twinkie
Member
 
Join Date: Oct 2006
Posts: 386
Default

Quote:
Originally Posted by BlizzardOne View Post


I thought GDDR4 was supposed to be more efficient from an energy side of things, when compared to GDDR3, at the same clockspeeds? (I know 2.0 W is next to nothing anyway.. but still)

And the GPU TDP looks off too, both are estimated at 750-800, same voltages etc.. just one is GDDR3, one is GDDR4, and the latter has a TDP 20 W higher?

Maybe i'm just being oblivious to the obvious though (i'll blame the hangover)
Ones 1024mb of GDDR4, and the others 512mb of GDDR3.
Twinkie is offline  
Old 06-May-2007, 11:07   #4550
BlizzardOne
Junior Member
 
Join Date: Sep 2006
Location: North West UK
Posts: 81
Default

*goes and sits in the corner with his dunce hat*
BlizzardOne is offline  

 

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 02:33.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.