3D Technology & Architecture

Dave Baumann · May 22, 2007

aeryon said:
3/ HDMI 1.2 is NOT fully compatible with BR and HD-DVD. As You said, it lacks features so it cannot be 100% compatible. from official HDMI website :
http://www.hdmi.org/about/faq.asp#hdmi_1.3
BR and HD-DVD offer the option to use Dolby TrueHD and DTS-HD Master Audio lossless audio formats, only available if you have HDMI1.3. However, with HDMI 1.2 you can play these 2 lossless audio streams in PCM format if your player and receiver are compatible.

Yes, 1.2 (and before) is fully compatible because HD DVD and Blu-ray have a baseline audio support (much like Dobly was the baseline for DVD). Dolby TrueHD and DTS- Master Audio are optional formats that can be included with a title as and extra sound track - if the disk supports either one of these then it will also support the required tracks as well, meaning that it can still be used. These audio format are likely to remain limited in their use to A/V amps and at present there are few that actually support them.

and finally the last revision of HDMI offers the ability to use deeper color space (10, 12 or 16 bit per component instead of only 8bit for HDMI 1.2) and broader color space (IEC 61966-2-4 xvYCC color standard with no color space limitation).

At present deep color has the issue of software support. In terms of the titles, but also, for the PC, in terms of the OS.

All of these HDMI specifications are optional and are not requirements of the spec.

nicolasb · May 22, 2007

Zvekan said:
As far as I can see, both G80 and R600, offer quite similar offload abilities.

Yeah, that's what worries me.

I was hoping R600 would offer H.264 decoding at least as good as that of G84/86, with full hardware-assisted VC-1 as an added bonus. But if R600's video decoding is no better than that of G80, then it's yet another reason not to buy R600. :???:

_xxx_ said:
When you mention a HTPC, I think of SFF and not a full-blown tower. And an SFF will hardly have enough room for the card, or a >400W power supply or proper cooling for anything of the likes.

Yeah, but you can't exactly use a system like that to play Oblivion, can you? And, given that I am not a complete moron, I am aware that you can't, and I am therefore not proposing to try.

Clearly if you expect the same system to function both as a gaming rig and as a video player there are certain compromises that need to be made, but I was hoping R600 would allow the compromises not to be too severe. (The more I hear about R600, the less likely this seems.

)

Dave Baumann said:
Yes, 1.2 (and before) is fully compatible because HD DVD and Blu-ray have a baseline audio support (much like Dobly was the baseline for DVD). Dolby TrueHD and DTS- Master Audio are optional formats that can be included with a title as and extra sound track - if the disk supports either one of these then it will also support the required tracks as well, meaning that it can still be used. These audio format are likely to remain limited in their use to A/V amps and at present there are few that actually support them.

While I wouldn't disagree with any of that, I have to ask what the point is of having audio via HDMI if you're not going to support Dolby TrueHD, etc. Current-generation audio formats can be output via S/PDIF without any problems. The principal benefit of HDMI audio (so far as I can see) is that it has the ability to carry audio formats that can't be sent via S/PDIF because S/PDIF simply doesn't have the bandwith. Pretty much any PC these days will have S/PDIF on it somewhere, so what do you actually gain by having HDMI audio?

I suppose there's the possibility of outputing encrypted DVD-Audio music tracks in digital form, but how many audio processors could actually decode that? And how many people actually listen to DVD-A music discs?

_xxx_ · May 22, 2007

nicolasb said:
Yeah, but you can't exactly use a system like that to play Oblivion, can you?

Why not? I did the whole last year and still do (not in 16x12 or above obviously, but it's not like you'd see any difference upwards from 1280 on a 32" TV from a 3-4m distance anyway).

My problem with R600 is exactly that, it won't fit my SFF. Hopefuly the single-slot midrange offerings from ATI won't suck, or maybe nV will release something single-slot with more oomph than the 8600. Otherwise I'm stuck with my X1800.

Jawed · May 22, 2007

Arun, earlier when I described the sequence of instructions at the start-up of a batch, I referred to "until 8 pixels have executed their 128 MADs", and counted this period as 128 clocks. This was presuming that the following execution could occur:

Code:

Clock Ins  Pixels
0001  MAD 12345678
0002  MAD 12345678
0003  MAD 12345678
0004  MAD 12345678
...
0126  MAD 12345678
0127  MAD 12345678
0128  MAD 12345678

But I think it's extremely unlikely that the pipeline can repeatedly issue an instruction on just 8 pixels, like that. It seems to me that at least one entire warp (32 pixels) has to run its 128 MADs, before MAD+SF can be issued. So I'm thinking there must be another sequence for warps/pixels.

So, now I'm wondering if it's best to think of things from the point of view of the maximum number of pixels in flight in a multiprocessor, specifically how the register file is used.

Because your shaders only use one register (scalar register), this means that 768 pixels can be issued to fill the register file, which can support 768 scalar registers.

I'm wondering if there are "fixed" warp pairings for the issue of MAD + SF instructions. This might be related to the way warps and their pixels are mapped into the register file.

e.g. warp 1 and warp 13 form a pair, so you can have a W01-MAD + W13-SF or W13-MAD + W01SF. Or W12-MAD + W24-SF etc.

Imagine warps 1-24 are in batch 1 and warps 25-48 are in batch 2 and bear in mind that only 24 warps can occupy the register file. I think what's happening is that as warp 1 completes execution, its place is taken by warp 25.

I'm just trying to think of another way to feed warps/pixels/instructions through a multiprocessor and trying to spot where bubbles can occur in either the MAD or SF units. I haven't tried to model this yet.

Jawed

nicolasb · May 22, 2007

_xxx_ said:
Why not? I did the whole last year and still do (not in 16x12 or above obviously, but it's not like you'd see any difference upwards from 1280 on a 32" TV from a 3-4m distance anyway).

Maybe, but who would use a TV like that?

For gaming I would be using my "faithful" Dell 2405FPW (24", 1920x1200) with a viewing distance of less than 1m. For high def video playback I would be using my Sony 55A2000 TV (European version, with the XBR2 electronics) which is 55" and 1920x1080, at a viewing distance of about 2.5m. For gaming on my Dell monitor I would like to have the option to use reasonable levels of AA and AF and turn the detail settings up high. Something tells me that is beyond the reach of any "mid range" card on a game like Oblivion.

Anyway, this is getting a bit off topic....

Mintmaster · May 22, 2007

trinibwoy said:
Isn't that the case for all architectures? I thought G80's strength was that it could come closer to its best case numbers in real world scenarios.

The whole point of that paragraph in my post is that it's not the same for all architectures.

Demos and shader benchmarks usually only showcase one effect. There aren't many parallel instruction streams, so R600 won't be able to get very good utilization. In a real game, chances are that you are using several techniques on simultaneously on most materials, so there's more opportunity to co-issue instructions together. G80, on the other hand, will get near optimal utilization in all scenarios.

For example, if G80 is faster than R600 in a VSM demo and lighting shader demo, it won't necessarily be faster when both are used together.

dnavas · May 22, 2007

Jawed said:
But I think it's extremely unlikely that the pipeline can repeatedly issue an instruction on just 8 pixels, like that. It seems to me that at least one entire warp (32 pixels) has to run its 128 MADs, before MAD+SF can be issued. So I'm thinking there must be another sequence for warps/pixels.

I've been following this discussion on and off, and that was a concern for me as well. In theory, I would think you can execute a new instruction every other cycle (that's what happens in the vertex case, for example). If you can issue instructions from multiple warps, then I don't think we can account for the percent utilization, but a new instruction every cycle begs the question as to why we have 16/32 branch granularity, instead of 8.... Maybe the organization is different? Could it be that there are two schedulers for the 16 scalars, but instead of each driving 8, they actually alternate ownership by clock? Hmm, not sure that helps either.

At anyrate, I haven't got any clever test to discover what might actually be going on in there.

Frank · May 22, 2007

nicolasb said:
Maybe, but who would use a TV like that? For gaming I would be using my "faithful" Dell 2405FPW (24", 1920x1200) with a viewing distance of less than 1m. For high def video playback I would be using my Sony 55A2000 TV (European version, with the XBR2 electronics) which is 55" and 1920x1080, at a viewing distance of about 2.5m. For gaming on my Dell monitor I would like to have the option to use reasonable levels of AA and AF and turn the detail settings up high. Something tells me that is beyond the reach of any "mid range" card on a game like Oblivion.

Anyway, this is getting a bit off topic....

You might be better off simply using your tv for gaming, as it will be about the same size from that distance and you get free AA with it. That reduces the GPU horsepower you need quite a bit at the same time.

Nick · May 22, 2007

Mintmaster said:
In a real game, chances are that you are using several techniques on simultaneously on most materials, so there's more opportunity to co-issue instructions together.

So you believe R600 is capable of issuing instructions from different threads to the same 5-way VLIW shader unit?

Jawed · May 22, 2007

dnavas said:
Could it be that there are two schedulers for the 16 scalars, but instead of each driving 8, they actually alternate ownership by clock? Hmm, not sure that helps either.

I was under the impression that there are alternating batches in the ALU pipeline. AAAABBBB, each instruction for each batch running for 4 cycles. It's notable that pixels 1-8 could run for four consecutive cycles, but only because each source-destination set of registers is different on each clock, i.e. clock 1 is .x, clock 2 is .y, etc.

AAAABBBB complicates my suggestion some more, because now it implies that there's 2 lots of 384 pixels in the register file, each of which is 12 warps in size.

So the pairing I was proposing should actually be W01-MAD + W07-SF, etc. on the A clocks and W13-MAD + W19-SF, etc. on the B clocks Hmm...

Jawed

OpenGL guy · May 22, 2007

Nick said:
Don't they have the exact same shader units?

R580 has three times as many pixel shader ALUs as R520...

trinibwoy · May 22, 2007

Nick said:
So you believe R600 is capable of issuing instructions from different threads to the same 5-way VLIW shader unit?

I think he meant the shader would be doing more stuff so there's a higher chance that there will be co-issuable independent instructions.

nicolasb · May 22, 2007

Frank said:
You might be better off simply using your tv for gaming, as it will be about the same size from that distance and you get free AA with it. That reduces the GPU horsepower you need quite a bit at the same time.

Not that much difference going from 1920x1200 to 1920x1080. And I don't see where the free AA comes from, either; maybe you're thinking of a Cathode Ray Tube TV; this is SXRD (LCOS) rear projection, which is a fixed-pixel device.

I might well use it for gaming sometimes, as I have a fairly nice surround sound setup to go with it; but I don't want to clock up too many hours on the bulb, as they're expensive to replace....

Nick · May 22, 2007

OpenGL guy said:
R580 has three times as many pixel shader ALUs as R520...

I know, but doesn't it use these for processing more pixels in parallel? Hence, wouldn't it take the same number of clock cycles to execute a shader on R520 and R580?

Nick · May 22, 2007

trinibwoy said:
I think he meant the shader would be doing more stuff so there's a higher chance that there will be co-issuable independent instructions.

Oh, now I see his point.

Nick · May 22, 2007

I started reading the R580 article here on Beyond3D again, and quickly noticed this:

[Eric Demers] I'd have to check with the compiler team, but on average, I think we see about 2.3 scalars per instructions being close to the average. Being able to do 2 full scalars (one using VEC and one Scalar) pretty much means that we are pegged out; as well the smaller ALU gets used a lot as well, giving an effective 2~4 scalars per cycle.

This must mean they see a lot of single scalar instructions. With some dependencies between them, wouldn't that mean R600 often can't fully utilize it's ALU's either?

Jawed · May 23, 2007

Eric's answers in this related thread are also great:

http://forum.beyond3d.com//showthread.php?t=27642

Jawed

psurge · May 23, 2007

Mintmaster - do you see that type of ILP as remaining significant in the future? It will be much harder to extract that kind of ILP if shaders get more and more complicated in terms of control flow, no?

OpenGL guy · May 23, 2007

Nick said:
I know, but doesn't it use these for processing more pixels in parallel? Hence, wouldn't it take the same number of clock cycles to execute a shader on R520 and R580?

What difference does that make? The ALU throughput on R580 is 3x the ALU throughput on R520. Isn't that significant? The shader I referred to takes 9 instruction slots. On R520 that means you get 16/9 pixels per clock. On R580 you get ~48/9 pixels per clock. This averages to 9 cycles on R520 and ~4 cycles on R580.

If you were computing a single pixel, there would be no difference, but that's not efficient on any GPU.

Nick · May 23, 2007

OpenGL guy said:
What difference does that make? The ALU throughput on R580 is 3x the ALU throughput on R520. Isn't that significant? The shader I referred to takes 9 instruction slots. On R520 that means you get 16/9 pixels per clock. On R580 you get ~48/9 pixels per clock. This averages to 9 cycles on R520 and ~4 cycles on R580.

The difference it makes is that we're comparing shader unit efficiency here. It's great if you have 300+ ALU's but if only half are utilized and they're clocked at half the frequency of the competition you're in a pickle.

If you were computing a single pixel, there would be no difference, but that's not efficient on any GPU.

I'm not talking about a single pixel, I'm talking about how many times each pixel passes through a shader unit. Clocks per pipeline stage, if you want.

So, I really don't know what this "~4 cycles" means. Are we talking about different things or am I missing something crucial in my understanding?

3D Technology & Architecture

Dave Baumann

Gamerscore Wh...

nicolasb

_xxx_

Jawed

nicolasb

Mintmaster

dnavas

Frank

Certified not a majority

Nick

Jawed

OpenGL guy

trinibwoy

Meh

nicolasb

Nick

Nick

Nick

Jawed

psurge

OpenGL guy

Nick

Similar threads