AMD: R7xx Speculation

3dilettante · Apr 8, 2008

Jawed said:
It needs to, since things like rasterisation, texture-filtering and Z-testing all have to be able to run as independent programs. These programs are out of sight on a current GPU.

My concern for flexibility is in workloads that don't use rasterization, filtering, or Z-testing.
AMD would be concerned about this too.

As far as I can tell both G80 and R600 only support one kernel executing in GPGPU mode at any one time (shared by all SIMDs). I expect that to change...

That's a whole other level of flexibility I hadn't been thinking of, actually.
I for some reason thought R600's virtualized resources already had some flexibility in this regard.

When a GPU can time-slice arbitrary kernels I don't see how you can reduce feasibility/utility/performance down to a statement defined by concurrently active kernels. Flexibility normally comes at the cost of performance.

SIMD count puts a very usable upper bound on the amount of concurrent and independent work a GPU in the style of R600 can do.
We know by virtue of there being 4 SIMDs, we won't be seeing 5 different shader clauses hitting the ALUs at the same time.

Inflexibility normally comes at the cost of utilization, which in turn impacts performance and power.
Power, in particular, is something that will be even more critical in the future.
I honestly hoped for a more assertive repudiation by now (by execs, some no longer with AMD) of some statements I heard before stating GPUs didn't have to worry about a power ceiling.

Now, if you were to argue that GPUs will never time-slice kernels upon individual SIMDs, well that's a different matter. I might have the wrong end of the stick here, expecting GPUs to do so

Even if they did time-slice, time-slicing isn't concurrent execution.

3dilettante · Apr 8, 2008

Sound_Card said:
I could be getting very confused so excuse me. I understood that the way the ultra threaded dispatch processor is setup, each SIMD has two arbiters and two sequencers. That would mean each SIMD can run two independent programs at once. Kind of like hyper threading correct? So in theory, R600 has 8 processors?

In a given clock cycle, only one instruction is actually going through the ALUs. The SIMD alternates between the two sequences.

trinibwoy · Apr 8, 2008

3dilettante said:
Your definition of execution is broader than mine.
My use of the term execution is limited to the time the instruction is actively engaging an ALU, not just getting to it.

I was looking more at R600 and AMD's use of the word SIMD to descirbe the 16-element array, which in current instantiations are all in lock-step and executing the same instruction.

Oh, I'm using the exact same definitions that you are. My possibly incorrect assumption is that an atomic instruction ADD/MUL/MAD etc is actually spread across multiple clocks/pipeline stages which opens the door for multiple batches to be "executing" at once. If the entire calculation is performed in a single cycle then I see what you're saying.

3dilettante · Apr 8, 2008

I think I was unconsciously assuming single-cycle ops, but I think the utilization argument still stands.

If we stretch things over multiple cycles in a pipelined unit, we're injecting bubbles that still indicate ALU units are not performing useful work.
If 1/3 of an ALU is idle in cycle one, then another 1/3 in cycle two, and then the remaining 1/3 in cycle three, after 3 cycles, it's the same as if the whole ALU were idle for 1.

We won't find more than one instruction in the same stage of execution across all 16 SIMD elements.

Shtal · Apr 8, 2008

Jawed said:
Dunno, but it'll be interesting to see whether cache size is increased. L2 cache in RV630 is 128KB for 8 TUs while in R600 it's 256KB for 16 TUs. Would cache need to double again?

DegustatoR said:
What's 'enough power'? =) And 32 is still less than 64 -)
Their 320 ALUs are idling out right now while 16 TMUs are fetching values.
If they increase the TMU number to 32 while increasing the ALU number to 800 then that would meen that their ALUs will be idling out more than they do right now.

This is just a theory; let say if ATI increases L2 cache to 1024KB - would larger level cache and only 32TMU's could removed limitation with combine 800 processors of RV770.

Rys · Apr 8, 2008

satein said:
G80 and R600 would be more interesting to see if it could do well for more than 2 chips SLi/Crossfire.

You can see that today, where adding more GPUs gives you diminishing returns and a worse gaming experience because of AFR. The number of customers today with the ability to go to 3+ GPUs must surely be disappointed with their purchase, or wondering why they should bother, since those are supported shipping configurations.

Multi-GPU needs new ways to scale rendering performance across the chips, or a renewed focus on an increased IQ contribution, and ideally both, for it to succeed in the longer term.

If R700 is still RV770 x 2 (and I think it is), then going 3-way+ seems no more attractive to me than it is now, without something in the architecture to help it do new things there.

EDIT: whoops, thought I was quoting a brand new post there. My points still stand though

Jawed · Apr 8, 2008

3dilettante said:
Even if they did time-slice, time-slicing isn't concurrent execution.

Well I have to say you've got me utterly mystified about the general point you're making.

Clearly there's a sweet spot for performance when you have a set of kernels that make up an application, all of which need to communicate. But that sweet spot is heavily processor-specific due to bandwidths, latencies, cache sizes, supported-batch counts, state overheads, etc.

It would certainly be an entertaining prospect if AMD is planning on a GPU of 4 SIMDs each 96 objects in parallel (1920 SPs in today's parlance, batch size of 384) for 2010.

In the meantime GPUs' inability to timeslice kernels or simply execute multiple kernels side by side is definitely a problem.

Jawed

3dilettante · Apr 8, 2008

Jawed said:
Well I have to say you've got me utterly mystified about the general point you're making.

If body of executable code A for whatever reason winds up utilizing overall resources by 40% and code B is even worse at 20% (made up numbers for the sake of argument), timeslicing them changes nothing.

In that case, we'd wind up sandwiching time slices of 40%/20%/40%/20% utilization. No time whatsoever is saved, and units are wasted twiddling their thumbs.
If the GPU could run them concurrently in an ideal world, we'd instead see a single period of 60% utilization.

In a less than ideal world, we see that an R600-style GPU's ability to run concurrently is limited by the fact that the SIMDs are the basic unit of granularity.
At 4 SIMDs, we could at best see a step function of quarter increments.

If the SIMDs only got longer and still only numbered at 4, we'd only ever see utilization increments of 25% of even coarser and less utilized SIMDs.

If we had more than 4 code streams, we'd see absolutely no change in utilization, because one stream must inevitably sit around and wait its turn.

It would certainly be an entertaining prospect if AMD is planning on a GPU of 4 SIMDs each 96 objects in parallel (1920 SPs in today's parlance, batch size of 384) for 2010.

I would find it interesting, but even a moderate amount of underutilization would be like leaving an RV670's-worth of performance (but perhaps not the TDP) on the table.
If it got that long, I'd hope for better utilization, or a way for a more definitive power-down of unused units and datapaths.

If there were some way to fudge the SIMD units out of lockstep, or allow them to ocassionally sneak in other work, then the effective performance impact of extreme batch size could be reduced.

In the meantime GPUs' inability to timeslice kernels or simply execute multiple kernels side by side is definitely a problem.

Jawed

I agree on that.

Jawed · Apr 8, 2008

3dilettante said:
If body of executable code A for whatever reason winds up utilizing overall resources by 40% and code B is even worse at 20% (made up numbers for the sake of argument), timeslicing them changes nothing.

In that case, we'd wind up sandwiching time slices of 40%/20%/40%/20% utilization. No time whatsoever is saved, and units are wasted twiddling their thumbs.
If the GPU could run them concurrently in an ideal world, we'd instead see a single period of 60% utilization.

In a less than ideal world, we see that an R600-style GPU's ability to run concurrently is limited by the fact that the SIMDs are the basic unit of granularity.
At 4 SIMDs, we could at best see a step function of quarter increments.

The count of SIMDs in a GPU doesn't affect the divergence penalty for dynamic branching. It's the size of a batch, which is some multiple of SIMD-width, that determines this.

If what you're saying is that more SIMDs are better in dynamic branching performance for a fixed number of pixels per clock - then I agree with that. For a given number of pixels, say 128, all executed during the same clock cycle at the same time, then it's better to have 16 SIMDs of 8 lanes instead of 4 SIMDs of 32 lanes. Clearly the former costs more die area.

Now, the comparison of G80 and R600 is interesting because G80 processes 128 pixels per clock, while R600 processes 64. G80 uses 16 SIMDs, while R600 uses 4. So G80 has 8 lanes per SIMD while R600 has 16.

We already know that in pixel shading R600 has twice the batch size of G80, i.e. it's worse. In vertex shading it's even worse, 16 on G80, 64 on R600.

G80's custom logic implementation for the ALUs allows NVidia to use half as many as would otherwise have been needed. Otherwise G80 would have been something like 32 SIMDs that are 8 wide or 16 SIMDs that are 16 wide.

If there were some way to fudge the SIMD units out of lockstep, or allow them to ocassionally sneak in other work, then the effective performance impact of extreme batch size could be reduced.

Conditional Routing:

http://cva.stanford.edu/publications/2004/kapasi-thesis/

The problem is the fragmentation you get with nesting and the bandwidth/time consumed in forming the temporary batches required for CR.

Jawed

IbaneZ · Apr 9, 2008

AMD prepare answer to the geForce 9800 GX2e. This card will be also get two-chip as Radeon HD 3870 X2. It will be based on 55 nm video chip RV770, which will replace RV670 in the segment of one chip graphical solutions.

On exhibition ceBIT 2008 there was many information about RV770 chip . First, RV770 chip will be 30 - 40% faster than its predecessor (RV670). Further improvement in 55 nm technical process will allow to raise the clock frequencies. Furthermore, the quantity of functional units in chip can be increased.

http://xtreview.com/addcomment-id-4461-view-RV770-30-40-percent-faster.html

30-40% faster than RV670? That's good enough for me. Two 4850's in CF should be sweet.

DegustatoR · Apr 9, 2008

trinibwoy said:
Yeah I agree with that 100%. But I'm just saying that having an underutilized shader core isn't necessarily a disadvantage vs the competition. I think G80 is a similiar case. It has 'enough' shading power and an abundance of probably underused texturing capability yet it still dominates because it has sufficient muscle where it counts.

Anything idle in a chip is a waste of transistors -- that's a disadvantage.
I don't think that G80 have any underused texturing capability. G92 -- yes, it has more than 'enough' TMUs, but i suppose that going from 32 trilinear to 64 bilinear didn't cost much from die area POV so it can't really be considered as a disadvantage.
Going from 320 underutilized ALUs to 800 underutilized ALUs will cost a plenty of die space and if they'll be idle most of the time then this die space is going to be wasted. And if RV770 is a middle end chip then i don't see any way to waste die area in it (because what matters in mid-end is a cost/perfomance ratio on average in todays applications).

Arun · Apr 9, 2008

DegustatoR said:
Going from 320 underutilized ALUs to 800 underutilized ALUs will cost a plenty of die space and if they'll be idle most of the time then this die space is going to be wasted.

Two points:
1) If TMUs are also increased to 40, then underutilization in a given workload wouldn't be any higher (excluding non-shader core bottlenecks). Clearly going for 32 TMUs would slightly, though.
2) This would be a substantially faster chip, which means it is more likely to be used at higher resolutions. Because of limited texture resolutions, that means multi-cycle texture filtering is less likely to be necessary, thus increasing the effective ALU:TEX ratio.

aaronspink · Apr 9, 2008

Shtal said:
This is just a theory; let say if ATI increases L2 cache to 1024KB - would larger level cache and only 32TMU's could removed limitation with combine 800 processors of RV770.

If we are going to be technical, lets be technical. Best case unless RV770 is a ground up redesign is 12 or less processor each consisting of 80 or more shader units/ALUs/SIMD bitslices.

Aaron Spink
speaking for myself inc.

aaronspink · Apr 9, 2008

Rys said:
You can see that today, where adding more GPUs gives you diminishing returns and a worse gaming experience because of AFR. The number of customers today with the ability to go to 3+ GPUs must surely be disappointed with their purchase, or wondering why they should bother, since those are supported shipping configurations.

Multi-GPU needs new ways to scale rendering performance across the chips, or a renewed focus on an increased IQ contribution, and ideally both, for it to succeed in the longer term.

I'm not even sure we should *really* be counting the FPS numbers in AFR modes. Certainly they should be de-rated. I would like to see a much higher focus as well on either using SLI(a misnomer if there ever was one) or CF to either increase image quality or accelerate single frame based frame rates (which requires some sort of distributed workload).

Especially with the single card designs they should be able to design in ~18-24 GB/s of inter-chip bandwidth should be plenty to support cross buffer reads and actually allow real improvements in frame rates.

Aaron Spink
speaking for myself inc.

nicolasb · Apr 9, 2008

For what it's worth: http://www.fudzilla.com/index.php?option=com_content&task=view&id=6721&Itemid=1

We learned that the RV770 GPU already went to production and there is a big chance to see R700 generation much earlier than anyone has expected. TSMC is already in pilot production and the chips are, as we reported earlier, developed in 55nm machitecture.

The way it looks now, there is a strong possibility that R700 should show its face at Computex (June 3) while the launch itself might be shortly before or after Computex. We still don’t have the final details.

RV770 is the based on the R700 design and the R700 itself consists of two RV770 chips. From what we know, RV770 should be at least fifty percent faster than the RV670.

XMAN26 · Apr 9, 2008

IbaneZ said:
http://xtreview.com/addcomment-id-4461-view-RV770-30-40-percent-faster.html

30-40% faster than RV670? That's good enough for me. Two 4850's in CF should be sweet.

Depending on the game, that would put it on equal footing as the 8800GT/S in single card configs. I'm still waiting for the ATI that brought us the 9700 to wake up.

Rangers · Apr 9, 2008

XMAN26 said:
Depending on the game, that would put it on equal footing as the 8800GT/S in single card configs. I'm still waiting for the ATI that brought us the 9700 to wake up.

Err, I think it would be a good deal better than that. In some games 3870 even beats 8800GT/S. Generally 3870 is about as fast as a 9600GT which isn't far behind a 8800GT either. Heck fix the AA hit and 3870 would compete with the 8800GT right now (as it generally does with AA turned off).

But I agree with your point, I am sure Nvidia is making a solution aimed at being 100% faster than G80. 30-40% is what AMD expects to compete with?

I mean if the tech specs are true I dont see how that is either..all RV770 specs seem at least doubled from RV670..on the other hand the die size rumors place it pretty small..

The fudzilla report is pretty exciting though, June 3 is less than two months away.

Berek · Apr 9, 2008

Given numbers like in the review link below, it better be 50 percent MINIMUM over the 3870x2 if its going to compete all-around against the GX2. 100 percent or more may just bring it up to perhaps compete with the GT200 when that is released.

http://www.anandtech.com/video/showdoc.aspx?i=3266&p=1

Some of the benchies here bring the 3870x2 to within 15 percent, but most of them put it way behind for either OpenGL driver related problems or simple performance issues in hardware. The 3870x2 is considerably cheaper though, its benefit... if you're willing to settle for 50 percent less performance in many games.

ATI must know they need to at least take out the GX2 (and get close to the GT200) one way or another, even if they aren't all about "high-end" at this time... whatever they come out with in June has GOT to be amazing or the next thing I'm getting is 2x "9800" gt's or the GT200.

A good start would be more supportive drivers, combined with bringing performance up on those games that suffer severely against the GX2. It's too much of a mixed bag with the RV670 series at this time.

nicolasb · Apr 9, 2008

Berek-Halfhand said:
ATI must know they need to at least take out the GX2 (and get close to the GT200) one way or another, even if they aren't all about "high-end" at this time... whatever they come out with in June has GOT to be amazing or the next thing I'm getting is 2x "9800" gt's or the GT200.

Mm, and I bet ATI is really worried about what your next purchase will be.

Berek · Apr 9, 2008

nicolasb said:
Mm, and I bet ATI is really worried about what your next purchase will be.

It wasn't an intended threat of course

. I was simply informing you guys of my decision factor, not a target against them... obviously. They will ultimately release what they want and see is most beneficial. I'm simply pointing out the differences in competition level requirements.

Just to ensure clarification and avoid another unneeded response, I meant "performance" related competition

.

AMD: R7xx Speculation

3dilettante

3dilettante

trinibwoy

Meh

3dilettante

Shtal

Rys

Graphics @ AMD

Jawed

3dilettante

Jawed

IbaneZ

DegustatoR

Arun

Unknown.

aaronspink

aaronspink

nicolasb

XMAN26

Rangers

Berek

nicolasb

Berek

Similar threads