AMD: R7xx Speculation

Rangers · Apr 9, 2008

Berek-Halfhand said:
Given numbers like in the review link below, it better be 50 percent MINIMUM over the 3870x2 if its going to compete all-around against the GX2. 100 percent or more may just bring it up to perhaps compete with the GT200 when that is released.

http://www.anandtech.com/video/showdoc.aspx?i=3266&p=1

Some of the benchies here bring the 3870x2 to within 15 percent, but most of them put it way behind for either OpenGL driver related problems or simple performance issues in hardware. The 3870x2 is considerably cheaper though, its benefit... if you're willing to settle for 50 percent less performance in many games.

ATI must know they need to at least take out the GX2 (and get close to the GT200) one way or another, even if they aren't all about "high-end" at this time... whatever they come out with in June has GOT to be amazing or the next thing I'm getting is 2x "9800" gt's or the GT200.

A good start would be more supportive drivers, combined with bringing performance up on those games that suffer severely against the GX2. It's too much of a mixed bag with the RV670 series at this time.

Be careful with SLI. When I was investigated dual 9600GT's, it all looked great with huge performance increases until I realized the second card doesn't help at all in the one game that really needs it, Crysis.

You'd be better off waiting for Nvidia's next monster single card really, if RV770 is not to your liking.

Berek · Apr 9, 2008

Rangers said:
Be careful with SLI. When I was investigated dual 9600GT's, it all looked great with huge performance increases until I realized the second card doesn't help at all in the one game that really needs it, Crysis.

You'd be better off waiting for Nvidia's next monster single card really, if RV770 is not to your liking.

I was comparing RV770 (well, the R700) to GX2.... I wasn't really comparing the single card high-end solutions, which are reflective of the same problem in performance.

You are right though, if you can avoid SLI/x2 in any form, great... but its all about scale... so either ATI or Nvidia performance curve remains relatively the same, with the usual exceptions here and there. x3 or even x4 is where trouble really starts performance wise.

RV770 will have a solution for a single card (RV770 is the whole series btw... R700 is the x2 I was trying to approach specifically) that might indeed be interesting, but we have to wait to see... this entire thread is building on speculation (with some maybe-if-facts sprinkled in)... I'm always hesitant to get too excited on this timeline point.

3dilettante · Apr 9, 2008

Jawed said:
The count of SIMDs in a GPU doesn't affect the divergence penalty for dynamic branching. It's the size of a batch, which is some multiple of SIMD-width, that determines this.

Yes but in R600, the SIMD width on a pool of X units also determines the number of independent SIMDs.
If we have a fixed number of units, the only thing we can fiddle with is the SIMD count.

It's also the case that there are other penalties besides dynamic branching.
It's possible to have highly coherent branching, but that the active shaders happen to not utilize some portion of the chip's functionality too heavily.
If we have a very low number of SIMDs, it may be that one of those shaders/programs that would use otherwise unused hardware has to wait some number of cycles because all the SIMDs are full.

Oversubscribing the available SIMD count becomes an issue-port restriction.

If what you're saying is that more SIMDs are better in dynamic branching performance for a fixed number of pixels per clock - then I agree with that.

Unless silicon can grow extra ALUs with extra practice, there is always a fixed number of potential pixels/elements per clock, so yes.
From a hardware perspective, everything is fixed with respect to some upper bound.

I'm also saying that the number of unique shader clauses that can be hitting the ALUs in a given cycle is limited by the SIMD count.
If R600 for whatever reason only had 1 SIMD, we would know that we would not have instructions actively sequenced for more than two shaders at the same time, and only one of them would be hitting any given pipe stage at any given instant.

Conditional Routing:

http://cva.stanford.edu/publications/2004/kapasi-thesis/

The problem is the fragmentation you get with nesting and the bandwidth/time consumed in forming the temporary batches required for CR.

Jawed

I'll have to read up on this to see if this is similar to what I thought could be done.
Even limited-case conditional changes to SIMD behavior would be helpful for the lowest-hanging fruit.

Failing that, at least a way to really power down unused units based on SIMD utilization would reduce power draw.

Jawed · Apr 9, 2008

DegustatoR said:
Anything idle in a chip is a waste of transistors -- that's a disadvantage.
I don't think that G80 have any underused texturing capability.

Half the TMUs go idle when 8bpc formats are bilinear or point-sampled - there isn't enough TA capability to keep them all occupied.

Going from 320 underutilized ALUs to 800 underutilized ALUs will cost a plenty of die space and if they'll be idle most of the time then this die space is going to be wasted. And if RV770 is a middle end chip then i don't see any way to waste die area in it (because what matters in mid-end is a cost/perfomance ratio on average in todays applications).

If the TUs are doubled to 32 then the ALUs need to be doubled, otherwise the TUs won't deliver twice the performance. A shader has hotspots that are TEX limited and others that are ALU limited. Regardless of whether the shader is, as a whole, ALU limited, a reduction in the ALU:TEX ratio will reduce the per-clock speed-up of 2x TUs. The shortfall in speed-up is hard to assess because there's no easy way to assess the effect of hotspots in the shader code on overall performance.

In other words only trivially TEX bound shaders will get a doubling in performance from 96/32. All the others will speed-up by less because of the shortfall in ALU throughput.

If a pixel costs, say, 3M transistors in the ALU and 8M transistors in the TU, then it makes sense that for a doubling of performance you want to increase the ALU capability by at least the same factor as the increase in TU capability.

Jawed

Jawed · Apr 9, 2008

aaronspink said:
Best case unless RV770 is a ground up redesign is 12 or less processor each consisting of 80 or more shader units/ALUs/SIMD bitslices.

What kind of redesign would be required beyond 12? Or are you merely projecting the die size of a 12-processor GPU and suggesting that's the limit of practicality?

Jawed

3dilettante · Apr 9, 2008

Here's a fun question:

How wide can we make a SIMD before fanout of all the data and command lines leads to clocking problems?

We could add stages to the SIMD, but that means increasing the batch size relative to the hardware.

Sound_Card · Apr 9, 2008

3dilettante said:
Here's a fun question:

How wide can we make a SIMD before fanout of all the data and command lines leads to clocking problems?

We could add stages to the SIMD, but that means increasing the batch size relative to the hardware.

So you do a little bit of both.

Jawed · Apr 9, 2008

3dilettante said:
Yes but in R600, the SIMD width on a pool of X units also determines the number of independent SIMDs.
If we have a fixed number of units, the only thing we can fiddle with is the SIMD count.

That's clearly not true - a cursory comparison of G80 and R600 reveals radically different architectures with quite different emphases on SIMD width.

Look at RV610, which has 4-wide SIMDs.

I think you're trying to generalise too much and ignoring how architectures scale. At the same time super-wide SIMDs are always going to look bad against narrower SIMDs solely because of the dynamic branching problem. But it's the batch size that's the crunch point, not the SIMD width per se.

For example, let's say RV770 has a total ALU width of 160 (800 SPs). There's two obvious configurations, 10 SIMDs each 16 wide and 5 SIMDs each 32 wide. I could throw in 20 SIMDs each 8 wide if you like.

I've been talking about the way the TU is constructed, hypothesising that it's a monolithic unit in RV670, with each TEX instruction running for 4 clocks. If RV770 is the same, then this enforces a batch size of 128 on the SIMDs (since a TU batch is assumed to be 32 wide * 4 clocks). So the basic design choices restrict the options for SIMD width. Only 5 SIMDs each 32 wide fits.

Clearly ATI can fiddle with the count of SIMDs. But if the batch size is 128 due to basic architectural factors, no variation in SIMD count is going to alter the divergence penalty.

So, the question is, am I right about the monolithic TU? If not, then there's more flexibility in both SIMD width and count.

Ultimately, all I'm saying is that SIMD width and count aren't completely free choices - they have to line up with other parts of the GPU. It's sorta like writing cache-friendly code that uses data structures that are an exact multiple of cache-line size.

It's also the case that there are other penalties besides dynamic branching.
It's possible to have highly coherent branching, but that the active shaders happen to not utilize some portion of the chip's functionality too heavily.

That's what buffers and load-balancing (time-slicing) are for. Clearly there are extremes, such as a 1 ALU instruction shader that fetches 3 textures - that's utterly TEX bound and the ALUs will be running empty most of the time. It's common to see colour fillrate going unused simply due to the average duration of a pixel shader, regardless of whether it's TEX bound. Another example is G80 being Z fillrate limited because of setup rate - absolutely nothing to do with the configuration of the SIMDs.

Apart from the start-up/shut-down costs of a program and branch divergence penalties, the SIMD count/width makes no difference, if the total ALU capability is constant. Even then fetch width and rasterisation width are going to have significant effects and you'll get other effects caused by buffer sizes (e.g. post vertex transform cache size).

If we have a very low number of SIMDs, it may be that one of those shaders/programs that would use otherwise unused hardware has to wait some number of cycles because all the SIMDs are full.

Xenos appears to suffer from this issue (it has 3 SIMDs) because of the relatively low number of batches it can support - 32 vertex and 64 pixel.

So it's relatively easy to starve a SIMD of batches to process. e.g. there might only be 8 vertex batches in flight and the set of 64 pixel batches all contain lots of TEX instructions and/or slow filtering (e.g. fp16, anisotropic). The texture batches are unable to hide their own latency amongst themselves and there aren't enough vertex batches to fill the bubbles in the SIMDs.

In this case it's not the SIMD count that causes the problem, it's the ratio of available batches to SIMD count (register file size per SIMD).

At the simplest you can see this problem if you consider the implementation of a 1MB register file. A 4 SIMD GPU will perform better than a 16 SIMD GPU - the latter will have less batches per SIMD available to hide latency. And as the register allocation increases the total number of available batches across the entire GPU will fall faster in the 16 SIMD case because of fragmentation.

Does the increased performance of dynamic branching in the 16 SIMD GPU compensate for the lower latency-hiding in complex shaders?

Oversubscribing the available SIMD count becomes an issue-port restriction.

Streaming processors hide fetch latencies by oversubscribing the SIMDs. I can't work out the meaning of "issue-port restriction" to be honest.

Unless silicon can grow extra ALUs with extra practice, there is always a fixed number of potential pixels/elements per clock, so yes.
From a hardware perspective, everything is fixed with respect to some upper bound.

This doesn't tell us anything interesting though because of the freedom of architectural choices (and the learning curve). G71 v R580 is another good example, with respect to batch-size and dynamic branching, where both designs could perform 48 MADs per clock.

Jawed

Entropy · Apr 9, 2008

Berek-Halfhand said:
You are right though, if you can avoid SLI/x2 in any form, great... but its all about scale... so either ATI or Nvidia performance curve remains relatively the same, with the usual exceptions here and there. x3 or even x4 is where trouble really starts performance wise.

Last time I checked, the percentage of Crossfire users were less than 0.1% - low enough in absolute numbers that it could be attributed to test systems alone. SLI was at just below 1%. And this was on Steam, I doubt the WoW players care about multi GPU solutions a lot. For all the attention these technologies are given here, their actual market penetration seems/seemed absolutely miniscule.

Does anyone know why they removed that information from the Steam survey?

swaaye · Apr 9, 2008

Berek-Halfhand said:
I was comparing RV770 (well, the R700) to GX2.... I wasn't really comparing the single card high-end solutions, which are reflective of the same problem in performance.

What does this mean exactly? A single GPU definitely does not share the scaling and efficiency issues of SLI/CF.

Berek-Halfhand said:
x3 or even x4 is where trouble really starts performance wise.

They are basically valueless, yes. Very inefficient.

I am rather skeptical of RV770's potential vs RV670, especially if it's designed for a dual GPU card. It won't push the process to the limits like G80/R600 did if it is for a dual card. It can't or it will be too big and hot for such a card. Likewise, if NVIDIA's new design is for a dual GPU card, I expect it to also not be hugely better than G92.

However, if one of the companies still has the guts to design a GPU intended for a very high-end single-GPU board, I believe that company will offer the best performance. It is probably a certainty that a single GPU will always be more efficient than a dual-chip board and definitely be less of a headache to use in the end. Adding another level of depth to the driver writers' optimization tasks just can't be good by any measure. There's no guarantee they will ever make an app work well with SLI/CF, or if an app is even capable of benefiting from it.

Situations like these are just no fun after you make a $400-900 investment in 2 cards. (old graph, but it's one of the games you play a lot)

3dilettante · Apr 9, 2008

Jawed said:
That's clearly not true - a cursory comparison of G80 and R600 reveals radically different architectures with quite different emphases on SIMD width.

I've said in another post I was limiting this discussion to a GPU in the vein of R600 and RV770.

I don't consider G80 to be an R600-style GPU.

I think you're trying to generalise too much and ignoring how architectures scale. At the same time super-wide SIMDs are always going to look bad against narrower SIMDs solely because of the dynamic branching problem. But it's the batch size that's the crunch point, not the SIMD width per se.

Non-coherent execution can exist outside of the individual batches.
What happens if AMD decides to add another kind of shader thread (just sayin')?
What happens if it's more than 4 such types?
More likely, what happens if AMD adds the ability to run multiple independent or loosely coupled programs, or maybe add the ability to make some special kind of procedure call?

I've been talking about the way the TU is constructed, hypothesising that it's a monolithic unit in RV670, with each TEX instruction running for 4 clocks. If RV770 is the same, then this enforces a batch size of 128 on the SIMDs (since a TU batch is assumed to be 32 wide * 4 clocks). So the basic design choices restrict the options for SIMD width. Only 5 SIMDs each 32 wide fits.

Clearly ATI can fiddle with the count of SIMDs. But if the batch size is 128 due to basic architectural factors, no variation in SIMD count is going to alter the divergence penalty.

All the R6xx GPUs thus far have kept TU width on a 1:1 basis with SIMD width.
If we assume that remains the case, then fiddling with the TU width maps 1:1 to fiddling with SIMD width, which if we keep the ALU count and organization otherwise the same means we haven't really moved away from fiddling with SIMD count.
We could even treat the TU as a SIMD, albeit more specialized, if we go by what a poster or two have said at other times...

So, the question is, am I right about the monolithic TU? If not, then there's more flexibility in both SIMD width and count.

I like the idea of a non-monolithic TU better in the abstract, though what I like obviously doesn't matter to AMD one bit.
A wider TU will suffer from more utilization issues, and on reflection of this, I think the price of underutilizing the equivalent of memory or cache ports is worse than underutilizing the ALUs.

Apart from the start-up/shut-down costs of a program and branch divergence penalties, the SIMD count/width makes no difference, if the total ALU capability is constant.

It does if the GPU did what I said I hoped would happen in the future: where the GPU can generate clauses from independent programs and apportion them to a SIMD.
With programmability growing in many areas of functionality, the SIMD itself is increasingly a useful unit of granularity.

My posts have lacked clarity, so I failed to make a point more clear that when I speak of utilizing a GPU's resources that it wasn't limited to only ALU utilization, but utilization everywhere.
As programmability increases, the number of ties to the shader arrays is bound to increase.
Future units that get tied into the execution pipeline are likely to have a way to interface with the SIMDs, but of course we'd only get garbage if they simultaneously hit a SIMD.
The SIMD count becomes the upper bound on the number of independent programs that can run simultaneously, or at least the upper bound on programs that use any ALUs.

At the simplest you can see this problem if you consider the implementation of a 1MB register file. A 4 SIMD GPU will perform better than a 16 SIMD GPU - the latter will have less batches per SIMD available to hide latency. And as the register allocation increases the total number of available batches across the entire GPU will fall faster in the 16 SIMD case because of fragmentation.

Won't batch sizes scale with SIMD length?
I'm thinking a SIMD 1/4 the size of the first example would have 1/4-sized batches to match. The fragmentation and batching overhead issues would increase, true.

Does the increased performance of dynamic branching in the 16 SIMD GPU compensate for the lower latency-hiding in complex shaders?

How reduced is the latency hiding capability?
Besides, that question can't be answered for all workloads.
It could be done either way.

Streaming processors hide fetch latencies by oversubscribing the SIMDs. I can't work out the meaning of "issue-port restriction" to be honest.

Merely a comment on the fact that sending an instruction to be executed is the same as a CPU's issue port, just fanned out by a factor of 16.
If it has 5 instructions, each of which exercises a different area of the chip, it's only going to get away with 4 and must leave part of the chip idle.
I'm also not sold on the idea that hiding fetch latency is the absolute highest priority in all cases, not if speculation or divergent work causes the chip to hit the TDP barrier.
Is latency tolerance massively important? Yes, but perhaps sometimes going to the ends of the earth to avoid that last cycle of unhidden latency isn't worth the price.

This doesn't tell us anything interesting though because of the freedom of architectural choices (and the learning curve). G71 v R580 is another good example, with respect to batch-size and dynamic branching, where both designs could perform 48 MADs per clock.

I'm basing my argument on the idea that R7xx will bear some resemblance to R6xx, and that AMD intends to extend the architecture into the future when it runs into Larrabee in the GPGPU space.
This could be wildly wrong, or AMD may not exist by that point, but that is where I'm coming from.

Berek · Apr 10, 2008

Entropy said:
Last time I checked, the percentage of Crossfire users were less than 0.1% - low enough in absolute numbers that it could be attributed to test systems alone. SLI was at just below 1%. And this was on Steam, I doubt the WoW players care about multi GPU solutions a lot. For all the attention these technologies are given here, their actual market penetration seems/seemed absolutely miniscule.

Does anyone know why they removed that information from the Steam survey?

An interesting question indeed... why does anyone care? Like this speculation thread for instance, why do we post about all this speculation when we will likely be disappointed or mildly interested in the end?

ChrisRay · Apr 10, 2008

Berek-Halfhand said:
An interesting question indeed... why does anyone care? Like this speculation thread for instance, why do we post about all this speculation when we will likely be disappointed or mildly interested in the end?

Beyond3d speculation threads are pretty much a ritualistic practice. Its mostly that.

Rangers · Apr 10, 2008

swaaye said:
However, if one of the companies still has the guts to design a GPU intended for a very high-end single-GPU board,

Nvidia is smart, I'm sure they are.

AMD is stupid, who knows what they're doing.

I'm pretty sure therefore GT200 is that 1.2 billion transister, ~2X as fast single chip (that explains whats taking so long). What AMD is doing though, who knows.

sauron · Apr 10, 2008

Jawed said:
Code that uses dynamic branching pays a divergence penalty on any SIMD processor. Larger batches increase the chances of paying this penalty.

Jawed

1)So if R600 branch granularity is 64 pixels, this means that if just one pixel has to be processed, then all R600 64 shaders will be "filled" with other pixels of the same batch in the same clock cycle. So there will never be the case that a SIMD is working on pixel threads while another SIMD is working on vertex threads (in the same clock cycle) because granularity prevents that.
2)On G80, on the contrary, since bg is 32 pixels or 16 vertex large and there are 8 SIMDs, we can have the executions of different type threads batches at the same time.
3)This is a big deficit for R600 respect to G80.

Are 1,2,3 correct?

Thanks

whocares · Apr 10, 2008

R700 & GT200 specs speculate :

http://forums.vr-zone.com/showthread.php?t=260999

ALU:TMU 1:5

.

silent_guy · Apr 10, 2008

whocares said:
R700 & GT200 specs speculate :

http://forums.vr-zone.com/showthread.php?t=260999

ALU:TMU 1:5 .

From the original article on PCGH:
"Speculations based on numbers that are currently floating around. ... It's possible that HD4000 and GF10 will completely different."

Link bait, basically.

whocares · Apr 10, 2008

silent_guy said:
From the original article on PCGH:
"Speculations based on numbers that are currently floating around. ... It's possible that HD4000 and GF10 will completely different."

Link bait, basically.

In principle, yes . I think that the proper specs will reveal in next few weeks .

Jawed · Apr 10, 2008

3dilettante said:
I've said in another post I was limiting this discussion to a GPU in the vein of R600 and RV770.

OK

Non-coherent execution can exist outside of the individual batches.
What happens if AMD decides to add another kind of shader thread (just sayin')?
What happens if it's more than 4 such types?

How do you think the 2 SIMD RV610 handles the 3 types of shaders it has to run?

When control point shaders arrive with D3D11 I expect there'll be CP-specific buffers and the batch scheduling hardware will be expanded to cater for the specifics of this type of shader.

More likely, what happens if AMD adds the ability to run multiple independent or loosely coupled programs, or maybe add the ability to make some special kind of procedure call?

For the existing shader types there are clear parameters for defining the scheduling, with tunable weightings and whatnot.

As far as supporting "general" kernels (and more than one program running concurrently) I guess the developer will get access to certain architectural features. Early days...

All the R6xx GPUs thus far have kept TU width on a 1:1 basis with SIMD width.
If we assume that remains the case, then fiddling with the TU width maps 1:1 to fiddling with SIMD width, which if we keep the ALU count and organization otherwise the same means we haven't really moved away from fiddling with SIMD count.

I'm not convinced about the TU being monolithic and even if it currently is, whether it would continue being so. Your earlier question about the maximum width of a SIMD is clearly a big deal

We could even treat the TU as a SIMD, albeit more specialized, if we go by what a poster or two have said at other times...

Yeah I am treating it as a SIMD. Looking at the assembly code it's quite clear that a TU runs a number of instructions as a clause:

Code:

02 TEX: ADDR(292) CNT(5) VALID_PIX 
     13  SAMPLE R0.x___, R0.xyxx, t2, s2
     14  SAMPLE R3.x___, R3.xyxx, t2, s2
     15  SAMPLE R8.x___, R8.xzxx, t2, s2
     16  SAMPLE R11.xxx_, R9.xwxx, t2, s2
     17  SAMPLE R12.xyz_, R10.yxwy, t3, s3

or

Code:

01 TEX: ADDR(112) CNT(4) VALID_PIX 
      4  LD R1.xyz_, R1.xy0w, t0, s0
      5  LD R0.xyz_, R0.xy0w, t0, s0
      6  LD R2.xyz_, R2.xz0w, t0, s0
      7  LD R3.xyz_, R3.xy0w, t0, s0

or

Code:

06 TEX: ADDR(378) CNT(3) 
     37  GET_GRADIENTS_V R1, R7.xyxy, t0, s0  WHOLE_QUAD 
     38  GET_GRADIENTS_H R0, R7.xyxy, t0, s0  WHOLE_QUAD 
     39  SAMPLE R2.zwxy, R2.xyxx, t1, s1  WHOLE_QUAD

Note the first line in each case is a sequencer instruction which effectively spawns clauses in the different processors (ALU, TU) or does things like looping and moving data.

I like the idea of a non-monolithic TU better in the abstract, though what I like obviously doesn't matter to AMD one bit.
A wider TU will suffer from more utilization issues, and on reflection of this, I think the price of underutilizing the equivalent of memory or cache ports is worse than underutilizing the ALUs.

Dynamic branching definitely affects TU utilisation. There's a patent application on this subject...

It does if the GPU did what I said I hoped would happen in the future: where the GPU can generate clauses from independent programs and apportion them to a SIMD.
With programmability growing in many areas of functionality, the SIMD itself is increasingly a useful unit of granularity.

I guess your idea or Conditional Routing (if they're different from each other) will require another architectural revision. CR would appear to need some kind of kernel that scans predicate (presumably for multiple batches) and then sends the portion of state (x registers consumed by the clause) to a new stream.

My posts have lacked clarity, so I failed to make a point more clear that when I speak of utilizing a GPU's resources that it wasn't limited to only ALU utilization, but utilization everywhere.
As programmability increases, the number of ties to the shader arrays is bound to increase.
Future units that get tied into the execution pipeline are likely to have a way to interface with the SIMDs, but of course we'd only get garbage if they simultaneously hit a SIMD.
The SIMD count becomes the upper bound on the number of independent programs that can run simultaneously, or at least the upper bound on programs that use any ALUs.

Input/output queues for the win.

Won't batch sizes scale with SIMD length?
I'm thinking a SIMD 1/4 the size of the first example would have 1/4-sized batches to match. The fragmentation and batching overhead issues would increase, true.

I wasn't thinking clearly as the 1/4 sized batches would consume register file at 1/4 rate. The overheads would prolly be outside of the register file, per se.

How reduced is the latency hiding capability?
Besides, that question can't be answered for all workloads.
It could be done either way.

I think we're stuck without some advanced models. It seems hard enough to understand this stuff when you've got a wodge of test results to hand...

Merely a comment on the fact that sending an instruction to be executed is the same as a CPU's issue port, just fanned out by a factor of 16.
If it has 5 instructions, each of which exercises a different area of the chip, it's only going to get away with 4 and must leave part of the chip idle.

Which part of R600 are you thinking of here? Each part of the chip has queues.

I'm also not sold on the idea that hiding fetch latency is the absolute highest priority in all cases, not if speculation or divergent work causes the chip to hit the TDP barrier.
Is latency tolerance massively important? Yes, but perhaps sometimes going to the ends of the earth to avoid that last cycle of unhidden latency isn't worth the price.

Things like anisotropic filtering still affect performance so clearly latency hiding isn't achieved at all costs.

I'm basing my argument on the idea that R7xx will bear some resemblance to R6xx, and that AMD intends to extend the architecture into the future when it runs into Larrabee in the GPGPU space.
This could be wildly wrong, or AMD may not exist by that point, but that is where I'm coming from.

With time-slicing for general kernels I honestly don't see any relevance in the absolute number of SIMDs in a chip (ignoring their width). Batch, gather and scatter widths are extremely important. Larrabee looks like it'll have a clear advantage over the GPUs in these parameters (though use of GDDR might stymie that) and I dare say it is prolly the easiest place to implement CR which would make for another big advantage.

Jawed

Jawed · Apr 10, 2008

sauron said:
1)So if R600 branch granularity is 64 pixels, this means that if just one pixel has to be processed, then all R600 64 shaders will be "filled" with other pixels of the same batch in the same clock cycle. So there will never be the case that a SIMD is working on pixel threads while another SIMD is working on vertex threads (in the same clock cycle) because granularity prevents that.

No, because each of the 4 SIMDs in R600 is independent of the others. The 64 pixels in a batch (where 63 don't want to run some instructions) all execute over a period of 4 consecutive clocks in just one of the SIMDs. The other SIMDs can be doing VS, GS or PS work.

2)On G80, on the contrary, since bg is 32 pixels or 16 vertex large and there are 8 SIMDs, we can have the executions of different type threads batches at the same time.

G80 and R600 only vary in SIMD count/width. Clearly G80 performs better when handling the simplest case of dynamic branching. On the other hand, G80 is still nagged by the idea that GS can only run on one SIMD. Still a bit of a mystery topic as to whether that's true though (e.g. it could be true when GS is amplifying/decimating). And there's another question over the performance of nested dynamic branching on G80 - but again it's tantamount to hearsay.

Jawed

AMD: R7xx Speculation

Rangers

Berek

3dilettante

Jawed

Jawed

3dilettante

Sound_Card

Jawed

Entropy

swaaye

Entirely Suboptimal

3dilettante

Berek

ChrisRay

<span style="color: rgb(124, 197, 0)">R.I.P. 1983-

Rangers

sauron

whocares

silent_guy

whocares

Jawed

Jawed

Similar threads