AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
Who says that NVIDIA will stay at a size of 32?

As far as I have heard they are improving on dynamic branching by some kind of unknown scheme to avoid bubbles but still maintaining the Vec8s. So the Warp size should still be 32, but you should get much better performance on branch divergence than on G80/GT200.

I don't know if such a scheme is also viable on AMDs VLIW architecture because you have up to 5 different dependencies per instruction.

The only way round this that I can think of is two parallel rasterisers, each 16 wide, each feeding half the clusters. Then it'd be similar to R580 where the 16 wide rasteriser built batches of 48.
Wouldn't it be possible to just use large enough FIFOs between ALUs and ROPs?
 
Yes, I didn't think ahead to the 80 versus 32 mismatch :oops:

The only way round this that I can think of is two parallel rasterisers, each 16 wide, each feeding half the clusters. Then it'd be similar to R580 where the 16 wide rasteriser built batches of 48.

Jawed

Or maybe 1200 is actually 1280?
 
Market isn't set in stone, it changes according to what's on offer. AMD made NV drop the ASPs this time pretty significantly - I believe that NV was "aiming at the sweet spot" too but it turned out that their aim was wrong because the sweet spot switched places.
You plan for what you know you can plan for - costs are known, or at least can be relatively well predicted, so you can plan a strategy around that. Thats what the "sweet spot" strategy is about; the $200-$300 bracket at launch (with the knowledge that it will scale down afterward).

What happens in an competetive environment is not a strategy that is planned at the inception of a product / family.
 
Yes, I didn't think ahead to the 80 versus 32 mismatch :oops:

The only way round this that I can think of is two parallel rasterisers, each 16 wide, each feeding half the clusters. Then it'd be similar to R580 where the 16 wide rasteriser built batches of 48.

There has been talk about increasing the raster throughput that could be consistent with doubled setup.
It could be two rasterizers, each supplying half of the SIMDs all the time, or maybe it could also be that each rasterizer is supplying all the SIMDs half the time. The tesselation pipeline might figure in as well.

A side effect of 20-wide SIMDs is that the ROPs have to work for 25% longer to handle writeout for a batch.
 
Did you include spare die for DX11 compliance?

I'm not sure what parts of DX11 other than the tesselator will require much die space. If it's true that tesselation in DX11 is based off Xenos, then the only things left to implement with regards to that are the 2 new types of shaders as I'd assume the tesselation unit in Rv770 is probably close to what will be needed for a DX11 class card.

Personally I'm not expecting anything too radical from Rv870. Maybe more of the same, but no radical changes, but that's only a personal feeling. Maybe ATI will surprise me with something. :) But other than performance I'm expecting a rather more pedestrian and boring evolution in architechture.

Personally I'm way more interested to see what Nvidia does with GT300. It's already large and they'll want to implment hardware tesselation for DX11. I'd imagine a major goal of theirs would be to also reduce size while increasing capabilities.

Regards,
SB
 
If the CrossFire-on-a-stick strategy is still in the ATi's mind for a flagship SKU, then I think some considerations will be in favour for sub-300 mm² design.
Speaking of that, what about the side-port thingy? It is obvious by now, that the added bandwidth to the existing bridge interconnection is just not sufficient to bother about, so why not just use the extra [strike]PCIe[/strike] port for bridge-less X2 setup in a kind of master-slave configuration, and save some pennies by ditching the "third wheel"!

Isn't the next X2 said to be an MCM? (two dies on the same package).
I'm curious about it. It may end up being the first good dual GPU since voodoo2 SLI and voodoo5, if they do something interesting with that link.
Strange, everyone agreed that AFR sucked back then on the ATI Rage Fury Maxx. AFR hasn't changed as far as I know :p
 
Isn't the next X2 said to be an MCM? (two dies on the same package).
I'm curious about it. It may end up being the first good dual GPU since voodoo2 SLI and voodoo5, if they do something interesting with that link.

The 4870X2 was supposed to be the revolutionary MCM too.

Strange, everyone agreed that AFR sucked back then on the ATI Rage Fury Maxx. AFR hasn't changed as far as I know :p

Well it's a concept and concepts don't change. The implementation is vastly different now though, especially at the software level with all the inter-frame traffic going on these days.
 
These specs don't add up for me. Look at the TU:ROP ratio. from rv670, rv770, to rv870, it goes from 1->2.5->1.5. Can anybody explain to me how it makes sense?

Increased SIMD width would be a serious miscalculation IMHO. Nv is at 32 (and attempting to reduce it), LRB is at 16 (vector masking) and AMD going for 100:rolleyes: (assuming 48 tu's are tie to 1200 sp's, so 1tu serving 100 wide simd in packets of 4)

It could be just fud you know:p After all rv790 was widely rumoured to have 960 sp's until close to launch.
 
These specs don't add up for me. Look at the TU:ROP ratio. from rv670, rv770, to rv870, it goes from 1->2.5->1.5. Can anybody explain to me how it makes sense?
There's not really any need for a fixed relationship between TU and ROP counts, they don't directly interface.

Increased SIMD width would be a serious miscalculation IMHO. Nv is at 32 (and attempting to reduce it), LRB is at 16 (vector masking) and AMD going for 100:rolleyes: (assuming 48 tu's are tie to 1200 sp's, so 1tu serving 100 wide simd in packets of 4)
Those numbers are confusing physical SIMD width with batch width.
Nvidia's physical width is 8.
Larrabee's minimum batch width is 16, but in order to hide texturing latency, multiples of 16 may be put on a fiber.

If the numbers on RV870 are correct, and the scheme is similar to RV770, it is more like five 20-wide SIMD per array.
 
rpg.314: it's the same bus/ROPs ratio as on RV740 ;)

You can also remember RV530 and it's double-Z ROPs, which had double Z-prerformance when compared to R520 or R580. ATi used double-Z ROPs than on the entire R6xx generation. This seems to be similar.
 
If the numbers on RV870 are correct, and the scheme is similar to RV770, it is more like five 20-wide SIMD per array.

To get the semantics right, you are referring to rv770 having five 16-wide SIMD per array, right?
 
Yeah but why is that useful? That may be how the hardware is setup but that's certainly not how the software sees it. It's still 16 threads issuing 5-wide VLIW instructions and branch granularity is based on the number of threads....
 
Those numbers are confusing physical SIMD width with batch width.

Oops, my bad. :oops:

AMD right now has a branch granularity of 64 threads on a 16 wide SIMD. If the SIMD became 20 wide instead, the branch granularity will become 80 threads (presumably). Which is moving in the opposite direction. :cry:
 
To get the semantics right, you are referring to rv770 having five 16-wide SIMD per array, right?

Yes.

Yeah but why is that useful? That may be how the hardware is setup but that's certainly not how the software sees it. It's still 16 threads issuing 5-wide VLIW instructions and branch granularity is based on the number of threads....

It's useful when making distinctions between three different schemes used by the different designs in terms of batch size and physical width. Indicating how AMD subdivides its stream processors explains why it is considered 16-wide despite having so many units per SIMD.
 
Oops, my bad. :oops:

AMD right now has a branch granularity of 64 threads on a 16 wide SIMD. If the SIMD became 20 wide instead, the branch granularity will become 80 threads (presumably). Which is moving in the opposite direction. :cry:

Well I guess scheduling changes are a possibility so a 20-wide SIMD wouldn't imply branch granularity of 80.
That said, I'd say that rumour is probably about as reliable as the one about rv770 having 480 SPs...
 
Well I guess scheduling changes are a possibility so a 20-wide SIMD wouldn't imply branch granularity of 80.

Possible, except that it goes against the definition of a SIMD. The branch granularity can only be a integer multiple of the physical alu's. So it must be 20 (next to impossible), 40 (probably not) or 60 (well that would mean a 3 cycle madd latency, so unlikely) or 80 or 100(definitely not).
 
Well, I suppose all of us need a "lalalalala I don't believe the RV870 SP count lalalalalala" qualifier in our signatures :LOL:

Jawed

You can safely add "lalalalala it's on 40nm" to your signature this time *snicker*
 
Back
Top