AMD: R8xx Speculation

TimothyFarrar · May 5, 2009

Jawed said:
Multiples of 10s of cycles compared with always having to hide ~500 cycles? It's not even close. And Larrabee has 4 hardware threads to help amortise L2 latencies. Just as those threads amortise branch mis-prediction latency. And it's not as if Intel is forced to stick with 4 hardware threads with later versions.

Except I wasn't talking about going to global memory for the "software" blending in NVidia's case, but rather doing it in shared memory (no bank conflicts)!!

Jawed · May 5, 2009

3dilettante said:
That decision was helpful in increasing the orthogonality of D3D.
I suppose Microsoft could have allowed emulation of a unified pipeline on top of physically segregated units. I'm not clear if the specification required physical unification.

There is no unified pipeline - merely unified resources.

None that I know of right now, but my view that the established players have an interest in not making it easy to emulate everything they do in software. If they can set up certain parts of functionality that serve the majority of the market (their cards) but just happen to conveniently perform at a mediocre level on a new entrant, they'd be remiss if they didn't try.

NVidia seems to have held out with PCF for as long as it could.

I think Microsoft's mediation has put paid to any further shenanigans - the best that a cmpany can do, say, is to get features pulled or get data-structures capped lower - e.g. the 16 attribute limit per vertex and the 1024 limit on stream out in D3D10.

The first avenue I see that most directly can hurt Larrabee's flexibility is adding new incompatible texture formats and then evangelizing the heck out of them.

I have to admit I was a bit surprised at the new texture formats in D3D11. Thought texture formats were pretty much done. I'm obviously not remotely qualified to speculate on anything new in future.

Sure Larrabee could emulate support on the main cores, but not without a performance cost.

In the mix of other rendering can the performance cost be discerned?

Intel might even go along with this. Forced obsolescence will sell more Larrabees than if they can be upgraded in software forever.
This one isn't necessarily a problem with Larrabee's cache structure unless the format conveniently is a bad fit for 64 byte cache lines.

I was thinking it might be a bad fit in terms of the lack of fixed-function decompression hardware. Any format that doesn't pack neatly into 512-bits sounds like stupidity, to be honest.

The future direction I think the GPU makers can take is them getting cute with writes to shared memory or shared registers, with multiple small scatters that cause different threads to modify elements within the same cache line.

Larrabee could handle this case correctly, but the penalties from false sharing would rise, particularly if the current shared memory or global register pools it has to emulate remain relatively small.

Maybe swizzle strands into different fibres?

Granted, current GPUs like RV790 don't like writing to shared memory in a non-aligned and dynamically determined manner.

I don't see any advantage in GT200 here... They're both suffering with banking collisions.

Wacky writing to weird formats is something ROPs could be used to accellerate.

Off die memory and on-die caches both penalise that stuff, so you're appealing to something horribly complex that takes dozens/hundreds of cycles to compute, rather than irregular data structures, I guess. Before, Tim brought up compression of render target data - something that's entirely outside D3D as it stands as it's purely a performance optimisation. So it may not be so much a complex format as something that benefits greatly from compression, say. Though the L2 bandwidth-efficiency factor still plays a key role here.

Jawed

nAo · May 5, 2009

So..what's more likely: next gen GPUs having proper (multiprocessor) shared caches or next gen LRB having some fast GPU-like support for uncached memory? I guess you all know the answer.

Jawed · May 5, 2009

TimothyFarrar said:
Except I wasn't talking about going to global memory for the "software" blending in NVidia's case, but rather doing it in shared memory (no bank conflicts)!!

Unless NVidia changes the rasteriser/setup configuration to one that bins triangles by screen-space tile, then shared memory simply isn't going to be anywhere near big enough to "cache" render targets over any meaningful period of time, i.e. the render target data will be repeatedly being swapped into shared memory as any one cluster/multiprocessor will be in charge of numerous screen-space tiles. Non-overlapping in-flight triangles can complete in any order within any tile.

So the latency of those swaps dominates, though obviously the coherence of pixels in a render target means that latency can be shared over quite a few threads fairly easily.

It would be fun if NVidia did actually build a binning forward renderer.

---

EDIT: the swap rate may actually be way lower than I was thinking, because with only 1 or 2 thousand fragments in flight per multiprocessor at any one time, shared memory of 32KB (less whatever's in-use to hold interpolants) would tend to be able to hold a substantial portion, if not all of, the render target colour data. Shared memory could easily be larger than this, too.

Jawed

3dilettante · May 5, 2009

Jawed said:
In the mix of other rendering can the performance cost be discerned?

If the format were incompatible with Larrabee's texture hardware, the order(s) of magnitude performance deficits that prompted Intel to add texture units would come into play.
I think it would have an impact.
It would not make Larrabee useless, but it might make it lose a few 3dmarks.

Maybe swizzle strands into different fibres?

This works if it doesn't throw off the behavior of the strand's other accesses, which might align well enough with its current neighbors.
In general, there might be an incremental performance penalty. Not ruinous, but still present.

I don't see any advantage in GT200 here... They're both suffering with banking collisions.

I didn't comment on GT200 because I wasn't sure what measures it had for this situation.
The problem isn't necessarily collisions, which are a valid source of contention.
The problem, in some future scenario, would be when there are accesses by different cores to different memory locations that are currently mapped to the same long coherent cache line, where the hardware generates collisions that do not exist.

Off die memory and on-die caches both penalise that stuff, so you're appealing to something horribly complex that takes dozens/hundreds of cycles to compute, rather than irregular data structures, I guess.

It would be creative abuse of shared registers or shared memory at a granularity that leads to sharing conflicts between cores.
It's not a current problem, but it can be a future problem if someone wanted to make it one.

Before, Tim brought up compression of render target data - something that's entirely outside D3D as it stands as it's purely a performance optimisation. So it may not be so much a complex format as something that benefits greatly from compression, say. Though the L2 bandwidth-efficiency factor still plays a key role here.

Writing to a compressed target can lead to a number of potential pitfalls as far as a cache is concerned, depending on the design and the intent of the designer. Each modification is a read/modify/write operation, and a non-standard format can mess with alignment with cache line boundaries.

Something like a compressed render target with strange alignment or small granularity that is also shared (maybe in the future) would be a problem case. It could be possible to engineer a situation where every write to a location in the target has no true collisions, but it would still end up ping-ponging cache lines across the chip.
If the cache is MESI, this can be more brutal than it needs to be.

Current GPUs may not fare any better, though it may be that coalesced memory accesses at narrower strides might hide a few such writes. In some future design they might have the same problem, or they could special case this, because they are ASICs and can do what they want.

Jawed · May 5, 2009

3dilettante said:
This works if it doesn't throw off the behavior of the strand's other accesses, which might align well enough with its current neighbors.

The way I see it, with multiple fibres sharing a thread, the flexibility's there to do both - it's a question of the overhead of juggling between 2 or more swizzles, ultimately, versus the cost of L1-fetch serialisation.

Jawed

TimothyFarrar · May 5, 2009

nAo said:
So..what's more likely: next gen GPUs having proper (multiprocessor) shared caches or next gen LRB having some fast GPU-like support for uncached memory? I guess you all know the answer.

Perhaps I'll end up eating my words, I don't see next gen GPUs having large coherent sharable+writable caches (do see them sticking to the L2+L1 read only caches). Given that you don't want to share cache lines on LRB anyway, and you don't want to access shared L2 likely because of the added ring buffer trip, what's the point of large shared caches? L1 is simply an extension of the register file, L2 is just a slower and larger work-group shared memory. I still don't see all this fancy general purpose code making use of the caches (outside the constraints I've described) and scaling to the level of hundreds of millions of N (N being whatever, triangles, verts, pixels, etc).

Also, things like atomic operations can scale hugely better on GPUs without coherent caches. Think about what happens when you have your global atomic operations happening after global memory request serialization (ie at the time of the memory access). At least one paper (stressing global atomic operations in ways not possible with coherent caches) saw only a 15% cost increase from global atomics compared to just global writes.

JasonCross · May 5, 2009

Sound_Card said:
You know it's funny you bring that up. I was talking to someone about this before.
I was thinking three times the SIMD cores, but half as wide with half the TMU's. This gives you 240 ALU's (1200), and 60 TU'S. Assuming they have the same die size target as before, I believe it fits quite nicely.

Yeah, they could play around with it. If memory serves, they sort of did that with RV730. RV770 has 10 SIMD units, 80 stream processors in each, and 4 TMUs on each. RV730 has 8 SIMD units, 40 stream processors in each, 4 TMUs on each one. RV730 has twice the TMU-to-ALU ratio as RV770 by having half as deep SIMD units.

There's no reason each SIMD chunk can't have a variable number of ALUs (within reason) and a variable number of TMUs (within reason), without making major architectural changes.

keritto · May 6, 2009

nicolasb said:
4 or more chips could easily be the sweet spot if we've moved away from AFR to an architecture where the driver sees all four chips as being one big virtual GPU. Think about the way (say) Voodoo 2 used to work....

Yep but lets say how many actual games worked on Vodoo5 class cards and all of them need special compiling just to make it happen to work with all 2/4 cores well in those days to work.

Lukfi said:
Exactly! Ever heard of LucidLogix Hydra? They say it can do this with today's GPUs. But we have yet to see it in action.

Yep, Hidra (coolHydra) seems amazing on the papers. But let's presume that almight chip could manage switching of 4 independent xPU would we like it to use same power as one core of 50W or even more of that. That way power/performance scale up could be considerably less than a two chip solution over direct second PCI-e 2.0 link embedded in the same core as we now have only one. I know that sideport is probably better thing than 16 PCI-e lanes and it's already there on RV770. For me xydra is just a fan boy nerd talk to push something relatively expensive on market that eventually could pay it off even when it might not need it.

Shtal said:
You mean Nvidia believed RV770 will be slightly better then RV670 with only 480SP's (96 *5D) for RV770.

I think they believe it would have only double SPUs of R600. Cause r&d leakies between ATi & envy guys were (are) certainly going on. But then ATi thougt it'll kill them in a matter of size and consumption if overbeef the rv770 with two additional SIMD clusters for which they claimed that they simply had a spare room to put it. Ofc they had when all ATi's shader units takes less than 40% space of chip while at the same time envidia had to add two full processing clusters which are pretty more die expensive than 8% of the GT200 size.
So knowing that their shader units are pretty much lighter than nvidia ones, ATi could add in fact as much as 2000SP with half an effort but only if nvidia could somehow follow it. So I just hope ATi won't be too wise considering to much if and how competition could follow them: 16cluster 16SIMD x8SP + 16x4TMU + 6RBE (24ROPs) i believe would fit under same 260mm2 or even less. I don't believe L2 cache don't need to be much beefed up from that 4MB+ that RV770 has (lets say to some 6MB+ if any at all if they will go back to ring which will finally benifit from DX11 configuration style). And fit it with some 256bit(1GB) /384bit (1.5GB) 1.25GHz GDDR5.

Shtal said:
RV770 were 2.5x more power then RV670.....

lol: Is it possible to predict RV870 will also have 2.5x more power then RV770.

SPECULATION
ATI RV870
40nm
(480 *5D) = 2400 Stream Processors
120 TMU's
32 ROP's
354m?
512bit GDDR5 memory

LOL. Thy don+t need that much TMUs as nvidia's DX10 architecture What's the most sad thing then it might not be an 260-300mm2 chip, and certainly ATi wouldn't want yet another R600 kind of Diezilla with 200W+ TDP on that 354mm2 of size.

CarstenS said:
Don't forget the 60% increase in clock rate, plus shader hotclock and of course, the boost to 80 fat ALUs per SIMD - so you can just promote everything to FP64 and do not have to make a living on the obsolete FP32.

One of the above was actually serious. Go figure...

I see we're on the same trail, but why 80SP more per SIMD and odd 320bit configuration when 256bit would get you almost everything we need? That god special 5th unit still can handle regular operations and with it mirrored and 2 additional little stream units per SIMD i think ATi will have monster that'll crunch even Larabee's 512bit architecture. And i believe ATi must not forget about it and wait until Larabee really catch the spot of sunlight and then go with hussle cause as we all now AMD has an issue with compiler supports for a few years after they release their inovative architecture. And it would be a shame that Intel trash kills this warp2 architecture with their 15yrs old and obsolete nuclear reactor.

Wirmish said:
Concept -> 1 master chip and up to 4 slave chips.

The master chip is like a normal GPU... but with 4 SidePorts.
These SidePorts use the Rambus FlexIO Interface:
· Rambus FlexIO is capable of running from 400 MHz to 8 GHz.
· Contains only 12 lanes (5 lanes are inbound, 7 outbound)
· Theoretical peak I/O bandwidth of 76.8 GB @ 8 GHz (44.8GB out, 32GB in)
· Total bandwidth: 76.8GB/s x 5 SIMD Core (1 internal + 4 "slave chips") = 384GB/s

Nice concept. But then 'hd5850', 5670-5830 hd 5650 would be too costly with too much shader processors that wouldn't fit into current TSMC package no matter how small chip. I don't think intel has somewhat similar for their Atoms. And at the top of all ... too many chips on mainstream boards. If that would be the case will see MCM rather than this kind of multi chip solutions. And it's too costly to properly produce cards with even three of those small monsters, they'll then had more pcb yield issue than die yield issue

And best of all, if it would work, master-slave concept is on last decade extinction list. They even avoid to use it in CFX solutions starting from R600 and PS4.0 so why going two steps back :???:

Lukfi said:
How do you connect 24 ROPs to a 256bit bus?

Ring bus

And even that only if 40nm would be such a elctrical junkie as ATi used to state for their RV740 and doesn't yield up well @950MHz+. In that case current 4RBE would suffice even crazies demand because TMUs & RBE are not tied together since dx10 forward. And since dx11 further unties TMU from SPU we don't need crazy number of them and 48TMUs hanging on a ring would probably be sufficient enough for that 2048 SPU (16x16x8) proposed by me.

Arun said:
As GDDR5 becomes the norm rather than the exception, it makes sense to double the number of ROPs per MC.

A 128-bit GDDR5 RV770 on 40nm could definitely be 20% faster than RV770... (remember in that timeframe, 2.5GHz GDDR5 is perfectly reasonable and 40nm promises a pretty nice clockspeed improvement)

I'm with you on that. It would be great that they use leading edge gddr5 technology on lower mainstream cards, but as we know from the past newer tech process which is been used as testing ground for their new monster chips and which is covered with all that mysterious talkies about leaking problems that RV740 supposedly have just to cover up how excellent RV740 scales up to 900MHz, and if you company it with excellent memory chips (1.25GHz) it would cripple RV790/RV870 sales.
It would kill even HD4870 card owners not even a year after. And ATi considers faint of heart problems of their customers

. We now can see moderate results of that 'leaky' RV740 http://forum.beyond3d.com/showpost.php?p=1289805&postcount=388 and only think what would ATi do if they done it better than this. Faithful customers would go frenzy.

@Jawed i think he meant real 1.25GHz (5GHz in ATi notation since they all fussed up with gddr3->gddr5 conversion methods, somewhere even define it as QDR)

Jawed said:
Could 8 quad-RBEs with a 128-bit bus be more bandwidth efficient than 8 quad-RBEs and a 256-bit bus (both configurations having "the same bandwidth", GDDR5 and GDDR3 respectively, say)?

There are no eight Quad RBEs just four of them

Why do we always considering somekind of die-size ratio approach in our speculations and think that they should double that monstrous 4MB+ of L2 cache it wouldn't be a die shrink then and RV770 monstrous cache is there only for test purposes, probably only with 2-3MB would be almost as efficient as it is now. And from my reasoning larger cache in ATi GPU needs more memory connected to it. Not less

Freak'n Big Panda said:
Well if you think about it ATI has been 1 year refreshes since R300 really. 9800pro wasn't that much of an upgrade for 9700pro users and then there was the 9800xt... and then R420 came along almost 2 years after. r420 wasn't replaced until r520 in 2005.. about a 1.5 year wait there. I guess there's a little exception here with r580 but anyways you get my point. 1 year cycles are pretty common for ATI.

Development cycles are generally getting longer these days as well so I wouldn't be surprised in the slightest if ATI didn't have anything to fill the time gap between rv770 release and rv870 release.

You've been time wise and exact. They hoped for easy scale up on 40nm but all they done is produce bigger 55nm chip that consumes more power and probably gets less in return than cheap RV770.

BRiT · May 6, 2009

keritto said:
Yep but lets say how many actual games worked on Vodoo5 class cards and all of them need special compiling just to make it happen to work with all 2/4 cores well in those days to work.

All games seemed to work. Games did not need any special compiling or tricks to support all 2/4 cores of the Voodoo 5 / 6 series.

keritto · May 6, 2009

Jawed said:
Hmm, this is kinda interesting. Remember how R600 was intended to be only a moderate performance improvement over R580? This "rumour" seems to suggest that the D3D10.1->11 transition (if this is truly D3D11) is going to cause the same kind of mediocre increase in performance.

dx9->dx10 was incomparable transition wasn't it. We cant have both on the same system :mrgreen:

While on the other hand dx11 requirements like 32kB SP cluster local cache (L1) are just doubled up from dx10 specs. Not the mention real time tessellation and FP64 requirements. Just with that it shouldnt be just a minor update. Big wishes.
And yes dx8(r200)->dx9(r300) and dx9(r350)->dx9b(r420) and dx9b(r420)->dx9c (r580) doubled performance. Now that TSMC join the latest litography techie club, and they weren't there only 2yrs back. So from that benefit we can expect more flexibility in ATis designs in the matter of easily doubling performance. We only now need enough SPUs to doubled up performance (DX11).

3dilettante said:
In addition, chips destined for X2 boards would have been culled from the part of the pool of chips that had better than average power characteristics, leaving the thermally less desirable chips for the single-chip cards.
So yes, one could say that the chips on X2 boards burn less power, because more power hungry chips aren't allowed on X2 boards.

Exactly it's all in the chip binning and sorting. It's time consuming and the same principles couldn't be shared on MCM solution. So when you once pack it you could only get lower grade MCM if not satisfing TDP so they must produce better and worse products also, based on MCM if they used it. And even weirder performance per watt configurations if they want to close up to 100% yield which is a reason for small MCM solutions in the first place.
So in fact the whole palette should be based on MCM which would be even more power consuming than some heavier full functional single chip from that MCM when a big part of MCM would be disabled so they have as little to nothing to waste. It's a devious marketing from my point of view. Premium products still would be a premium one with mainstream level product buyer paying price for extra power consumption that could be avoided.

Somebody beat my speculations
http://www.hardware-infos.com/news.php?news=2908

But i can hardly see why damn 32ROPs when that trash is suppose to work @900MHz with only 50% more shaders as many expecting (only 2,16TFlops not even doubled RV770 computational power). It would be a total crap and I hope Intel's Larabee would easily beat them to the ground. It's just too many RBEs and too little shaders for the proposed clock. This chip should go @1.2GHz or somewhere near to have respectable computational power and at that speed even 4RBEs would be more than enough. Or it's maybe sub 90W TDP part then it's pretty prodical chip i must add. Still waiting 32nm power conscious RV840.

neliz said:
$649 cards are a sweet spot? everyone knows that sales at those figures were a small percentage versus the rest of the market. the sweet spot always has been below the high price/high performance parts. placing a part at 80-90% of that performance for 40% of the price seems more like 'finding a sweet spot" than marketing your new product as "previous high end price+$50"

You beat my idea

. HD5770 for me should also be 299$ but only if they make 3-4TFlops scale as i presume. Anything below could sell at 199$ in my opinion. Not that it will ever happen

DegustatoR said:
What makes you so sure that it won't be AMD (again!) who will be in a position of "sweet spot" miss this time? How many times did they missed that spot for the last years? How many times NV did? RV770 is a great chip but let's not forget that it's great in comparision to its competition. And since you don't know what that competition will be you can't know that "sweet spot" we're talking about. You may hope but that doesn't mean that your hopes will become a reality. RV770 is just one chip, you can't assume that RV870 will repeat its success just because RV770 was great. G80 was great, GT200 was pants. Voodoo Graphics was great, Voodoo 3 was pants. Market always changes.

It basically not true cause envy need to revise their architecture which basically only small upgrades since pain in the ass NV30 series followed with great NV40 and then totally revamped in G70. For now envy first needs to learn how to walk and big hope not to learn how to crawl like with NV30 inbred. They in fact are on life support machines since they present GT200 as their prodigy of dying architecture. They had glory days since G70 and that is now fade away. They realize it a year ago when they trash talked AMD & Intel presenting x86 as a dinosaur cumbered with it's age and their glorious cuda as a successor to it.

DegustatoR said:
Why is that bothering you? 48 TUs with 4 TUs per cluster means 12 clusters. 1200/12=100 which is 4 superscalars more than in RV770 cluster.

dx11, dx11 .... and it put it to 190% computational power of RV770 hopfully at 60% RV770's TDP

Novum said:
Doesn't fit with 48 TMUs.

Everything fits especially now in this every part of gpu for itself specifications. And as already has been said

BRiT said:
Because in speculation land, anything is possible.

rpg.314 said:
Increased SIMD width would be a serious miscalculation IMHO. Nv is at 32 (and attempting to reduce it), LRB is at 16 (vector masking) and AMD going for 100 (assuming 48 tu's are tie to 1200 sp's, so 1tu serving 100 wide simd in packets of 4)

I also thought they should split their 16SIMDs (80SP) into two but that still not the way to go with all that shading needs (doubled L1 from DX10) and it's not really easy to have too many independent pipelines/clusters. So for now even buildng it up to 16SIMDs x8SP per cluster =128SPs would be a way to go to build up processing power in my opinion. nV & intel have pretty obscure architectures i might add, one is with us for 6.5 years and one has 'pretty inventive' comeback after 15 years not mentioning how flanky was it at its prime time.

JasonCross said:
It's not that they need to do more. I think an RV770 or 740 or whatever with a new tessellation unit and maybe bigger local cache in the SIMDs might even be enough to be DX11 compliant.

I just don't expect all this time to have passed and for there not to be other improvements. Not "major features" but some architectural bits and bobs and tuning and re-thinking ratios to squeeze greater perf per watt, mm2, and transistor than before.

That's what we should be afraid about. Just shiny new face lift on an old engine. DX11 compliancy without any real boost inside SIMD internal structure design. needed for real FP64 power in DX11. Just a prototype like ill born NV30 but why do that when they can easily scale up the whole thing up. Probably they think of crisis and lower cash inflow so they don't want to give us an revamped architecture we could get on sub-300m2 40nm chip. Not just jet. It's too sad to even think of it.

Jawed said:
It's like comparing the "transistor count" of RV630 and RV635. RV635 gained D3D10.1 functionality and supposedly lost 12M transistors. Transistor count isn't much use though.

They might lost it in shared L1 of R6xx architecture or some advancement in L2 production tech or even losing some weight there.

no-X said:
For more signigicant performance increase, die has to be bigger, too. ~400mm2 die is sufficient for 512bit bus.

What would you like to put under the hood of yours RV870 that you need 400mm2? 80TMUs and 32ROPs? We don't need that much with transitioning to that intel's proposed sick ray tracing instead of abundant texture mappings we're used to.
I'd to like more parallax occlusion maping (is any game ever used that??) to intels so ill bastard raytracing they need to compensate traditionally poor TMU performance of their so called GPU.

Jawed said:
If GT300 is radical then maybe they've fixed all the wastefulness in GT200 and don't need such huge amounts of TMUs/ROPs.

If rumours of ALUs being more sophisticated are true then the implication is that they'll take yet more area per FLOP. NVidia may only be able to afford that kind of extravagance if the TMUs/ROPs go on an extreme diet. Though the wastefulness is only in the region of 30-50% I reckon, so there isn't a monster saving to be made there.

Ofc GT300 should be a radical change because GT200 is on extinction list, and really can't get much performance with that kind of self replicating complex clusters they inherited from G70. Luckily for them they jumped from 192SPs /6clusters to 240SPs/10clusters in the past cause we saw otherwise they would suck badly with 192SP top product that consumed as much as HD4870 and have 10-15 less of performance. Luckily for them they didn't have an insight that ATi would beef up their clusters from 8 to 10. And while ATi do that for a price of a penny, for nvidia it was pretty expensive thing to go from 400mm2 to over 500mm2 die size.

And finally if somebody missed

my ultimate RV870 architecture would be
2048SPUs (16clusters x16SIMDs x8 32bit SPUs (2 of them that big ALUs) easily scalable for FP64 operations needed in DX11
48TMUs (or maybe 64TMUs for better old game lot of texturing capability

)
24ROPs (6RBEs) or some 32 simpler dx11 ROPs (4RBEs) cause i don't know what dx11 specifies
Ring bus with 256bit 1GB@1.25GHz for HD5750 part (249USD) and 384bit 1.5GB@1.25GHz for HD5770 part (or 256bit 1GB@1.40GHz) (299USD)
6MB+ L2 cache (up from 4MB+ RV770)
Core clock: @700-750MHz (2.85-3.05GFlops) HD5750 and @900-950MHz HD5770 (3.7-3.9TFlops)

Considering some issues with a large number of SPUs per cluster we could see a might be future reuse design with only 12SIMDs x8SPUs but that would only total in 1536SPUs (2.9TFlops @950MHz) with a wasted die space and wasted memory bandwidth which then could be shrunk to 1.25GHz 256bit for HD5770 top model and need for a crazy yields to stay in the performance dome. And is futile if 320SPs dx10 cost only as much as 22mm2 this wouldn't save more than 34mm2 and it's better to have beefier core which you could partially disable or cripple with slower GDDR5 than something that simply doesn't reach it's peak.

keritto · May 6, 2009

BRiT said:
All games seemed to work. Games did not need any special compiling or tricks to support all 2/4 cores of the Voodoo 5 / 6 series.

I used to have some game that was bundled with some of Voodoo4 or 5 series and it didn't work at all on dx6 based Rage128 nor newer radeons RV100-R300. So it might be my misinterpretation but a friend used to have it Voodoo4 with 4chips and say it needed some special tweaks for it to work. Might be a driver issues i really don't remember cause wasn't my itch on my bald head. Anyway i think he rejuvenate with GF2GTS.

Kaotik · May 6, 2009

keritto said:
I used to have some game that was bundled with some of Voodoo4 or 5 series and it didn't work at all on dx6 based Rage128 nor newer radeons RV100-R300. So it might be my misinterpretation but a friend used to have it Voodoo4 with 4chips and say it needed some special tweaks for it to work. Might be a driver issues i really don't remember cause wasn't my itch on my bald head. Anyway i think he rejuvenate with GF2GTS.

The games bundled with the Voodoos were probably Glide games [wink]
And Voodoo4 series only had 1 chip cards (V4 4500 was the only one released, but there was prototypes for additional V4 models), V5 series had 2 chips (V5 5500) and 4chip (V5 6000) but the 4chip model never got to markets, there's several different revisions of it with different bridge chips etc out there, and drivers were especially back in those days something that didn't properly support V5 6k.

Blazkowicz · May 6, 2009

I ran a V5 for a while and its compatibility was brilliant. Strong drivers and IQ features make it a great card for legacy gaming (high quality 16bit, 4x AA w/ negative LOD)

Not a single time had I to care about SLI. I always ran at vsync off double buffering, > 60Hz. There really is no scaling and temporal issues as on modern dual GPU, but that's because it were the fixed function days.
With windows 98, it will run pretty much every game from the 1980's till 2001, and DX8/DX9 titles after that as long as they have a fixed function path. I ran GTA vice city, mafia, max payne 2 and postal 2, the ut2003 demo too.

OpenGL guy · May 6, 2009

keritto said:
I used to have some game that was bundled with some of Voodoo4 or 5 series and it didn't work at all on dx6 based Rage128 nor newer radeons RV100-R300. So it might be my misinterpretation but a friend used to have it Voodoo4 with 4chips and say it needed some special tweaks for it to work. Might be a driver issues i really don't remember cause wasn't my itch on my bald head. Anyway i think he rejuvenate with GF2GTS.

Many old games had bad behavior due to hard coding many things internally. For example, they expected the pixel formats to be reported a particular way and would show corruption (or worse) otherwise. Also, if you exposed too many formats, many games would crash since they would statically allocate an array to hold all the pixel formats and since they didn't know in advance how many formats there were...

To get around these issues, vendors had to report their pixel formats in the same order as 3dfx and also had to limit what formats were reported on certain games. I.e. you might avoid reporting DXT formats. DX8 resolved these issues because games had to explicitly check for each format they wanted to use instead of requesting a list of available formats from the driver.

hkultala · May 6, 2009

BRiT said:
All games seemed to work. Games did not need any special compiling or tricks to support all 2/4 cores of the Voodoo 5 / 6 series.

Those 9-year old games did not do any "render to texture" tricks,
and voodoo5 did not even have T&L support.

It's much different thing to parallelize simple triangle drawing and texture mapping than all those rendering to texture and geometry calculation things.

Lukfi · May 6, 2009

keritto said:
Ring bus And even that only if 40nm would be such a elctrical junkie as ATi used to state for their RV740 and doesn't yield up well @950MHz+. In that case current 4RBE would suffice even crazies demand because TMUs & RBE are not tied together since dx10 forward. And since dx11 further unties TMU from SPU we don't need crazy number of them and 48TMUs hanging on a ring would probably be sufficient enough for that 2048 SPU (16x16x8) proposed by me.

But... ATI doesn't use ring-bus anymore, do they?

itaru · May 6, 2009

RV870 may become the structure that resembled PPU of AGEIA.

Lets Get Physical: Inside The PhysX Physics Processor
Part 2 - Physical Hardware
http://www.blachford.info/computer/articles/PhysX2.html

Jawed · May 6, 2009

TimothyFarrar said:
Given that you don't want to share cache lines on LRB anyway, and you don't want to access shared L2 likely because of the added ring buffer trip, what's the point of large shared caches?

Intel put L2 sharing and coherency in there for fun?

Also, things like atomic operations can scale hugely better on GPUs without coherent caches. Think about what happens when you have your global atomic operations happening after global memory request serialization (ie at the time of the memory access).

Any global atomic requires global thread serialisation, not merely global memory serialisation.

The sensible way to serialise in this situation is to distribute serialisation of addresses across extant threads, e.g. if Larrabee has 32 cores, then across 128 threads. Simple tiling of the atomic variables will ensure that cache-line/RAM-bank collisions don't occur.

To serialise, each thread enters itself in the queue for a tile. As long as each thread has enough fibres, it'll be able to hide the latency of the tile queue as it'll be in entered into multiple tile queues.

A large on-die cache then enables tiles to stay on-die for significant periods rather than incurring RAM latency.

At least one paper (stressing global atomic operations in ways not possible with coherent caches) saw only a 15% cost increase from global atomics compared to just global writes.

Antiradiance uses an atomic add, and for 3 light exchanges the total time is 77.3ms for 1,090,000 links. So the average is 42.3M light exchanges per second. GTX260 with 192 lanes at 1.242GHz clock can execute 238.4B ALU cycles per second. So each light interaction costs ~5637 cycles.

With 21212 patches, each containing 256 bins, there's nominally 5.4M bins being atomically updated. So about 50 links per patch and presumably multiple bins updated per link in the light exchange.

I'm unsure about precise timings for portions of their algorithm where the atomics are focused. e.g. table 2 shows 1.2ms bin search + 3.6ms write-minimum, which I guess is a simple tree traversal for the interacting patches plus atomicMin. I can't work out how many times that's executed.

What "goodness" about atomics can you discern from these performance figures?

Jawed

MfA · May 6, 2009

Jawed said:
Intel put L2 sharing and coherency in there for fun?

No, they put it in there because they wanted to present developers with a x86 processor supporting classical multithreaded development ... but it does tie them down a bit.

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

TimothyFarrar

Jawed

nAo

Nutella Nutellae

Jawed

3dilettante

Jawed

TimothyFarrar

JasonCross

keritto

BRiT

(>• •)>⌐■-■ (⌐■-■)

keritto

keritto

Kaotik

Drunk Member

Blazkowicz

OpenGL guy

hkultala

Lukfi

itaru

Jawed

MfA

Similar threads