chiphell insist it's a dual core design
If each TU is single-cycle fp16, that means each TU could be ~70% bigger. This huge increase in TU area might be motivation to increase ALU:TEX.each 5-way SIMD is 16 strands wide (just like now) but the ALU:TEX is doubled to 8:1 - this leads to a massive increase in compute density as the TUs currently cost ~29% of a cluster's area, and this would reduce that penalty to a mere ~17%. Put another way that would double per cluster compute for 71% more area.
http://www.guru3d.com/news/ati-rv870-scores-p18000-in-vantage/According to the latest rumors AMD's ATI RV870 flagship graphics card launch is almost imminent (somewhere next month)
At an Asian bulletin board, is seems that the RV870 surfaced with specs and all, it supposedly comes with a 384 Bit memory interface, but that info seems shady and somewhat inaccurate, we'll leave that info as is.
Other colleagues published a 3DMark Vantage scores in the performance preset for the mainstream solution RV840 (Juniper) already a few days ago. That chip reached a P9500 points, while they did not give any further details about the test system. But this causes a ranking between HD 4850 and HD 4870.
That same source has also a result of the flagship product RV870 (Cypress) on hand. But they did not disclose exact performance number, but a range in-between P16000 and P18000, which is almost twice the RV840 and passing the HD 4870 X2.
* Cypress/RV870: P16000-P18000
* Juniper/RV840: P95xx
* Redwood/RV830: P46xx
Though 3Dmark says very little, it's safe to assume that the new flagship card will performance double opposed to the last generation.
Maybe it just refers to the direly needed dual-setup-unit, which is indicated by the double-complex scheduler?"Dual-shader" might mean that one cluster contains two 5-way SIMDs, both of which share a quad-TU. This provides two options:
Jawed
- each 5-way SIMD is 8 strands wide, i.e. a thread is 32 wide - this improves branch incoherence penalties substantially
- each 5-way SIMD is 16 strands wide (just like now) but the ALU:TEX is doubled to 8:1 - this leads to a massive increase in compute density as the TUs currently cost ~29% of a cluster's area, and this would reduce that penalty to a mere ~17%. Put another way that would double per cluster compute for 71% more area.
You mean the AAAABBBB instruction-issue?Maybe it just refers to the direly needed dual-setup-unit, which is indicated by the double-complex scheduler?
[Update 2] AMD will host its Evergreen’s official launch on aircraft carrier U.S.S. Hornet moored in Alameda, Calif.
Not sure what you mean, so I cannot tell if I mean the same.You mean the AAAABBBB instruction-issue?
Jawed
Your Juniper specs are very RV670-like. The only significant difference is in higher number of ALUs. I can't believe, that the Juniper could be the same size as RV670 at 1,9x bigger manufacturing process...So, ahem, guesstimating:
- Juniper 181mm² - 5 clusters, 800 ALUs, 20 TUs, 16 RBEs, 128-bit
- Cedar 220mm² - 8 clusters, 1280 ALUs, 32 TUs, 16 RBEs, 128-bit
- Cypress 316mm² - 12 clusters, 1920 ALUs, 48 TUs, 32 RBEs, 256-bit
Jawed
Two threads are issued on the ALUs, with thread A issued as a single instruction over four cycles AAAA, then thread B takes its turn. So the SIMD looks like it's executing two threads at the same time.Not sure what you mean, so I cannot tell if I mean the same.
Well, there's a black hole labelled D3D11 that seems to be sucking up die - it's just a guess...Your Juniper specs are very RV670-like. The only significant difference is in higher number of ALUs. I can't believe, that the Juniper could be the same size as RV670 at 1,9x bigger manufacturing process...
Maybe never, because at this point it would make more sense to just use shader alus for filtering?For what it's worth I'm wary of these big TUs, just because they have such a large hit on 8-bit performance. But, ahem, ATI tried once before, the question is, when does the tipping point come that makes them worthwhile?
That's why I've been querying how expensive it is to meet the filtering precision specification of D3D11 for 8-bit textures. Single-cycle fp16 TUs are 70% more expensive in R600, apparently. Maybe less (if there is some overhead associated with process/library from back then).I think pretty much any single-cycle FP16 implementation would have to provide double throughput for INT8 right? Otherwise that's a lot of wasted space.
The bandwidth is already there. 128 bits of unfiltered data are as fast as 32 bits. L1 is very small, so doubling it is hardly a world of pain - I think RV770 doubled L1 per cluster in comparison with RV670, so for performance parity against the per unit capability of RV670 the cache is already there.It is not just the bigger tex units for the 16-bit compute lanes, but what about the texture cache and load bandwidth for sampling -- this also must be accounted in the transistor budget.
I guess so, but I don't know what D3D11's filtering specifications cover.Anyway, in that perspective, should we expect some advancement in AF quality this time?
Pretty much all of what you see on surfaces in games originates as 8-bit textures.@jawed
why do you care about 8bit textures in this day and age ?
Thanks, but no, that's not what I meant. I was merely referring to the "rumor" that started with Olicks presentation on id tech6 and wormed its way through doubled ROP-count on all DX11 chips compared to their DX10 predecessors (which may very well be more than just a rumor).Two threads are issued on the ALUs, with thread A issued as a single instruction over four cycles AAAA, then thread B takes its turn. So the SIMD looks like it's executing two threads at the same time.
Jawed
Be warned, though CB is using the same driver this time, they'd have to revert to medium details quite often in order to let the X800 and 6800 compete at all.I'd like to know, what's the major cause of the R600's TMUs performance (compared to todays TMUs) - whether the native FP16 support, or the point samplers. Still thinking about the low (60%) performance difference between HD2900XT and HD4890 in current game w/o FSAA.
Would TA be have to beefed up significantly/ a bit in order to support the required 16k-textures?That's why I've been querying how expensive it is to meet the filtering precision specification of D3D11 for 8-bit textures. Single-cycle fp16 TUs are 70% more expensive in R600, apparently. Maybe less (if there is some overhead associated with process/library from back then).
Jawed