AMD: R9xx Speculation

Another thing just scratched my mind. We know AMD used additional vias to counter the signaling woes of TSMC 40nm process, resulting in "exploded" die size a little bit over the projected numbers. So, I wonder if there have been some progress over this issue and what would be the net benefit/gain for NI, being on the same node.
 
Actually, when one thinks about it, it should work pretty well for graphics workloads, where the wavefront size doesn't matter that much (compared to some GPGPU algorithms were you are basically tied to certain work group sizes because of LDS and stuff like that). All what changes is that the tile size of a wavefront (at least the natural one, the rasterizer are able to reorder it either way with a small penalty) would grow from 4x4 quads (8x8=64 pixels) to 5x4 quads (10x8=80 pixels).
Edit: Just as I read my own post, one of the current rasterizers with 16 pixel/clock would fill a wavefront of 80 pixels over 5 cycles instead of 4 now. Should be no problem.

It just seems to me that even if things work well enough for graphics this isn't really the right approach. Granted AMD doesn't emphasize GPGPU nearly as much as Nvidia, but still... Also, even for graphics you can observe that Evergreen does quite a bit better vs. Fermi the higher the resolution - or other way put, it does poorly in low resolution. While there are certainly other factors than wavefront size here, increasing that won't make it better.

I hoped AMD would overhaul the texture filtering a bit which appears to be L1 Tex cache bandwidth starved for trilinear and AF currently (that the bilinear rate is far above Fermi isn't that much of an advantage in quite some scenarios). Without a redesign of the TMUs and the Tex cache, slightly lowering the Alu/Tex-Ratio might have proven beneficial and maybe enough to use some more samples without impacting performance too much.
Yes, instead this way you'd get an actual (effective, not in theory) increase of ALU/TEX.
 
I'll be shocked if AMD changes the wavefront size to 80. It would make programming the thing so much harder, for little to no gain in hardware efficiency, and at a steep hardware and software design cost.

In my book, any rumor that says wavefront size is 80 is just shoddily and hastily constructed. It's a non-starter.
 
Can't add SIMDs without adding TMUs.

What if the TMUs went on a diet, something that the patent documents I linked earlier vaguely hint at...

What about sharing a quad TMU to 2(4,8...) SIMDs :?: We are at the point where cypress 80 TMU-s is more than enough ;).
 
Or they go back to the R600 design?
Those are single-cycle fp16 filtering TMUs.

http://v3.espacenet.com/publication...=A1&FT=D&date=20091126&DB=EPODOC&locale=en_gb

[0012] While the more sophisticated bilinear, trilinear, and anisotropic filtering techniques produce better results they require higher amounts of computation. In addition, where the dynamic range of sampled texels is large, the required computations typically are done using floating point arithmetic solutions in order to preserve data quality. Floating point calculations require the use of floating point arithmetic logic units within a GPU which increases the associated cost and area required in a circuit to implement.

[0013] When an interpolation, such as a bilinear interpolation, is generated using normalized fixed point texel data where the range of data (i.e., the differences between the texel magnitudes) is large, there can be a loss of precision due to the limited number of bits of calculation provided in a single precision bilinear filtering unit. Such a limitation could be overcome by the use of extended precision filtering using multiple single precision bilinear filtering units which would, by operating in parallel on texel data, generate extended precision floating point texel data.

[0014] What are needed, therefore, are systems and/or methods to alleviate the aforementioned deficiencies. Particularly, what is needed is a system and method to dynamically determine when an interpolator should generate extended precision results and a bilinear filter system that could generate such results when desired.

[...leading to...]

[0040] FIG. 4 is a more detailed illustration of bilinear interpolator 310 with the ability to be dynamically reconfigured for extended precision according to an embodiment of the present invention. Bilinear interpolator 310 comprises input control, shifter, and multiplexer 410, dual bilinear interpolators, A 412 and B 414, and output control, summation, and multiplexer 420. Bilinear interpolator 310 may operate in either a "standard" precision mode where, as an example, two blocks of normalized mantissas 242 are presented to bilinear interpolator 310. Input control, shifter, and multiplexer 410 will pass one block to bilinear interpolator A and the other block to bilinear interpolator B.

[0041] In this "standard" precision mode bilinear interpolator A 412 and bilinear interpolator B 414 act independently. Bilinear interpolator A 412 utilizes the horizontal and vertical weights, wH1, 312-1 and wV1 314-1 whereas bilinear interpolator B 414 utilizes the horizontal and vertical weights, wH2 312-2 and wV2 314-2. Output control, summation, and multiplexer 420 will keep the results of bilinear interpolator A 412 and bilinear interpolator B 414 separate and distinct, outputting the bilinear filter results of bilinear interpolator A 412 through path 421-1 and bilinear interpolator B 414 through path 421-2. In the "standard" precision mode, bilinear interpolator 310 produces two bilinear filtered results per cycle.

[0042] However, when input control, shifter and multiplexer 310 inspects a incoming pair of normalized mantissas 242 where the exponent range exceeds a certain threshold, for example as in the example previously presented when the exponent range of a block of texel data is greater than the difference between the number of bits of filtering precision in the bilinear module and the number of bits in the texel data mantissa, the input control, shifter and multiplexer 310 would allow just a single bilinear interpolator operation to occur whereby both bilinear interpolator A 412 and bilinear interpolator B 414 are used in an "extended" precision mode. In this manner the bilinear filter precision width is doubled to 2M where M is the number of bits of precision in a single bilinear interpolator.

[0043] Therefore, in a double bilinear interpolator embodiment there is no loss of precision where the exponent range is less than twice the filter precision width of a single bilinear interpolator, assuming the widths of the interpolators are equal, less the width of the texel mantissa. This example of a double bilinear interpolator is not meant to limit the implementation of an extended precision bilinear interpolator as other embodiments could be implemented using any number of bilinear interpolators.

[0044] When the input control, shifter, and multiplex 410 identifies an incoming pair of normalized mantissas 242 where the exponent range exceeds a certain threshold, it will multiplex the most significant bits of the mantissa into a bilinear interpolator, for example into bilinear interpolator B 414, and the least significant bits of the mantissa into the other bilinear interpolator, for example into bilinear interpolator A 412. In an extended precision mode the horizontal weights applied to bilinear interpolator A 412 and bilinear interpolator B 414 must be equivalent as the same weighting factor must be applied to all of the mantissa bits representing a particular texel. In the same manner, the vertical weights must also be equivalent. Therefore, in this dual bilinear interpolator example, horizontal weight wH1, 312-1 is equal to wH2 312-2, and vertical weight wV1, 314-1 is equal to wV2 314-2.

[0045] Once bilinear interpolator A 412 and bilinear interpolator B 414 complete an interpolation cycle, the results are presented to output control, summation and multiplexer 420. In the situation where there has just been an extended precision interpolation performed, output control, summation and multiplexer 420 will sum the results of bilinear interpolator A 412 and bilinear interpolator B 414, shifting the least significant bits left by the width of a single precision interpolator (M), thereby producing a single bilinear filtered result mantissa of double precision which is then output on either path 421-1 or 421-2 as desired.
The text describes a dual-interpolator which is capable of outputting two filtered reults, from units A and B. Each unit can accept 4 texels and produce a bilinearly filtered result. If enhanced precision is required, these two units will share the workload in producing a single bilinearly filtered result, with high bits going to B and low bits going to A.

In theory a standard quad-TMU could be constructed from a pair of these blocks.

I interpret this as the basis of two quad TMUs working together sharing a single L1 cache, based upon:

[0036] FIG. 3 is an illustration of multiple shader pipe texture filters with associated dual-ported level one caches and multiple level two cache systems according to an embodiment of the present invention. System 300 comprises one or more dual-ported level one cache systems, of which each supports up to two shader pipe texture filters, and a level two cache system.
From:

http://v3.espacenet.com/publication...=A1&FT=D&date=20100610&DB=EPODOC&locale=en_gb

which to me implies a pair of SIMDs sharing an L1 with two quad-TMUs that can either work independently, or can be ganged when high-precision filtering is required.
 
Another thing just scratched my mind. We know AMD used additional vias to counter the signaling woes of TSMC 40nm process, resulting in "exploded" die size a little bit over the projected numbers. So, I wonder if there have been some progress over this issue and what would be the net benefit/gain for NI, being on the same node.

i had some rough calculations based on standard cell sizes. it's about 10% larger than what it should be based on transistor count. tsmc said a while back that they fixed the electromigration issues in the vias but i would still bet on ATi using yield enhancement techniques at the cost of die space. also vias dont require much space and can be stacked.
 
What about sharing a quad TMU to 2(4,8...) SIMDs :?: We are at the point where cypress 80 TMU-s is more than enough ;).
Yes, I suggested this a while back, based on the patent documents I've just linked, yet again.

I suspect 80 TMUs in Cypress is excessive, so making them smaller would at least be a start.

Though to be honest I don't know if the descriptions I've linked relate to a lower-cost TMU than seen in Evergreen.

Alternatively it may be that the 80 TMUs in Cypress are excessive because the L1/L2 system can't cope, so the revisions I've linked might be a way to unlock performance.

Another interesting aspect of the L1s described there is that they can talk to each other. So texels don't have to come from L2 if another L1 contains them. But then we get into questions of whether Cypress is struggling with L2->L1 bus bandwidth or L2-transaction rate. And, honestly, I suspect the former. In which case having L1s talking to each other would appear to require more bandwidth.

That could be achieved with an L1-only peer bus. Or there could be a ring bus for the whole L2/L1 system :p

:LOL:
 
Those are single-cycle fp16 filtering TMUs.

http://v3.espacenet.com/publication...=A1&FT=D&date=20091126&DB=EPODOC&locale=en_gb


The text describes a dual-interpolator which is capable of outputting two filtered reults, from units A and B. Each unit can accept 4 texels and produce a bilinearly filtered result. If enhanced precision is required, these two units will share the workload in producing a single bilinearly filtered result, with high bits going to B and low bits going to A.

In theory a standard quad-TMU could be constructed from a pair of these blocks.

I interpret this as the basis of two quad TMUs working together sharing a single L1 cache, based upon:


From:

http://v3.espacenet.com/publication...=A1&FT=D&date=20100610&DB=EPODOC&locale=en_gb

which to me implies a pair of SIMDs sharing an L1 with two quad-TMUs that can either work independently, or can be ganged when high-precision filtering is required.
What I ment, is going back to R600's vertical SIMDs (an 2 clocks latency):
Two blocks with 1 "TMU-SIMD" (eight TMU-Quads) and 5 Shader-SIMDs (eight Shader-Quads) per block.
But this won't happen. So all this Barts presentations are partially true: the HD5000 specs are real.
 
You are late.:rolleyes:


BTW, this sounds like the other PDF should be true.
But there is another problem with SIMD-size.

960SPs and 12 SIMDs doesnt fit pretty much with 4D-VLIW.:p

Soooo...?
 
AMD marketing guys: 1
Napoleon: 0

I wonder whether he'll have to change his predictions even once more - where there is a "second edition of the final version of the Roadmap", there might very well be a third, even more final than second-time-final version, right? :LOL:

Napoleon is really a couple of weeks late with his ahem, "information"
 
Back
Top