View Full Version : ELSA hints GT206 and GT212
CarstenS
05-Feb-2009, 08:59
Arguably GT21x GPUs were designed before the shock and awe of RV770 hit, so what are the chances NVidia will respond specifically by increasing ALU:TEX?
"Specifically" almost nil, though I am not sure how modular their TPCs are and if they had plans in the drawer for various ALU:TEX-ratios.
But chances are, if they didn't think of increasing that rate by themselves and before Q2/2008, we're not going to see any drastic change.
OTOH seeing the massive overhead they're having with their (scheduling-) logic outside of TPCs, it would be almost foolish for them not to be carrying that burden without thinking of it as an investment in future GPUs.
Are you referring to the "Pixel Setup" data point in figures 13 and 14 in the Siggraph paper? That's about 10%.I'd need to find it again, but I think it varied from 10 to 25% depending on the game. Of course my memory could be very faulty indeed :) Either way, it's not negligible, so it's not a straightforward "We should remove this" even if it did make sense in practice.
And, I'm still looking, high and low, for any sign of a transcendental ALU in Larrabee.Presumably they've got an instruction or two to make it a bit cheaper, but not a dedicated instruction per-se?
Well, forgetting the register file for a second, all ALU operands have to come through the operand collectors, whether they're from the register file, shared memory, the constant cache, video memory or attribute parameter buffer.But that's a much lower die size penalty than having to try hiding register bandwidth limitations for real-world scenarios. Also some of that complexity must surely be there anyway for shared memory, as required by CUDA.
Regardless, the operand collector is still bigger simply to deal with the increased bandwidth of a MAD+MI configuration.You mean the register file bandwidth? (since the others clearly aren't really affected by it).
The way I see it both are legacies of GPU history, accelerated interpolation was a key part of getting texturing to work when most rendering cycles were texturing bottlenecked and fast transcendentals were needed to get vertex shading at decent speeds (especially given how few vertex pipes there were).I'm not sure I agree completely; how many interpolation operations nowadays are directly for the registers that'll be used for texture fetches? It has already moved past that goal and is still useful.
As you say, the real question is how useful it is - or more precisely, how much does its usefulness *vary*? Because if it's always not-very-useful, you can just make it less powerful but cheaper. If sometimes it's very useful and sometimes completely useless and wasted, then it starts making some sense to unify that functionality into another block. However being able to extract a little bit of the MUL clearly already helps that problem a bit (but only if adding the MUL on its own makes sense; otherwise it's less of a win).
I'd need to find it again, but I think it varied from 10 to 25% depending on the game. Of course my memory could be very faulty indeed :) Either way, it's not negligible, so it's not a straightforward "We should remove this" even if it did make sense in practice.
In the paper the performance data is collected by simulating Larrabee cores - so the ~10% cost of Pixel Setup on those graphs is using whatever ALUs they are planning for Larrabee.
Maybe there is a transcendental ALU and maybe there's an attribute interpolator too, but there's no mention of either of these things so far as I can tell.
Presumably they've got an instruction or two to make it a bit cheaper, but not a dedicated instruction per-se?
I can imagine a look-up table instruction to seed the transcendental macros with the data they need to start:
http://developer.intel.com/technology/itj/q41999/pdf/transendental.pdf
where double-precision vec2 MAD ALUs can calculate DP transcendental functions. A vec2-DP ALU can do any transcendental in 52-72 cycles, which is a throughput of 1/104 to 1/144. So in Larrabee a SIMD-16 would have a throughput of 1/26 to 1/36, again for double-precision. I dunno how much faster single-precision would be. Twice as fast? : 1/13 to 1/18.
GT200 currently has 1/4 to 1/8 throughput (maybe there are some things that are 1/16). Halving those again (as you propose) produces 1/8 to 1/16 - very much like the rate for Intel's macros on IA-64, translated onto Larrabee.
Though of course NVidia's pair of SIMDs (MAD-8 and MI-4 - MI-8 is capable of 2 transcendentals or 8 interpolations per clock, so MI-4 is half) is more like half the width of Intel's SIMD-16.
But for overall throughput, wouldn't you rather have one SIMD-16 than two MAD-8s and two MI-4s? Sure, transcendentals are faster in the latter, but the SIMD-16 is smaller, once you take into account the control overhead of the operand collectors and the 4 SIMDs in NVidia's architecture, not the one in Larrabee.
But that's a much lower die size penalty than having to try hiding register bandwidth limitations for real-world scenarios. Also some of that complexity must surely be there anyway for shared memory, as required by CUDA.
Yes to both those points. But there's still dedicated scoreboarding for MI operands - they still have to be scheduled and tracked.
You mean the register file bandwidth? (since the others clearly aren't really affected by it).
Operand and resultant bandwidth has to be managed regardless of the ALU lanes involved.
I'm not sure I agree completely; how many interpolation operations nowadays are directly for the registers that'll be used for texture fetches? It has already moved past that goal and is still useful.
Yes, but D3D10 only allows 16 attributes per vertex, while D3D10.1 allows for 32 attributes. So, per fragment, the worst case interpolation is currently 32 32-bit values.
A single G92 cluster can interpolate 16 attributes in 1 cycle (2 multiprocessors, each doing 8 per clock). The desired rate is set by the rasteriser/fillrate, not by ALU throughput. In other words, G92's rasterisation rate of 16 fragments per core clock is supported by an interpolation rate of 128 attributes per ALU clock, so in 9800GTX at 675/1688, that's a 1:20 ratio, rasterisation:interpolation.
GT200 rasterises at 32 per core clock and interpolates at 240 per ALU clock, so GTX280 602/1296 has a 1:16 ratio.
A speculative, small, GT300 with 384 interpolations-per-clock (384 MAD lanes) with say 4:1 ALU:TEX, and with say 600/1300 clocks would be 1:26. Why? :shock:
ATI gets by with 32 interpolators, or 1:2.
So, yes, my conservative GT300 has way way too much interpolation rate, so halving the MIs or dropping them entirely really sounds like a good idea.
If deleting MI allows GT300 to go from 4 to 5 multiprocessors per cluster, that would be a bonus eh?
As you say, the real question is how useful it is - or more precisely, how much does its usefulness *vary*? Because if it's always not-very-useful, you can just make it less powerful but cheaper. If sometimes it's very useful and sometimes completely useless and wasted, then it starts making some sense to unify that functionality into another block. However being able to extract a little bit of the MUL clearly already helps that problem a bit (but only if adding the MUL on its own makes sense; otherwise it's less of a win).
Clearly the MUL was a complete waste of time until GT200 arrived - and the compilation and instruction-dependency scoreboarding issues it generates are hardly a decent pay-back for the shitty utility.
The reality is that interpolation is bloody cheap the way NVidia's done it, by tacking it on the side of transcendental. Going forwards, though, it looks like a case of the tail wagging the dog
http://www.cupidity.f9.co.uk/ATTF1010.jpg
So the intriguing question is, does interpolation explode in your face if the MI is dropped? Seeing the stratospheric interpolation rates of even G92, I don't see how NVidia can justify keeping it in future.
(I'm actually wondering if I've radically misunderstood interpolation rates in NVidia's architecture because the numbers are so silly.)
Jawed
trinibwoy
05-Feb-2009, 14:20
Which of those is an ALU-specific test? I know 3DMark06 Perlin Noise is ALU-bound (just about).
On the Nvidia hardware I've had the two Perlin Noise tests are pretty much 100% shader bound and the POM tests significantly so despite the texturing happening there. And what else do you propose is happening with the GS tests that would limit performance besides ALU throughput?
HD4870 can't get any slower than the serial MAD test I linked, i.e. 68% performance per mm2 or 37% of the absolute performance of GTX285.
As to the "nature" of more general code, the issue is really about the memory system. Some general code is so compute bound it barely uses any kind of memory resources, either video RAM or on-die shared RAM - just registers, basically. That code will be quite happy in naive scalar form.
Oh your percentages were normalized per mm. Misread them. The memory issue is a fair point and it was raised in that paper you linked above. But isn't that an implementation detail? I'm sure there are cache structures that work well for scalar issue.
Eh? Until AMD re-writes the core to use LDS/GDS, F@H tells us precisely nothing.
Oh my bad, I thought they were doing so already....and why does AMD have to rewrite the core? What's stopping the stanford guys from doing so?
DegustatoR
05-Feb-2009, 14:55
Nvidia's GT225, GT220 and GT210 are die shrinks (http://www.fudzilla.com/index.php?option=com_content&task=view&id=11868&Itemid=34)
I'm offically lost it. Should we assume that Fuad's talking about retail names and all these GTs will be based on GT218? Will GT218 be enough to replace 9600GSO?..
I have a feeling that news is bullshit.
a.tard.living.in.his.own
05-Feb-2009, 16:13
so gt300 just a die shrink to 40nm from gt200?
why not? who cares about graphiccards anyway? i myself got a xbox and just LOVE the fact that i'll be TOP NOTCH performance and hardware wise until the earliest of 2012!
a proud GAMER!
On the Nvidia hardware I've had the two Perlin Noise tests are pretty much 100% shader bound and the POM tests significantly so despite the texturing happening there. And what else do you propose is happening with the GS tests that would limit performance besides ALU throughput?
Shader bound on NVidia doesn't mean shader bound on ATI.
3DMark06 Perlin Noise, while ALU bound on ATI, won't be ALU bound if ATI increases the ALU:TEX of the hardware even a smidgen, say to 5:1.
The GS tests seem to do a lot of "texturing", i.e. reading data from memory, as far as I can tell.
Oh your percentages were normalized per mm. Misread them. The memory issue is a fair point and it was raised in that paper you linked above. But isn't that an implementation detail? I'm sure there are cache structures that work well for scalar issue.
Yep, you'll find them in CPUs.
Oh my bad, I thought they were doing so already....and why does AMD have to rewrite the core? What's stopping the stanford guys from doing so?
Mike Houston appears to be the one "writing the core". Originally he was a Stanford guy writing Brook, but now he's an AMD guy writing Brook+. The core is, as far as I can tell, a platform for the Stanford guys to codify the molecule simulation they want to run - sort of a library I think.
Maybe Stanford has guys writing Brook+, too, dunno - I don't understand the organisation of their programming teams. But I think Mike's responsibilities extend into the functionality of Brook+ itself, i.e. how it's compiled to run upon IL. I'm not saying he's a one man band or anything, merely that between a F@H core and AMD hardware there's several layers, and LDS/GDS functionality is lost in there somewhere...
The first step is to get IL using LDS/GDS (maybe that works already?) and then extending Brook+ so that it correctly targets LDS/GDS within IL.
Why not drop Brook+ and just write the core in IL? Dunno... What about OpenCL?...
Jawed
pjbliverpool
05-Feb-2009, 22:55
so gt300 just a die shrink to 40nm from gt200?
No
why not? who cares about graphiccards anyway?
Everyone else in this thread
i myself got a xbox and just LOVE the fact that i'll be TOP NOTCH performance and hardware wise until the earliest of 2012!
Huh? :???:
Mintmaster
05-Feb-2009, 23:08
The reality is that interpolation is bloody cheap the way NVidia's done it, by tacking it on the side of transcendental.Cheap in terms of arithmetic logic, not data flow. Compared to ATI's top of the pipe attribute interpolation, storage and data flow costs are huge.
(I'm actually wondering if I've radically misunderstood interpolation rates in NVidia's architecture because the numbers are so silly.)I think you have, because they're not high enough to support the TMUs in G92 for straight bilinear, IIRC.
Cheap in terms of arithmetic logic, not data flow. Compared to ATI's top of the pipe attribute interpolation, storage and data flow costs are huge.
I was tempted to waffle on this subject earlier, but didn't.
Assuming that ATI stores all the fully interpolated attributes per fragment (figure 7):
http://ati.amd.com/products/radeonx800/RadeonX800ArchitectureWhitePaper.pdf
Once the visible pixels have been determined, the next step is to assign them initial values for basic parameters such as color, depth, transparency (alpha), and texture co-ordinates. The initial values are determined by looking at the values assigned to each vertex of the current triangle, and interpolating them to the location of the current pixel. The interpolated values are stored in registers that are used by the pixel shader units.
then I guess the worst case scenario (32 attributes per fragment) must cause a drastic reduction in fragments in flight, along the normal lines of register file trade-off for registers-per-fragment versus fragments-in-flight.
It seems it doesn't take many attributes per vertex to really hit the register file allocation per fragment, e.g. I can imagine 8x fp32 (2x vec4) is reasonably common. Naturally some of these can be freed-up if they contain texture coordinates, once the texture coordinates have been consumed.
I suppose register file storage of pre-computed interpolated attributes would also explain the strong bias in ATI towards a large register file.
In NVidia it seems to me that the extra data path to fetch interpolant data is a real issue, as you mention. i.e. there's an interpolant buffer which the operand collector must fetch from, and then schedule, for issue to MI.
That's a fair bit of bandwidth. I think the fetch bandwidth is 5x 32-bit (A, B, C, x, y are all 32-bit) per quad, and it generates 4x fp32 interpolated attributes. Multiply by 2 for the entire SIMD-8 MI. Or call it 1.25 scalars fetch bandwidth per attritube per clock and 1 scalar per attribute per clock resultant bandwidth.
The data paths in NVidia (register file, constants, shared memory, interpolants, RAM) sure look messy when compared with the clean L1/register file model in Larrabee.
Yet another target for a radical clean-up in GT300?
I think you have, because they're not high enough to support the TMUs in G92 for straight bilinear, IIRC.
:shock: Higher is even sillier. You sure? (I once thought the same, too.)
The ALUs are >2x TMU clock rate, and each multiprocessor can generate a 2D texture coordinate for 4 fragments in one ALU clock (8 attributes per clock), so G92 has > twice the interpolation rate required for bilinear. What am I missing?
Jawed
So, per fragment, the worst case interpolation is currently 32 32-bit values.
That should be 32 vector interpolants, or 128 scalar 32-bit values.
Thanks for the correction, didn't read:
http://download.microsoft.com/download/f/2/d/f2d5ee2c-b7ba-4cd0-9686-b6508b5479a1/Direct3D10_web.pdf
closely enough, sigh.
That should be 32 vector interpolants, or 128 scalar 32-bit values.
:shock: - well that puts quite a dent in ATI's register file, allowing only 8 batches in flight and assuming at least one register can be clawed-back on the first instruction executed :razz:
Jawed
The data paths in NVidia (register file, constants, shared memory, interpolants, RAM) sure look messy when compared with the clean L1/register file model in Larrabee.
Hmm, that list may well be shorter: interpolants could be held in shared memory.
Jawed
trinibwoy
06-Feb-2009, 17:48
The data paths in NVidia (register file, constants, shared memory, interpolants, RAM) sure look messy when compared with the clean L1/register file model in Larrabee.
Would the current texture cache be a good candidate for conversion into a generalized per-cluster L1? It can then serve as a texture/constant/global memory cache. And is there documentation that points to dedicated interpolant storage on Nvidia's hardware - why wouldn't they just reside in the register file?
Larrabee's L1 looks clean if you're listing it out like that but the implementation and accesses to that L1 are certainly much messier and more complicated than Nvidia's multi-banked shared memory approach.
Would the current texture cache be a good candidate for conversion into a generalized per-cluster L1? It can then serve as a texture/constant/global memory cache.
Don't think so, the L1 is currently very precisely targetted at serving texels, in fairly complicated access patterns, to the texe fetch/filtering stages.
The shared memory (PDC) might be.
For what it's worth Larrabee's dedicated texture units have their own cache.
And is there documentation that points to dedicated interpolant storage on Nvidia's hardware - why wouldn't they just reside in the register file?
OPERAND COLLECTOR ARCHITECTURE (http://v3.espacenet.com/publicationDetails/biblio?KC=A1&date=20080508&NR=2008109611A1&DB=EPODOC&locale=en_GB&CC=US&FT=D)
[0066] FIG. 7 is a block diagram of another exemplary embodiment of the Register File Unit and Execution Unit(s) of FIG. 2, in accordance with one or more aspects of the present invention. Register File Unit 750 and Execution Unit(s) 770 perform the functions of Register File 250 and Execution Unit(s) 270, respectively. A Collector Unit 730 receives operand addresses and corresponding program instructions for execution by an Execution Unit A 765 from Register Address Unit 240. In some embodiments of the present invention, Execution Unit A 765 is configured to perform interpolation, reciprocal, square root, multiplication, logarithm, sine, cosine, and power function operations. In particular, Execution Unit A 765 may be configured to execute program instructions with up to two operands. For some of the program instructions two operands are read from one or more banks 320. For other program instructions two operands are read from another storage unit (not shown) that stores per primitive values, such as triangle attributes, plane equation coefficients, and the like. Additionally, two operands can be read from a combination of the per primitive value storage unit and Banks 320. Access to the per primitive value storage unit is arbitrated since more than one Collector Unit 735 and/or 730 may request access in single cycle.
Larrabee's L1 looks clean if you're listing it out like that but the implementation and accesses to that L1 are certainly much messier and more complicated than Nvidia's multi-banked shared memory approach.
Why?
Do you mean the software threading model and cache misses? I certainly have some qualms there, but current GPUs have large functional blocks that are idling very often and I think that that wastage is far worse than the instantaneous hardware-thread switching overhead that Larrabee will suffer.
Jawed
Mintmaster
07-Feb-2009, 01:57
In NVidia it seems to me that the extra data path to fetch interpolant data is a real issue, as you mention. i.e. there's an interpolant buffer which the operand collector must fetch from, and then schedule, for issue to MI.Some comments in this forum actually suggest that G80 onwards has multiple post-transform vertex caches, and if more than one multiprocessor is working on the quads from a polygon, its vertex attributes are copied to each.
It's hard to say which method is best. NVidia would use less space for attributes, but would need a separate storage area for attribute data and another data path into the shader units. ATI gets dual use out of its register space, and I think register-heavy shaders tend not to use the iterators much (in the case of GPGPU, almost none), and super-high interpolator use is pretty rare. From that perspective, the storage cost is almost free if you're already designing a GPU capable of handing a high register load.
ATI's method seems cleaner to me. No need to worry about interpolation in the shader core, as it's already done.
:shock: Higher is even sillier. You sure? (I once thought the same, too.)
The ALUs are >2x TMU clock rate, and each multiprocessor can generate a 2D texture coordinate for 4 fragments in one ALU clock (8 attributes per clock), so G92 has > twice the interpolation rate required for bilinear. What am I missing?I thought each multiprocessor can generate a 2D texture coordinate for only 1 fragment per ALU clock, because it's quarter speed. I see that the B3D review says that they're full speed, but that doesn't make sense according to the multifunction interpolator design (http://arith.polito.it/foils/11_2.pdf). If it's quarter speed at doing f(X) = C0 + C1*X + C2*X^2, then it'll be quarter speed at doing U(x,y) = A*x + B*y + C.
:shock: - well that puts quite a dent in ATI's register file, allowing only 8 batches in flight and assuming at least one register can be clawed-back on the first instruction executed :razz:Well, that's a worst case scenario. Nobody uses anywhere close to 128 attributes. The other thing to note is that often attributes are only used once in a program, particularly when you have so many, so the register they occupy is freed as the program progresses; likewise, you don't need all registers to be available at the beginning of a program.
Mintmaster
07-Feb-2009, 02:22
why wouldn't they just reside in the register file?Vertex attribute access doesn't match the behaviour of register access, nor are the units writing the data the same.
When you have batches of 32 processed by 8-SIMD units, no register location needs to be read/written more than once every 4th clock, because each register is only used for one pixel, and there are savings to be had in designing the register file with this constraint in mind. Attributed data, on the other hand is used for possibly all pixels in a batch. The other difference is that the attribute data is written to each cache from the setup engine, whereas register data has no need to have an external connection except to pass the final pixel info to the ROPs.
Some comments in this forum actually suggest that G80 onwards has multiple post-transform vertex caches, and if more than one multiprocessor is working on the quads from a polygon, its vertex attributes are copied to each.
The latter seems inevitable. No one seems to have worked out what's happening with the routing/scheduling of geometry.
It's hard to say which method is best. NVidia would use less space for attributes, but would need a separate storage area for attribute data and another data path into the shader units. ATI gets dual use out of its register space, and I think register-heavy shaders tend not to use the iterators much (in the case of GPGPU, almost none), and super-high interpolator use is pretty rare. From that perspective, the storage cost is almost free if you're already designing a GPU capable of handing a high register load.
I suppose it's a question of the cost of the interpolators. Each interpolator is a MUL and an ADD running in parallel:
Vertex data processing with multiple threads of execution (http://v3.espacenet.com/publicationDetails/biblio?KC=A1&date=20060216&NR=2006033757A1&DB=EPODOC&locale=en_V3&CC=US&FT=D)
My first thought is that's a 64-way MIMD configuration for RV770's 32 interpolators (32x2). But I think it should be possible to SIMD-ise that, e.g. 4x32.
ATI's method seems cleaner to me. No need to worry about interpolation in the shader core, as it's already done.
I'm thinking that G80's MI design, where transcendentals and interpolations share an ALU and at least some data paths is a good solution for an intermediate period before they junk MI. It seems to me that utility can only decrease with time, as the per-fragment cost of interpolation falls off with more and more complex pixel shaders (and the increasing cost of non-pixel shaders) and the proportion of transcendental calculations in general computational code seems to be not high enough to warrant building dedicated transcendental ALUs.
I thought each multiprocessor can generate a 2D texture coordinate for only 1 fragment per ALU clock, because it's quarter speed. I see that the B3D review says that they're full speed, but that doesn't make sense according to the multifunction interpolator design (http://arith.polito.it/foils/11_2.pdf). If it's quarter speed at doing f(X) = C0 + C1*X + C2*X^2, then it'll be quarter speed at doing U(x,y) = A*x + B*y + C.
No, it's full speed, as I explained here:
http://forum.beyond3d.com/showthread.php?p=870075&highlight=b3d72#post870075
http://www.cupidity.f9.co.uk/b3d72.jpg
Hmm, this thread's pretty useful :smile:
http://forum.beyond3d.com/showthread.php?t=41358
Well, that's a worst case scenario. Nobody uses anywhere close to 128 attributes. The other thing to note is that often attributes are only used once in a program, particularly when you have so many, so the register they occupy is freed as the program progresses; likewise, you don't need all registers to be available at the beginning of a program.
I suspect that all the registers occupied by attributes have to be pre-allocated - the interpolator doesn't produce them on demand, because it's got a stream of vertices coming in and presumably a very limited buffer holding triangle data (i.e. per-vertex attributes and A, B and C).
Jawed
Mintmaster
07-Feb-2009, 23:07
No, it's full speed, as I explained here:
http://forum.beyond3d.com/showthread.php?p=870075&highlight=b3d72#post870075I see. It seems that I wasn't paying proper attention. In that case, the MI really isn't saving much space all - maybe 15% of total SF+INT space. There's a lot of multipliers in those interpolation-only sections.
I have a feeling that GT200 eliminates those side branches, or at the very least GT300 will if they decide to stick with the distributed vertex caches. You're right - as shown the interpolation rate is definately overkill. Also, is this really where the second MUL happens? It just seems really silly to try and wedge it into there, because the data paths are all wrong.
I suspect that all the registers occupied by attributes have to be pre-allocated - the interpolator doesn't produce them on demand, because it's got a stream of vertices coming in and presumably a very limited buffer holding triangle data (i.e. per-vertex attributes and A, B and C).Oh, of course. Registers are always preallocated, and this applies to NVidia, too. A fragment takes a known amount of register space from beginning to end, and it's known at compile time. ATI's compiler simply has to deal with some initial values in registers, and this may or may not increase the total register load depending on how many simultaneous attributes and temporary values are needed to complete a shader.
trinibwoy
08-Feb-2009, 00:26
OPERAND COLLECTOR ARCHITECTURE (http://v3.espacenet.com/publicationDetails/biblio?KC=A1&date=20080508&NR=2008109611A1&DB=EPODOC&locale=en_GB&CC=US&FT=D)
Cool, thx. Guess I need to read more carefully.
Do you mean the software threading model and cache misses? I certainly have some qualms there, but current GPUs have large functional blocks that are idling very often and I think that that wastage is far worse than the instantaneous hardware-thread switching overhead that Larrabee will suffer.
Not really. For some reason I thought Larrabee's caches would be multiported in an attempt to mimic the broadcast of shared memory. But it looks like they'll be using multithreading to hide cache latencies as well. So I guess there'll be some work there to pack data into single cache lines as there won't be a way to explicitly set up bank aligned data sets like in cuda.
Vertex attribute access doesn't match the behaviour of register access, nor are the units writing the data the same.
When you have batches of 32 processed by 8-SIMD units, no register location needs to be read/written more than once every 4th clock, because each register is only used for one pixel, and there are savings to be had in designing the register file with this constraint in mind. Attributed data, on the other hand is used for possibly all pixels in a batch. The other difference is that the attribute data is written to each cache from the setup engine, whereas register data has no need to have an external connection except to pass the final pixel info to the ROPs.
Right that makes complete sense. I think Jawed is right though. Nvidia will probably clean up their memory structure in the next round and implement some sort of cluster level L1 (distinct from the texture cache) similiar to LRB. But how would that change operand fetch? Would they now have to support multiple inflight fetches from cache to the register file and treat those fetches as yet another latency to be hidden?
I see. It seems that I wasn't paying proper attention. In that case, the MI really isn't saving much space all - maybe 15% of total SF+INT space. There's a lot of multipliers in those interpolation-only sections.
Re-reading the paper:
Logic Block Area (full-adders)
==================================================
17b squarer 90
CS to radix-4 SD converter 45
lookup table ROM 1380
function overhead total 1515
--------------------------------------------------
2 optimized 17x24 mults 945
8 5x24 mults 2040
3 24b right-shifters 280
3 24b two’s complementers 110
4 45b right-shifters 840
4 CSAtree 730
4 45b CPA 640
4 normalizers 930
planar interpolation total 6515
==================================================
multifunction total 8030
==================================================
I think it's fair to say the design sees transcendental functions as a small overhead on the considerably larger interpolators, which makes it much harder to justify dropping transcendental altogether.
I have a feeling that GT200 eliminates those side branches, or at the very least GT300 will if they decide to stick with the distributed vertex caches.
Another alternative is to serialise the four branches of interpolation:
==================================================
17b squarer 90
CS to radix-4 SD converter 45
lookup table ROM 1380
function overhead total 1515
--------------------------------------------------
2 optimized 17x24 mults 945
2 5x24 mults 510
3 24b right-shifters 280
3 24b two’s complementers 110
1 45b right-shifters 210
1 CSAtree 183
1 45b CPA 160
1 normalizer 233
planar interpolation total 2631
==================================================
multifunction total 4146
==================================================
which ~halves the unit entirely. GT300 with 4:1 or higher ALU:TEX should be happy.
But now transcendental is about 37% of MI area, whereas it was 19% in the layout described originally.
Only with detailed testing of attribute throughput could we find out the actual throughput of these GPUs - I don't know of any documented tests.
You're right - as shown the interpolation rate is definately overkill. Also, is this really where the second MUL happens? It just seems really silly to try and wedge it into there, because the data paths are all wrong.
Yeah, somewhere in there - the patent application paragraph I quoted earlier, [0066] :
In some embodiments of the present invention, Execution Unit A 765 is configured to perform interpolation, reciprocal, square root, multiplication, logarithm, sine, cosine, and power function operations.
In the other thread I linked, Bob said:
http://forum.beyond3d.com/showpost.php?p=1008712&postcount=24
I think you should try writing a MUL (or MAD)-only shader that interpolates different attributes at each instructions. The SFU really will do a MUL at the rate of 1 per clock per thread, largely independent of what happens in the MAD pipe.
Implying that our ideas of MUL utility are incorrect, that there's more throughput available there.
Jawed
Cool, thx. Guess I need to read more carefully.
Can't blame you when the source document is patentese. And that paragraph is merely a possibility, after all.
Not really. For some reason I thought Larrabee's caches would be multiported in an attempt to mimic the broadcast of shared memory. But it looks like they'll be using multithreading to hide cache latencies as well.
Yeah, software multi-threading.
So I guess there'll be some work there to pack data into single cache lines as there won't be a way to explicitly set up bank aligned data sets like in cuda.
Obviously a lot of data will naturally fall into cache lines. They'll also have some swizzle functionality in the ALUs, so that might help. And of course for memory operations there's gather functionality, explicitly designed to make use of coalesced memory accesses and produce efficient cache lines.
Right that makes complete sense. I think Jawed is right though. Nvidia will probably clean up their memory structure in the next round and implement some sort of cluster level L1 (distinct from the texture cache) similiar to LRB. But how would that change operand fetch? Would they now have to support multiple inflight fetches from cache to the register file and treat those fetches as yet another latency to be hidden?
One sticky question I haven't resolved is how NVida handles virtualisation of the register file.
D3D10 requires that each element can access 4096 vec4 fp32s. The naive interpretation is a gargantuan register file. So in reality some kind of paging mechanism is required.
Related to this is the question of indexed register accesses, e.g. r(r3.x) (the register at the address stored in r3.x), which is a feature of D3D11 (ATI GPUs already do this - it's a recipe for waterfalling :sad: ).
As far as I can tell Larrabee virtualises the entire "D3D register file". The vector unit's registers are completely dynamic. There is absolutely no static allocation of pixels to slots in a register file like we see in GPUs (for the duration of a shader). Simply because the vector register file is small, much like a SSE register file is small (though, ahem, not quite that tiny I guess). So if the vector unit is a SIMD-16 there might be 16x 16-wide registers, where each register is 16x32 bits wide, so a total of 16x 64 byte registers.
The vector unit only needs enough registers to cover pipeline latency for all the operands it can fetch. And since a single operand can be read from L1 per clock, that will further reduce the need for a large register file attached to the vector ALU.
So the D3D register file (all 4096 vec4 fp32s per element!) is actually implemented in memory. It is merely cached through L2/L1 for use by whichever hardware thread is in context. As far as Larrabee is concerned, register names are merely an abstraction of memory addresses.
Now, it may be that NVidia goes in this direction, too. It's a hell of a big change. It would mean that there's no point in having the parallel data cache as a dedicated block for shared memory between threads.
Jawed
Mintmaster
08-Feb-2009, 22:45
Another alternative is to serialise the four branches of interpolationThat's what I meant. Drop those branches so that you only do one pixel per clock, whether for SF or interpolation. It makes even more sense if NVidia is going to up the ALU:TEX ratio.
Implying that our ideas of MUL utility are incorrect, that there's more throughput available there.The thing is that MUL needs a second operand, unlike all the other SF functions. It's a new path towards the register file. You also can't use 5x24 multipliers, which are possible for interpolation because they're only dealing with pixel offsets from the quad center.
I dunno. Maybe these documents only display enough functionality to get the patent, and the real thing is quite different.
That's what I meant. Drop those branches so that you only do one pixel per clock, whether for SF or interpolation. It makes even more sense if NVidia is going to up the ALU:TEX ratio.It'd make great sense if SF was all that frequent either; which it isn't, AFAICT. I could be horribly wrong, but I still prefer the possibility of having one SF/MI unit per multiprocessor instead of two in the GT21x generation. As for GT3xx, who knows given we have zero idea how the shader core looks like... :)
The thing is that MUL needs a second operand, unlike all the other SF functions. It's a new path towards the register file.Yup, but this is obviously resolved pretty easily if it's half-speed... ;) (well not really since SF is quarter-speed and would be 1/8th speed then, but the relative cost is still lower)
You also can't use 5x24 multipliers, which are possible for interpolation because they're only dealing with pixel offsets from the quad center.Oh, the MUL isn't anywhere in that diagram for a very simple reason: it's not for the interpolation per-se, it's for the division by 1/w (the original RCP for that is done on demand, BTW, and can therefore be avoided by the driver if no interpolation is ever required - I remember testing that way back in the day...) - so it definitely has to be FP32 anyway, there's very little waste here except for the RF part of the equation.
In case anyone hasnt heard allready (well done nvidia, fantastic idea)
GeForce 9 rebranding may take effect next month - Nvidia will re-introduce the GeForce 9800 GTX+ as the GeForce GTS 250 and the GeForce 9800 GT as the GeForce GTS 240
I wonder whether there's any connection between the renaming and new 40nm GPUs - could they be delayed? Or is it that only cheaper cards won't be renamed (at least for retail) and will be superceded by new GPUs in April or March, while G92 won't have a replacement untlil Q3?
trinibwoy
09-Feb-2009, 12:32
It'd make great sense if SF was all that frequent either; which it isn't, AFAICT. I could be horribly wrong, but I still prefer the possibility of having one SF/MI unit per multiprocessor instead of two in the GT21x generation. As for GT3xx, who knows given we have zero idea how the shader core looks like..
I figure if they drop a MI we can say goodbye to MUL co-issue. Have you guys considered the implications for warp size? Without the need or ability to issue to the MI every other core clock will we see superwarps go away and everything run in 16 warp sizes?
I figure if they drop a MI we can say goodbye to MUL co-issue. Have you guys considered the implications for warp size? Without the need or ability to issue to the MI every other core clock will we see superwarps go away and everything run in 16 warp sizes?They could do that if it simplified their scheduler in any way, but I suspect they'd still expose the MUL exactly as in GT200: as I said previously, for Graphics, GT200 can only expose half a MUL per clock cycle for RF/scheduler reasons (but it can use it entirely in practice if you use it for a MI half the time and a MUL the other half). If you removed the half MUL that can't be exposed anyway, it might still make sense to expose the other.
As for warp size, I guess theoretically 24 would be possible, but I doubt NV wants to go away from multiples of 16 for backwards compatibility reasons (i.e. CUDA programs that optimized everything with 16 in mind, for example). [EDIT: Actually now that I think about it, it might really not be that easy to implement anyway...]
I figure if they drop a MI we can say goodbye to MUL co-issue.
It isn't co-issue. It's "issue at some arbitrary rate", where the rate is determined by the availability of the ALU, and with a timing offset from the MAD ALU.
"Issue" is a funny concept in NVidia's design, as an instruction is issued only every "x" cycles (4, 8 etc.) but operands/resultants seem to be flowing continuously.
Have you guys considered the implications for warp size? Without the need or ability to issue to the MI every other core clock will we see superwarps go away and everything run in 16 warp sizes?
Does "superwarp" refer to a pair of 16-wide batches? As far as I can tell (rusty memory alert) this was only used in G80, for pixel shading, while VS (GS too?) used 16-wide batches ("half-warps" was one name for them, though officially they are warps in the strict sense). The later GPUs have 32-wide batches. Regardless of size, NVidia seems to use a pair of batches/warps (a convoy) as it helps with register file banking.
In general I'm not sure that slowing down MI would directly impact warp size.
Jawed
pjbliverpool
09-Feb-2009, 12:59
In case anyone hasnt heard allready (well done nvidia, fantastic idea)
I dunno, that kinda makes sense to me. Those G92's look very similar to what a cut down GT2xx would anyway and I doubt there is any performance detriment in comparison.
So for me this makes the product line a lot tidier compared to what it was previously. The 9800GTX+ was a horrible name anyway!
I assume the standard 9800GTX has been dropped from the product line altogether.
I dunno, that kinda makes sense to me. Those G92's look very similar to what a cut down GT2xx would anyway and I doubt there is any performance detriment in comparison.
It lacks GT200's TMU efficiency gains and it has the "wrong ALU:TMU". I think G92's also lacking full VC-1 decode (introduced with G98 I think).
Jawed
trinibwoy
09-Feb-2009, 15:36
It isn't co-issue. It's "issue at some arbitrary rate", where the rate is determined by the availability of the ALU, and with a timing offset from the MAD ALU.
Oh yeah, totally agree. But the setup is such that they could alternate issue to the MAD and the MI every other core clock if necessary. This was explicitly laid out in one of the patents as one of the reasons for 32-wide pixel batches. And to be honest I can't think of another good reason for them.
Does "superwarp" refer to a pair of 16-wide batches? As far as I can tell (rusty memory alert) this was only used in G80, for pixel shading, while VS (GS too?) used 16-wide batches ("half-warps" was one name for them, though officially they are warps in the strict sense). The later GPUs have 32-wide batches. Regardless of size, NVidia seems to use a pair of batches/warps (a convoy) as it helps with register file banking.
Yeah it seems that the unit of work is a half-warp and two of those are ganged together to reduce pressure on instruction issue. Not sure why you say super-warps help with register file accesses though. Based on the patents and CUDA documentation I've read the coalescing rules for global memory accesses all happen within the scope of a half-warp. Shared memory is 16-way banked as well.
Also, since there's no caching of global memory and any unneeded bytes in a given memory transaction seem to be discarded anyway the other half-warp doesn't appear to reap any benefit from the first half-warp's memory requests.
In terms of the register file there's no evidence that larger warps help there. Although Nvidia recommends multiples of 64 for block sizes for whatever reason:
The compiler and thread scheduler schedule the instructions as optimally as possible to avoid register memory bank conflicts. They achieve best results when the number of threads per block is a multiple of 64. Other than following this rule, an application has no direct control over these bank conflicts. In particular, there is no need to pack data into float4 or int4 types.
Yeah it seems that the unit of work is a half-warp and two of those are ganged together to reduce pressure on instruction issue. Not sure why you say super-warps help with register file accesses though.
I just don't remember any reference to "super-warp", that's why I asked.
Based on the patents and CUDA documentation I've read the coalescing rules for global memory accesses all happen within the scope of a half-warp. Shared memory is 16-way banked as well.
But ALUs run ~twice as fast as registers and memory, so from the point of view of memory, the ALUs are 16-wide.
Also, since there's no caching of global memory and any unneeded bytes in a given memory transaction seem to be discarded anyway the other half-warp doesn't appear to reap any benefit from the first half-warp's memory requests.
Depends on interleaving factors and "wrap-around". Look at the examples relating to bank conflicts.
In terms of the register file there's no evidence that larger warps help there. Although Nvidia recommends multiples of 64 for block sizes for whatever reason:
The reason being that a pair of warps (each being 32 wide) run in lock step in a "convoy".
Jawed
trinibwoy
09-Feb-2009, 17:18
I just don't remember any reference to "super-warp", that's why I asked
My bad. It was actually supergroup and the term pops up in a few patents.
[0083]In another alternative embodiment, SIMD groups containing more than P threads ("supergroups") can be defined. A supergroup is defined by associating the group index values of two (or more) of the SIMD groups (e.g., GID1 and GID2) with each other. When issue logic 424 selects a supergroup, it issues the same instruction twice on two successive cycles: on one cycle, the instruction is issued for GID1, and on the next cycle, the same instruction is issued for GID2. Thus, the supergroup is in effect a SIMD group. Supergroups can be used to reduce the number of distinct program counters, state definitions, and other per-group parameters that need to be maintained without reducing the number of concurrent threads.
But ALUs run ~twice as fast as registers and memory, so from the point of view of memory, the ALUs are 16-wide.
Yep exactly, which is why the unit of memory access is 16-wide as well.
Depends on interleaving factors and "wrap-around". Look at the examples relating to bank conflicts.
Ok, but why would "wrap-around" matter at anything above a multiple of 16? Why is 64 the magic number? Sorry if it's something really simple that I'm missing.....
The reason being that a pair of warps (each being 32 wide) run in lock step in a "convoy".
I don't think that's the definition of a convoy though.
Each slot in the instruction buffer can hold up to two instructions from a convoy (a group of 32) of threads.
By using different clock rates and providing multiple execution pipelines, a large amount of threads can be grouped together into a convoy of threads according to the formula: convoy_size=(number of execution pipelines).times.(number of data paths in each execution pipeline).times.(ratio of the clock rate of the data processing side to the clock rate of the instruction processing side).
Therefore convoy_size = 2 * 8 * 2 = 32. So if you remove the MI it becomes convoy_size = 1 * 8 * 2 = 16.
This excerpt below is what I was referring to in saying that 32 wide warps enable issuing to both the MAD and MI pipelines.
http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=2&p=1&f=G&l=50&d=PG01&S1=%28nvidia.AS.+AND+convoy.BIS.%29&OS=an/nvidia+and+spec/convoy&RS=(AN/nvidia+AND+SPEC/convoy)
[0033] In the preferred embodiment, the issue logic 320, when issuing instructions out of the instruction buffer 310, alternates between instructions of the MAD type and instructions of the SFU type. In this manner, both of the execution pipelines 222, 224 can be kept completely busy. Successive issuances of MAD type instructions or SFU type instructions may be permitted if the instruction buffer 310 contains only single type of instructions. However, a convoy of 32 threads requires 2 T clocks or 4H clocks to execute, and so, successive issuances of same-type instructions (e.g. MAD-MAD or SFU-SFU) can occur at most every other T clock. Issuing different-type instructions alternately to the two pipelines, on the other hand, permits an instruction to be issued at every T clock and provides for higher performance. The compiler can help with the scheduling of the instructions so as to ensure that different-type instructions are stored in the instruction buffer 310. Allowing different convoys to be slightly apart in the program may also improve performance.
Cookie Monster
09-Feb-2009, 20:43
It lacks GT200's TMU efficiency gains and it has the "wrong ALU:TMU". I think G92's also lacking full VC-1 decode (introduced with G98 I think).
Jawed
What about CUDA capabilities?
The patent documentation is so old it relates to G80, where the hardware has a batch defined as 16.
Since then NVidia GPUs have changed so that the smallest batch is 32. NVidia doubled the clock count per issued instruction, essentially.
I think supergroup is referring to an instruction being issued for a single batch over successive cycles, i.e. a supergroup in GT200 consists of 2T clocks ("2 thread clocks", yay, the true meaning of thread in this architecture) or 4H clocks (hot, i.e. ALU clocks) for a single batch. So a convoy has to be re-defined to be:
SIMD-width * number-of-data-paths * ALU-throughput-multiplier * supergroup-size
In G80 the supergroup was 1, in GT200 it's 2. That's my interpretation, anyway. Of course there's a bit of a complication in GT200, because the number of data paths is really 3 if you count the double-precision ALU :razz:
CUDA documentation is misleading about G80 because it implies that a batch is 32-wide, but it's really "half-warp" sized. But since the GPU always runs convoys it doesn't matter - whereas in graphics G80 supposedly runs vertex shader batches un-convoyed.
---
As to wrap-around, see figure 5-6 in the CUDA 2.0 Programming Guide, to see how the banks wrap around. Additionally, by placing a convoy's data in an interleaved pattern in memory (i.e. mixing A and B registers for AAAABBBB batch pattern) you can use the burst length to fetch data for both batches in a convoy and also utilise all the banks evenly, instead of chucking away half the burst.
Obviously there are access patterns that will always cause grief, which is why the CUDA documentation goes to great lengths to explain this stuff.
Jawed
What about CUDA capabilities?
Yep, there's a difference there too. Double-precision, some atomicity improvements, predicate evaluation for whole warps and an increase in capacity for in-flight warps/registers. See appendix A of the CUDA 2.0 Programming Guide.
Jawed
trinibwoy
10-Feb-2009, 03:41
The patent documentation is so old it relates to G80, where the hardware has a batch defined as 16.
Since then NVidia GPUs have changed so that the smallest batch is 32. NVidia doubled the clock count per issued instruction, essentially.
From where I'm sitting nothing has changed in the context of that patent. There are still two pipelines and there is still a 2:1 clock ratio. The definition of convoy and supergroup are still the same. Even if GT200 issues everything as a supergroup the definition doesn't change as a result and there's no bearing on the behavior of future iterations that follow the G8x model.
As to wrap-around, see figure 5-6 in the CUDA 2.0 Programming Guide, to see how the banks wrap around.Yeah but that's based on the stride between data elements within a single half warp. Still don't get the relationship to the block size.....
A common case is for each thread to access a 32-bit word from an array indexed by
the thread ID tid and with some stride s:
__shared__ float shared[32];
float data = shared[BaseIndex + s * tid];
In this case, the threads tid and tid+n access the same bank whenever s*n is a
multiple of the number of banks m or equivalently, whenever n is a multiple of m/d
where d is the greatest common divisor of m and s. As a consequence, there will be
no bank conflict only if half the warp size is less than or equal to m/d. For devices
of compute capability 1.x, this translates to no bank conflict only if d is equal to 1,
or in other words, only if s is odd since m is a power of two.Those conditions dont seem to have any dependency on block size. Unless the stride is related to block size somehow which the docs don't mention anything about (don't see why it would be either).
Additionally, by placing a convoy's data in an interleaved pattern in memory (i.e. mixing A and B registers for AAAABBBB batch pattern) you can use the burst length to fetch data for both batches in a convoy and also utilise all the banks evenly, instead of chucking away half the burst.Register files have a burst length? I thought it was simply 1 32-bit read per clock per bank?
Register files have a burst length? I thought it was simply 1 32-bit read per clock per bank?
Each H clock, MAD needs 3 scalar operands for each of 8 lanes, which is 24 operands. Each T clock, which is ~2 H clocks, MAD needs 48 operands.
Similarly each H clock, MI needs 1 operand for each of 2 lanes (for transcendental), so that's 2 operands - so each T clock that's 4 operands.
If MI is doing MUL, then it's 8 operands per T clock.
So the worst case is 48 operands per T clock for MAD and 8 operands per T clock for MUL = 56 operands.
So each T clock the register file needs to produce 64 operands to cover all these cases. So each of the 16 banks in the register file produces a burst of 4 scalars per T clock.
From where I'm sitting nothing has changed in the context of that patent. There are still two pipelines and there is still a 2:1 clock ratio.
Your earlier question about removal of MI essentially affects the need to use a convoy, as the convoy is constructed specifically to twiddle batches across MAD and MI.
I've also taken this as an optimisation for operand fetching, as by pairing up two batches in a convoy you can use a burst to read "half" of each batch's operands. This means that when a batch is reading registers from all over the register file, the addressing rate (1 per T per bank) and the burst length are less likely to produce surplus operands.
The ideal case (in G80), with a burst length of 4, is four batches interleaved in the register file. This allows:
MAD r0, r1, r5, r9
RCP r13, r19
where r1, r5, r9 and r19 are fetched on four consecutive Ts. This produces 16 banks * burst length 4 * T count 4 = 256 operands. That's enough operands for 4 batches of 16, where each batch wants 52 operands.
The resulting data will feed a pair of convoys over 4 consecutive Ts:
MAD MI
(r1, r5, r9) r19
=============================
T0 A B
T1 B A
T2 C D
T3 D C
Obviously this requires that MAD and RCP are fully independent instructions (as they are in this example).
So alternating MAD+MI and interleaving of batches in register-file bursts synergistically maximises the bandwidth utilisation of register-file and operand collection.
Of course this all starts to fall apart with branch divergence...
What's interesting is that a pair of batches joined to make a convoy make a de-facto batch. If branch divergence affects one of the batches the entire convoy is affected.
Even more simply, if the batches in a convoy diverge e.g. batch A takes the THEN clause (MAD r0, r1, r2, r3) while batch B takes the ELSE clause (MAD r5, r6, r7, r8), then the operand collector is effectively trying to fetch operands for independent instructions on the same T clocks and so it will prolly run out of bandwidth.
This makes me think that GT200 is merely an enforced-convoy architecture, with a batch size of 16 - but for branch divergence and memory operations it counts as having a batch size of 32. Now it could be that GT200 actually has a baseline batch size of 32, making a convoy 64 elements. That would make for effective batch size of 64. Dunno.
Whereas I think RV770 has an effective batch size of 128.
The definition of convoy and supergroup are still the same. Even if GT200 issues everything as a supergroup the definition doesn't change as a result and there's no bearing on the behavior of future iterations that follow the G8x model.
I think you might have a point here.
I'm still trying to reconcile the scaling introduced by "supergroup" (i.e. when would it be used) with the fact that GT200 supposedly issues a single batch instruction for 4 H clocks, not 2 as in G80.
My interpretation is that GT200 has 32-wide batches that are convoyed to make a super-batch of 64.
An alternative interpretation is that a 32-wide batch is formed from a convoy, indivisibly. Some kind of internal change has been made that enforces this scheduling. G80 didn't enforce convoys, it seems (vertex shader batches seem to be truly 16 in size).
Yeah but that's based on the stride between data elements within a single half warp. Still don't get the relationship to the block size.....
Think of it as a 3-dimensional block of memory, where you're allowed to cut any plane you like as long as it only uses 2 dimensions to address. The interleaving in figure 5.6 shows how the banks work independently. If you replace the bank dimension with time (T) or registers fetched in a burst, then other useful interleavings present themselves. e.g. as I explained earlier, by interleaving batches, a burst can produce operands for 4 batches and produce zero wastage.
Banks are the "best" dimension, since they're truly granular (each bank is independent of the others). Operand collection has limited time in which to produce the operands for a batch, so you can't go crazy with distinct reads. And burst length is fixed (i.e. forces multiple reads over multiple T if the data isn't in the burst).
So it all boils down to register allocation.
What I haven't mentioned so far is that the TMUs also have to fetch operands (each T clock).
Those conditions dont seem to have any dependency on block size. Unless the stride is related to block size somehow which the docs don't mention anything about (don't see why it would be either).
So it makes sense to use blocks of 64 since that's the number of elements that can have one operand fetched from register file in one T.
Jawed
willardjuice
10-Feb-2009, 17:50
Sorry to interject, but what's the rough eta for these cards?
In case anyone hasnt heard allready (well done nvidia, fantastic idea)
IMHO Nivia's marketing department should get some serious spanking. What on earth is wrong with these people ? Do they get a bonus for who can confuse their customers the most ?
In dutch we call this "ouwe wijn in nieuwe zakken" (selling wine in old bags).
DegustatoR
10-Feb-2009, 22:57
http://www.theinquirer.net/inquirer/news/925/1050925/nvidia-tapes-cpus
Maybe it's time to change the thread title?
trinibwoy
10-Feb-2009, 23:12
Heh, Charlie's articles are like a breath of fresh air.
ShaidarHaran
11-Feb-2009, 00:35
I was a bit worried that I would feel a fool having purchased a GTX 285 the other day, but if GT212 is not to be then I feel I have made a wise purchase.
Ailuros
11-Feb-2009, 10:55
Heh, Charlie's articles are like a breath of fresh air.
It wouldn't surprise me if he's british; after all they managed to kill entire generations with their bad teeth *ducks and runs*
ShaidarHaran
11-Feb-2009, 18:53
It wouldn't surprise me if he's british; after all they managed to kill entire generations with their bad teeth *ducks and runs*
He lives in Minneapolis. I saw him downtown last weekend. Was gonna heckle him but thought "meh, what's the point?"
I was a bit worried that I would feel a fool having purchased a GTX 285 the other day, but if GT212 is not to be then I feel I have made a wise purchase.
It could be worse. You could have bought the GT212 in Summer, and the GT300 could have arrived in Xmas. I do not see the point of this GT212, when the GT300 is the card to get this year.
Btw, i got a 285 too, one of those with 180GB/s :D It should be enough till the GT300 arrives :)
DegustatoR
11-Feb-2009, 21:08
I do not see the point of this GT212, when the GT300 is the card to get this year.
GT212 is supposed to have around 300 mm^2 die size while GT300 will probably be close to 600 mm^2. That makes GT212 a good candidate for ~$200 price range while GT300 will cost quite a bit more.
The point is that those two chips aren't exactly in the same price range.
AnarchX
11-Feb-2009, 22:24
600mm² @ 40nm? :???: That would give around 3 billion transistors.
I would think GT300 is a more efficient approach.
Ailuros
12-Feb-2009, 07:40
600mm² @ 40nm? :???: That would give around 3 billion transistors.
I would think GT300 is a more efficient approach.
You're thinking wrong I guess. Albeit 600 sqmm sounds a tad too much, I wouldn't be in the least surprised if the result would lie roughly =/<GT200's die size. Point being if NV manages this time to convince everyone that the monolithic single high end chip approach does really make a difference; personally I wasn't particularly convinced with anything GT2x0 so far.
He lives in Minneapolis. I saw him downtown last weekend. Was gonna heckle him but thought "meh, what's the point?"
I know that Charlie is downright a nice guy; it was merely a harmless joke since I love to poke the brits once in a while LOL.
CarstenS
12-Feb-2009, 09:03
You're thinking wrong I guess. Albeit 600 sqmm sounds a tad too much, I wouldn't be in the least surprised if the result would lie roughly =/<GT200's die size.
Which is, according to my measurements with a sliding rule, roughly 600mm² in 65nm. :)
Which is, according to my measurements with a sliding rule, roughly 600mm² in 65nm. :)Yup, it's 583mm².
Ailuros
12-Feb-2009, 09:17
Oh make up your mind with the darned thing; it cannot be 576, 583 or even 600 at the same time ROFL :lol:
Oh make up your mind with the darned thing; it cannot be 576, 583 or even 600 at the same time ROFL :lol:576 is what you get is if you estimate it at 24x24, 600 is what you get if you estimate it with a ruler based on the package size, and and 583 is the official number from NVIDIA. Happy now? ;) In fact to be even more precise, NVIDIA claims it's a 24.3x24 chip...
CarstenS
12-Feb-2009, 09:45
I wasn't awara that Nvidia was giving out official number on die-size.
http://home.arcor.de/quasat/Die-Shot/GT200-Width.jpg
http://home.arcor.de/quasat/Die-Shot/GT200-Height.jpg
:)
So from the discrepancy between NVidia's numbers and Carsten's measurements, it seems that there's 0.5mm of sealant/packaging on each of the width and height. Useful to bear in mind when future die measurements are performed...
Jawed
Indeed, that's interesting. I wonder how much of that is at the packaging level, and how much is at the wafer level (i.e. inter-chip spacing) - clearly the latter would matter much wrt cost than the former... :)
DegustatoR
12-Feb-2009, 23:49
600mm² @ 40nm? :???: That would give around 3 billion transistors.
I would think GT300 is a more efficient approach.
What's number of transistors and die size has to do with efficiency?
And don't forget - GT300 is the real LRB competitor from NV so i wouldn't be very surprised if it'll have loads of programmable memory/cache on die - Larrabee style.
compres
13-Feb-2009, 02:38
OMG LOL@Charlies article.
The reason is pretty obvious, the TSMC 40nm process is leaky as hell, and the bigger the chip, the more transistors that leak. Top this off with a totally botched design that couldn't be shrunk from 65nm to 55nm sanely, so 40 was very iffy.
Ailuros
13-Feb-2009, 11:56
576 is what you get is if you estimate it at 24x24, 600 is what you get if you estimate it with a ruler based on the package size, and and 583 is the official number from NVIDIA. Happy now? ;) In fact to be even more precise, NVIDIA claims it's a 24.3x24 chip...
I'll be happy if you keep those disrepancies in mind when estimating future dies *har har har*
And don't forget - GT300 is the real LRB competitor from NV so i wouldn't be very surprised if it'll have loads of programmable memory/cache on die - Larrabee style.
Well I personally doubt that LRB could shoot up to such magnitudes of die area as GT3x0 and even then power consumption might be higher for the first.
I think if there will be really no GT212 as a chip with rumoured 384SPs and 96TMUs, the best NVIDIA could do is releasing GPU (GT215??) with 256SP, 64TMU, 16 ROPs and 256-bit MC (or 24 ROPs and 384-bit MC) with GDDR5. Chip with these specs should have small die size and should be even faster than GTX285 and could be worthy competitor to Rv790.
Ailuros
14-Feb-2009, 07:11
I think if there will be really no GT212 as a chip with rumoured 384SPs and 96TMUs, the best NVIDIA could do is releasing GPU (GT215??) with 256SP, 64TMU, 16 ROPs and 256-bit MC (or 24 ROPs and 384-bit MC) with GDDR5. Chip with these specs should have small die size and should be even faster than GTX285 and could be worthy competitor to Rv790.
Even if GT212 is a 384SP/4*64bit it would depend on its final frequencies if it would set itself above the 285 or slightly below.
While NV most certainly needs above all budget to mainstream 40nm chips for all its markets to further reduce manufacturing costs, it still remains that it doesn't make much sense in the longrun to NOT have a 40nm performance chip when GT3x0 arrives.
CarstenS
14-Feb-2009, 09:40
Even if GT212 is a 384SP/4*64bit it would depend on its final frequencies if it would set itself above the 285 or slightly below.
Now, that's a nice oxy moros we have here.
A chip with a much more advanced process technology and paper-specs that seemingly blow the old one out of the water and yet supposedly it depends on clock frequencies if it's faster or not - even though theoretically, 40nm-tech should be guaranteeing a healthy increase in frequency headroom. :)
Even if GT212 is a 384SP/4*64bit it would depend on its final frequencies if it would set itself above the 285 or slightly below.
While NV most certainly needs above all budget to mainstream 40nm chips for all its markets to further reduce manufacturing costs, it still remains that it doesn't make much sense in the longrun to NOT have a 40nm performance chip when GT3x0 arrives.
There is no option that GT12 could be positioned below GTX285. Why? Because NVIDIA wouldn`t pack such a big number of SP and then such a big increase of transistors if they have planned to make a "performance" chip. There would be very bad move if we take a performance/produce costs ratio.
The most NVIDIA need is a worthy successor of G92. They need this chip to have a worthy competitor for Rv790 and other Rv7xx 40nm GPUs from ATI.
So i think a good move for NVIDIA could be make GPU with specs something like these:
-256SP ~ 1,8-2 Ghz
-64 TMUs
-16 ROPs (or 24 but more possible is 16)
-256-bit GDDR5
-clock domain for TMUs and ROPs about 750-800 Mhz.
To get these clocks in 40nm shouldn`t be a problem and die size of this GPU shouldn`t be bigger than 250mm^2 and performance will be above GTX285 in overall.
I hope that chip something like this is now a mysterious GT215.
For a another highend GPU of GT2xx family i`m still staying that 320SP/80TMU/512-bit/32ROPs (the same number of clusters as a GT200/GT200B has) will be enough to beat any ATI single solution and it will still give a possibility to make GX2 variants. IMO NVIDIA doesn`t need GT212 chip with specs which leaked some weeks ago.
Ailuros
16-Feb-2009, 11:21
Now, that's a nice oxy moros we have here.
Oxymoron (don't mind the greek spelling police...)
A chip with a much more advanced process technology and paper-specs that seemingly blow the old one out of the water and yet supposedly it depends on clock frequencies if it's faster or not - even though theoretically, 40nm-tech should be guaranteeing a healthy increase in frequency headroom. :)
Apart from the significant increase in arithmetic performance due to probably 60% more ALUs, there's only a rather mediocre increase in texel fillrate and not necessarily an increase in all other fillrates considering that rumoured specs speak of 4 ROP partitions and not 8.
I wouldn't be at all surprised to see a 20% increase in frequencies for a chip with similar complexity as today's GT200b just on 40nm. The usual up to date 212 scenarios suggest a higher complexity than that of a hypothetical 200b shrunk to 40nm. I'm merely having second thoughts if 40nm isn't that troublefree to allow at the same time roughtly the same increase of chip complexity and frequency.
For a another highend GPU of GT2xx family i`m still staying that 320SP/80TMU/512-bit/32ROPs (the same number of clusters as a GT200/GT200B has) will be enough to beat any ATI single solution and it will still give a possibility to make GX2 variants. IMO NVIDIA doesn`t need GT212 chip with specs which leaked some weeks ago.
The supposed GT212 specs didn't leak just a couple of weeks ago, yet rather weeks before that ELSA roadmap picture in the original post in this thread. There's no GT21x model with 8*64bit afaik and I have severe doubts you'll see such a wide bus configuration even with the next D3D11 generation.
So do you think even the fastest GT21x 40nm GPU will have 256-bit MC? Don`t you think it is going to be a bottleneck in higher resolutions? Maybe not caused by 256-bit Mem bus because of using GDDR5 but 256-bit means only 16ROPs and i don`t think that NVIDIA is going to improve them in 40nm GT21x GPUs.
So what about "a good old" 384-bit MC which is doing great even was introduced 2,5 year ago? ;)
256-bit means only 16ROPs and i don`t think that NVIDIA is going to improve them in 40nm GT21x GPUs.
NVidia appears adamant that per ROP they don't need more bandwidth, so indeed it will be interesting to see how many ROPs there are on a GT2xx GPU with GDDR5.
Jawed
Well, 16 ROPs with GPUs as GT215 - supposed "Performance" level - could be enough but it seems that all GT21x GPUs will have 256-bit/16ROPs confoguration so i don`t see a point to make a GPUS with 1,5X or more shader power than current top end GPUs (GTX280/285) with nearly 2X less ROPs bandwidth.
Other thing is if NVIDIA does something like 256-bit/32ROPs configuration (but i doubt it). Then ROPs performance will be at the same level as nowadays or a little higher and GDDR5 will do their job as well :)
PS. While about Rv790 rumours are more and more detailed then NVIDIAs GPUs specs are still mysterious and unknown for anyone. Moreover we still don`t know even GT212 is alive or canceled.
Ailuros
17-Feb-2009, 07:51
Well, 16 ROPs with GPUs as GT215 - supposed "Performance" level - could be enough but it seems that all GT21x GPUs will have 256-bit/16ROPs confoguration so i don`t see a point to make a GPUS with 1,5X or more shader power than current top end GPUs (GTX280/285) with nearly 2X less ROPs bandwidth.
I'm not sure what you mean with ROP bandwidth exactly, since such a hypothetical GPU's final bandwidth depends obviously on the GDDR5 they'd use.
There's definitely going to be a significant Z-/Pixel Fillrate reduction if you cut the amount of ROPs in half.
I've never really understood the reasoning behind the concept of such a configuration myself to be completely honest.
Other thing is if NVIDIA does something like 256-bit/32ROPs configuration (but i doubt it). Then ROPs performance will be at the same level as nowadays or a little higher and GDDR5 will do their job as well :)
IMHLO it would be way easier (in a purely hypothetical speculation exercise) to think of increasing capabilities per ROP than doubling the amount of the existing ROPs per partition. That is of course if the current ROPs actually need any increases. Recent driver versions point rather in the direction that they might finally woke up and are optimizing them for better 8xMSAA performance.
PS. While about Rv790 rumours are more and more detailed then NVIDIAs GPUs specs are still mysterious and unknown for anyone. Moreover we still don`t know even GT212 is alive or canceled.
In what way are 790 rumours more detailed? Has the rumour mill up to now made up it's mind how many units that one contains? It sounds more like just a frequency increase to me and one reason more for me to believe that increasing both the amount of units as well as frequencies for 40nm could be a too tough exercise for anyone at this point. If GT212 is alive and it truly has the rumoured config, I'm having second thoughts on noteworthy frequency increases that's all.
KonKort
28-Feb-2009, 09:10
I have not got good messages for you. I heard from Nvidia that it is not sure if GT212 really comes. So Charlie could be right with his report.
The good message is that the development of G300 runs well. It looks like he will launch this year, but he is not taped out yet.
Ailuros
28-Feb-2009, 09:28
Many of us I guess were suspecting something like that. It's far more reasonable for any IHV at this point to concentrate more on the next generation than anything else.
GT3x0's final tape out under 40nm is the critical point of the entire story.
Yea, releasing such a chip like GT212 with rumoured 384SP doesn`t make sense when it is really planned for Q3/09. Moreover the most important thing for NVIDIA at the moment is make and releasing new Mainstream and Performance GPUs in 40nm to compete with Rv740/Rv790 in this generation. So i wonder what specs can we expect from GT216, GT215 and GT214 if all these GPUs are going to be released.
What does GT212 cancel mean? Could it be that GT300 is doing well and maybe is there any chance releasing it , hmm, somewhere about September/October? Or maybe GT300 is still scheduled for Q4 (Nov/Dec)? I hope that first option is right ;)
Or maybe the high-end market is just dead right now because of the economy? *shrugs*
So you want to say GT300 maybe won`t be a real Highend chip too like G80 or GT200 have been?
ShaidarHaran
28-Feb-2009, 14:44
Or maybe the high-end market is just dead right now because of the economy? *shrugs*
I don't know, I just purchased a GTX 285 and my roommate just bought a 4870X2. Prices were too good to pass on, despite the economy. We each still have jobs and our bills haven't changed, plus we just got our tax returns so timing was right. I suspect plenty of others will do the same with their tax returns.
Remember every time you framerate is higher than his you must point and laugh
Well this kind of settles it. I will grab me a couple of 285s in the coming months.
ShaidarHaran
28-Feb-2009, 20:24
Remember every time you framerate is higher than his you must point and laugh
:lol: no, that wouldn't be good since I recommended all the hardware he purchased for his new Core i7 rig. If my year old 4GHz C2D system outperforms his and cost less to build I think he might be a bit unhappy with me...
So i don`t get it at all
http://www.fudzilla.com/index.php?option=com_content&task=view&id=12422&Itemid=1
It is said that NVIDIAs 40nm GPU (GT218 there mentioned as a performance chip) is going to be a real beast (maybe "the second G92"??). But as we remember GT218 was said to be a low-end GPU. It could be great to see again a great chip from NVIDIA, worthy successor of GF8800GT but it`s some strange to me if it is going to be released in Q3 this year and compete with new AMDs architecture. IMO most likely GT3xx is a Rv8xx competitor not another GT2xx chip even in 40nm.
I think if GT218 stands against Rv870 NVIDIA will be defeated with no doubt.
So i wonder when and what 40nm GPUs from NVIDIA we will see this year. They need them against current ATIs GPUs (cheap to produce and great in performance) and following Rv740/790 as well.
Ailuros
07-Mar-2009, 07:05
GT218 should be a 40nm chip for budget/low end. Fuad most likely is refering to Gt3x0 than anything else.
CarstenS
07-Mar-2009, 11:09
GT218 should be a 40nm chip for budget/low end.
That'll make Domell's statement even more likely, won't it? ;)
I think if GT218 stands against Rv870 NVIDIA will be defeated with no doubt.
trinibwoy
07-Mar-2009, 12:47
Why would GT218 stand against RV870? They're in two completely different segments. It'll be interesting to see whether or not AMD goes big again for DX11 or sticks to the $300 price point. And if so, would Nvidia have a ~300mm^2 part to compete or would they have to use a cut down version of a bigger chip.
But the most important question is what chip is GT218? Low-end, Mainstream or Performance? Did NVIDIA change their codenames again or Fudzilla is simply wrong about GT218 as performance GPU?
Will be another 40nm GT2xx GPUs there?? Or maybe NVIDIA have cancelled some of their 40nm GT2xx GPUs like GT214 or GT215 (GT212 is most likey cancelled too)??
NVIDIAs strategy is very confusing. There is no new info about GT2xx 40nm GPUs and there is no GT3xx info too. AMDs Rv740/Rv790 are going closer and closer and Rv740 seems to be a real big thing. Very small die size and performance comparable to slower versions of Rv770 is simply amazing. I think that at this point NVIDIA should make the best they could do and push GPU with performance/price ratio comparable to AMDs killer.
It is interesting to see what NVIDIA will do during this month and April after releasing by AMD their new GPUs.
It is interesting to see what NVIDIA will do during this month and April after releasing by AMD their new GPUs.
Hmm, there is still plenty of SKUs to rebrand out there. :lol:
On a serious note, at least we should see some GDDR5 adoption from NV, if nothing else. :roll:
Heck, even the skinny mobile 4800 series will be getting 5GHz parts, now!
NVIDIAs strategy is very confusing.
GT200's lateness (though in fact it may well have been a replacement for "G100") indicates that the wheel nuts started coming undone towards the end of 2007.
NVidia's strategy also led to the poor performance increment of GT200, distracted as they were by CUDA functionality.
Then there was their apparent arrogance in the face of ATI.
And then their repeated failures to make 65nm and 55nm technology deliver the goods.
So a wheel or two have fallen off of their strategy. I wouldn't call it a strategy any more.
GT300's strategy, with a bit of luck, hasn't really been affected. You could say it boils down to 40nm now, as an external factor, and not much else.
Question is, why has NVidia struggled to get the most out of TSMC, both in terms of die area and in terms of timeliness?
---
Interestingly enough, G92b with a GDDR5 interface would prolly give RV770 a good run for its money. I think AMD was lucky with GDDR5 (RV770 would have been pretty lame without it) and come this autumn I imagine there'll be a level playing field there, though NVidia might not deploy GDDR5 as widely as AMD, electing to keep the low end on GDDR3?
Jawed
MarkoIt
09-Mar-2009, 07:35
GT200's lateness (though in fact it may well have been a replacement for "G100") indicates that the wheel nuts started coming undone towards the end of 2007.
NVidia's strategy also led to the poor performance increment of GT200, distracted as they were by CUDA functionality.
Then there was their apparent arrogance in the face of ATI.
And then their repeated failures to make 65nm and 55nm technology deliver the goods.
So a wheel or two have fallen off of their strategy. I wouldn't call it a strategy any more.
GT300's strategy, with a bit of luck, hasn't really been affected. You could say it boils down to 40nm now, as an external factor, and not much else.
Question is, why has NVidia struggled to get the most out of TSMC, both in terms of die area and in terms of timeliness?
---
Interestingly enough, G92b with a GDDR5 interface would prolly give RV770 a good run for its money. I think AMD was lucky with GDDR5 (RV770 would have been pretty lame without it) and come this autumn I imagine there'll be a level playing field there, though NVidia might not deploy GDDR5 as widely as AMD, electing to keep the low end on GDDR3?
Jawed
So we can probably expect a G92bb with GDDR5 support, named GTS260! :lol:
Seriously, a G92@40nm with a 128bit MC and GDDR5 would outperform RV740. The die-size would be probably a little bigger (150-160 mm^2 maybe?), but it would be close enough to be a competitive solution. I actually think that GT215 isn't going to be much different from this.
Now, the problem is: when Nvidia will be able to launch such solution? Q3? This gives ATi almost 6 months of overwhelming superiority both in desktop and notebook market.
Ailuros
09-Mar-2009, 08:01
IHVs have their ups and their downs exactly because of specific strategies per timeframe and that goes for all IHVs and all markets. There's no such thing as getting things always right and never make any mistakes and there's it's quite irrelevant if it's Intel, NVIDIA, AMD or anything else.
We can of course start counting how many corpses each and everyone has hidden in its own dungeon, but it'll result to the same nonsensical drivels as always.
If there's one lesson some should have learned over the years while following the whole 3D circus (especially those that followed it from the very beginning) is that it doesn't take too long until tables turn. Some balances hold longer and some less. The challange for the consumer is to find out which hardware (irrelevant if it's GPU, CPU or anything else) offers the best bang for the buck for a given timeframe and buy exactly that without any senseless brand loyalties or feelings.
We are going through extremely tough times and AMD not only did their very best achieving an outstanding perf/mm2 ratio for their chips, it reduced prices significantly across the market in times where its quite tough for everyone (and no I don't think when AMD conceived the RV7x0 family that they could have foreseen the financial crysis). The challenge now is to repeat a similar stunt with the D3D11 generation. Albeit their odds having RV7x0 in the back of one's mind seem excellent at the moment, it's still no absolute guarantee.
The only D3D11 architecture a few details are known about at the moment is Intel's Larabee and that only because Intel is entering the GPU market with a fundamentally new architecture and has to start evangelize for it as early as possible. The other two remain a quite big question mark; it might be safe to assume that AMD will continue its performance GPUs only strategy and NVIDIA its monolithic high end single chip strategy but that tells next to nothing about what may be the better sollution after all.
Any of the above aside it is true that NVIDIA this time has really to convince the world that their monolithic single high end chip approach is the better strategy. Personally it didn't convince me one bit this past round.
I think AMD was lucky with GDDR5 (RV770 would have been pretty lame without it)
I don't quite agree with that. Judging by the results some published (color fill benchmarks, overclocking experiments), efficiency of gddr5 (at least with the memory controller in rv770) seems to be quite a bit lower than gddr3, and the hd4850 indeed scales very well with memory clock. I think a rv770 with same gpu clock as HD4870 but using these factory-overclocked 1.3Ghz gddr3 parts would be quite close in performance to the HD4870 as we know it.
I wasn't aware of 1.3GHz GDDR3 :oops: This is pretty damn nippy:
http://www.evga.com/products/moreInfo.asp?pn=01G-P3-1288-AR
It wouldn't have quite worked out for 8xAA situations. But yes, GDDR5 may not have been quite so lucky.
Jawed
AFAIK those are not exactly 2,6 GHz GDDR3, but overclocked chips rated for lower clocks (2,5 GHz eff. perhaps?). Anyway, such chips weren't probably available for HD 4870's release, and again, even 2,2 GHz GDDR3 doesn't limit RV770 to a point we could call "lame".
AFAIK those are not exactly 2,6 GHz GDDR3, but overclocked chips rated for lower clocks (2,5 GHz eff. perhaps?).
Well, samsung does offer 1.3Ghz GDDR3 chips (only 512mbit ones, however, fastest 1gbit ones are 1Ghz but qimonda has faster 1gbit chips - 1.2Ghz). AFAIK everything over 1Ghz though requires more than 1.8V (hence I call them factory-overclocked).
Anyway, such chips weren't probably available for HD 4870's release, and again, even 2,2 GHz GDDR3 doesn't limit RV770 to a point we could call "lame".
Possible, though the 1.1Ghz gddr3 was available way before that (GF8800U had mem clock of 1.08Ghz). Maybe 1.2Ghz would have been available, not sure. I wouldn't call it lame with 2.2Ghz GDDR3 neither, but such a configuration clearly would be no longer competitive with GTX 260.
CarstenS
29-Mar-2009, 11:25
So from the discrepancy between NVidia's numbers and Carsten's measurements, it seems that there's 0.5mm of sealant/packaging on each of the width and height. Useful to bear in mind when future die measurements are performed...
Jawed
Here's some new stuff to ponder about...
G80 (90nm): 486,5 mm² (21,7 mm x 22,42 mm)
GT200 (65nm): 606,9 mm² (24,77 mm x 24,5 mm)
GT200b (55n): 497,3 mm² (22,3 mm x 22,3 mm)
G92b (55nm): 264,8mm²
RV770 (55nm): 274,7mm²
http://www.pcgameshardware.de/aid,679763/Reale-Chipgroessen-von-G80-GT200-und-GT200b-nachgemessen/Grafikkarte/News/
(in german only, so you might want to use your favourite online translator)
GT200 (65nm): 606,9 mm² (24,77 mm x 24,5 mm)
That would count for 85 chips per wafer (gross selection), and from all the pictures available you can count definitely more there -- 94~93, by my own measurements.
CarstenS
29-Mar-2009, 12:28
I was at 95 (maybe minus 1). But take into account the glue around each die. That might make a small difference.
So each 55nm wafer is what, $4000-5000 each?
Oh look, quad GT200 GPUs on one card! :runaway:
http://www.extrahardware.cz/files/images/clanky/2009/03brezen/gtx390/nahledy/IMG_1287.jpg
http://www.extrahardware.cz/node/3266
Wow, it's April [the first] already! :D
AnarchX
01-Apr-2009, 09:24
Really nice looking fake or it is a real GT200b workstation-card with 4 GPUs, since there are rumors about GTX 295 equivalents:???:
Pretty nicely done, can't even make out which card it's based on... Do we even have pictures of the coming single-pcb 295?
AnarchX
01-Apr-2009, 09:35
4x 6-Pin + PEG = 375W maximum board power.
8 memory chips per GPU = 256-Bit per GPU.
Two Display Ports.
Now combine it with the Quadro FX 3800: 256-Bit, 108W board power:
http://www.nvidia.com/object/product_quadro_fx_3800_us.html
So this might be really a Quadro FX 5x00X4, maybe aimed on the new Multi-OS SLI.
But of course its April 1st...
If real it's some kind of Tesla, no display connectors and no NVIO (and imho not enough PWM).
And the noise from that cooler can only go into a data center, not a living room ;)
edit: ok, looks like it has 2x displayport..
AnarchX
01-Apr-2009, 10:07
http://img24.imageshack.us/img24/2410/kecxlgh73z7ns6530605.jpg (http://img24.imageshack.us/my.php?image=kecxlgh73z7ns6530605.jpg)http://img23.imageshack.us/img23/2161/13ev8m88ev6fs.jpg (http://img23.imageshack.us/my.php?image=13ev8m88ev6fs.jpg)
http://img23.imageshack.us/img23/6450/1p19qmbvo4a5s6566230.jpg (http://img23.imageshack.us/my.php?image=1p19qmbvo4a5s6566230.jpg)
:lol:
http://translate.google.ch/translate?u=http%3A%2F%2Fdiy.yesky.com%2Fvga%2F414 %2F8778914.shtml&sl=zh-CN&tl=de&hl=de&ie=UTF-8
Too sad that GT212 seems to be canceled in reality...
090401 might be a clue, eh?
Jawed
compres
01-Apr-2009, 11:08
090401 might be a clue, eh?
Jawed
:lol:
Does this Digitimes piece mean anything (quoting Commercial Times)?:
Nvidia recently set its outsourcing schedule with Taiwan Semiconductor Manufacturing Company (TSMC) and United Microelectronics Corporation (UMC) for the second quarter, according to a Chinese-language Commercial Times report.
In addition to expanding its 55nm GT200b GPU outsourcing, the company will also start mass production of the 40nm entry-level GT218, high-end mobile GT215 and mainstream GT214 and GT216 GPUs in mid to late second quarter, according to the paper.
We're already more than half way through Q2 :???:
Jawed
Does this Digitimes piece mean anything (quoting Commercial Times)?:
That the GT214 is still alive possibly. The chip is functionally the same as the GT215, just a smaller die area and bit cheaper i guess.
The GT218 and GT216 should be coming first though, maybe later than originally expected, would be very lucky if they can launch either before june.
The GT215 is also quoted as a mobile part, guess nvidia might want this to compete with the mobile RV740...dont think started producing this yet, so maybe middle late Q3 at best.
trinibwoy
21-Apr-2009, 12:35
Does this Digitimes piece mean anything (quoting Commercial Times)?:
We're already more than half way through Q2 :???:
Calendar Q2? That's mid-May.
D'oh, slaps forehead :oops: :lol:
Jawed
http://vr-zone.com/articles/nvidia-preparing-40nm-geforce-gt-240m--g210m/6954.html?doc=6954
Now, VR-Zone has learned that Nvidia is preparing GeForce GT 240M and GeForce G210M for launch somewhere between May-June.
But no sign of any desktop parts.
Jawed
isn't the GT240M a 40nm variant of G92 (9800GT) versus the GTX260/80M G92?
http://vr-zone.com/articles/nvidia-preparing-40nm-geforce-gt-240m--g210m/6954.html?doc=6954
The article has speculated that the G210M is GT216 based and the GT240M is GT215 based. I don't think this is correct, the GT215 is delayed somewhat.
The current mobile lineup is roughly:
G 1X0M - G98 ver 2 - 64 bit
GT 1X0M - G96 - 128 bit
GTS 1X0M - G92 - 256bit (...with units disabled, might also have snuck some G94s in there as well)
GTX 2XXM - G92 - 256bit
the above i think being replaced by:
G 210M - GT218 - 64 bit
GT 240M - GT216 - 128 bit
GTS 2XXM - GT215 - 192 bit
GTX 2XXM - G92 - 256 bit
But no sign of any desktop parts.
The GT218 and GT216 are too far along, very unlikely to be cancelled now, desktop parts should turn up somewhere. Suppose there is some possibility that the yield is presently too low on these such that the 55nm equivalent parts are cheaper. The parts are thus being restricted to the mobile space for their power saving till they can get yields up sufficiently.
I wonder if NVidia has GDDR5 working on any of these 40nm chips. If it wants to compete with HD4770 it needs GDDR5 or there's going to be a lot of spare die in the ~190mm² needed for a 256-bit bus.
Jawed
I wonder if NVidia has GDDR5 working on any of these 40nm chips. If it wants to compete with HD4770 it needs GDDR5 or there's going to be a lot of spare die in the ~190mm² needed for a 256-bit bus.
Sorry for delay replying.
The 2 lower end chips GT216 and GT218 dont appear to have GDDR5 support. GDDR5 is supposed to debut with the GT300 series when the MCs get redesigned. The last chip the GT215 i suppose there is some small chance.
These chips are designed as shrinks of current tech for low end/mobile/oem customers. Idea was to clean up all of the high volume products to deal with the lower asps expected over the next 12 plus months.
The first 2 chips are aimed lower than the RV740. The GT215 is a G92, die size is just enough for a 256 bit bus, but current yields on TSMCs 40nm process mean that its not economic to go ahead with except as a mobile part for some power savings.
Are nVidia's 40nm woes more or less atributable to it's use of domains? I can imagine that with 40nm a leaky process a part of the chip running at close to 2ghz is like the Niagra Falls.
CarstenS
06-May-2009, 09:06
It can't be that bad, considering HD 4770s power and OC characteristics - which allow for high clock rates also.
Is this idea that "mobile is lower volume than desktop" valid? Is the volume of laptops that might be fitted with the equivalent of $70 and $100 desktop discrete graphics lower than the discrete cards? I was under the impression that laptops are selling so much that this seems kinda unlikely.
It seems to me that laptop might be more important to NVidia because of higher margins.
Jawed
mobile volumes are indeed lower than desktop. 2006's numbers were 70+ million chips total, of which 50+ on desktop. discrete mobile numbers were around 4 million versus ~20m on the desktop market. yet somehow that 20% of the volume of discrete notebook chips helps revenue better than the other 80%.
DegustatoR
06-May-2009, 11:09
It seems to me that laptop might be more important to NVidia because of higher margins.
Mobile 40nm is more important because of less power consumption and longer introduction-to-market time. If you want to have some major design wins in the future you have to provide a product which is same/better than cometitor's products now. Really high volumes of mobile GPUs will be needed later.
The first 2 chips are aimed lower than the RV740. The GT215 is a G92, die size is just enough for a 256 bit bus, but current yields on TSMCs 40nm process mean that its not economic to go ahead with except as a mobile part for some power savings.
Just following up on this. There is a thread on chiphell here:
http://bbs.chiphell.com/viewthread.php?tid=44437
Not 100% on the translation but it appears to be stating that nvidia is trying to get the GT215 running on UMCs 40nm process. This is generally considered to be running behind TSMCs equivalent process, so if above is true is somewhat remarkable. On the other hand Xilinx is apparently shipping samples of their Virtex6 FPGA to select customers with full availability scheduled for 2nd half of the year so i guess just maybe.
(If someone could check above translation, that would be greatly appreciated)
Also there is some confusion whether the part has a 192bit or 256bit memory interface. Die size is such that neither could be ruled out.
shyam335
11-May-2009, 19:12
Interesting. I had seen some news that UMC is first with HP 40nm.
but how much better it compared to TSMC.
shyam335
11-May-2009, 19:17
ok maybe,it should be read as first with 'real' 40nm as said by vrz.
OT:wheres my edit button?!
ok maybe,it should be read as first with 'real' 40nm as said by vrz.
OT:wheres my edit button?!
You will get your edit button after you've posted more
The GT218 and GT216 should be coming first though, maybe later than originally expected, would be very lucky if they can launch either before june.
Think these will be lucky to launch before late june or early july, will be shortages otherwise.
Is some confusion over sp's count, was widely reported (http://www.guru3d.com/news/nvidia-gt218-40nm-specs-surfaced/) GT218 at 32 in january. Might come in lower at 16, not sure if this is a cut down part to increase yields which have been problematic. Similarly GT216 at 48 and 64 sp's depending on who you believe.
Finally have no info on dates for GT215. :cry: Guess will be quite a way into Q3.
Blazkowicz
20-May-2009, 15:31
that 32SP card with 800MHz doesn't seem that terrible, somewhat half a RV730 if that's to be believed.
considering it would probably end up with regular desktop 800MHz ddr3 and sold at 30€, that will be a nice crappy card.. so much better that the 9200SE, X300SE, FX5200 and Intel 945 stuff you can still regularly come across on random existing PCs.
that 32SP card with 800MHz doesn't seem that terrible, somewhat half a RV730 if that's to be believed.
considering it would probably end up with regular desktop 800MHz ddr3 and sold at 30€, that will be a nice crappy card.. so much better that the 9200SE, X300SE, FX5200 and Intel 945 stuff you can still regularly come across on random existing PCs.
The bigger brother GT216 is supposed to go against the RV730. GT218 is more to replace G96/G98, i guess competing with RV710 and older products like RV635. How both chips go i guess all comes to down to the yield, how good it is now and how it ramps over the next 6 months or so. Net price makes or breaks.
Interesting nobody much is looking at or talking about the GT214/GT215 and all things that have happened. Seemed to have moved on to GT300 blissfully ignorant of all the events that have occurred and what they mean.
trinibwoy
25-May-2009, 04:15
Seemed to have moved on to GT300 blissfully ignorant of all the events that have occurred and what they mean.
What do they mean? I think everyone has already recognized that Nvidia is having all sorts of problems getting their chips out. Or are you referring to a possible change in strategy for GT3xx derivatives? I'm still wondering how Nvidia plans to combat RV870 given its expected featureset advantage over GT2xx.
Interesting nobody much is looking at or talking about the GT214/GT215 and all things that have happened. Seemed to have moved on to GT300 blissfully ignorant of all the events that have occurred and what they mean.Uhm, GT212 (GDDR5) and GT214 (GDDR3) were replaced by GT215 (GDDR5) because: a) GT212 was too big. b) GDDR3 on GT214 was dumb. End of story, nothing to see here folks.
DegustatoR
29-May-2009, 20:10
Uhm, GT212 (GDDR5) and GT214 (GDDR3) were replaced by GT215 (GDDR5) because: a) GT212 was too big. b) GDDR3 on GT214 was dumb. End of story, nothing to see here folks.
So what's GT215 in your opinion? GT214 with GDDR5 or something bigger?
Fuad is now saying Nvidia's 40nm line up, most presumably the GT21x are DirectX 10.1, wonder how Nvidia will play this up? (Keeping in mind, 100% of their current line up will suffer from a pseudo negative performance impact if they push devs for DirectX 10.1)
Fuad is now saying Nvidia's 40nm line up, most presumably the GT21x are DirectX 10.1, wonder how Nvidia will play this up? (Keeping in mind, 100% of their current line up will suffer from a pseudo negative performance impact if they push devs for DirectX 10.1)I heard about that many months ago, yeah. Never mentioned it because: a) I initially couldn't. b) I wasn't completely sure.
We'll see what happens. Fun stuff.
Hehe, I remember us talking about it early this year. Indeed funny stuff if it's true.
Btw... GT220 and G210 here we come... in August. Oh and both will be available in DDR2 and DDR3 versions. GT220 will be placed in the $55-$60 slot while G210 will be in the $30-$35 bracket.
Hehe, I remember us talking about it early this year. Indeed funny stuff if it's true.
Btw... GT220 and GT210 here we come... in August. Oh and both will be available in DDR2 and DDR3 versions. GT220 will be placed in the $55-$60 slot while GT210 will be in the $30-$35 bracket.
CJ are you sure about the name for the lowest end as the 'GT210' in the retail name dont the lowest end parts prefixxed with just a plain 'G' and mainstream 'GT'
Arun re the GT215 having GDDR5, that is news. Nobody seemed sure, few said GT3XX parts was where that would debut. Also problems with the GT214/215 wrt leakage and TSMC40nm, rumor was they had to redo the design in order to mitigate the problem.
Oops. Sorry you're right. It says G210 here. My bad.
trinibwoy
02-Jun-2009, 01:59
Btw... GT220 and GT210 here we come... in August.
Whoa. Are things still that bad even for chips that small? Still no HD4770's in sight either. What's really going on with TSMC....can't remember things ever being this bad in recent history.
One thing to note maybe is that they changed the retail name for the GT216 from GT240 => GT220. Hinting that perhaps the first part released might have reduced units and/or reduced clocks to get more volume. Some time later a full strength chip will appear with higher clocks and/or more units.
Suppose this could be a part to release in bulk to OEMs then closer to christmas a higher performing more retail oriented part for regular consumers.
trinibwoy
02-Jun-2009, 04:03
How does this all tie in with the mobile parts that are supposedly launching at Computex this week? Or is that BS?
Some more info...
GeForce GT 220 can come in three : P681 (DDR3), P682 (DDR2), P680 (GDDR3). Designkits are available early June. Mass Production in August. G210: P690 (DDR3), P691 (DDR2) design kits available in June. MP in August
DegustatoR
02-Jun-2009, 08:27
CJ, you're talking about descreet graphics, right?
http://www.digitimes.com/news/a20090602VL201.html
Q: Just to finish can we talk briefly about the manufacturing side at Nvidia. You're currently in the process of moving to Taiwan Semiconductor Manufacturing Company's (TSMC's) 40nm process, how is that going?
A: We're moving to TSMC's 40nm node with our OEM products first. This is mainly due to capacity; TSMC is not building up their 40nm capacity until the second half of this year, probably close to the fourth-quarter, so based on the type of volumes we do in the channel, millions and millions of units per quarter, it doesn't make sense for us to move those products to 40nm just yet. But for the tons of OEM design wins we have on desktops and notebooks, we need to support those, and so we're using TSMC's limited 40nm capacity to supply those customers first. We'll transition to 40nm for our channel products probably at the end of this year.
Jawed
"tons of OEM design wins we have on desktops and notebooks"
Is this their way of saying:
"We've lost many OEM deals because we messed up last year and can't get 40nm products out until September"
I know TMSC's problems are AMD problems too ... but, at least they have an 40nm product on shelves now.
Anyone knowing hows AMD doing with design wins?
trinibwoy
02-Jun-2009, 12:47
I know TMSC's problems are AMD problems too ... but, at least they have an 40nm product on shelves now.
Not as far as I can see. A month after launch and it's out of stock everywhere with lead times up to two weeks at some retailers. So far RV740's woes are corroborating Nvidia's side of the story.
Haven't heard anything about OEM design wins for AMD's 40nm products but that could be because they don't toot their own horn nearly as loud as the competition.
Well AMD used to tout design wins with Puma, and at the end it was rather non-existant (the Neo on dv2 alone probably did better than the whole Puma ordeal, and Neo only has 2 design "wins" AFAIK, HP and MSI)
So design win touts could mean anything, I suppose.
For something that's supposed to go into mass production in August, seems rather unnatural that it'd score large sums of wins now, too. WRT AMD's 55nm mobile parts though, the 40nm nVidias could do quite some damage, so we aren't really too sure what's the clear line out of this.
http://www.semiaccurate.com/2009/06/02/nvidia-40nm-parts-yawn-delayed-again/
short news, no 40nVm before August. no DX10,1 (it wouldn't make sense for me anyway) or wait. .is this CJ's post as a news blurb?... :(
CarstenS
02-Jun-2009, 21:15
Nah, cannot be. Remember: We'se are not worthy of Charlie's time.
LordEC911
02-Jun-2009, 21:29
http://www.semiaccurate.com/2009/06/02/nvidia-40nm-parts-yawn-delayed-again/
is this CJ's post as a news blurb?... :(
That's what I was thinking as well.
I have seen this stuff happen time and time again over the past couple of years.
I think he's seen the same info that I've seen... he specifically mentioned something I didn't... which is that the prices quoted are FOB, which means that end users will have to pay more in the end.
He did however make the same mistake as I did by calling the lowest part GT210 instead of G210, so I think he just has to read the slides a bit better.
How about a hint about what if anything, NVidia will be showing at show this week.
Just looking at the semiaccurate article it might be worth abandoning the GT21X code names and substituting the NV21X style names instead.
The clash between the retail style name and internal code name is kindof confusing, especially if not following too closely. Secondly not really sure how much tesla these mainstream parts will have.
Trying this now:
How about a hint about what if anything, NVidia will be showing at show this week.
They are supposed to announce(or maybe only show in private not sure) the NV216 and NV218 chips. Also an outside chance of samples of the NV214/215 also being on show, though likely not announced. If you are interested in the high end i think it is also the official announcement of the GTX275. There is also quite a bit of ion stuff + 3d glasses to appeal to ordinary people.
:lol: "We admit it's good and crunchy" -- Nvidia finally embraces DirectX 10.1 (http://www.fudzilla.com/content/view/14034/1/)
So, the mobile GT2** would be a test vehicle for the NV's DX10.1 impl?
:lol: "We admit it's good and crunchy" -- Nvidia finally embraces DirectX 10.1 (http://www.fudzilla.com/content/view/14034/1/)
So, the mobile GT2** would be a test vehicle for the NV's DX10.1 impl?
This is Nvidia's first 40nm chip and it's DirectX 10, but there is a big chance that it even supports DirectX 10.1. Unfortunately, we cannot confirm this at press time.
It will also come with DX11, or it might not. It might also ship with a picture of JHH's abdomen, or .. it might not.
http://www.fudzilla.com/content/view/14035/1/
Nvidia's first desktop 40nm chip is finished. It is called NV218 and should most likely end up with the GT210 brand. So far, this chip is OEM only, and it is currently shipping to big OEMs such as Acer and Dell.
Jawed
http://www.fudzilla.com/content/view/14050/1/
The card has 128-bit memory interface and it comes with 1GB of DDR3 memory. It should cost slightly over $50 and the first shipments should be available in late June.
Jawed
Next news they'll post is that GT216 has 48 SPs and GT218 has 24 SPs. And then it's the clockspeeds (600-650 for GT216 and ~600 for GT218)... and finally a newsarticle with the Vantage scores (~P3xxx & ~P1xxx)... that way they'll have news and hits for the next couple of days. ;)
Next news they'll post is that GT216 has 48 SPs and GT218 has 24 SPs. And then it's the clockspeeds (600-650 for GT216 and ~600 for GT218)... and finally a newsarticle with the Vantage scores (~P3xxx & ~P1xxx)... that way they'll have news and hits for the next couple of days. ;)
You caught on to them too eh? Funny how they can post the same article three times and no-one seems to notice.
trinibwoy
06-Jun-2009, 23:10
Everybody noticed long ago and we all bitched about it back then too.
24 SP's = uninspired GT200 clones? Meh.
Silent_Buddha
07-Jun-2009, 02:01
Does 1 GB of memory even make sense for an approximately 50 USD chip and the performance it would theoretically be capable of?
Is this possibly just Nvidia trying to offload more inventory? I'm just struggling to see how 1 GB would be useful on a low end budget chip.
Regards,
SB
So is there any credibility to the idea of a DX10.1 part?
If so the parts must surely be very late with DX11 due out in only a few months O_o
Or they are actually DX11?!
Next news they'll post is that GT216 has 48 SPs and GT218 has 24 SPs. And then it's the clockspeeds (600-650 for GT216 and ~600 for GT218)... and finally a newsarticle with the Vantage scores (~P3xxx & ~P1xxx)... that way they'll have news and hits for the next couple of days. ;)
I now have 3 different unit counts for GT218 - 16,24 and 32, and still 2 for GT216: 48 and 64. If GT216 is released at only 48shaders think might be a bit of a gap above it to the G92 or GT214/215(supposed to be 128SPs). The more aggressive unit counts were early before nvidia perhaps saw the lay of the land at 40nm.
Re Fudzilla 'GT210' name turns up again, what a virulent meme.
Does 1 GB of memory even make sense for an approximately 50 USD chip and the performance it would theoretically be capable of?
Is this possibly just Nvidia trying to offload more inventory? I'm just struggling to see how 1 GB would be useful on a low end budget chip.
Regards,
SB
It all depends on the application; GTA4 is an extreme sample, but in benchmarks made by iirc quite reputable site (which name i can't recall atm, but i think it was some german site) HD38-series card with 1GB mem actually was faster than HD48-series card with 512MB mem when the textures were set to high(est) quality
Silent_Buddha
07-Jun-2009, 06:11
It all depends on the application; GTA4 is an extreme sample, but in benchmarks made by iirc quite reputable site (which name i can't recall atm, but i think it was some german site) HD38-series card with 1GB mem actually was faster than HD48-series card with 512MB mem when the textures were set to high(est) quality
In that case, Nvidia is expecting whatever this is to be as fast or faster than HD 46xx. And looking at those. Yeah I guess they do come in 512 meg and 1 gig sizes. Interesting, I hadn't looked at that market segment in a while.
Amazing that we have old HD 3870 level of performance in the 50 USD price range.
Regards,
SB
GT214/215(supposed to be 128SPs).
Last I heard about GT215 was four clusters of 24, so 96 SPs and 32 textures.
Vincent
07-Jun-2009, 18:49
Last I heard about GT215 was four clusters of 24, so 96 SPs and 32 textures.
Is there any upcoming GT21X as the 10 clusters of 24 :?:
Vincent
08-Jun-2009, 11:47
Everybody noticed long ago and we all bitched about it back then too.
24 SP's = uninspired GT200 clones? Meh.
Performance wise, it seems questionable to me.
So are NVidia's next GPUs really going to be D3D10.1? Is this because 10.1 is the "baseline" for W7, with any missing hardware capabilities either emulated (seemingly in Warp?) or a fall-back used (hurting performance?).
Jawed
Windows 7 can take advantage from D3D10.1 when rendering Aero, but it will only improve performance.
There will be no WARP emulation of features. WARP is only used if you explicitly create a software device.
Ah, I was under the impression that software rendering would kick in where the hardware is lacking.
Maybe that's only for Aero, but in this case the use of 10.1 is so limited (tracking hotspot on the taskbar, anything else?) that no-one will notice?
Jawed
Blazkowicz
08-Jun-2009, 13:43
no way they are 10.1, those are G80 derivatives again.
If I believe a few rumors and my opinion, I'd say the cards are 1x32 SP, 2x32 SP and 4x32 SP
DegustatoR
08-Jun-2009, 19:44
no way they are 10.1, those are G80 derivatives again.
10.1 is a somewhat simple update. NV can easily implement 10.1 support in G8x derrivative.
If I believe a few rumors and my opinion, I'd say the cards are 1x32 SP, 2x32 SP and 4x32 SP
Yeah, well, that doesn't mean that one of four multiprocessors in each TPC won't be used for redunduncy on TSMCs 40G for the time being...
10.1 is a somewhat simple update. NV can easily implement 10.1 support in G8x derrivative.
If it really were that simple, why didn't nVidia implement DX10.1 in GT200, huh? :smile:
If it really were that simple, why didn't nVidia implement DX10.1 in GT200, huh? :smile:
Supposedly they have to change some stuff in the texture units to support a DX10.1 feature like Gather4/Fetch4. ATI already supported that ever since R580/RV530. So for ATI it was a small step to support DX10.1 while for NV it most likely is a bigger step.
DegustatoR
08-Jun-2009, 21:58
If it really were that simple, why didn't nVidia implement DX10.1 in GT200, huh? :smile:
Probably because 10.1 specs were published when GT200 design was already complete (10.1 was introduced in Autumn of 2007 with RV670; and GT200 was supposed to be launched at the end of 2007 or at the beginning of 2008). Implementing 10.1 support at this stage would push GT200 to the end of 2008 or even in 2009. For such an unimportant functionality update this wasn't an option. But it looks like for GT21x which went through the design stage after 10.1 specs were finalised it was.
Supposedly they have to change some stuff in the texture units to support a DX10.1 feature like Fetch4. ATI already supported that ever since R580/RV530. So for ATI it was a small step to support DX10.1 while for NV it most likely is a bigger step.
I'd say that most of changes for 10.1 support are needed in G8x ROPs. NVs TUs have supported something like Fetch4 since NV30. But per-MRT blending modes and programmable MSAA patterns never was a part of NVs ROPs.
Come to think of it, they well may implement 10.1 via shader ALUs in some parts. This way they'll have the checkbox and undermine 10.1 performance to the point were everyone will simply use 10.0 (and 11.0?) and don't even bother with 10.1 anywhere.
http://222.73.168.146/en/NEWS/news_detail.aspx?id=20
February 2008 S3 had D3D10.1.
Jawed
And ATI had D3D10.1 in 2007, meaning the specs must have been available for some time then, at least to the GPU guys. Doesn't change anything on the fact that it's probably not so easy for nVidia, so they let go with GT200 and they will most probably let go with GT21x.
Supposedly they have to change some stuff in the texture units to support a DX10.1 feature like Fetch4. ATI already supported that ever since R580/RV530. So for ATI it was a small step to support DX10.1 while for NV it most likely is a bigger step.
The ROPs is where it's at.
willardjuice
09-Jun-2009, 01:27
The ROPs is where it's at.
And the feature in question I believe has been supported since the 9700 for ATi (someone might have to verify that). :razz:
Forgot to post this, found via vr-zone last week:
http://www.4gamer.net/games/086/G008634/20090605058/
Pictures of various laptops containing the new chips, including a info from the nvidia control panel on the GT240 model:
Driver Version: 185.95
Stream Processors: 48
Graphics Clock: 550Mhz
Memory Clock: 790Mhz(1580 Mhz)
Memory interface : 128bit
Dedicated video memory: 1024Mb
From the same event in english:
http://news.driversdown.com/News/200906/05-11271_4.html
However, the GPUs themselves aren't any different and will continue be using the G92 graphics architecture and not the GT200 architecture that's seen in the high-end desktop graphics cards of the GeForce GTX 285 and the likes. While this wasn't totally unexpected since the old G92 is plenty adequate in both performance and features, the new 40nm based mobile GPUs will take on newer but confusing naming scheme such as the GTX 260M and GTS 210M - and that something we dreaded since it wasn't even using the newer GT200 architecture to begin with. In any case here's a variety of notebooks you can expect in the next few months to sport this new graphics engine.
Hopefully someone at event bothered to double check all the high end models labelled GTX2xx were G92 based and there was no GT214/215 present.
http://www.semiaccurate.com/2009/06/14/nvidia-launches-5-40nm-mobile-parts-monday
English and full of vitriol. :D
DegustatoR
14-Jun-2009, 21:05
and that something we dreaded since it wasn't even using the newer GT200 architecture to begin with
That's weak. They should try harder next time.
put me down with a not surprised checkbox.
Hmmm, cannot find any of the GTS chips anywhere, not sure what is going on, also the numbers overlap G92 - GTX260M and GT214/215 - GTS260M which i dont think they have done before. Also was sure this chip was physically bigger, enough at least to support a 192bit interface....apart from the obvious GDDR5 support wondering why they have released 2 chips with 128bit interfaces.
According to Tech Report (http://www.techreport.com/discussions.x/17067), the new chips indeed support DX10.1, and are GT2xx based, not G9x
http://www.nvidia.com/object/geforce_m_series.html
PhysX is not supported on GT240M, GT230M and G210M. D3D10 is listed, not 10.1.
Jawed
Seems NVidia's specifications page for these GPUs is erroneous, as the Features page says this:
High Performance GeForce DirectX 10.1 Graphics Processor
NVIDIA® GeForce ®GTS enthusiast class GPUs include a powerful DirectX 10.1, Shader Model 4.1 graphics processor, offering full compatibility with past and current game titles with all the texture detail, high dynamic range lighting and visual special effects the game developer intended the consumer to see. Water effects, soft shadows, facial details, explosions, surface textures and intricate geometry create cinematic virtual worlds filled with adrenalin pumping excitement. Of course all these special effects run at high resolution and playable frame rates for immersive heart-pounding action.
Jawed
trinibwoy
15-Jun-2009, 17:17
Why no texturing specs? Does anyone know how these things are structured cluster wise?
Why no texturing specs?
That would give it all away, no? :grin:
I suspect that even with the purported distancing from G9X, the parts are pretty still much based on the same clustering, and castrated for initial yields. DP SFUs? I don't think so.
Would the following scenario hold true?
G92c with 128-bit bus and GDDR5 support plus some tweaks to support DX10.1?
G94c, and G96c on 64-bit (brutal halving of SPs as of now, scary)
128 - > 96
64 -> 48
32 -> 16
I think it's 100% clear when Matt Wuebling says "Leverage the architecture of our Previous Desktop GPU's" in this Hexus interview.
http://tv.hexus.net//show/2009/06/COMPUTEX_2009_NVIDIA_launch_new_notebook_GPUs/
And then goes on to "based on a different architecture"
Does nVidia actually know what their GPU's are based on.
trinibwoy
15-Jun-2009, 19:46
You make it sound like they're mutually exclusive. Do you know of many chips that were not both leveraging past architectures and based on a new architecture?
I am a bit confused. So what chips are those new mobile cards using???
The 16 SP part more looks like a G92 derivative, while the other ones more like G200 derivative (not only because of the 16 SPs but also because it reaches a higher shader clock than any of the others, similar to what we see on desktop g92 vs g200).
Otherwise these chips look somewhat decent to me, particularly since they seem to have reasonable TDP. Now, the better-than-IGP G210M definitely looks a bit slow, the GT230M/GT240M however could compete quite well against some AMD parts (I doubt the GT240M can touch the HD4670 Mobility if both are using the top-end configuration, but a lot will depend on actual memory (clock) used - Nvidia now also (or did they before) support ddr3, but HD4650 Mobility should be in range I guess).
GTS250M/GTS260M also don't look too shabby, they might be slower than the top-end nvidia mobile chips but have way more reasonable power draw (though the GTS260M is quite bad compared to GTS250M here). Neither one will be able to touch HD4860 Mobility but HD 4830 Mobility should be in range I guess (particularly versions using GDDR5 - way less compute power than HD 4830 but way more memory bandwidth).
DegustatoR
15-Jun-2009, 21:48
I am a bit confused. So what chips are those new mobile cards using???
Not really hard to guess:
G 210 = GT218
GT 230/240 = GT216
GTS 250/260 = GT215
The 16 SP part more looks like a G92 derivative, while the other ones more like G200 derivative (not only because of the 16 SPs but also because it reaches a higher shader clock than any of the others, similar to what we see on desktop g92 vs g200).
I don't think that you may make any assumptions about GPU features from SP numbers and shader domain frequencies.
NV's saying that they all support 10.1 plus 250/260 support GDDR5 which means new ROPs are a given. That makes them more than a derrivatives of anything, closer to GT200 evolution. (Probably why they're called GT21x.)
So... What's this?
http://i41.tinypic.com/jkd4d5.jpg
Ideas anyone?
OK, this looks to be good old G92 again. Wonder what's it doing in GT21xM announcement...
I wonder why NVIDIA won`t release a higher level GT2xx 40nm GPU - with at least 192SP or more. I mean something to replace GT200B because GT300 is still about 6 months away.
Ailuros
15-Jun-2009, 22:45
They might have cancelled the fabled GT212 and diverted its resources elsewhere.
http://i41.tinypic.com/jkd4d5.jpg
Ideas anyone?
OK, this looks to be good old G92 again. Wonder what's it doing in GT21xM announcement...
Maybe NVidia's saving the full 128 ALU lanes version for desktop? Or even mobile, when the yields improve (GTX270M, say).
Jawed
This picture is G92b and was used by Nvidia in the latest prez to illustrate the GeForce GTX 280M.
Now here is the GT215 die shot :
http://www.hardware.fr/medias/photos_news/00/26/IMG0026274.jpg
trinibwoy
16-Jun-2009, 01:03
Nice. GT215 is the first invisible GPU!
Nice. GT215 is the first invisible GPU!
Maybe the colon represents a dual GPU solution comprised of very tiny chips?
-FUDie
Sorry I forgot that direct links don't work from hardware.fr. You can get the picture there : http://www.hardware.fr/news/10283/40nm-gddr5-dx-10-1-nvidia.html
So, it's official now... nVidia has DX10.1 support. And 40 nm aswell :)
I wonder if the other rumours are true aswell... that there will be desktop-derivatives of these new DX10.1 chips soon.
I've especially doubted that because where would that leave the DX11 parts? But you could say it's wishful thinking. I mean, I suppose most of us will want to see DX11 parts from both nVidia and AMD... but there is a possibility that nVidia's chips are delayed, so there will be a 40 nm DX10-refresh first, and that refresh may include DX10.1 support, just as on the mobile side.
It's fairly normal, with the introduction of a new GPU design, for the cheaper SKUs from the prior generation to hang around for a while until they're replaced by cheaper members of the new family. G80 was alone until G84 arrived 5 months later.
Apparently GT2xx family has only just been filled out with its smaller variants.
It seems unlikely that GT300 will be joined by cheaper members simultaneously. Unless the machinations at 40nm crunch everything up. Given that 3 GPUs have just launched simultaneously, maybe something similar could be repeated with GT300.
It'll be interesting to see if AMD's supposed lead means that there'll be several D3D11 ATI chips on the market by the time that GT300 launches, as part of AMD's strategy is faster family fill-out. RV770, RV730 and RV710 all arrived within 3 months of each other.
Jawed
It seems unlikely that GT300 will be joined by cheaper members simultaneously. Unless the machinations at 40nm crunch everything up. Given that 3 GPUs have just launched simultaneously, maybe something similar could be repeated with GT300.
Historically nVidia has always started on a new process with lower-end chips first, to 'test the waters'...
If these mobile 40 nm parts are the 'testing the waters'-phase, then GT300 is probably still a few months off (assuming it's 40 nm, and not 55 nm... but I think that's a safe assumption).
In that light I think there is room for some lower-end 40 nm desktop parts aswell.
But looking at the actual desktop product line from nVidia, I'm not sure where it would fit in... Their high-end has moved to 55 nm quite recently (January this year, I believe?), and shrinking to 40 nm so soon would probably make the 55 nm shrink a poor investment.
So the 9800/GTS250-range would be the most logical target for 40 nm... But... that product line is getting VERY long in the tooth. Do they really want to shrink that architecture yet again?
So the way I see it, putting 40 nm on the desktop at this point wouldn't be a very good investment. They'll probably want to wait a few months, then shrink GT200 to 40 nm, and replace the older G92-based product line with it.
But that all depends on how GT300 is going. Will they have a 40 nm GT200 as mainstream parts, with a GT300 high-end? Or will there be mainstream GT300 variations at launch? Or perhaps GT300 is more than just a few months off?
DegustatoR
16-Jun-2009, 11:32
Nice die shot, too bad the quality is too low to clearly see what is copy-pasted and what is not, and which blocks are the same ones but synthesized multiple times
Maybe this one is better?
http://i41.tinypic.com/10pzgaq.jpg
I wonder if it's only two GPUs really with 230/240 and 250/260 using the same GPU but with half of TPCs in 230/240 case...
That would mean that 214/215 is still somewhere out there...
Their high-end has moved to 55 nm quite recently (January this year, I believe?), and shrinking to 40 nm so soon would probably make the 55 nm shrink a poor investment.
Bear in mind that G94, G96 and GT200 all had 65nm and then, within a few months, 55nm versions. Only G92 had a long interval. The interval seemed longer because NVidia had so much 65nm inventory. So while short intervals imply wasted money, the process nodes have mucked things up on a fairly substantial scale this last 18 months.
So the 9800/GTS250-range would be the most logical target for 40 nm... But... that product line is getting VERY long in the tooth. Do they really want to shrink that architecture yet again?
I have been expecting GT200 to be shrunk, with a performance increment (i.e. more ALUs and prolly more TMUs).
The biggest of these newest GPUs are essentially G92 shrunk to 40nm (admittedly with a bit of a shortfall in ALUs & TMUs - but that didn't hurt G94 much in comparison with G92) with added features (D3D10.1) and tweaks (some GT200 texturing efficiency and register file sizing?). So there's no need to do any more shrinking for GPUs below GT200.
These prolly should have been here last November, but who knows, eh?
Will they have a 40 nm GT200 as mainstream parts, with a GT300 high-end?
Seems pretty likely to me, for what it's worth. There'll be a huge performance gap between GT300 and GT215 - assuming that the largest of these NVidia GPUs is indeed GT215.
Jawed
I wonder if it's only two GPUs really with 230/240 and 250/260 using the same GPU but with half of TPCs in 230/240 case...
That would mean that 214/215 is still somewhere out there...
It wouldn't be surprising, since NVidia needs some kind of redundancy.
Unless what's actually just launched is a single GPU :shock: Turning off half the ALUs/TMUs for the crappy GT240M/GT230M. And additionally turning off half the memory and ROPs for the shitfactor G210M.
Holy fuck.
Jawed
Bear in mind that G94, G96 and GT200 all had 65nm and then, within a few months, 55nm versions. Only G92 had a long interval. The interval seemed longer because NVidia had so much 65nm inventory.
True, but the G94 and G96 are also higher volume parts, so you can get your return on investment more quickly.
The biggest of these newest GPUs are essentially G92 shrunk to 40nm (admittedly with a bit of a shortfall in ALUs & TMUs - but that didn't hurt G94 much in comparison with G92) with added features (D3D10.1) and tweaks (some GT200 texturing efficiency and register file sizing?). So there's no need to do any more shrinking for GPUs below GT200.
Well, that is if you don't make any distinction between mobile and desktop chips.
Currently these 40 nm chips are only for mobile parts. Which means the desktop line still needs a shrink (even though nVidia has already done the work for it with the mobile line). The production of desktop parts is still limited to 55 nm. Hence my question... when nVidia puts 40 nm desktop parts into production, will they be derivatives of this mobile line, or will they go for something else on the desktop?
Seems pretty likely to me, for what it's worth. There'll be a huge performance gap between GT300 and GT215 - assuming that the largest of these NVidia GPUs is indeed GT215.
That is what you'd expect, given the past strategy of nVidia. The 8800 series launched as high-end first. Everything else was just last gen's DX9 hardware.
By the time the 8600 arrived, they had already revamped the core logic a bit.
Just wondering if that strategy isn't a bit dated now. Then again, Intel still does it with their CPUs aswell.
DegustatoR
16-Jun-2009, 12:19
Seems pretty likely to me, for what it's worth. There'll be a huge performance gap between GT300 and GT215 - assuming that the largest of these NVidia GPUs is indeed GT215.
Jawed
Even if we assume that GT215 isn't that 96 SPs / 32 TUs chip but some kind of a bigger G92b replacement (192 SPs, 64 TUs, 192-bit GDDR5?) the gap between it and 512 SP 512-bit GDDR5 G300 is still quite large.
GT212 cancellation may mean something here. I hope that some kind of G30x middle class GPU (~256 SPs, 256-bit GDDR5?) which should fill this gap left by GT212 cancellation isn't that far off...
Looking at these pictures:
G210M:
http://www.pcgameshardware.com/&menu=browser&mode=article&image_id=1144031&article_id=687342&page=1
GT230M
http://www.pcgameshardware.com/&menu=browser&entity_id=154104&image_id=1144035&article_id=687342&page=1
GT240M
http://www.pcgameshardware.com/&menu=browser&entity_id=154104&image_id=1144039&article_id=687342&page=1
there's at least 2 GPUs in this range (though "GT240M" versus "GT230M" looks like a photochop). So there doesn't seem to be an MC and/or ROP redundancy option amongst these SKUs.
Considering that the 128-bit memory bus should occupy around 18mm of perimeter (RV770's 256-bit bus occupies 36mm of perimeter), is it reasonable to guess that the biggest of these is about 11.3x12mm, 137mm²?
If this biggest chip is: 96 MADs+24MULs, 32 TMUs, 16 ROPs, 128-bit at 137mm², it makes for an interesting comparison with RV740's 640 MADs, 32TUs, 16 RBEs and 128-bit bus, doesn't it?
Put another way, this chip can do one MAD for 96 pixels in parallel while RV740 can do 5 MADs for 128 pixels in parallel. Even adjusting for clocks and counting the MUL, this NVidia GPU's compute density seems pitiful.
Does this new GPU have double-precision?
Jawed
Well, that is if you don't make any distinction between mobile and desktop chips.
Why would NVidia make a distinction?
Are the prior mobile GPUs distinct from the desktop GPUs? (I don't honestly know, for what it's worth).
Jawed
Why would NVidia make a distinction?
Are the prior mobile GPUs distinct from the desktop GPUs? (I don't honestly know, for what it's worth).
In a way they are, yes. At the very least they are clocked lower than their desktop counterparts and/or binned differently. After all, mobile GPUs need to work under different heat and power requirements. So you'd need to have a slightly different validation process at least.
Mobile parts may also be built on a slightly different manufacturing process, focusing less on maximum performance, but more on minimum leakage and/or size.
In some cases, mobile GPUs also have the video-memory on the GPU package. So they are physically different.
For example, this ATi Mobile Radeon 4690:
http://files.macbidouille.com/mbv2/news/news_06_09/e4690.jpg
The question 'why' isn't that important, since clearly nVidia DOES make the distinction. They specifically announced mobile parts, no sign of any desktop parts at 40 nm and with DX10.1 yet.
trinibwoy
16-Jun-2009, 13:02
Put another way, this chip can do one MAD for 96 pixels in parallel while RV740 can do 5 MADs for 128 pixels in parallel. Even adjusting for clocks and counting the MUL, this NVidia GPU's compute density seems pitiful.
Wasn't that difference in "peak" compute density already baked into the architectures since R600/G80? Given the different approaches what compute density would you consider to not be pitiful?
In a way they are, yes. At the very least they are clocked lower than their desktop counterparts and/or binned differently.
If these chips are expensive for the foreseeable future (because 40nm yields will ramp very slowly) and that existing 55nm chips are cheaper in their desktop incarnations, then you might be on to something. But in terms of the chips, technologically, I don't see anything that restricts them from being used as desktop parts.
Jawed
Wasn't that difference in "peak" compute density already baked into the architectures since R600/G80? Given the different approaches what compute density would you consider to not be pitiful?
All of this is conditional on the die size of this new chip. Also remember that feature differences (D3D10.1) made a comparison against RV7xx problematic until now.
So, when someone measures it, and when the full specification is actually revealed, we'll have more to go on.
But, for what it's worth, this is looking more pitiful than GT200 versus RV770 implies - not surprising with the added features, but has NVidia also not benefitted from 40nm as much as AMD?
Since this isn't a desktop launch there's not much marketing muscle behind this, so we'll just have to wait it seems.
Jawed
If these chips are expensive for the foreseeable future (because 40nm yields will ramp very slowly) and that existing 55nm chips are cheaper in their desktop incarnations, then you might be on to something. But in terms of the chips, technologically, I don't see anything that restricts them from being used as desktop parts.
Well no, you COULD use mobile parts for discrete videocards... just like it is possible to use mobile CPUs in a desktop.
It's just that this rarely happens. Mobile parts are generally more expensive, and have lower performance than their desktop counterparts (eg a GTX280M is nowhere near as fast as a discrete GTX280).
And although nVidia could easily create desktop variations of the current line of mobile products, it is a specific action they have to take, because it requires a slightly different manufacturing and validation process. Until then, their partners simply cannot order any 40 nm desktop parts, and as such there won't be any 40 nm-based videocards on the market.
Well no, you COULD use mobile parts for discrete videocards...
Hopefully someone will come up with evidence one way or the other.
Clearly ATI is using the same die for both mobile and desktop in the case of RV740. I just don't know what NVidia has done historically.
Jawed
Oh, as has been pointed out to me:
http://www.nvidia.com/object/product_geforce_gts_260m_us.html
shows a clearly different GPU, again. So three different GPUs have launched.
Jawed
Clearly ATI is using the same die for both mobile and desktop in the case of RV740. I just don't know what NVidia has done historically.
Same die yes, but they do run at different clockspeeds, and probably also at different voltages, so they have different validation processes at the least.
I'm not sure if they even physically come off the same production line.
Maybe some do and some don't? Back in the X1000 era, ATI launched a few chips which used strained silicon manufacturing (RV530-derived Rad. Mobility X1700), and those were just for the mobile market.
CarstenS
16-Jun-2009, 14:53
What I am missing a bit are the TMUs, which where directly coupled to the SIMDs previously.
Looking at the 96sp die, the SIMD arrays are mostly a carbon copy of the GT200's ones, shrunken to the new process, of course, but the texture units are obviously no more aligned to the clusters [sort of]:
http://img34.imageshack.us/img34/6901/gt200mobile.png
CarstenS
16-Jun-2009, 15:41
If you rotate the SIMD-blocks to the same alignment, you'll see, that they're not the same size, nor identical otherwise wrt control logic (the "non-yellow-part").
http://img190.imageshack.us/img190/9509/tpc.png
Left: 40nm TPC
Right: 65nm TPC (w/ TMU)
Note: Images are not scaled to the corresponding process -- just for a viewing convenience!
CarstenS
16-Jun-2009, 16:33
What I meant was this:
http://home.arcor.de/quasat/GT2xALUMirror.PNG
Could be a more loose alignment of the various blocks -- look at a hi-res die shot of GT200 and you'll notice how some of the TPCs are definitely looking a tad bigger, while in fact the building components are just a bit "shaken" off their places. ;)
nor identical otherwise wrt control logic (the "non-yellow-part").
The control logic blocks are identical function-wise, despite looking a bit different each other -- that is due to the extensive use of automated process for optimal placement instead of hand-tuned circuits, used only for the critical parts.
CarstenS
16-Jun-2009, 20:01
I'm still skeptical. Those are too far apart for my taste to be attributed to automated circuitry placement variations. Plus, the SIMDs also look different than GT200s ones - at least for a 1:1 copy.
Did anyone come up with a decent theory on the blue marked regions btw?
http://home.arcor.de/quasat/GT2y.jpg
http://img40.imageshack.us/img40/5750/88137739.png
http://images.nvidia.com/products/geforce_gt_240m/GeForce_gt_240m_front_med.pnghttp://images.nvidia.com/products/geforce_gts_260m/GeForce_gts_260m_front_med.pnghttp://farm4.static.flickr.com/3640/3323338507_993b4519f3.jpg?v=0
Would it be fair to assume that the substrate on these are the same size, or at least for the 260 and RV740, which are both 128bit, gddr5 etc?
And in that way determine the die size (sure, all 3 are 3d renders, but I guess the scale is correct).
In that case I get basicly the same size for both.
And do we have any hard info on the number of ROPS in GTS260M?
Just for fun...
Looking at the die and package shots nvidia provided for G210m (http://www.nvidia.com/object/product..._g210m_us.html)(under additional images) here, can see die is almost exactly 1/3 the package size.
Back in January vr-zone (http://vr-zone.com/articles/nvidia-gt218-card--specs-surfaced/6529.html?doc=6529) had a gt218 P692 (GDDR3?) board which via PCI-E (http://www.interfacebus.com/PCIe_Card_Dimensions.html) specs gave the package size as 22.6x22.6mm. Which implies the die size is 22.6mm / 3 = 7.67mm or 58.8mm2 die area for the G210m.
Similarly fudzilla (http://www.fudzilla.com/index.php?option=com_content&task=view&id=11755&Itemid=34) reckon the package size for the GT216 was 29mm. Looking at the GT240m (http://www.nvidia.com/object/product_geforce_gt_240m_us.html) that is also almost exactly 33% width and length of package size implying 29 x 0.33 = 9.66mm for die sides => die size of GT240m is 93mm2
Of course could have changed the packaging in the intervening 6 months or use different packaging for mobile and desktop parts...
Now for heroic leaps....
Looking at the G210m (http://www.nvidia.com/object/product_geforce_g210m_us.html) and GT240m (http://www.nvidia.com/object/product_geforce_gt_240m_us.html) on the nvidia site under additional views for each chip the die shot and a shot of the pin layout.
Ok from previous post i get that G210m is 57mm and GT240m is 93mm ie GT240m is 1.65 times.
Flipping the package over and looking at the pins, if can make the really big assumption that the minimum distance between the pins will be constant between the 2 packages...i get that GT240m is 1.64x the size. This is close enough to the above measure to take this foolish leap...
Looking at GTS260m (http://www.nvidia.com/object/product_geforce_gts_260m_us.html) from measuring the pins and assuming constant separation gives it as 2.12x the size of the G210m.
ie by constant pin separation GTS260m is 121.5mm2
Obligatory caveat about the laughable reliability of this calculation: by constant pin separation the GTX280m is 2.87x the G210m size ie 167mm2 or a third less than it actually is in real life
Would it be fair to assume that the substrate on these are the same size, or at least for the 260 and RV740, which are both 128bit, gddr5 etc?
Might also be worth correlating with the GPU being "replaced", e.g. G94 being replaced by GT215 (GTS260M), since it seems likely that the package sizes are the same.
I'm dubious that RV740's substrate matches though (except coincidentally). Unless a common third-party cooler design influences things?
And do we have any hard info on the number of ROPS in GTS260M?
Nope, NVidia's being shy about ROPs and TMUs. ROPs scale with bandwidth and theoretically the smallest of these chips, GT218, has a single cluster of 16 ALU lanes, so prolly has 8 TMUs.
The remaining question is whether these have had an overhaul as part of D3D10.1 changes or implementation of GDDR5 :???:
Jawed
Just for fun...
Looking at the die and package shots nvidia provided for G210m (http://www.nvidia.com/object/product..._g210m_us.html)(under additional images) here, can see die is almost exactly 1/3 the package size.
Back in January vr-zone (http://vr-zone.com/articles/nvidia-gt218-card--specs-surfaced/6529.html?doc=6529) had a gt218 P692 (GDDR3?) board which via PCI-E (http://www.interfacebus.com/PCIe_Card_Dimensions.html) specs gave the package size as 22.6x22.6mm. Which implies the die size is 22.6mm / 3 = 7.67mm or 58.8mm2 die area for the G210m.
Similarly fudzilla (http://www.fudzilla.com/index.php?option=com_content&task=view&id=11755&Itemid=34) reckon the package size for the GT216 was 29mm. Looking at the GT240m (http://www.nvidia.com/object/product_geforce_gt_240m_us.html) that is also almost exactly 33% width and length of package size implying 29 x 0.33 = 9.66mm for die sides => die size of GT240m is 93mm2
Ooh, that's nice.
Both those sizes would seem to be targetting the minimum for 64-bit and 128-bit buses. Earlier I was querying the ability to design a GPU for minimum area for a given bus size on a process, when the process is very new (i.e. hard to be precise).
Now I'm wondering if the long gestation of a node (18 months?) from the time the libraries are available to the time production starts, means that the designer can be fairly precise?
Jawed
Obligatory caveat about the laughable reliability of this calculation: by constant pin separation the GTX280m is 2.87x the G210m size ie 167mm2 or a third less than it actually is in real life
Aha, that's pretty cool, though this caveat does screw things up.
For what it's worth, I get a different size for GT216, using the "additional views" pix from NVidia. Assuming the package is 29mm on each side, the chip is indicated to be 10.1mm per side, 102mm².
It seems to me that GT215 has the same package size as the balls are exactly the same. This leads to 11.9mm per side, 141.6mm².
If we assume 0.5mm of die encapsulation, then these come out as 92mm² and 130mm².
Jawed
CarstenS
17-Jun-2009, 09:58
Possible?
http://home.arcor.de/quasat/GT2z.jpg
For what it's worth, I get a different size for GT216, using the "additional views" pix from NVidia. Assuming the package is 29mm on each side, the chip is indicated to be 10.1mm per side, 102mm²
Redoing the calc i get the same as above. Also 54.8mm2 for G210m. Originally i did it only using shot of the 2 chips one face up and one face down both of which were at different angles, tried hard to account for the perspective guess i slipped up somewhere though.
Afterwards noticed that there were also higher resolution straight up shots present...:oops:
It seems to me that GT215 has the same package size as the balls are exactly the same. This leads to 11.9mm per side, 141.6mm².
If we assume 0.5mm of die encapsulation, then these come out as 92mm² and 130mm².
Yes package does look very similar between both chips. If it is the same size then for the GTS260m i get just less than 140mm2
Not sure if i was imagining it, but thought i could see black bits sticking out from behind the silver covering they photoshopped on top on the GT240m and GTS260m images but not G210m.
I think fellix's ideas are on the right track, though the "stacked" stuff is quite a puzzler.
Carsten if you compare with the annotated GT200 here (even if there are some who doubt its accuracy):
http://www.techreport.com/articles.x/14934/2
you should see that TMUs and ROPs take up acres of space.
Jawed
I think fellix's ideas are on the right track, though the "stacked" stuff is quite a puzzler.
It looks like the PCIe Phy interface is damn hard to scale than anything else. ;)
By looking at the overall die, it definitely doesn't seem to be much pad limited -- the GDDR paddings are conveniently stretched all the way, without cornering or stacking, for that matter. You can clearly see the separation of command/address and data pads.
I should have said I think PCI Express is that long thin section down the left hand side - compared with other die shots, PCI Express is thinner than DDR.
That stacked stuff might be GDDR5-specific. I dunno.
Jawed
CarstenS
17-Jun-2009, 13:08
Carsten if you compare with the annotated GT200 here (even if there are some who doubt its accuracy):
http://www.techreport.com/articles.x/14934/2
Apart from this maybe not being very accurate: In the GT200-Shots, there's two vertical units "above" each set of processing cores - AFAIK that was scheduling/control and the two quad-TMUs attached to each SIMD. Those are missing in the 40nm shots altogether and the control-part is attributed to "texture" in the picture from techreport.
you should see that TMUs and ROPs take up acres of space.
Which you were one of the greatest critics of, IIRC. :)
I should have said I think PCI Express is that long thin section down the left hand side - compared with other die shots, PCI Express is thinner than DDR.
That stacked stuff might be GDDR5-specific. I dunno.
Excerpt from the GT200 die-shot:
http://img35.imageshack.us/img35/2811/77722033.jpg
The vertical row on the left side is the PCIe phy link -- the pattern looks very similar to the "stacked" one in the 40nm 96sp die.
Ooh, yeah, snap! Right, well, seems pretty conclusive to me.
I wonder what that long thin section is...
Jawed
Just a copy from the "nvidia strain" topic:
http://vr-zone.com/articles/nvidia-partners-get-ready-40nm-g210--gt-220-cards/7272.html?doc=7272
vr-zone reports mass production of 40nm won't start 'till August with mass availability for both laptop and desktop parts not expected till October, i.e. Win7 time.
GPU-Z/VR-Zone's reporting the wrong shader count, 24 instead of 16, on GT218.
Jawed
Possible?
http://home.arcor.de/quasat/GT2z.jpg
Nope. The TMUs are right next to the ALUs and should be much larger. The layout of the TMUs of this chip must be irregular.
GT200:
http://mental-asylum.de/files2/gt200marked.jpg
red = Vec8
green = octo TMU.
3xVec8 + octo TMU = TPC.
What you have marked as ROPs+TMUs ist most likely the GDDR5 interface.
GPU-Z/VR-Zone's reporting the wrong shader count, 24 instead of 16, on GT218.
Hmm, I've been told that NVidia's specifications page is wrong and it's 24.
So that would appear to indicate the entire line-up is based upon 3 multiprocessors with a pair of quad TMUs per cluster.
So TMUs appear to be:
GT218 - 8
GT214 - 16
GT215 - 32
Jawed
CarstenS
30-Jun-2009, 13:46
Nope. The TMUs are right next to the ALUs and should be much larger. The layout of the TMUs of this chip must be irregular.
GT200:
http://mental-asylum.de/files2/gt200marked.jpg
red = Vec8
green = octo TMU.
3xVec8 + octo TMU = TPC.
What you have marked as ROPs+TMUs ist most likely the GDDR5 interface.
I know it's this way with GT200 and older chips. GT21x are a new breed and 'til now, I fail to identify the TMU area(s) on those GPUs.
3xVec8 + octo TMU = TPC.
What you've marked doesn't add up to an entire cluster. Could be general control or it could be TMU. Dunno.
Also, what's interesting is that in the GT215 die shot the clusters appear to contain much less logic than GT200 (the ratio of area for "ALUs" to "TMUs" is wildly different comparing the two) - implying that the layout of GT215 doesn't have clusters as single contiguous units.
Either that or there's much less TMUs. Or that scaling to 40nm has been wildly non-linear depending upon unit :???:
The scaling of the ALUs, for what it's worth, appears to be ~2x, from 65nm GT200 to 40nm GT215. One "ALU" in GT200 is 0.654mm² and the same unit in GT215 is 0.323mm².
Jawed
There are four similar structured rectangle blocks, situated between the pairs of TPCs distinguishable in the die shot -- those could be texturing hardware, being just the samplers, mapping units or even both (too small for eight TMU quads, anyway... duh!). :???:
trinibwoy
30-Jun-2009, 17:08
So TMUs appear to be:
GT218 - 8
GT214 - 16
GT215 - 32
What's GT214? Isn't it GT216?
RussSchultz
30-Jun-2009, 17:43
What you've marked doesn't add up to an entire cluster. Could be general control or it could be TMU. Dunno.
Its oddly marked, that's for sure, but there's obviously 10x(3x+1) instances.
For the die shot linked to, I don't believe the areas marked 'SIMD' should cover the area that they do. Each SIMD block does seem to represent 3x of something, but the piece attached to it (which I'm saying shouldn't be part of it) isn't a duplicate on each of the different blocks. It might be a routing issue that's making them look different (and they're only instanced on lower metal layers), but I kinda doubt that.
I don't think that what that person has labeled as the same thing on the lower and left hand edges are actually the same thing.
What I see is
4x(3x)--what's mark SIMD
8x --what's marked octo-dunnos
8x --what's marked QTU on the left
4x --what's marked QROP of the left
8x --what's marked QTU on the bottom
4x --what's marked QROP on the bottom
I'd gather that there are 4 functional units, each composed of:
3x something (SIMD)
2x something (QROP of the left)
2x something (QROP of the bottom)
2x something (OCTO on teh top)
1x something (QTU on the left)
1x something (QTU on the bottom)
What's GT214? Isn't it GT216?
:oops: yep!
Jawed
Its oddly marked, that's for sure, but there's obviously 10x(3x+1) instances.
For the die shot linked to, I don't believe the areas marked 'SIMD' should cover the area that they do.
Agreed.
Each SIMD block does seem to represent 3x of something, but the piece attached to it (which I'm saying shouldn't be part of it) isn't a duplicate on each of the different blocks. It might be a routing issue that's making them look different (and they're only instanced on lower metal layers), but I kinda doubt that.
I don't think that what that person has labeled as the same thing on the lower and left hand edges are actually the same thing.
What I see is
4x(3x)--what's mark SIMD
8x --what's marked octo-dunnos
Appears to be PCI Express
8x --what's marked QTU on the left
4x --what's marked QROP of the left
8x --what's marked QTU on the bottom
4x --what's marked QROP on the bottom
IO connections for GDDR, with what's labelled QROP actually prolly corresponding with command bus with the remainder being data bus.
Jawed
http://img197.imageshack.us/img197/9460/48212361.png
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.