Looks like we have an obvious trend here: G80 (90nm), GT200b (55nm) and GF100 (40nm) -- all falling within the narrow range of ~480mm².
Remains to be seen if the latter is as exciting as the first one or as boring as the middleman
Looks like we have an obvious trend here: G80 (90nm), GT200b (55nm) and GF100 (40nm) -- all falling within the narrow range of ~480mm².
Also, assuming that is the rv870 die shot, I could be wrong but it looks like there are 24 simd's there.
Mmm, fish, yum
Yeah, I think RV870 is actually 24 simd (1920sp) and a few are disabled. and I think it was Charlie D who also hinted that the real shader count is not 1600...
5890 in about 6 months maybe?
/Kef
It was Charlie D and not Suzie Q huh? Who's that Charlie bloke again and why is that story complete bullshit?
That was unnecessary.
At its GPU Technology Conference (GTC) last week Nvidia specifically noted that it paid a lot of attention on boosting double-precision floating point computational performance on its Fermi-G300 chip (about 750GFLOPs) and thus will be able to address new several new markets, e.g., high-performance computing. However, DP performance is rarely needed by average consumers on the desktop or laptop markets. Obviously, in order to create power efficient version of Fermi for notebooks or low-cost desktops, Nvidia will have to sacrifice some of its capabilities.
“We're not talking about other (chips) at this point in time but you can imagine that we can scale this part by having fewer than the 512 cores and by having these cores have fewer of the features, for example less double-precision,” said Mr. Dally, who did not explain how it is possible to reduce double-precision floating point performance without decreasing single-precision point speed, something which is needed by video games. In fact, Mr. Dally’s comment may imply that non-flagship Fermi derivatives will have not only be slower in terms of performance, but will be seriously different in terms of implementation.
The way i understand it, he talked about the total of Cores per SM. So one of the groups of 16 is DP and can also be used for SP, the other is SP only. INT cannot ever produce "FL"OPS.According to an interview with Andy Keane and PCGH, each CUDA core consists of:
a DP-FMA, a SP-FMA and an integer ALU
... and they say in some cases the DP-FMA can be used for SP-taks.
So we are talking about up to 4 FLOPs per CUDA core?
Those terms seem much better than the transistor count-based scaling idiocy I see everywhere I look. Also, I don't see how I could make the caveat stronger.
Jawed
Average transistor density:
Cypress -- 6.45 (2154M:334mm²)
Fermi -- 6.25 (3000M:480mm²)
Pretty even this round, eh?
And bandwidth. Not FLOPs though.The way you posted it, it sound like you considered GTX285
to be faster than HD 4890 only because of the ROPs/TMUs.
The deletion of SPI in ATI will mean the advantage is even lower, theoretically. Though you might argue that simply means higher utilisation on ATI (since it's rarely more than 85% on R700 I guess). Still don't know what the instruction throughput for attribute interpolation is, or what instructions are executed to fulfill it.The whole Fermi-chip (TWFC) with 512 ALUs will probably be losing its edge in terms of those two (comparatively speaking), but otoh gain (quite a lot?) on the compute side of things - even if it's only GT200-style-FLOPs, Cypress will only be 77% more SP-MAD-FLOPs (using a 1500is hotclock), while used to be an ~92%ish advantage.
With post-processing and shared memory or memory-intensive algorithms it could be a lot faster, since NVidia seems to have made more effort in that direction.Plus, as you said, we don't know the individual effectiveness of TWFC yet.
Will DP have a meaning in the supposed lifetime of Fermi outside of some small professional market? And will Nvidia not really love to have a strong reason for this professional market to buy the more expensive Quadros and/or Teslas instead of Geforces? This could prove way more potent than just some BIOS-level stuff to differentiate pro from consumer products, couldn't it?That seems to be a big issue for maintaining compatibility amongst cards. If there are cards with no DP, then it makes things a hell of a lot harder for developers - you end up with all sorts of ugly checks to find feature sets...ironically just like a CPU.
What happens if code is written assuming DP and then runs on a card with no DP? Does the JIT just emit an error?
David
"Designing a CPU and designing a GPU is a remarkably different thing," he continued, noting that one was all about locality of content in cache in very high frequencies and speculative execution. "The key word is speculation, so that they can have more parallelism," said Huang, adding "we don't speculate anything. We don't speculate at all."
The NVIDIA chief went on to explain that because his firm's technology was latency tolerant, it could simply use another thread "out of the 15,000 other threads we're executing on at any other time, instead of just one or two threads. We just keep running."
"The two architectures are radically different, the two approaches are radically different, the problems faced are dramatically different," continued an unrelenting Huang.
Mostly i agree. It's quite hard to filter out the different limitations, but i think there's a lot improved internally in Fermi, whereas AMD chose more or less the brute force approach, scaling mainly the number of execution units while leaving internal bandwith from and to caches etc. untouched an thus on the level of the identically clocked HD 4890.And bandwidth. Not FLOPs though.
The deletion of SPI in ATI will mean the advantage is even lower, theoretically. Though you might argue that simply means higher utilisation on ATI (since it's rarely more than 85% on R700 I guess). Still don't know what the instruction throughput for attribute interpolation is, or what instructions are executed to fulfill it.
The hunt is still on for a game that's notably ALU-bound... Current reviews seem devoid.
Though you might say that such a game is not defined exclusively by: "when HD4890 is faster than GTX285", since it's possible to write a shader that has terrible ALU utilisation on ATI while being fine on NVidia - but you'd have to make the majority of all shading be bottlenecked on such shaders(s) for that to become obvious in benchmarking.
And, obviously in general, some shaders can be ALU-bound in a game, but the game still mostly scales with other capabilities. Hard to separate-out all this stuff. e.g. the SSAO in Battleforge:
http://www.anandtech.com/video/showdoc.aspx?i=3650&p=2
looks like a candidate for being ALU-bound (compare HD4870 and GTX285) but that might actually be unfiltered texel rate (the speed-up under CS5 apparently derives from the bandwidth efficiency of fetching from shared memory)...
With post-processing and shared memory or memory-intensive algorithms it could be a lot faster, since NVidia seems to have made more effort in that direction.
I vehemently disagree. Charlie was actually one of the saner people I've met in my lifetime, he's a hoopy frood who really knows where his towel is.
It doesn't seem all too hard to reduce or eliminate DP without affecting SP.
The data paths are already structured for peak SP, so that can remain unchanged.
The load/store units are structured for SP, so they won't mind.
One hackish possibility is to gut DP functionality from one of the pipelines in a core. It cuts DP and likely certain INT functions in half, and the schedulers will have to be updated so they only issue heavy operations on the main pipe.
It could be dumped entirely, which means yanking the heavier ALUs and putting in skinnier units.
Possibly the schedulers and operand collectors might have logic for the dual-register special case ellided.
The effectiveness of this will also depend on how much of the core is taken up by the ALUs. There are some big areas in those cores that probably won't change all that much.
I don't see any volumes of ATIs DX11-Chips. All they need is a (Paper)-launch to make people wait a little bit further for Fermi.1/ Nvidia can compete with 5870/5850 being shipped in volume
2/ Nvidia can ship GT300 in volume before Q1 2010 when they couldn't even produce (or at least spare 1 to show publicly) an actual prototype card at the start of Q4 2009? Because I don't think that's ever been done in this industry