If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
|
|
#1 |
|
Tiled
Join Date: Oct 2003
Location: Kings Langley, UK
Posts: 2,675
|
I don't have anything serious (no discrete article anyway that I can have ready) by now, so I think I'm going to post some details and just talk about it on the forums until I've got something formal ready. HD 5870 needs finishing first really anyway.
The big highlights (some is guesswork, NV won't talk about graphics transistors or clocks today, so beware I might be wrong in places there): 3.0B transistors @ TSMC, 40nm 2 x 16-way FMA SM, IEEE754-2008, 16 SMs Each SM has four SFUs 384-bit GDDR5 ~650/1700/4200MHz (base/hot/mem) 8 pixels clock address and filter per SM 48 ROPs, 8Z/C clock 64KiB L1/smem per SM (48/16 or 16/48 per clock config, not user programmable as far as I know, at least not yet) Unified 768 KiB L2 (not partitioned now, so a write to L2 from any SM is visible to all others immediately) Unified memory space (hardware TLB, 1TiB address, 40-bit if my brain's working) Each SM dual-issues per clock on two half warps, for two clocks. Instructions can be mixed, so FP+INT, or FP+FP, or SFU+FP, etc. If DP instructions are running, nothing else runs. Although I don't think that's quite right, need to run some CUDA on a chip to test. 1.5K threads per SM in flight (1K in GT200), 32K FP32 registers per SM (up from 16K in GT200). DP is half rate as mentioned, and it's a FMA too. All memories the chip talks to, from registers up, are ECC protected (potentially, nobody ships ECC GDDR5, and I think the chip will address 'PC' DDR3 for that in the end). Not sure what scheme or penalty. New generation of PTX, CUDA 3.0. C++ in CUDA because of the unified address space. Some new predication support, although it's really not clear how the hardware makes it happen. Seems you can predicate any instruction. New atomic performance. Seems like it'll coalesce atomic ops in a warp and won't hit DRAM if the update fails, instead using L2 (GT200 replayed the transaction at DRAM hundreds of clocks later). The whitepaper explanation is wrong. Seems RF per SM has enough ports (256) and support from the operand fetch hardware to sustain full FMA rate across the chip. It can run multiple CUDA kernels now at the same time. Limit is 16 per chip (one per SM), but I think that'll be capped at 8. I think the tesselator is a software pipe with very little hardware support, too. Anyway, that's from memory, more later when I'm free. ![]() If you want more, dkanter's ready with his (and it's excellent) here.
__________________
A major redesign of the core ALU pineapple boomerang fortress. Last edited by Rys; 01-Oct-2009 at 21:25. Reason: 8 pixels/clock address and filter |
|
|
|
|
|
#2 |
|
Epsilon plus three
Join Date: Feb 2002
Location: Chania
Posts: 7,818
|
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs. |
|
|
|
|
|
#3 |
|
Tiled
Join Date: Oct 2003
Location: Kings Langley, UK
Posts: 2,675
|
Can we keep this thread free of crap, please (other than my initial post!). Post links to the webcast so people can keep up, and other Fermi pieces from around the web when they pop up. Old thread got a bit silly at times, less of that if poss.
__________________
A major redesign of the core ALU pineapple boomerang fortress. |
|
|
|
|
|
#4 |
|
hardly a Senior Member
Join Date: Jul 2008
Location: still camping with a mauler
Posts: 3,676
|
|
|
|
|
|
|
#5 |
|
Member
Join Date: Apr 2002
Location: London
Posts: 269
|
Some info from that anandtech article:
"The price is a valid concern. Fermi is a 40nm GPU just like RV870 but it has a 40% higher transistor count. Both are built at TSMC, so you can expect that Fermi will cost NVIDIA more to make than ATI's Radeon HD 5870. Then timing is just as valid, because while Fermi currently exists on paper, it's not a product yet. Fermi is late. Clock speeds, configurations and price points have yet to be finalized. NVIDIA just recently got working chips back and it's going to be at least two months before I see the first samples. Widespread availability won't be until at least Q1 2010. I asked two people at NVIDIA why Fermi is late; NVIDIA's VP of Product Marketing, Ujesh Desai and NVIDIA's VP of GPU Engineering, Jonah Alben. Ujesh responded: because designing GPUs this big is "fucking hard"." |
|
|
|
|
|
#6 |
|
Regular
Join Date: Jun 2003
Posts: 6,177
|
There's no hardware out there yet, and it's going to be expensive and damn difficult to make. I hope Nvidia don't have to compromise the design and give us something less than all these slides promise.
I don't want to raise the spectre of NV30, but the last time we saw this kind of forward publicity from Nvidia was when they were under pressure to produce an extremely ambitious design on a process that wasn't ready for such a complex design, that was running late, and something had to be put out as a spoiler against ATI's recently launched and very successful R300. We got a lot of promises that didn't translate into the finished product One thing that will work both for and against GF100 is that they seem not to be focussing on the gaming side of things, but are sidestepping into the GPGPU realm. Obviously AMD and Intel may not follow them there, as they have CPUs to sell, but for Nvidia, it might make sense to make this new chip something other than a CPU or a GPU and effectively carve out a new market for themselves. The only problem will be if gamers no longer see this as a gaming product, and don't go for it. I'm not sure OEMs will want it at what's got to be a higher price than competing products, when it seems to be aimed at the GPGPU segment rather than gaming or general purpose use. It seems to be an amazing product if it lives up to the hype, but in the same way a Bugatti Veyron is an amazing thing - but it's not one I am likely to buy except for it's gaming/video applications. Where's all the gaming stuff or is Nvidia moving away from that market? |
|
|
|
|
|
#7 |
|
Senior Member
|
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#8 |
|
hardware monkey
Join Date: Mar 2007
Posts: 3,910
|
So the 16 SMs are on the "north" and "south" sides of the chip w/PCI-e and GDDR5 interfaces along the borders, any guesses as to what's in the center? Especially the very center. Scheduling?
|
|
|
|
|
|
#9 |
|
Epsilon plus three
Join Date: Feb 2002
Location: Chania
Posts: 7,818
|
Rys,
16 pixels/clock address and setup/SM? Are you sure 256 TMUs aren't way too much overkill for that kind of bandwidth? Also when you state 8Z/8C samples /clock for the ROPs, I assume it's either/or as in today's GPUs?
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs. |
|
|
|
|
|
#10 | |
|
Junior Member
Join Date: Aug 2009
Posts: 80
|
Quote:
Also hardware.fr is reporting that each SFU unit can do 8 interpolations. (if i understood correct) Each SM has 1 SFU unit, so the GT300 can do 16X8=128 interpolations. Wouldn't the design (256TMUs) be limited that way? (128 interpolations) EDIT* I just rechecked hardware.fr. It reports 16 interpolations. So probably we have 256TMUs. Last edited by elsence; 01-Oct-2009 at 10:45. |
|
|
|
|
|
|
#11 | |
|
Epsilon plus three
Join Date: Feb 2002
Location: Chania
Posts: 7,818
|
Quote:
Else if it truly has something like 256 TMUs or equivalents I doubt real time fillrate could even peak to such hights. One aspect that hasn't anyone asked this far and obviously there's probably no data available on that, is 8xMSAA performance on GF100. I'd dare to say it has 48 ROPs (wtf it would need 48 pixels/clock is beyond my uneducated imagination), but the 8Z/8C note in Rys' diagram isn't particularly telling IMO in that department yet either. Finally if it ends up with something like 1200MHz GDDR5 the 384bit "grants" it 230.4GB/sec which might be sufficient and marks a 50% increase over GTX285, closer to what Dally allowed me to speculate from his PCGH interview and way less to anything the so far 512bit scenarios granted.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs. |
|
|
|
|
|
|
#12 | |
|
Junior Member
Join Date: Aug 2009
Posts: 80
|
Quote:
With G92b, Nvidia had a design with 738MHz 16ROPs/64TMUs/128SPs and with 1,1GHz GDDR3. (70,4GB/sec with a 256bit memory controller) GT300 has 2,85X pixel fillrate / 3,8X texel fillrate / 3,3X memory bandwidth. This logic is kinda simplistic, but things doesn't look so bad if the GDDR5 memory controller has good efficiency. Nvidia must do something to reduce the hit with 8xMSAA . Anyway, i guess (wild guess) GT300 will be around 1,5X faster per MHz in relation with a 5870 (4X AA). The problem with 5870 is that the performance improvement in relation with a 4890 is not consistent. (it has much higher variations in perf. than what 2X specs would normally have, i know about the bandwidth...) I don't know why is that, but i guess it is either the geometry setup engine (Geometry/Vertex assempler has same performance with 4890's) or something about the Geometry shader performance? I could only find 3DMark Vantage tests, if you check: http://www.pcper.com/article.php?aid...=expert&pid=12 GPU Cloth: 5870 is only 1,2X faster than 4890. (vertex/geometry shading test) GPU Particles: 5870 is only 1,2X faster than 4890. (vertex/geometry shading test) Perlin Noise: 5870 is 2,5X faster than 4890. (Math-heavy Pixel Shader test) Parallax Occlusion Mapping: 5870 is 2,1X faster than 4890. (Complex Pixel Shader test) It shouldn't be a problem of the dual rasterizer/dual SIMDs engine efficiency since synthetic Pixel Shader tests is fine (more than 2X) while the synthetic geometry shading tests is only 1,2X. Is these synthetic vertex/geometry shading tests so bandwidth limited in order to deliver 1,2X instead of 2X? Or the Pixel Shader tests like the Parallax Occlusion Mapping test is not bandwidth limited at all? (why they deliver more than 2X?) And anyway, it is not logical for 5870 to be extremely bandwidth limited. Why ATI to waste transistor resources like that, if the design is not going deliver (in such a degree) because of the bandwidth? Certainly, they would have used the transistor space in a more efficient way. Last edited by elsence; 01-Oct-2009 at 14:01. |
|
|
|
|
|
|
#13 | |
|
Senior Member
|
Quote:
IE, their attached filtering units (16x4 values bilinearly interpolated or only 16 values (i.e. a traditional quad-TMUs worth)? Would they be the only units responsible for transactions to/from memory, i.e. are they replacing the traditional ROPs or are separate ROPs (or parts of, as Z-Compares) still present? I think, Nvidia did not reveal very much of Fermi, there's a lot of guesswork left.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
|
#14 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,462
|
Quote:
edit: oh hmm, forgot about the higher clock of the alus vs. tmus. Still, imho 256 TMUs would make no sense at all, it would increase TEX:ALU ratio from GT200 considerably (back to the G92 level). Last edited by mczak; 01-Oct-2009 at 14:24. |
|
|
|
|
|
|
#15 |
|
Regular
|
|
|
|
|
|
|
#16 |
|
Member
Join Date: Aug 2002
Posts: 230
|
|
|
|
|
|
|
#17 |
|
Senior Member
|
So, the addition of ECC to the GDDR interface would definitely reflect on the chip's perimeter occupancy -- 64+8 bits per channel, for grand total of 432-bit data bus!?
And looks like there will be third revision of NVIO companion ASIC for the thing.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#18 | |||
|
Member
Join Date: Jan 2007
Posts: 334
|
Quote:
Target of 750, I doubt they will be able to do it. Then again, Dear Leader might be flogging the troops until morale improves, and is gunning for higher, but that will likely mean only more delays. See G200 - the worlds first .933TF GPU for more on this. Quote:
Quote:
http://www.theinquirer.net/inquirer/...0-architecture Almost like I knew what I was talking about all those months ago. Who would have thought. -Charlie |
|||
|
|
|
|
|
#19 | |
|
Tiled
Join Date: Oct 2003
Location: Kings Langley, UK
Posts: 2,675
|
Quote:
__________________
A major redesign of the core ALU pineapple boomerang fortress. |
|
|
|
|
|
|
#20 | |
|
Member
Join Date: Jan 2007
Posts: 334
|
Quote:
All this said, no, I am not going to 'prove' my numbers. Yes I am certain of them, I wouldn't have printed them if I wasn't. I have an article on this coming, but I haven't been able to finish it yet. One of the biggest problems is that NV itself doesn't have enough working silicon to characterize the #(*$&ing parts. The target is 750 +/- a bit. Will they get there? I doubt it, G200 missed targets by 10+% and 7 months, and this one is more rushed, plus has about 5 very critical risk areas. They aren't giving out specs because they don't have a clue what they can make yet. -Charlie |
|
|
|
|
|
|
#21 |
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,882
|
Nice copy-pasting. I mean by NVIDIA's synthesis team, not by you, of course.
Overall, I quite like the SM design - I was expecting the dual-MADD layout for a number of reasons (hint: GT200 didn't expose the full 1024 threads, so I knew it was going to jump to 1536, which meant 6 virtual RF read ports), although I'm surprised they've gone for dual-warp instead of dual-instruction; pleasantly surprised, mind you. I'm not pleasantly surprised by the fact 99% of your execution hardware is taking a nap when doing, say, basic integer operations which are quite important to me. Oh well - you can't please everybody! Even in terms of MUL/ADDs for graphics programs though, it seems rather inefficient. The SMs, TMUs, and MC-linked blocks are all easy to notice on the die shot. In the bottom left of the central block lie all the 'unique' stuff, conveniently quite near to the PCI Express analogue. What I find interesting, however, is that the MC-linked block is so huge. Seems like a lot of formerly "central" functionality was moved to the MC-linked blocks; I wonder if that includes input assembly and all of its little friends later in the pipeline! (also I really should go on IRC sometime!) |
|
|
|
|
|
#22 |
|
Regular
|
Anyone seen any TMUs?
Jawed
__________________
Can it play WoW? |
|
|
|
|
|
#23 |
|
Epsilon plus three
Join Date: Feb 2002
Location: Chania
Posts: 7,818
|
I've asked already Rys where the 16 pixels/clock address & setup per SM come from but I'm still waiting for his answer. I don't think the 16 load/store units have anything to do with it.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs. |
|
|
|
|
|
#24 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,242
|
The design put forth really hammers some low-hanging fruit that earlier Nvidia GPUs (and others) lacked.
The multiple kernels, the closed/semi-closed write/read loop, 1/2 DP throughput. The other stuff is downright crazy to see: indirection, exceptions, IEEE compliance, ECC, simplified addressing. The mapping of separate memory spaces to lie within the global address space is an elegant way to have the benefits of hardware peculiarity in a more specialized instance without having it impinge the general computation case. I had sort of thought of a design using special page table bits that would allow hardware to route to special on-chip storage if enabled, and easily forgettable if not. This isn't quite the same, but the idea of using the target memory location to deliniate special things you want done with it is a rather nice touch. The size of the chip shows the price of generality, though. FLOP density is not likely to be anywhere near Cypress, and I'd be curious to know if Larrabee's final clocks will mean even the x86 will have an advantage. I don't know how it will fare in gaming, or how many other problems there may be, but I have to give Nvidia credit: this design took balls. As far as DP is concerned, the quality of this implementation is enough to make Cypress appear as useful as its botanical namesake in HPC. Physical and economic realities that may intrude on this (it doesn't exist on a store shelf), but as a topic of discussion, I find this architecture much more interesting to discuss. The posited tool sets and initiatives are such that this is the first time I've ever thought a GPU designer took serious computation seriously.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#25 |
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco
Posts: 4,308
|
I am waiting to see how many ppl will start to complain about the introduction of some sort of a semi-coherent cache.
__________________
[twitter] More samples, we need more samples! [Dean Calver] First they ignore you, then they laugh at you, then they fight you, then you win. [Mahatma Gandhi] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|
|
|
![]() |
| Tags |
| delay, fermi, geforce, gf100 |
| Thread Tools | |
| Display Modes | |
|
|