NVIDIA Kepler speculation thread

Erm.. there was an option for either FP16 or FP32 IIRC.. but that doesn't make them less advanced than the Tegra's PSs...

Of course did they have split precision, but that doesn't mean that they weren't capable of FP32 unlike T2's PS ALU which are 80bit RGBA (the VS ALU though being FP32). By the way the Kishonti database lists for Tegras only 16bit Z precision also.
 
G7x? When did that happen? I thought Tegra was vintage NV4x. :)

Ail, where do you get info on mobile GPUs? Seems there's never any detailed public info available.

Hmm, just something I picked up along the way. I'm not overly informed about these weak gizmos:smile:
 
G7x? When did that happen? I thought Tegra was vintage NV4x. :)

Ail, where do you get info on mobile GPUs? Seems there's never any detailed public info available.

Uhmm don't hurt me but the answer for the PS ALUs is NV's homesite: http://www.nvidia.com/content/PDF/t...ing_High-End_Graphics_to_Handheld_Devices.pdf

Page 7:

The GeForce GPU includes four pixel shader cores and four vertex shader cores for high speed vertex and pixel processing. The GPU pipeline uses an 80-bit RBGA pixel format with FP20 data precision in the pixel pipeline, and FP32 precision in the vertex pipeline.
Since even the diagram is showing "8 cores" it's most likely 1 Vec4 PS ALU and 1 Vec4 VS ALU. Highest frequency for tablets should be at 333MHz (and no for heaven's sake don't expect any hotclocks either) for the ULP GeForce in T2. You said a bunch? In order to surpass Fermi performance for an up to 2x as much margin at least, you'd rather need a gazillion of those ;)

I'd even bet that they didn't go for FP24 but chose FP20 in order to annoy ATI :p

As for Z precision, what do you figure out of that one? http://www.glbenchmark.com/phonedet...&D=Samsung+GT-P7100+Galaxy+Tab+2&testgroup=gl
 
Keep on topic: KEPLER

What's with all these posts about Tegra.

This is the Nvidia KEPLER thread.

If you want to talk about Tegra start a Tegra thread and quit hijacking the Kepler thread.
 
Uhmm don't hurt me but the answer for the PS ALUs is NV's homesite: http://www.nvidia.com/content/PDF/t...ing_High-End_Graphics_to_Handheld_Devices.pdf

As for Z precision, what do you figure out of that one? http://www.glbenchmark.com/phonedet...&D=Samsung+GT-P7100+Galaxy+Tab+2&testgroup=gl

Lol, thanks :oops:

Figure depth precision is already a problem although a quick google brings up complaints of z-fighting on the 24-bit SGX540 as well.

Since even the diagram is showing "8 cores" it's most likely 1 Vec4 PS ALU and 1 Vec4 VS ALU. Highest frequency for tablets should be at 333MHz (and no for heaven's sake don't expect any hotclocks either) for the ULP GeForce in T2. You said a bunch? In order to surpass Fermi performance for an up to 2x as much margin at least, you'd rather need a gazillion of those ;)

Hmmm, that makes sense but doesn't explain the 12 "cores" of Kal-El. Totally different arch maybe.

I'm not expecting much in the way of architectural innovation from Kepler to be honest. I suspect it's going to be akin to G71 and nVidia will use the 28nm transition to put their chips on a diet with the big chip at ~1.6x the perf of the 580. Maybe a re-balancing of unit counts here and there but nothing dramatic. Of course, this is just random speculation based on the lack of news on the desktop and the ginormous focus on mobile devices.
 
Hmmm, that makes sense but doesn't explain the 12 "cores" of Kal-El. Totally different arch maybe.

I'd say that TKK might have hit the nail on its head. Since when ALU vector lanes justify a "core" description is another chapter.

I'm not expecting much in the way of architectural innovation from Kepler to be honest. I suspect it's going to be akin to G71 and nVidia will use the 28nm transition to put their chips on a diet with the big chip at ~1.6x the perf of the 580. Maybe a re-balancing of unit counts here and there but nothing dramatic. Of course, this is just random speculation based on the lack of news on the desktop and the ginormous focus on mobile devices.
How about 2*32, instead of 32+16@GF1x4? Oh and by the above witless core definition, it will consist of at least a thousand+ "cores".
 
I also expect GF104 type SMs with better warp schedulers and bigger registers in order to keep efficiency.. 768CC 384bit 48 enhanced ROPs 128 TMUs and ~800-900mhz core clocks and die size around ~400mm^2
 
Or take trinibwoy's idea of putting the GK100 a bit on diet, add an evolution of the GF104 SMs (with doubled registers and local memory/L1) with 64 ALUs as mentioned by Ailuros, and you get 1024 ALUs in 16 SMs and probably just 5 billion transistors (i.e. less than a factor 1.7 increase).
 
Or take trinibwoy's idea of putting the GK100 a bit on diet, add an evolution of the GF104 SMs (with doubled registers and local memory/L1) with 64 ALUs as mentioned by Ailuros, and you get 1024 ALUs in 16 SMs and probably just 5 billion transistors (i.e. less than a factor 1.7 increase).

...and probably a MC worthier of something like Fermi hm? ;)
 
I also expect GF104 type SMs with better warp schedulers and bigger registers in order to keep efficiency.. 768CC 384bit 48 enhanced ROPs 128 TMUs and ~800-900mhz core clocks and die size around ~400mm^2
Would be quite a small chip for nvidia. After the rather problematic GF100 though why not...
One thing the GF104 type SMs can't do is fast fp64 - not sure if you'd extend them to support that (at 1/3 rate?) these SMs would be terribly more efficient though than GF110 ones.
Can't see why you'd need "enhanced" ROPs (unless you're talking more cache or some such). They look "too fast" this generation so should serve next generation just fine. If anything I'd expect more pixel export capability from the SMs to the ROPs (especially if Kepler uses GF104 style SMs).
 
Last edited by a moderator:
One thing the GF104 type SMs can't do is fast fp64 - not sure if you'd extend them to support that (at 1/3 rate?) these SMs are terribly more efficient though than GF110 ones.
Can't see why you'd need "enhanced" ROPs (unless you're talking more cache or some such). They look "too fast" this generation so should serve next generation just fine. If anything I'd expect more pixel export capability from the SMs to the ROPs (especially if Kepler uses GF104 style SMs).

Despite GTX 580 have 50% more rops it is still behind in some cases than 6970.. some times performs like 6870.. thats why i say it needs enhanced rops.. dont know but maybe that's the reason why fermis behave worse at high resolutions along with texture fillrates :?:
http://www.hardware.fr/medias/photos_news/00/30/IMG0030754.gif
http://www.hardware.fr/medias/photos_news/00/30/IMG0030755.gif
http://www.hardware.fr/medias/photos_news/00/30/IMG0030753.gif
http://www.hardware.fr/articles/818-6/tests-theoriques-pixels.html
 
I don't think the ROPs are the issue but rather:

Beyond3D said:
Now, you may be thinking: no more than 8 fragments can be rasterised per GPC per base clock, thus it'd take 4 base clocks to fill a fragment warp, thus apparent rate would be 8 fragments per GPC per clock and thus 32 across the entire chip – why so many ROPs (6 of them equate to a theoretical maximum of 48-fragments per base clock)? Two reasons, at least in our opinion: first, the memory controller-to-ROP connection is so tight that it would have been quite intrusive to remove the extra ROPs, and second, atomics.

Each SM only does 2 ppc...

With regard to texture fill, the 580 has fewer TMUs (64 vs 96) which are clocked lower, and are performing as well as can be expected I'd say.
 
The right way to think about ROPs is, I think, that they are one potential bottleneck (among many) in the system, and one that is intimately connected to memory bandwidth. So a well-designed chip will be made such that the ROPs are capable of saturating the available memory bandwidth the majority of the time (note that the available memory bandwidth will often not be the full memory bandwidth, because a good amount of it is taken up by things like texture reads).

As such, if you're interested in performance, as long as the chip is well-designed so that the ROPs are not often a bottleneck, then you will get a better estimate of performance by looking at things like shader processing power and memory bandwidth than you will by looking at ROPs. But in the end, with the complexity of today's chips, it's not really possible to look at a spec sheet and predict performance. You have to test it.
 
Why? Maybe they just added another Vec4 PS ALU.

Oh yes, of course you're probably right :)

these SMs are terribly more efficient though than GF110 ones.

I don't know if that's true. The 580 only has a 25% flops advantage over the 560 Ti yet is much faster at compute tasks. If they continue with GF104's "superscalar" approach they need to improve register bandwidth or whatever else is necessary to feed the ALUs.

The 32-bit path from the SM's to ROPs may not be a bottleneck in reality but it just seems like such a striking limitation on paper that they should do something about it too.
 
I don't know if that's true. The 580 only has a 25% flops advantage over the 560 Ti yet is much faster at compute tasks. If they continue with GF104's "superscalar" approach they need to improve register bandwidth or whatever else is necessary to feed the ALUs.

indeed.. you can check here for GTX 460 peak integer perf - single md5 speed vs GTX 465's.. despite GTX 460 has higher peak integer perf it is still slower than a GTX 465 in all tests..

http://www.golubev.com/gpuest.htm
 
I don't know if that's true. The 580 only has a 25% flops advantage over the 560 Ti yet is much faster at compute tasks. If they continue with GF104's "superscalar" approach they need to improve register bandwidth or whatever else is necessary to feed the ALUs..
Yes, that's what I wanted to say, the GF104 SMs don't look that much more efficient now, and if you'd extend them for faster FP64 they might not really have that much of an advantage.
 
The 32-bit path from the SM's to ROPs may not be a bottleneck in reality but it just seems like such a striking limitation on paper that they should do something about it too.
I thought it's 64bit so not quite that bad. Maybe at least for the 32 ALU SMs it is "enough".
 
Back
Top