Nvidia Pascal Announcement

Regardless of what they say there, on launch they were really clear about it, 4GB / 224 GB/s and 8GB / 256GBs, with the possibility of AIBs bringing different memory clocks on both products

That is only for reference cards. That big "*" next to 256 GB/s tells you all you need to know. The official minimum spec for RX480 is 7 Gbps.
 
That is only for reference cards. That big "*" next to 256 GB/s tells you all you need to know. The official minimum spec for RX480 is 7 Gbps.
The big "*" refers to both 224 & 256 GB/s, not just 256 GB/s, meaning in both cases AIBs can offer different configurations, or at least that's the how I understood their messaging on launch
 
First official notification of 1060 3GB we got was from Inno3d, followed by EVGA:
http://www.inno3d.com/products.php?refid=1&subid=18
http://eu.evga.com/articles/01047/evga-geforce-gtx-1060-3gb/

So it's official now: 3 GB and 1152 ALUs, with only the „3GB“ differentiating the card from the 6-GB-SKU on the outside, confusion inbound.

Pricing for germany seems to be 219 EUR including all necessary taxes.
It's just stupid having 5 versions of a 1060 3Gb, or any specific card for that matter.
 
But x86 or Phi coupled to on-package or on-die FPGA functionality shouldn't be more than 5 years away. And that, in my view, marks the end of GPUs in HPC of any kind.
How would you program the FPGA?

There is a lot of CUDA code out there right now. Phi+FPGA would face this sunk cost mountain and I don't see how it can deliver the 10x perf/W advantage over GPU necessary to motivate code porting.
 
You need to differentiate between Knights Landing and Knights Corner (was a disappointment), downside with latest is the extra coding required for KNL to be optimised.
Cheers
What extra coding is required for knl?

Could you please describe it for me?
 
What extra coding is required for knl?

Could you please describe it for me?
Well you do not need to but for large scale projects it is pretty important to modernise/optimise for KNL, yeah more obvious when going from say Haswell-to-KNL but also applies when also upgrading from Knights Corner.
Worth noting some in that link you provided also differentiate between Knights Corner and Knights Landing, the complaints in that thread relate to people with experience using Knights Corner.

Quite a bit of info out there relating to projects implementing KNL, such as from Biadu/NERSC, and Intel with frameworks such as Caffe/Torch and GEMM algorithms.
https://software.intel.com/en-us/ar...ations-from-knights-corner-to-knights-landing
https://software.intel.com/sites/default/files/managed/b4/3a/319433-024.pdf

http://www.nersc.gov/users/computat...ce/getting-started-and-optimization-strategy/
Some good case studies such as Berkeley in that list: http://www.nersc.gov/users/computat...ing-and-performance/application-case-studies/

I also had some articles on vectorization considerations and a couple of other large scale projects along with the modernization/optimization of Caffe and Torch but cannot find them for now, and well with Olympics on kinda distracted :)
Cheers
 
Last edited:
I wonder what impact will the Thunderbolt GPU docks do on all the high-end mobile GPUs that nvidia announced so far.
 
Just sounds like another thing to carry around, ok for lan parties but that is about it is my guess. Never really thought they were high selling products.
 
I wonder what impact will the Thunderbolt GPU docks do on all the high-end mobile GPUs that nvidia announced so far.

No more (and probably less) than the impact they have today (which is negligible).
http://www.nvidia.com/object/drive-px.html
  • Scalable from 1 to 4 processors (2 next generation Tegra SoC and 2 Pascal GPUs)
  • Dual NVIDIA Tegra® processors delivering a combined 2.5 Teraflops
  • Dual NVIDIA Pascal discrete GPUs delivering over 5 TFLOPS and over 24 DL TOPS
  • Interfaces for up to 12 cameras, radar, lidar, and ultrasonic sensors
  • Periodic software/OS updates
- See more at: http://www.nvidia.com/object/drive-px.html#sthash.auXSPWNS.dpuf

Cross posting from the Tegra thread. That gives us over 2.5 TFLOPS per GPU. So GP107 is looking like 768 CCs at slightly over 1.6 Ghz (No issues with the Samsung process then..). Pretty much what I expected and should be a solid mid-range and laptop GPU (FWIW my speculation for GP108 is 512 CCs, 8 ROPs, 64 bit)
 
First official notification of 1060 3GB we got was from Inno3d, followed by EVGA:
http://www.inno3d.com/products.php?refid=1&subid=18
http://eu.evga.com/articles/01047/evga-geforce-gtx-1060-3gb/

So it's official now: 3 GB and 1152 ALUs, with only the „3GB“ differentiating the card from the 6-GB-SKU on the outside, confusion inbound.

Pricing for germany seems to be 219 EUR including all necessary taxes.

Welp time to start warning my customers to be careful if they decide to buy a GTX 1060 that the 3 GB version other than the memory difference will be slower than the 6 GB versions if they feature the same clock speeds.

I find it rather bizarre that they did this without some nomenclature other than memory difference, which in the past hasn't meant significantly different specifications other than memory, to denote that it is a less capable card.

Regards,
SB
 
Are you sure about those ROP numbers? All other Pascal cards have 16 ROPs per 64 bit of memory bus. So 16 ROPs seem far more likely to me.

Not sure at all..pure speculation on my part. But there's no reason it cant be different. With Maxwell, GM206 and GM107 had 32 ROPs and 16 ROPs respectively, both on a 128 bit bus. And GM108 was 8 ROPs on a 64 bit bus which is what GP108 would be a successor to.
 
Not sure at all..pure speculation on my part. But there's no reason it cant be different. With Maxwell, GM206 and GM107 had 32 ROPs and 16 ROPs respectively, both on a 128 bit bus. And GM108 was 8 ROPs on a 64 bit bus which is what GP108 would be a successor to.

I know there's no real reason it can't be different, but logic dictates it's probably going to be 16. In the examples you give, both GM1xx chips have something in common: 8 ROPs per memory partition, same amount as Kepler chips had. GM206 has 16, which it's something it shares with every other GM2xx chip and all Pascals so far. The difference in ROPs between GM107 and GM206 despite both being 128 bit is explained by the fact that they each belong to a different family. For what it's worth, even Tegra X1 had 16 ROPs (also based on Maxwell 2).

EDIT: Is a way what I'm trying to say is this: There's no reason it can't be different, but is there any reason for it to actually be different?
 
Could someone wrap up Pascal generations half-precision details? Do they all have 2x FP32 performance?
 
Could someone wrap up Pascal generations half-precision details? Do they all have 2x FP32 performance?
They all support 2x FP32 instructions for compatibility reasons. But except for GP100, they don't achieve even remotely 2xSP throughput, as these instructions are only executed by 2 half-precision-FPUs per SMM. Only the non-vectorized, legacy half-precision instructions run at full SP rate.
 
I know there's no real reason it can't be different, but logic dictates it's probably going to be 16. In the examples you give, both GM1xx chips have something in common: 8 ROPs per memory partition, same amount as Kepler chips had. GM206 has 16, which it's something it shares with every other GM2xx chip and all Pascals so far. The difference in ROPs between GM107 and GM206 despite both being 128 bit is explained by the fact that they each belong to a different family. For what it's worth, even Tegra X1 had 16 ROPs (also based on Maxwell 2).

EDIT: Is a way what I'm trying to say is this: There's no reason it can't be different, but is there any reason for it to actually be different?

Well..the reason is usually to balance out the chip. GM206 has 32 ROPs because it has 10 SMs compared to 16 and 6 for GM107. A 256 bit GP204 has 64 ROPs and 20 SMs. I can't imagine a 128 bit GP107 with 6 SMs(30%) needing 32 ROPs(50%) for example.
 
Well..the reason is usually to balance out the chip. GM206 has 32 ROPs because it has 10 SMs compared to 16 and 6 for GM107. A 256 bit GP204 has 64 ROPs and 20 SMs. I can't imagine a 128 bit GP107 with 6 SMs(30%) needing 32 ROPs(50%) for example.

There's a couple mistakes there. GM206 had 8 SMs and GM107 had 5 SMs.

And I get your point, but I think that the relation between memory controller and ROPs is stronger than the relation between SMs and ROPs, whatever the reason may be (probably convenience for not having to redesign the ROP/L2/Mem partition).

History tells just as much, as I'll show down below. Maxwell 2 was unique in that GM200, GM204 and GM206 all had the exact same balance of units, but nearly all other families have a different mix.

GP106 has only 10SMs but 48 ROPs on a 192-bit interface, compared to 64 ROPs in a 256-bit interface for GP104 (20 SMs) and 96 ROPs, 384-bit and 30 SMs on GP102. Here the constant is 16 ROPs per every 64-bit interface and not a balance of units.

Not enough evidence, but let's look at Kepler (and I'll put Maxwell 1 into the mix):

GK110: 15 SMs, 48 ROPs, 384-bit --> 3.2 ROP-per-SM, 8 ROP-per-controller
GK104: 8 SMs, 32 ROPs, 256-bit --> 4 ROP-per-SM, 8 ROP-per-controller
GK106: 5 SMs, 24 ROPs, 192-bit --> 4.8 ROP-per-SM, 8 ROP-per-controller
GK107: 2 SMs, 16 ROPs, 128-bit --> 8 ROP-per-SM, 8 ROP-per-controller
GK208: 2 SMs, 8 ROPs, 64-bit --> 4 ROP-per-SM, 8 ROP-per-controller

GM107: 5 SMs, 16 ROPs, 128-bit --> 3.2 ROP-per-SM, 8 ROP-per-controller
GM108: 3 SMs, 8 ROPs, 64-bit --> 2.6 ROP-per-SM, 8 ROP-per-controller

See a pattern?
 
There's a couple mistakes there. GM206 had 8 SMs and GM107 had 5 SMs.

Ahh my bad..got it mixed up with GP106 and 107. But I think you got what I meant.
And I get your point, but I think that the relation between memory controller and ROPs is stronger than the relation between SMs and ROPs, whatever the reason may be (probably convenience for not having to redesign the ROP/L2/Mem partition).

History tells just as much, as I'll show down below. Maxwell 2 was unique in that GM200, GM204 and GM206 all had the exact same balance of units, but nearly all other families have a different mix.

GP106 has only 10SMs but 48 ROPs on a 192-bit interface, compared to 64 ROPs in a 256-bit interface for GP104 (20 SMs) and 96 ROPs, 384-bit and 30 SMs on GP102. Here the constant is 16 ROPs per every 64-bit interface and not a balance of units.

Not enough evidence, but let's look at Kepler (and I'll put Maxwell 1 into the mix):

GK110: 15 SMs, 48 ROPs, 384-bit --> 3.2 ROP-per-SM, 8 ROP-per-controller
GK104: 8 SMs, 32 ROPs, 256-bit --> 4 ROP-per-SM, 8 ROP-per-controller
GK106: 5 SMs, 24 ROPs, 192-bit --> 4.8 ROP-per-SM, 8 ROP-per-controller
GK107: 2 SMs, 16 ROPs, 128-bit --> 8 ROP-per-SM, 8 ROP-per-controller
GK208: 2 SMs, 8 ROPs, 64-bit --> 4 ROP-per-SM, 8 ROP-per-controller

GM107: 5 SMs, 16 ROPs, 128-bit --> 3.2 ROP-per-SM, 8 ROP-per-controller
GM108: 3 SMs, 8 ROPs, 64-bit --> 2.6 ROP-per-SM, 8 ROP-per-controller

See a pattern?

I get what you're saying and you could well be right. Like I said..that was purely speculation.Either ways..its only a matter of a few weeks before we find out.
 
Back
Top