Recent content by LiXiangyang

  1. L

    Nvidia Ampere Discussion [2020-05-14]

    Has anyone tested 3090 yet? Some Machine-Learning programmers in China report disappointing Tensor-Core performance for 3090s, reporting basically no performance gain over Turing at tensor core performance, and sometimes even slower than the latter.
  2. L

    Nvidia Ampere Discussion [2020-05-14]

    Well, the compiler can construct different instructions for different GPU arch as well, and in practice the compiler can reorganize instructions to fit the pipeline better for various arch, and a good programmer can also take the most of the resource for the target arch by allocating resource...
  3. L

    Nvidia Ampere Discussion [2020-05-14]

    The application may not know the hardware configuration much (well, unless you are an informed programmer), but the compiler sure does...
  4. L

    Nvidia Ampere Discussion [2020-05-14]

    Well, if Volta's TDR is of any indiction, the higher TDR may already take the extra FP32 unit into consideration. For instance, despite of whatever computing load (large sgemm and Tensor-based hgemm included), the real power draw of my volta rarely reach more than 80% of its TDR unless you do...
  5. L

    Nvidia Ampere Discussion [2020-05-14]

    The benchmark may not being able to take advantage of GA102's new FP32 unit if it has not been recompiled with new arch/sm options, so the performance gain here may come from higher boost clock.
  6. L

    Nvidia Ampere Discussion [2020-05-14]

    I have many inhouse ML algorithms that can be benefited greatly from more FP32 CUDA core counts, but not so much from more tensor cores, so to me if Nvidia's new FP32 cuda "cores" are really just as capable as the old ones, than I am impressed with the product (I actually has the feeling of skip...
  7. L

    Nvidia Volta Speculation Thread

    Dont have time to test it on amber, but for the computing software (mostly in-house) I have tested so far, Titan V work just as good as my other cards, and can always produce reproducible results unless the software designed to not to be so, but I only test it on a 3-GPU workstation with 2 of...
  8. L

    Nvidia Volta Speculation Thread

    Contracted a local nvidia guy, it seems that the boost on Titan V is just that low (1335MHz for my two cards, and the boost is much less flexiable than average Geforce, more like the case in Tesla/Quardo, so maybe the Titan V should be renamed to Tesla V80 instead), but when play games, the card...
  9. L

    Nvidia Volta Speculation Thread

    Never mind, I just checked CUDA 9.1 documents, it seems that cublasSgemm will just covert FP32 to FP16 when tensor core is enabled:
  10. L

    Nvidia Volta Speculation Thread

    Just got my 2 Titan Vs today, have tested on a few kernels, the results are good, but it seems that the boost clock is overrated, in my tests, the GPU boost clock only reach to 1355MHz, vastly lower than my GP102, which can reach to 1850+MHz. The most interesting part is GEMM test with...
  11. L

    Nvidia Volta Speculation Thread

    According to their own doc, they did the DL thing actually through 256x256 matrix mul within a warp, for a mixed precision mul, note that there is insufficient storage in either register (warp-wide) or shared-memeory to store temp results, so they have to write back results to main memory...
  12. L

    Nvidia Volta Speculation Thread

    Judging by the spec, I have a hard time to believe this Titan V can achieve 110T flops of DL, the memory bandwidth on V100 is barely sufficient to feed the mixed precision computation, and now they cut 1/4 of them off.
  13. L

    Nvidia Volta Speculation Thread

    Not necessarily an improvement, since a self-driving DL computer on car only need to apply pre-trained DL model to forecast instead of training the network itself, the 130TOPS could very well be very low precision stuff like int8 or even lower, just like GP102 can do nearly 50T DL ops but GP100...
  14. L

    Nvidia Volta Speculation Thread

    The V100 has 2.85x the bandwidth of shared memory comparing to Pascal, and greatly improved L1 cache, greatly improved memory controller, it also have a more flexible SIMD execuation method at warp level, and it can also execute FP and INT computation (usually to compute index) at the same time...
  15. L

    Nvidia Volta Speculation Thread

    I have contacted a local GPU supplier in Beijing, the listed price of a Tesla V100 is abit cheaper than I thought, and it will become availble sooner as well. Tesla V100 PCIE will be charged about the same amount of money as Tesla P100 PCIE at launch, and about 10%-20% more expensive comparing...
Back
Top