Nvidia Pascal Announcement

Discussion in 'Architecture and Products' started by huebie, Apr 5, 2016.

Tags:
  1. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    988
    Likes Received:
    280
    Because price is determined by performance not the size of the die.

    If Intel used logic that die size would determine selling price they would be selling processors from new nodes at much lower prices than the previous node even though the new ones out perform the older ones which clearly they are not doing.

    I'm not sure why people expect Nvidia to do something that everyone else in the industry doesn't do.
     
  2. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    Yep its ~2x the cost per transistor.
     
  3. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    Well with the diagrams of the dynamic load balancing, it seems both graphics and compute kernels can be done on an individual SMM, but the ALU that finish can then switch over to another kernel when they are free (or is this by the use of preemption it was kind of vague?), it seems there is no time delay though so I'm assuming there is no need for preemption.
     
    pharma likes this.
  4. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    All I see on page 32 is concurrent execution of multiple grids of the same kernel on the TK1 in the final batch of stages, and both grids had even the same dimensions(!) - but I don't see overlap of different kernels apart from the small measurement error most likely originating from monitoring the completion signals on the CPU long after the GPU has dispatched the next grid internally.

    If there was any parallel execution on the same SM, it should have happened in the early stages as well, as different kernels were only involved then, and not only in the final stages. Only the K40 exposes a measurable overlap between the different kernels, but it also has more than one SMX, so this was expected.

    On the TK1, using multiple streams really just cut the overhead for switching between kernels to the point where the order of the signals on the CPU appears inverted, but not to the point where we could speak about concurrent execution.

    So far, that looks more like a point in case that different kernels can not run in parallel, but the workload wasn't orthogonal as requested, so we can't eliminate the possibility that the concurrency was just prevented by coincidentally saturating the whole SMX for the whole duration of one grid.
    Yes, from compute to graphics at instruction granularity, from graphics to compute at vertex/pixel granularity. I think it was hinted though that the driver needs to trigger the rescheduling, it's not entirely automated. And I don't know if this actually involves preemption, or if it is just triggered when the SMM runs idle to avoid the overhead involved with preemption.
     
    nnunn likes this.
  5. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    well if its using preemption, there should be some latency incurred, we shell see when tests come out, at least more than the driver forcing rescheduling I presume.
     
  6. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yeah I thought the Dynamic Parallelism (and that was an example they provided somewhere) has changed between Fermi, then Kepler, and now Pascal.
    Maxwell I thought was the same as Kepler *shrug*.
    Ah well, wonder what changed then in it.
    Cheers
     
  7. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    711
    Likes Received:
    282
    GPU performance has increased exponential for the last 15 years, price not.
    Same could be said for ARM CPU.

    It's undeniable the last couple of generations the introduction price of GPUs has gone up by $50+
    A healthy competitive ecosystem keeps prices low.
    We can only hope AMD comes up with something great again.
    (Speaking of course as a customer, not a stakeholder)
     
    #927 Voxilla, May 16, 2016
    Last edited: May 16, 2016
  8. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,935
    Likes Received:
    1,629
    Hopefully we will know more once some detailed reviews start to come in, though not sure many will delve into this aspect.
     
  9. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    The leaked slide on that mentioned a "sub 100usec latency". So probably still potentially something like 100,000 clock cycles. Nothing I would call blazing fast. Perfectly fine for a single asynchronous time warp kernel call per frame, but a hundred context switches may add up to something significant. But who knows how this will work in practice. That needs to be tested.
     
    #929 Gipsel, May 16, 2016
    Last edited: May 16, 2016
    nnunn and Razor1 like this.
  10. pixelio

    Newcomer

    Joined:
    Feb 17, 2014
    Messages:
    47
    Likes Received:
    75
    Location:
    Seattle, WA
    Makespan drops from 84 to 78.9 ms. using concurrent kernels.

    The kernels at the end of the job are small enough to cohabitate the single SMX.

    Concurrency doesn't mean preemption... hopefully we are discussing the same thing.
     
  11. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    The grids at the end of the job are apparently also identical in dimensions of the grid and the kernel invoked, only the passed parameters differ, at least if I didn't mistake the context. This is still in line with the assumption that different kernels can not simultaneously cohabitate the same SMX. At no point it was stated that they would have to belong to the same grid.
    But preemption is a possible method to provide concurrency. The question is whether we can have simultaneous, and not only concurrent execution.
     
  12. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,172
    Location:
    La-la land
    Except, I didn't dispute that.

    In fact, it's pretty much inferred from my statement. :p

    Didn't dispute that either. Although as a customer, I wouldn't mind less moneygrubbing from NV, but lacking any true competition there's no pressure on them to not gouge the hell out of their users, just like Intel is doing on their high-end chips. $1.5k for an 8-core socket 2011 CPU, whee... *golfclap*
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,128
    Likes Received:
    2,887
    Location:
    Well within 3d
    The option for allowing a wavefront a fixed period of time to complete before a context switch was invoked was proposed in some of AMD's descriptions for pre-emptable compute. That seems in line with other QoS measures that try to provide predictable or deterministic latencies rather than absolute minimum latency.

    Some of that latency could be from that kind of grace period, since that's plenty of time to write out context, or it might give an idea of the kind of latency it takes to make the determination that something must be preempted at the driver level and signalling it. That might be fast for an external request, given queuing delays that can be experienced in other scenarios.

    That might not be used in the allocation method, similar to how AMD's priority queue functionality (edit for clarity: quick response queue) has a minimum GCN version that cannot preempt graphics.

    The dynamic resource allocation slide, if it accurately describes prior Nvidia GPUs in the static scenario, might indicate not having the ability to rapidly check for SM availability for compute issue.
    That would seemingly fall out the implementation of preemption functionality (what's the point of preempting if you don't know when a switch is ready?), even if the allocation method itself does not use a preemption request. There does not seem to be any actual change in the area devoted to the graphics portion, so it's not getting cut off.
     
    #933 3dilettante, May 16, 2016
    Last edited: May 16, 2016
    nnunn likes this.
  14. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    939
    Likes Received:
    398
    Price is determinate by how much does it cost to make the product vs how much the consumer are willing to pay for it. If sells for the 1080 drops because its too expensive then Nvidia would have to drop the price. Something we saw(and suffer) with Intel. Selling CPUs 3 years old more expensive than when they came out A completely insult in my opinion, but as long as people keep giving them 5 stars at amazon they are going to keep selling and the prices(even the used ones...) wont drop a single dollar.
     
  15. pixelio

    Newcomer

    Joined:
    Feb 17, 2014
    Messages:
    47
    Likes Received:
    75
    Location:
    Seattle, WA
    No, they're not. They wouldn't have shorter runtimes if they were the same size grid.

    Different kernels can run concurrently on the same SM if there are enough resources.

    Here's an example: https://gist.github.com/allanmac/049837785a10b7999fce6ca282f62dc6

    Kernel "a" and "b" are similar and both require half of the Maxwell v1's shared memory.

    Thus the expectation is that only a pair of kernels can fit into an SMM.

    Two timelines are shown running on a 3 SMM Quadro K620:
    1. Launch 6 kernels on 6 streams. The first kernel is "b" the rest are "a". All run concurrently.
    2. Launch 7 kernels on 7 streams. The first and 7th are "b". The rest are "a". Six run concurrently and the seventh runs by its lonesome.
    This is reliable and routine behavior that CUDA devs depend on.

    Any questions?

    ck --device=1 --nkernels=6

    [​IMG]


    ck --device=1 --nkernels=7
    [​IMG]
     
    elect, nnunn, CarstenS and 3 others like this.
  16. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    No, thank you very much for proving it properly and providing a reproducible test.
    The second run with two "b" kernels doesn't allow any conclusion as they might have been matched, but the first one with the 5:1 ratio does.
     
  17. pixelio

    Newcomer

    Joined:
    Feb 17, 2014
    Messages:
    47
    Likes Received:
    75
    Location:
    Seattle, WA
    If you run the example for 6, 12, 18, 24, 30, etc. times you'll see many different concurrent arrangements of "a" and "b" kernels.

    Translation: the scheduling rule is probably somewhat dynamic and a mix of first-fit and round-robin. I wouldn't be surprised if there was a heuristic hiding in there to drain more of the same kernel type before launching a different kernel on the same SM (for many many reasons).

    For 600 concurrent kernels it looks like this (purple is launch mod 6 = 0):

    [​IMG]
     
    elect, nnunn, CSI PC and 3 others like this.
  18. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    988
    Likes Received:
    280
    What time tomorrow (May 17, 2016) does the GTX 1080 NDA expire?
     
  19. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Would be interesting if someone could run on this on a Fermi GPU, to see if there is a difference between the architectures on this.
    Cheers
     
  20. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,490
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    PCPer will host a live stream with NV's Tom Petersen at 10:00am PST (1:00pm EST), so I guess it'll be by that time -- http://www.pcper.com/live/
     
    #940 fellix, May 16, 2016
    Last edited: May 16, 2016
    pharma and pixelio like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...