Is everything on one die a good idea?

Discussion in 'Architecture and Products' started by punchinthejunk, Jul 21, 2014.

  1. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Starting with the cache, here's an example of an A15 die shot:

    http://www.techinsights.com/inside-samsung-galaxy-s4/

    I'm not looking at pixels exactly or anything, but I'd give a really rough estimate that the icache + dcache + tags constitutes about 1/3rd of the core area (this isn't including L2 which is off to the side somewhere) Multiply that by 3 and you've got 5/3 the space - not really 2x but it's getting pretty close. I expect L1 ITLB and DTLB sizes to roughly scale with L1 cache sizes.

    nVidia also claims that branch prediction will be better (not just a lower misprediction penalty, which they also claim, but a lower misprediction rate). This tends to imply bigger, sometimes much bigger data structures for tracking different types of branch histories and voting on the results. Loosely related is the 1K optimized branch target cache which normal CPUs won't have.

    Denver will save out on not needing as many ARM decoders, renamers, a ROB, schedulers, etc. But if it's anything like Efficeon in this regard I expect it to have larger speculation buffers than a normal in-order core so it can speculate things like alias prediction over larger and less dynamic regions than a pipeline shadow. This will need various buffers and potentially some big structures to lookup conflicts. Likewise, some buffers needed for holding additional state for runahead execution, if it indeed supports it.

    Now for a totally made up guess as to how the compilation and optimization process could work. I emphasize, this is all totally made up, but it's too annoying to annotate everything with "I'm heavily speculating on an idea or something very very vaguely similar" so please just read that implicitly :p

    I'm thinking something like dynamic statistics for the top N blocks. Where a new block enters the list and increases an execution counter, while all the execution counters are periodically decremented, so all but hotly executed blocks will fall off the list. Blocks that pass a threshold of execution will move to another list, while also being marked for recompilation.

    The second list contains some profile statistics for the next top N recompiled blocks. These blocks will be continually monitored for catastrophic events, big slow things that fall outside the critical path like cache and/or TLB misses, branch mispredictions, alias mispredictions, I/O access, and so on. But the amount of actual data that can be stored is limited because of tracking multiple simultaneously. Like the execution counters, the numbers go down in time automatically to emphasize the hottest offenders. When it passes a threshold, it's flagged to be compiled again.

    A block that's flagged for compilation or recompilation will run through the plain ARM decoders again, but this time will also collect more indepth profiling data stored in one buffer just for that block. After the block is executed (or maybe executed a few or all times if it's a loop?) the core will interrupt to the recompiler which will have the recent live statistics data in addition to the accumulated statistics data to inform it on its decisions.

    For the very hottest blocks that benefit it from the most there could be multiple copies following long traces spanning multiple branches. It's possible that the branch prediction hardware itself works with the software via a mechanism that reduces or eliminates the overhead of picking traces based on future branch paths.
     
  2. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    The same results can be obtained with profile-guided optimization. The only advantage of doing it at run-time is that it can optimize legacy binaries for newer architectures. That's very helpful, and I mentioned it as one of the advantages of out-of-order execution as well, but it just can't achieve the performance of out-of-order execution for real-world applications. You don't need the memory latencies be very erratic to gain from out-of-order execution. With static scheduling you can't take any variation into account. So if you predict data to be in L2 then you end up stalling for the cases where it's in L3 or RAM, or for the times it's actually in L1 you're unnecessarily delaying instructions which might affect performance.

    Also, startup time has a big effect on the user experience, and that's where dynamic code generation makes things worse. The dynamic code optimization thread can preempt foreground programs, and that adds unwanted variability in execution time. JIT-compiled languages are typically used by less performance critical applications, but when it's done as part of the architecture, everything gets penalized. Out-of-order execution does not make things worse for software that is already highly optimized.

    It's an interesting variation on Transmeta technology, but I'm highly skeptical it's going to float this time around. To quote Linus Torvalds; it's BS.

    Don't get me wrong. I would love for there to be an easier path toward CPU-GPU unification. Denver certainly is a step closer to the in-order architecture of the GPU, and perhaps NVIDIA will achieve unification that way. I just don't think it will be at the same practical performance level, and achieving that level will require out-of-order execution, or an equal amount of complexity. I also think dynamic code generation is very valuable, but it should be used selectively in a controllable or predictable way.

    Denver will have to compete against Broadwell, which features a turbo clock of up to 2.6 GHz, AVX2, and 5% higher IPC. Meanwhile NVIDIA's white paper on Denver doesn't exhibit a whole lot of confidence: "It's a big gamble for the company and we're about to see if it paid off". My bets are on the sure thing.
     
  3. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Not really. Profile guided optimization has two big limitations vs runtime recompilation:

    1) The statistics are based on the run of one profiler instead of a specific user's inputs and actions
    2) The statistics provide one static data point which is supposed to characterize a best case for all runs

    And the more you try to generalize #1 the more it'll make #2 a smeared input.

    Note that nVidia's approach can include recompiling the same code multiple times at runtime as conditions change. Transmeta has in fact described some conditions where they've done this.

    There are some residual effects too, like profiling overhead potentially influencing the execution shape of what you're profiling.
     
  4. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    While you're correct GPUs often push out data the previous thread was using GPUs do clock gate quite often. If they didn't have this opportunity peak power would be achieved more often.
     
  5. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,629
    Likes Received:
    1,227
    Location:
    British Columbia, Canada
    It's not about the scheduling, it is literally about tracing the path through the ideal data dependency graph. The graph being wide doesn't affect that you ultimately still need something to rip through that path. I have never been talking about bad code here, but the ideal complexities of these algorithms. As I've qualified, this is not likely to be a fundamentally issue in what are still relatively narrow GPUs today, but the trajectory is clearly towards it getting worse. This isn't conjecture... you can measure how much time these sorts of critical paths take relative to the throughput of the GPUs and it's obviously getting worse over the years (i.e. they are taking up a larger percentage of the total frame time).

    I also regularly need to point out to people that even distributed scheduling and "lockless" data structures obviously require atomics, which is fundamentally synchronization too. It's obviously the way to go and in a good cache system it does localize the conflicts somewhat, but it really just changes the constants, not the overall argument. TSX is similar - it just helps make synchronization dynamic and finer grained (i.e. cache line level) which is often difficult to do statically (as efficiently) with data-driven dependencies.

    So while it's not something you guys necessarily need to think about for a while, it absolutely is something people designing hardware have to think about :) I think it's questionable to say that these critical paths will always be short enough that you can - for instance - take a big hit in serial performance and still be fine 10-20 years from now.

    Anyways enough said on that as we're sort of off in the weeds, but I wanted to clarify the high level point.

    Theory in this space tends to falls on deaf ears I'm afraid... you can always make the argument that a magic optimizing compiler with perfect knowledge can JIT the ideal schedule for any given situation. I've even made similar arguments myself over the years :) Unfortunately these are not new arguments and previous attempts to apply the concepts haven't exactly met with stunning success.

    If NVIDIA has somehow done something fundamentally different and it works way better, that's great. But it's fair to say that scepticism is highly warranted until we get to see these chips in the wild. Unfortunately these sorts of architectures are also the type that make benchmarks almost completely useless in terms of predicting the performance in arbitrary workloads. Incidentally that's not all that different from GPUs I suppose :)
     
  6. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Hey, I'm not under any illusion here that nVidia's approach can close the gap between typical in-order (or even worse, exposed pipeline VLIW) and proper OoOE, particularly for a wide variety of use cases. I'm just saying, it's not right at all to say that runtime compilation and recompilation offers no opportunities beyond traditional PGO.

    In nVidia's case I think this is going to be less about making up all the difference with the compiler and more about trying to make it up by spending the saved power budget on other things (more cache, better branch prediction, aggressive prefetchers, run-ahead execution, etc) But for the time being I also don't put any real stock in their numbers, much less would I consider them representative for many programs.
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Are you speaking of a double compare and swap?
    Is it a double compare and swap of two locations at arbitrary addresses?
    Since we are mixing discussion between CPU and GPU, by cache line do you mean it updates in chunks of 64 contiguous and aligned bytes (CPU packed operation or a GCN vector op with sequential addresses) or that the scheme works through the cache subsystem in line quantity, but it doesn't have to actually write those lines fully (divergent/masked/serialization at a lane level)?

    I thought the discussion of TSX was arguing for transactional memory in the vein of Intel's, rather than DCAS.
    DCAS has the distinction of being both beyond what CPUs support for their atomic operations and yet less ambitious than what AMD flirted with, Intel is bringing out, or what IBM uses.

    If that subset of full transactional memory support is what works for what you want, then fine, although it doesn't resolve the unknown as to whether GPUs as we know them are capable of giving what you want.
    GCN doesn't have the instruction format, and the ISA docs are dodgy on the ordering behavior of separate writes.
    The Sea Islands ISA doc is particularly untrustworthy because AMD forgot whether it was talking about GCN or VLIW in that section, which has been pointed out on this forum and I'm sure AMD is rushing to fix it while we all make a point of politely pretending it is still early 2013.

    There seems to be some assumed mechanism and method for this DCAS implementation that I think should be spelled out.
    An Intel-like formulation of transaction processing involves tracking whether a line or lines has been accessed/modified/evicted in core-private cache, and upon completion of the transaction the write set atomically becomes globaly visible. GCN's L2 is the one point that provides something close to this, but it is shared and lacks significant tracking or promises of atomicity or ordering.
    The long latencies bring into question the atomicity of transaction commitment and raise the chances of an untoward event cancelling the transaction.

    Having a wavefront wait while some other CU's transaction is finished is a thornier problem because the question is how does the CU know a transaction is in process, and is the design shifting from silent discarding of failed transactions to some kind assertion of ownership over lines in a globally visible shared last level cache.


    There are various reasons why they might not, since the trends these days are usually not towards openness, even for AMD and Intel.

    I have doubts about relying on marketing slides for this.

    Also, as was pointed out for Transmeta, a software translation layer is great at covering up problems with the implementation.
    The code morphing software did things that pointed to interlock problems or hardware bugs, and at a hardware level it would have been unacceptably primitive and dangerous to use without the interpreter--if things could be made functional.
    A modern core isn't a trivial undertaking and time isn't on Nvidia's side.
    An honest to goodness competitive OoO core would have been a challenge to get right without an extended history of making such cores.

    edit:

    It's generally a quadratic relationship with voltage, and linear with clock speed.
    Since higher clocks eventually necessitate voltage bumps, it actually translates to a cubic relationship over the whole voltage/clock range of a design. The extremely high peak at heavy overclocks is what happens when a design is being pushed past its speed and power targets.
     
    #147 3dilettante, Aug 27, 2014
    Last edited by a moderator: Aug 27, 2014
  8. keldor314

    Newcomer

    Joined:
    Feb 23, 2010
    Messages:
    132
    Likes Received:
    13
    I wouldn't rule out Nvidia pulling something like this off. They have the resources and a solid product base to fund their research. Of course, I don't expect them to get it all right the first generation either:razz:

    I think the big thing that's done to stop/reduce the voltage hikes is to increase pipeline length, which is in fact precisely why all these latency reduction technologies came to be! To maintain voltage, I believe you'd have to increase pipeline length linearly with clock speed, so that each rank of transistors in a given stage has the same amount of time to stabilize. However, I would strongly suspect that increasing the pipeline length four times to go from 1 GHz to 4 GHz would impact latency well beyond what you could fix with OOO etc.

    In fact, I think we may be making a mistake comparing latency optimizations needed at 1 GHz to those at 4 GHz, ESPECIALLY if you add superscalar to the 4 GHz case! Between the more and longer pipelines, you need quite a bit more ILP to cover it!
     
  9. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Indeed out-of-order execution in and of itself doesn't result in optimal scheduling. But out-of-order execution doesn't mean you can no longer statically schedule. Out-of-order execution will further improve on any static schedule whenever latencies are not as predicted. In addition, register renaming allows the static scheduling to ignore false register dependencies, which is something you can’t do for an in-order architecture.

    I believe that in theory out-of-order scheduling is optimal when you have infinite execution units and an infinite instruction window, for any order of the instructions and any latencies. For in-order execution this isn't true, even with infinite resources, so it’s always at a disadvantage. And for all practical purposes, Haswell’s eight execution ports (four arithmetic) and 192 instruction scheduling window gets pretty close to optimal scheduling when you make a reasonable effort at static scheduling (e.g. trace scheduling).

    So your downplaying of out-of-order execution isn't really justified. I’m not denying that practical implementations aren't 100% optimal, but it gets darn close and in-order execution essentially can’t compete with that. Mobile CPUs use out-of-order execution too now, and there’s no turning back from that. With all due respect NVIDIA’s first attempt at designing its own CPU will probably be remembered in the same way as NV1: using fundamentally the wrong architecture.
    Scheduling is cheap in hardware, not in software. Out-of-order execution essentially uses CAMs which perform hundreds of compares per cycle. In software it’s so expensive some JIT compilers don’t (re)schedule, and those that do use cheap heuristics that aren't as thorough as out-of-order scheduling hardware.

    Trying to improve on out-of-order execution is great, but it should use out-of-order execution as the starting point, not in-order execution. There’s some research on dual-stage scheduling which uses a slow big scheduler and a fast small scheduler to save on power while increasing the total scheduling window, but as far as I’m aware that hasn't been used in practice yet, probably due to direct improvements on single-stage scheduling. There are also ideas to switch to in-order execution when there’s good occupancy due to Hyper-Threading, but when you have lots of parallelism you probably want wide SIMD instead, and the cost of out-of-order execution gets amortized.
    First of all you don’t have to write in assembler to have to deal with scheduling for an in-order architecture. Secondly, compilers are written by developers too, and are in a constant state of flux. Scheduling is not a solved problem, and between all the other things that are expected of a compiler, is hard to keep up to date. Also, again, JIT compilation has a very limited time budget for scheduling. This is a serious concern for run-time compiled shaders and compute kernels, which shifts some of the problem to the application developer. Out-of-order execution makes things a lot easier for everybody. You can do a somewhat sloppy job and still get great results, and have your legacy code run faster without recompilation.

    You can curse at developers for this, but the reality is that they have a lot of other issues on their mind, mostly high-level ones, to not want to bother with low-level issues. The increase in productivity you get from out-of-order execution is something that shouldn't be underestimated, and is a significant factor in its widespread success.
    Yes, my argument is that you can have it either way, or both. If the branching is uniform and predictable and the conditional blocks are large, you can have a real jump instruction. If the branching is less uniform and the blocks are small, just execute both branches. Note that with AVX-512’s predicate masks, it can improve upon the latter by skipping instructions for uniformly non-taken branches. In theory it should be able to skip four such instructions per cycle, so the threshold for wanting to use an actual jump becomes quite high. On the other hand if you’re not sure about the branching behavior it’s still safe to have a real jump because prediction will help you out in 95% of the cases. You don’t get that luxury with in-order non-speculative execution.
    That’s not true. x86-64 doubled the number of registers without breaking backward compatibility for x86-32 applications. AVX-512 will double the SIMD register count without affecting any prior code. Also, when you compare different ISAs it’s clear that it’s not so critical in the first place. Micro-architectural features and backward compatible extensions have a bigger effect on performance than any overhauls that would break compatibility.

    Dynamic compilation is a valuable technique, but you don’t need/want it everywhere.
    The problem is that even though 32 registers might seem plenty compared to x86-64’s 16 registers, the CPU can leave several megabytes of data on the stack. Want a local array of 16k elements? No problem on the CPU, impossible on the GPU. Want to recurse the same function a few thousand times? No problem on the CPU, impossible on the GPU. Of course if the data is shared you could explicitly use shared memory and manage access to it, and you can convert recursion into a loop, but that’s putting a heavy burden on the developer.

    Microsoft increased the register limit from 32 in Shader Model 3.0, to 4096 in Shader Model 4.0. SM3 is ancient nowadays. And while you won’t run into the 4096 limit that easily yet with the relatively small bits of code the GPU is expected to run, anything in between results in lower occupancy.

    There’s a reason GPUs continue to increase their register set or lower the latencies every generation. It allows them to run more complex code without going back to the CPU. The end goal is to be able to start executing parallel code in the middle of sequential code, even when you already have tens of function’s stack frames on the stack. That’s only going to be achieved by a unified architecture.
    You don’t really get to choose how many local variables you need, or how deep your stack gets. For GPUs to have a flourishing software ecosystem you have to be able to call other people’s GPU code libraries. For this they have to be able to park stack frames in generic memory instead of precious registers, and they have to lower the the thread count so that this data will still be in caches by the time the call returns. And to achieve that you need out-of-order execution.
    If and when they guess wrong, yes. But automatic prefetching is quite conservative (it really needs a strong pattern), and if it predicts wrongly it immediately aborts. This can be tuned as desired for best performance/Watt. Not prefetching results in stalls or switching threads, and the latter also ‘pollutes’ the cache for the first thread. Research shows that GPUs should prefetch too to improve power efficiency.
    First of all, the speedup from out-of-order execution greatly depends on the rest of the micro-architecture, especially the execution width. In-order architectures simply don’t come as wide as out-of-order ones, so it gets hard to compare. Secondly, don’t trust singular results. It is obviously going to greatly depend on the software, data set, and usage pattern how much you gain.

    The Atom Z3770 is well over times faster than the Z2760. Even if it was just 30% when eliminating all other differences, that’s not something anyone is willing to give up. So unification will require the scalar part to use out-of-order execution. Strictly speaking the SIMD part could use in-order execution, but since the cost is amortized over a great vector width, you might as well stick with the out-of-order execution. Alternatively it could use less aggressive out-of-order execution. In fact that’s exactly what I’m proposing by executing AVX-1024 on 512-bit units. It takes two cycles to issue each instruction, so you can have two SIMD clusters and have a single scheduler alternate between them. Of course that’s the same cost as issuing them on 1024-bit units each cycle when you have just one SIMD cluster, but that wouldn't come with the latency hiding benefits. Also, if you pin each of the two 512-bit clusters to a single thread, you can really use out-of-order execution to augment the utilization.
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I think it would be much simpler to just hold the CPU<->GPU command buffers in dedicated SRAM on chip, which the OS can control. There is no reason to communicate here via DRAM when the cores themselves are on the same die.
     
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Which part did they pull off?
    Not an OoO core--referencing Denver, and most probably not a complex CPU core with no errata--referencing everyone.

    Barring a crippling flaw that somehow breaks the software layer, Nvidia can do what Transmeta did and compensate for vulnerabilities in the hardware by putting the necessary workarounds in the translated code cache.

    Additional pipeline stages can be added to allow for better voltage scalability in critical paths, but beyond selective application to things like the memory pipeline it can threaten the performance of the design. Heavier pipelining also increases transistor counts that directly contribute to the static power figure in the graph cited earlier.

    You're likely to run into a number of roadblocks with extreme pipelining for the sake of not bumping voltage.
    One is performance tanking in absolute terms and per-watt.
    Then there is the already mentioned increase in transistor count that will increase the static component of leakage in that graph.
    Another is that while wire delay is typically a bigger obstacle these days than transistor switch times, splitting things up and adding drive stages cuts wire delay at the cost of an accumulation of small-but-not-zero transistor delays, and the proportion of delay sources is not constant throughout a design.

    Then there's the cost of adding a pipe stage, since each stage involves passing signals through logic that takes data from one latch and outputs to the next.
    The latches are a non-trivial contributor to pipe delay, and every reduction of the amount of logic work per stage also increases the fraction of a pipe stage taken up by stage overhead.

    Just as an example:
    http://www.realworldtech.com/cell/3/

    The CELL processor had a very aggressively pipelined design whose cycle time was approximately 50% pipe stage overhead, and CELL still had to bump voltage to reach the upper range of its clock envelope.
    http://www.realworldtech.com/cell/8/

    Taken to the furthest extreme, you will simply run out logic layers that do any work if you add enough stages, and the chip that can do nothing is probably not going to hit the super high clock range in that graph without a voltage bump.
     
  12. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Wrong. Just adding more parallelism will run you into the Bandwidth Wall. The GeForce 8800 GTX could do 518 GFLOPS and had 86.4 GB/s of bandwidth, while a 680 GTX can do 3090 GFLOPS with 192.25 GB/s. That's six times the computing power but just twice the bandwidth to feed it. That’s only possible because of larger on-die caches to feed the additional units, and lower instruction latencies and dual-issue to improve locality. In effect they had to increase per-thread IPC (or rather decrease the CPI).

    The Bandwidth Wall also translates into a Power Wall. Bandwidth can’t be increased aggressively because it costs a lot more power, and unlike raw computing throughput it’s not helped much by the silicon process scaling. So ever more bandwidth will have to be delivered by caches, and the more you can fetch data from a closer cache level, the better the power consumption. To achieve that you have to run a low number of threads. Note that even Hyper-Threading with just two threads can result in no gains due to cache thrashing. So GPUs will have to continue to extract more ILP to stay efficient.
    Sure, but as long as it’s achieving a good performance : power ratio, that’s fine. Intel’s Broadwell architecture is claimed to uphold a 2:1 ratio or more for each new enhancement. And that’s for scalar code. For SIMD the scaling from 128-bit to 512-bit will result in even much better ratios. The effective GFLOPS/Watt is increasing spectacularly.
    When you quantify what you call “quite a lot”, and its rate of adoption, it’s really rather bleak compared to the rate at which computing power is increasing. Also, people don’t just want more pixels, they want more realism per pixel. Meanwhile the total frame latency can’t increase, and should ideally be decreased. So the only way to achieve all that is to compute things faster on a per pixel / per thread level. If you add more ALUs to increase parallelism, a good chunk of that has to go to instruction level parallelism.
    But then you’re wasting hardware by having an aggregate IPC much lower than 1, and you're ignoring locality issues due to running one thread per core. Obviously we don’t want to waste anything, so the choices are between out-of-order execution to increase the single-threaded IPC, or SMT to achieve an equivalent aggregate utilization. Again bandwidth and power consumption become your enemies as you scale up. You can only increase the number of cores up to the point where bandwidth is saturated at any level. And since you want the bandwidth further away from the cores to be as low as possible to minimize power consumption, the only way to have a high core count is to have a high locality of accesses. Out-of-order execution achieves a higher locality than SMT, so it will prevail.
     
  13. keldor314

    Newcomer

    Joined:
    Feb 23, 2010
    Messages:
    132
    Likes Received:
    13
    Before you go extrapolating about increased IPC using Kepler as a base, consider Maxwell. Maxwell is basically Kepler *without* the new IPC and with a larger L2. How did it perform? Quite well, thank you, *especially* in compute benchmarks!

    I'm somewhat limited in providing concrete data for this since pretty much all the compute benchmarks are OpenCL based. Given the state of Nvidia's OpenCL stack, it's not clear how much of the gains come from "Maxwell performs much better" as opposed to "Kepler's OpenCL support sucked, but Nvidia put a bit more work into it for Maxwell". Anyone have some Cuda benchmark comparisons?

    What this really goes to show is that the IPC micro-architecture first introduced with GF104 was a dead end. At the very least, it shows that it's better to spend the space/power you would use for OOO or something and put it into more cache and/or more processors.
     
    #153 keldor314, Aug 28, 2014
    Last edited by a moderator: Aug 28, 2014
  14. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    I don't think that's entirely the case. There's a constant, a scalar and a quadratic component to it, and while that makes it O(n²), the constant + scalar component actually dominates as long as you stay within the design space of the process. In fact as you can see from the image for most part of the curve you can get twice the clock speed for much less than twice the power consumption. Also, overclocking a 3 GHz design to 5 GHz will result in worse power than when it was designed for 5 GHz in the first place.

    Last time I checked, the performance/Watt results for CPUs vs GPUs were not off by 4x despite the big difference in clock speed. AVX-512 should further diminish that gap, and we'd need AVX-1204 for the most fair comparison. Either way, with a CPU designed for 4 GHz the power consumption for the performance you get out of it isn't bad at all.

    And then there's several ways to actually do lower the frequency in case that would be necessary to achieve unification. Downclocking to keep the core within the power budget while executing wide SIMD code could improve performance/Watt if the voltage was also lowered. The same design techniques for near-threshold voltage computing can keep things stable at a moderate voltage to optimize for performance/Watt. And while that would affect absolute scalar performance, it could ramp up again when no SIMD code is executed for a while. In fact Intel is already balancing the power budget between the CPU and iGPU in a similar fashion. Alternatively we could have two 1024-bit SIMD clusters per core, but clock them at half the frequency of the scalar part. That would have the downside of impacting legacy SIMD code which doesn't take advantage of the full width, but maybe that's quite acceptable especially when the two clusters would keep multi-threaded execution on par.

    Also note that while GPU core frequencies are relatively low, the RAM can run at up to 6 GHz! For CPUs it's practically the reverse situation. It appears that future GPUs will use lower clocked stacked RAM but use wider interfaces, which might allow them to increase core frequency, whereas CPUs will switch to higher clocked (but also power efficient) DDR4 soon.
     
  15. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    "dependent arithmetic instruction latencies have been significantly reduced"
    "scheduler still has the flexibility to dual-issue"

    Kepler couldn't actually dual-issue 4-operand instructions, due to register port conflicts. Also, whether or not you can dual-issue depends on the instruction dependencies. So dual-issue rates probably weren't that great. Instead a significant reduction in latencies, and the ability to still dual-issue different instruction types, must result in greater single-threaded performance. Ergo, another small convergence step toward the CPU.
    It seems more like one step back to make two steps forward in single-threaded performance. You can't conclude from this that out-of-order execution won't ever be viable. Also note that you have to cross the valley in one go. We won't actually see the architectures converge to the same point before unification. Instead they'll have competitive performance and power characteristics and CPUs will suddenly be able to take on graphics computing. Still, GPUs clearly can't wander off to the other extreme and have to incrementally achieve better access locality and hit rates through lower latencies and larger caches.
     
  16. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Note that sebbbi was talking about doing it manually, and thus statically. Anyway, yes in theory the same code can be recompiled multiple times. But that's not novel either. In fact proponents of JIT compilation keep bringing it up as a reason why JIT compiled languages should be faster than statically compiled ones. In reality they're not. The hotspots of performance critical applications are typically optimized by the developers themselves, already utilizing specialized paths for changing conditions. You can't really rely on a JIT compiler to do that for you. Where JIT actually can outperform static compilation is when the number of conditions to specialize for is too high to do it manually, but once again in reality they're all too conservative to be potentially spewing out a lot of code. I'm not expecting Denver to be any different.

    With Reactor I'm generating exactly the specialized code that I want at run-time. So even for cases where there are lots of conditions it's feasible and preferably to have the developer guide the process.
     
  17. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Denver isn't comparable to a JITed VM for a language like Java because it's translating machine code, not unoptimized or poorly optimized bytecode. The hotspots of performance critical applications will be optimized by the developers exactly as much as they always have been. Maybe not targeting Denver's specifically, but on Android apps have very little in the way of uarch specific optimization, not least of all because here's one binary per architecture without dispatching. While Denver is technically running a separate architecture in between I'm sure it has been tuned to fit ARM well.
     
  18. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Can you mention what the panelists had to say on shared memory? IE, why did they dislike it? What about your own views on shared memory.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    You mean 6 Gbps, right?
    GDDR5 doesn't clock its bus to 6GHz. Since it is DDR, the data bus clock is half that.
    The upper tiers of DDR4 may eventually bump into the lower range of GDDR5.

    High bandwidth solutions may be the very wide HBM or ultra-short-reach HMC. The USR variant of HMC is targeting 10 Gbps. The next iteration of that standard is apparently aiming for 30 Gbps.
     
  20. Blazkowicz

    Legend

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    GDDR5 sends data at quad the clock rate, not double?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...