Bulldozers are faster in reverse

Discussion in 'Architecture and Products' started by Nick, May 6, 2013.

  1. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    I'm thinking more like 8 cores, 16 threads, 16 SIMD clusters with 2 FMA units each. That would still be barely bigger than 100 mm² at 14 nm, but could pump out 2 TFLOPS.

    Leaving some of it dark isn't an issue at all. It's a necessity due to the power wall and the bandwidth wall. The important thing is to have the right execution logic for any type of workload, right where the data is.
    AMD tried that with Bulldozer, and look where it got them. Even for Intel, the highest Linpack scores are obtained when turning off Hyper-Threading. Granted, that's a synthetic benchmark and Hyper-Threading typically does help, but it shows that it comes with overhead that needs to be 'overcome' before it starts to contribute anything. That's why I'm proposing to execute AVX-1024 instructions on 512-bit units, instead. It helps hide latency, while doubling the scheduling window, and lowering the front-end activity.
    Unified cores then, right?
     
  2. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    627
    Likes Received:
    414
    I disagree on the dark silicon issue, as I already posted. We already have a way to profitably use every last transistor (even if with diminishing returns), we're not going to start reducing cache to plop in extra FPUs that will idle for most of the time.

    I still hold that the FPU is not the weak point. Every single FPU-heavy load I've ever ran on the BD has been cache or memory throughput-limited, not compute limited. Every last one. I have never managed to do anything that approaches looking like real work and that runs out of execution resources on BD. The caches are that bad.

    I honestly think that for a lot of real problems, a write-back L1 would double the total throughput of the system.

    Not only is linpack so unrepresentative of real work that it should never be mentioned, this completely skirts my point. My point wasn't "HT speeds up FPU-heavy loads", it was "in mixed-load environments, HT increases efficiency". Run linpack and 4 scalar threads at the same time, and you bet you're going to see very good gains for HT.

    This overhead is cache pressure. There is no overhead to overcome in the actual execution parts. Dedicating FPU clusters does not help overcome the cache pressure overhead in any way, shape or form.

    Eventually, but I think this will take a long time still.

    Note that I don't necessarily think this is the best way to go (I haven't studied the problem long enough to pick a position), but I do think that Intel is institutionally predisposed to go this way, and thus it will happen.
     
  3. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    One question Nick: Were do you put the shared ressources for the integrated GPU's math engines (which you want to merge with the Vector units of the CPU core) and how do you propose the work should be fused afterwards?
     
  4. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Why settle for diminishing returns already? That can always work as a last resort. Intel has been able to keep the L3 cache for four cores at 8 MB for three process nodes now, using the increasing transistor budget for better purposes. I'd rather know that each core is concentrating on one task, achieving maximum performance/Watt, than ensure full utilization of a few but leaving performance on the table.

    There are three basic kinds of workloads to deal with: high ILP, high TLP, and high DLP. For high ILP, you want four scalar execution ports like Haswell. For high TLP, you want to share those between two threads. And for high DLP, you just want wide SIMD units, and lots of them, at a lower frequency and hiding latency while maximizing data locality with long-running instructions.

    The architecture I'm proposing here is designed for each of these, and any mix of them. Basically, CPUs like Haswell are already very good at ILP and TLP, but they need GPU-like SIMD units, without lowering the data locality by running lots of threads, and other overhead associated with that.
    I have no first-hand experience with Bulldozer, so I'm not doubting your findings at all. But improving the caches and lowering the latencies would still not make it a good step towards unified computing. To match my proposal, it would need 32 scalar cores. There's no use for that in the consumer market, and it would waste a lot of space (and I'm not talking about low utilization, I'm talking about a complete waste - even using it for more cache would have been better).

    Of course the root of the problem is that AMD doesn't want unified computing at all. It wants small scalar cores that share an FPU for legacy purposes, and a big GPU to handle all throughput computing needs. That may sound good on paper, but it's fraught with heterogeneous computing issues. They're hoping for developers to miraculously deal with that, while Intel is pampering developers with a better ROI proposal.
    Like I said, I mentioned Linpack only to illustrate that Hyper-Threading has an overhead. I'm not arguing that it increases utilization in mixed-load environments, but I do argue that it doesn't offer the best performance/Watt, precisely due to the inherent overhead.

    Then why do I propose keeping Hyper-Threading for the scalar portion of the core? Because there's a tipping point where low utilization becomes a waste and using four scalar execution ports by two threads during high TLP workloads is more power efficient than one thread using on average only a couple of them. They do matter for increasing IPC in single-threaded workloads though due to Amdahl's law.

    So it's all about finding the right balance for each type of workload. I don't think SMT is optimal for the SIMD units. They suffer the most from cache pressure when the thread count is increased. AVX-1024 offers the necessary latency hiding qualities to increase utilization (the good kind that improves power efficiency) while lowering front-end and scheduling overhead.

    I'm open to alternatives, but I really don't think Bulldozer sets a good example.
    I think it will happen regardless of Intel's desire for it. That, plus their process advantage, will just make it happen sooner than the sheer necessity dictates. I have continuously underestimated how fast they would converge things, and I think that's saying something. If Skylake features 512-bit SIMD units, then we're looking at a 32-fold increase in computing power between a dual-core Westmere and an 8-core Skylake, in five years' time. That would probably put getting rid of the integrated GPU on the roadmap next.
     
  5. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    I think it can be done in the execution units instead of requiring a separate math box. Sines and cosines can be closely approximated with mainly just a handful of FMA operations. SSE/AVX also already has support for pipelined reciprocal and reciprocal square root approximation (although it could use some improvement). Next, gather instructions can be used for table look-ups for piece-wise approximations. The starting point for the exponential and logarithm function is to extract/insert the IEEE-754 exponent field, for which Xeon Phi has some interesting instructions. Extending the 'Bit Manipulation Instructions' to AVX would also help with that.
     
  6. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    I think we're talking about different things here Nick. Maybe I should not have called the Execution Units "Math Engines" - to close to the Intel term for special function units perhaps.

    What I actually meant was: Were to put the front-end, if you distribute/merge the execution units into/with the Vector units of the CPU cores. You'll need some kind of front end (work distribution) and some kind of back end (fusion) which unifies all the involved cores.
     
  7. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    I'm not sure if I'm interpreting you correctly this time, but there are no different cores involved. Instead of like today's CPU cores with Hyper-Threading where both the scalar and vector execution units are shared, my proposal is to have dedicated vector execution units for each thread. That's really the gist of it. The front-end and back-end basically stay the same.

    Just look at Bulldozer. It has two dedicated scalar clusters, one shared vector cluster, and one shared front-end and back-end. I just want the reverse for the scalar and vector clusters.

    To keep that many vector units occupied with just a 4 instruction wide front-end, there would be AVX-1024 instructions that split into two 512-bit operations on issue, executed sequentially. Running the SIMD clusters at half frequency would save power and further lessen the burden on the front-end.
     
  8. pMax

    Regular

    Joined:
    May 14, 2013
    Messages:
    327
    Likes Received:
    22
    Location:
    out of the games
    I think he is referring to the fact that a lambda function is divided into many work-items that progresses in parallel in GPU.
    Imagine a shader that copy a texture, with some custom pattern, some dependency on other parallel copy, and some IF in between that make the flow diverting.
    Unless you want to make your compiler can manage all this, you need 'something' that deals with that (including managing possible not-coherent retirement of data).

    i.e. you cannot spawn hundreds of threads in a CPU as in a GPU - that is not convenient.
     
    #48 pMax, Jul 11, 2013
    Last edited by a moderator: Jul 11, 2013
  9. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Software scheduling FTW. :grin:

    But you are right, it is in my opinion much more convenient to run larger tasks asynchronously and automatically distributed over the available resources, which effectively necessitates a separate scheduler with access to all vector units if you want to do it in hardware. Wait! That wouldn't be that homogeneous anymore. So scratch it.
    Otherwise you would need a specialized scheduler in the OS handling this (and some feedback to the app of course to spawn the right amount of threads at the right time). Basically it would amount to some kind of runtime environment for these tasks (which does the thread creation, work distribution and maybe even the scheduling [at least helps the normal thread scheduler of the OS]).
     
  10. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    797
    Likes Received:
    223
    Knights Landing is supposed to be around 3+ TFLOPS on 14 nm, and Haswell already reaches Knights Corner-levels of FLOPS/W, so the difference between Haswell and Xeon Phi is not very big once TDP is factored in. Do you see Xeon Phi as being relevant 5 or so years from now?
     
  11. Novum

    Regular

    Joined:
    Jun 28, 2006
    Messages:
    335
    Likes Received:
    8
    Location:
    Germany
    That would surprise me. Who measured that? Under what circumstances?
     
  12. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    797
    Likes Received:
    223
    [​IMG]

    It may not be a measured 3+ TFLOPS, but it is at least their plan.
     
  13. Novum

    Regular

    Joined:
    Jun 28, 2006
    Messages:
    335
    Likes Received:
    8
    Location:
    Germany
    I only see GFLOPs/Watt for the Knights, not for Haswell.
     
  14. willardjuice

    willardjuice super willyjuice
    Moderator Veteran Alpha

    Joined:
    May 14, 2005
    Messages:
    1,386
    Likes Received:
    299
    Location:
    NY
    But we have wide access to haswell! Peak flops for 4770 is ~500 gflops and it's tdp is ~100w (rough estimations, can't exactly remember). iMac's statement seems fair to me.
     
  15. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    DP FLOPS is half of that though. So Haswell is roughly half of KC's FLOPS/W in DP.
     
  16. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    797
    Likes Received:
    223
    The Xeon E3 1280v3 (integrated GPU disabled) is 3.6 GHz at 82 W for 2.8 DP GFLOPS/W, while Knights Corner parts range from 3.3 to 4.5 DP GFLOPS/W (unless I missed some parts). So admittedly not equal, but not half either…. [I wasn't thinking when I made the quick calculations in the above post.]
     
  17. Novum

    Regular

    Joined:
    Jun 28, 2006
    Messages:
    335
    Likes Received:
    8
    Location:
    Germany
    Yeah I just calculated and was pretty shocked that Larrabee is so weak with 60 cores compared to Haswell. Especially with the power it uses.

    Or maybe it's the very fat 512 bit GDDR5 controller that burns a lot of power?
     
  18. rapso

    Newcomer

    Joined:
    May 6, 2008
    Messages:
    215
    Likes Received:
    28
    LRBs isn't that much different from the GK110 if you look at FlOp/s/w. one outstanding point is the 30MB of cache (I think GK110 has 1.5MB(?)). while caches are very optimized, I'd guess it's still having some impact on the power consumption.

    I've played with my new haswell this weekend, I'm really impressed by how efficient you can run code on it. got some stuff vectorized/optimized to reach 90% of the theoretical peak and using HT it wen't to those 10% further. I've tried some matrix*matrix (SP FMA), MD5 (AVX2).
    My older i7 could get up to 30% boost with HT, but was never closer than 90% of the peak performance.
    I'd be really curious how well I could optimize some code for the xeon phi parts. I feel like comparing just the pure numbers is not doing justice to the excellent efficiency of haswell.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...