The ISA for a UPU

Discussion in 'Architecture and Products' started by Nick, Mar 6, 2013.

  1. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    856
    Likes Received:
    260
    Sorry, yes a single GPU SIMD-unit/ALU-block isn't partitionable, though what I wanted to express is: if you would put several GPU SIMD-units together to form a SIMD-vector as wide as a CPU SIMD-register, call that a PU (I don't have the OpenCL doc from AMD here, there is the right abrevation), then the advantage of that PU - which consists of multiple 4-wide SIMD-units - is fe. that it's partitionable, while the CPU SIMD-vector (a register) is not.
     
  2. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Are you saying the 4 wavefronts of a workgroup with 256 work items can not be distributed over the SIMDs but all 4 have to run on the same SIMD? Why? Do the barriers only work within a single instruction buffer/SIMD? That's hard to believe as the ISA manual states that a work group can consist of up to 16 work groups (more than a single SIMD can handle).
     
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,509
    Likes Received:
    839
    You're going to see the same cache contention whether data is read and stored in one single ├╝ber-context or eight to sixteen smaller ones. The same work needs to be performed. Cache contention is a result of lack of size or associativity.

    Your serial monster CPU is going to need a much bigger ROB, which will impact cycle time AND power consumption.

    The structure of the ROB and register file is fixed in silicon. You need bigger structures to support your larger instruction window. Accessing these bigger structures is slower and require more energy.

    It is clearly not! Haswell uses a lot more silicon. Wider issue width, wider execution units, wider data paths.

    If there is a need for these units, then it will make sense to widen them. But widening them spends more si real estate and burns more power (both active and idle). Power that could be used to increase single thread performance. Widening them only makes sense when the performance/power trade-off allows it.

    Cheers
     
  4. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    No, that's not what I am saying. What I am saying is that it is generally more efficient to use workgroups of 64 threads.
     
  5. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    I got that, but why? If the four wavefronts of a 256 work item workgroup get distributed over all SIMDs in a CU, what is the major difference to four smaller workgroups consisting of a single wavefront? Just that one may synchronize larger groups? Is there any difference left if the kernel needs no barrier?
     
  6. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Barriers are one advantage. Another is that if the machine is full, the machine can schedule a new workgroup sooner if the workgroup is smaller as you only need to wait for a single wavefront to finish, not 4 wavefronts.
     
  7. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Okay, another (in most cases not too large) effect. ;)
    It's OT here, but anyway: I would think one uses larger work groups only if one gets some benefit from data sharing over that larger work group. Otherwise it doesn't make much sense to begin with. So are there really some people choosing larger work groups for no reason?
     
  8. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Yes and no. OpenCL, unlike DirectCompute, doesn't require that the application specify the workgroup size for a kernel at compilation time. To avoid recompilation, we use a default that is as large as our maximum (256 threads in OpenCL), this way we can get correctness no matter what workgroup size is used with the dispatch. At dispatch time, the application can use NULL for the workgroup size, indicating we should use whatever was used to compile the program.

    So we could use a default of 64 threads per workgroup and then hope applications query the workgroup size used for compilation, but I suspect we would break a bunch of applications.

    Along with this, if the application doesn't specify the workgroup size, then we also don't know what are the best dimensions to use. Say the application is using a 2D dispatch, which is best: 64x1, 32x2, 16x4, 8x8, 4x16, 2x32, 1x64? Well it really depends on the data access patterns and such, so the driver really can't guess at what's optimal. I've had this discussion with some developers and they just don't get it, even if I point out how the competition's performance can improve as well. Even if we could change each dispatch to find the one that ran best, we don't know that the next dispatch will behave the same way.
     
  9. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Thanks for those details. So the solution is some more convergence. However, note that only part of the issues would be addressed by having closeby latency optimized scalar cores handle task scheduling. You'd still have tasks that are not parallel enough for the many threads that current compute cores demand. As they become ever more parallel, the only solution will be more convergence.
    Indeed, and I suggested further convergence than that. Maybe not this year or the next, but several years down the road. It's obvious things are heading in that direction overall.
    Don't inflate what I agree upon. Haswell's 2+1 load/store with three AGUs is a fact, and it's an improvement over all previous Intel architectures. But the code doesn't exclusively consist of memory accesses. x86 became the dominant ISA despite its relatively low number of logical registers and despite Intel hanging on to 1+1 L/S for a long time. CPU's can easily schedule around any L/S contention and Haswell is going to be even harder to bottleneck. GPUs on the other hand can easily become L/S bottlenecked and they demand extra threads to work around temporary contention.
    First of all, with two load ports it never has to be moved far. Of all the port contention a CPU, or a GPU for that matter, has to deal with, this is a huge luxury. But even so, CPUs can move it a long way and the code doesn't have to be much of a limitation. Especially when you're executing a loop with independent iterations, as would be typical for a throughput workload, the CPU can be executing instructions from multiple iterations.
    I don't think so. You're simply going to run out of threads if the contention on your texture or L/S port is too high, even if just locally. My point was that CPUs can still execute independent arithmetic instructions in this situation. Not only that, it can also reorder the memory accesses themselves.

    What I'm really trying to get at is that CPU architectures are on average every bit as good at high throughput workloads as the GPU. It just lacks SIMD width. In other words you don't need a massive number of threads to label something as an architecture capable of high throughput. And the low thread count improves access locality, which is a problem for the GPU going forward. We do need wider SIMD width in a unified architecture, but that's easy enough after a few more silicon nodes, and we can use long-running instructions to deal with a drop in cache hit rate as discussed above.

    So we're much closer to making unification viable than you might think from comparing thread count or register file size.
     
  10. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,699
    Likes Received:
    117
    Except they are not. High throughput workloads are often highly bandwidth dependent. And bandwidth isn't free. There is a reason Xenon phi has a 512bit GDDR5 memory interface. I'm sure if Xenon phi only had a 128bit DDR3 memory interface it could have even better FLOPs/watt. But in real world usage...
     
  11. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Currently, the needed amount of parallelism is grossly exaggerated as one has to amortize the huge setup costs.
    Read what I wrote! I just stated the basic fact, that almost every architecture (that includes GPUs as well as common x86 CPUs) will be limited by the throughput of their memory system if that is pegged. Nothing more. No need to argue about something obvious. ;)
    Sure, the L1 bandwidth per flop is quite a bit lower on GPUs, but they also enjoy a much larger register space so they don't need to use the L1 as an extended register file like x86 CPUs are doing. Depending on the code and conditions, that may be an advantage or a disadvantage (to either side). That's the way architectural differences play out. But that is no solid argument for nothing.
    How does the amount of load ports influence the memory latency?
    So you are telling me, that a GCN CU easily runs out of its up to 40 wavefronts (a wavefront can have up to 16 reads and 8 writes pending)? And usual code also has a few arithmetic instructions in between. Even when looking just on a single IB and vector ALU, a burst of memory accesses from one wavefront isn't going to stall the vALU/sALU/LDS instructions from the other wavefronts, let alone the vALU/sALU/LDS instructions from the wavefronts running from the neighboring instruction buffers and vALUs.
    I've no idea how long the queue in front of the AGUs is (i.e. how many vector memory instructions can be issued before the issue port gets blocked [it's available again after a few cycles, when the AGUs processed the next access in the queue]), but in any case that would block exclusively the vmem issue port for a short time, not the vALU port, not the sALU port, not the branch port, not the LDS port, not the export (mem write) port, and for sure not the handling of the internal instructions (like synchronization, they are handled directly in the instruction buffer and don't need an issue port). That means all arithmetic or local memory instructions from other wavefronts continue to be issued. If you have so huge amounts of memory accesses that this would play a role, you end up being bandwidth limited anyway. But short bursts of memory accesses (let's say 8 reads directly after each other with no arithmetic instructions in between) are handled usually quite well, if that is just a part of the kernel. There is apparently no performance degragation from such a grouping.
    So do GPUs.
    Is it as good in the sense that it of course can finish the task. :lol:
    But it burns way more power for the same throughput. Having a 4GHz CPU with fast caches and low latency is great for serial performance, but it is an unecessary power burden for throughput tasks. You don't need a L1 hit to beserved in a single nanosecond and an FMA instruction in 1.3ns. Having more leeway on the latency side (GCN does an SPFMA in ~4ns, a DPFMA currently in ~16ns [latency]) enables a more power efficient design of the functional units. We don't even have to start with the simpler and therefore less consuming scheduling logic. It's obviously better to have a different set of lower clocked but wider execution units next to the latency optimized ones. And it's hard to impossible to scale the frequency for a pipeline a factor of 4 without running into inefficiencies.

    Just look what happened to the serial performance of Larrabee/KnightsCorner. That is a more or less throughput optimized design which kept some latency optimizations (or it didn't go as far as GPUs) to fare better at intermediate sized tasks with dynamic and varying amounts and types of parallelism. Nevertheless, it clocks just above 1 GHz. It does an SPFMA also with ~3.5 to slightly below 4ns. Why does it have 62 cores@1.1GHz and not 16 cores@3.5GHz? Coincidence or a pattern?
     
  12. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    14,813
    Likes Received:
    2,229
    Ive got a question,
    It doesnt seem possible to run a gpu at cpu speeds eg: 3.5ghz so would a upu be limited to the 1 ghz mark ?
     
  13. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    284
    Likes Received:
    6
    Location:
    Herwood, Tampere, Finland
    It's perfectly possible to design a GPU that would run at 3.5 GHz.

    Such GPU would just have ~5 times less execution units that the current 1 GHz GPUs and would still consume more power.

    So it would be like 1.5 times slower and 2 times less power-efficient.

    And then there is the question whether the core would support OOE or not; OOE is needed for high single-thread performance, but increases power consumption and is not needed for multi-threaded workloads.

    It's all about tradeoffs, what to optimize the chip for. High clock speeds and OOE are needed for high single-thread performance, for latency-critical applications. Lowwe clock speed with huge amount of parallel function units and no OOE(using multi-threading inside a core to hide latenies) are best for embarassingly parallel workloads such as GPU's.


    There is however some reseach done in how to make ordinary high-performance CPU execute highly-threaded code more efficiently: Morphcore, allowing execution of 2 threads on OOE mode and 8 threads when OOE is turned off.

    http://hps.ece.utexas.edu/people/khubaib/pub/morphcore_micro2012.pdf
     
  14. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    14,813
    Likes Received:
    2,229
    So if we have an upu single threaded performance will be awful because of the low clockspeed ?
     
  15. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
    As a side note reading that thread I would think that GPU has won "the war", Nvidia got the contract and will be powering Exascale and so on, AMD approach to APU (with a "big" GPU relatively to the competition) won, and so on.

    None of that happened yet, actually the first gen of many cores pretty much just shipped (Xeon Phi and Power A2), the fight is ahead of us not behind.
    The disregard toward many cores while not surprising on a forum focused on CPU is not too surprising but I think that the pretense, or sounds like it to me or close, that GPUs already "won" is quiet a stretch.
     
  16. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
    Another side note, I don't get how it is relevant those days when comparing GPU and CPU to speak about memory types and bandwidth.
    Nvidia announced Volta and stacked memory but the same tech is coming to CPU, the difference wrt bandwidth between CPU and GPU is about to be leveled out, and I would think that CPU are the one that are to benefit the most out of it.
    1TB/s is 4 or 5 times what high end GPUs have to playt with now, for the CPU we speak a x20 jump.
     
  17. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    As CPUs tend to crawl through large (parallel) data sets much slower than GPUs, it isn't such a constraint for pure CPUs as it is for GPUs (or APUs).
     
  18. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,699
    Likes Received:
    117
    Because for certain workloads you need that bandwidth. If you scale CPU bandwidth up, your FLOPs/watt number is going drop. If you want to see how it is disingenuous, imagine Nvidia released a TITAN GPU with only a single 64 bit GDDR5 memory channel clocked very low (iow modern CPU bandwidth). The FLOPs/watt of such a chip would be very good on paper, but I doubt they would have customers lining up for such a device.

    Indeed, the power GPUs spend per GB/sec of bandwidth will be significantly reduced making comparisons more accurate.
     
  19. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    This thread is from an era where power is not THE issue. Well, it turns out it is THE issue :)
     
  20. Tahir2

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,978
    Likes Received:
    86
    Location:
    Earth
    Coming from a layman perspective and reading this thread, quickly have seen a pattern arise:

    1. Points made seem reasonable but examples used are heavily biased to prove a hypothesis
    2. Peers mention examples are not reasonable due to x, y reason making the hypothesis an ideal that is not sound or based on current reality
    3. Counters are made advising the examples are fine and people just don't get it.

    This is repeated several times over.

    In the end the thread, and excuse me for the language, seems like a lot of intellectual ego stroking.

    Also have read this same thread over the past few years a few times... begins to get a bit pointless after a while unless I am missing something and we are on the verge of a great intellectual discovery?!
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...