I'm not sure if they ever explicitly said its gonna be 4 GB, period?
AFAIK not, no. They always choose expressions that still can be interpreted.
I'm not sure if they ever explicitly said its gonna be 4 GB, period?
A 1250MHz clock and 640GB/s suggests that they're using HBM2 and only 2 stacks.
hm?A 1250MHz clock and 640GB/s suggests that they're using HBM2 and only 2 stacks.
hm?
Isn't that just 4096-bit/8*1250 = 640Giggle/s
Then the BW doesn't work.It could be a dual chip monster like R9 295X2.
Yeah, that's what I was thinking.The 1.25GHz is actually 625MHz, 125MHz over default HBM1 clocks.
625MHz, DDR, = 1,25Gbps
Regular HBM is clocked at 500MHz, but can do 1Gbps; so if 1.25GHz were the actual clock speed, this could do 2.5Gbps and therefore 640GB/s with a 2048-bit bus, i.e. two stacks.
For APUs it is a multi-pronged problem. From an engine level it may be fine, but the implementation appears to be responsible for a sizeable portion of the significantly longer memory latency AMD's recent chips have versus pre-APU and Intel chips, which would affect the baseline performance of the system.Yes, the CPU<->GPU coherence is not yet perfect. However we have not seen a problem with that, since we sidestepped that problem completely, by moving the whole graphics engine to the GPU. We don't need tight communication between the rendering and the game logic (physics, AI, etc). Obviously for general purpose computing, it would be a great improvement to get a shared CPU+GPU L3 cache and good cache coherence between the units (without needing to flush caches frequently or use slower BW bus).
The other portion of my discussion indirectly addresses this. That atomics and other traffic have to go back to the L2 is a power cost.On GPU side, the atomics are going directly to L2.
The paper at least provided for full FP support, and did rely on LDS for parts of its functionality.Heh, I was going to say that I want full float instruction set and full LDS read/write instruction set... but that I don't need memory write support
But I see some important uses cases for scalar unit memory writes. Especially if the scalar unit cache is not going to be coherent with the SIMD caches.
I've seen the desire for VALU ops to be able to source more than one scalar register. Expanding the register access methods and more fully integrating the domains might make that more of a possibility.I just try to get maximum performance out of the hardware. This is why I suggest things that allow the programmer to write new algorithms that perform faster. All kinds of hardware optimizations that allow shutting down unneeded transistors are obviously we much welcome.
I did not mean to discount those use cases, which are major. I was trying to make the case from the GPGPU or HPC standpoint, since GCN was willing to sacrifice graphics-domain capabilities for the sake of them.16 bit fields are also sufficient for unit vector processing (big parts of the lighting formulas) and for color calculations (almost all post processing). No need to waste double amount of bits if you don't need them.
I think 3dcgi and a few others had more in-depth knowledge on it.3dilettante and a few other hardware oriented guys (sorry forgot who) were discussing the GCN tessellator and geometry shader design in another thread year or two ago. The conclusion of that discussion was that the load balancing needs improvements (hull, domain and vertex shader have load balancing issues because of the data passing between these shader stages).
.
At least for GPUs, the ALUs do not have their own register memory, as they source from a wide register file. Other notable thread contexts like their instruction pointers and wavefront states are either global state for the wavefront or housed in physically common hardware.Call it "ALU" or "simple core" - as long as each of them still has its own register memory, it's a matter of classification.
The SRAM register files in a CU have capacities on the order of some CPU L1 caches. In aggregate, they are larger than some caches. The hit logic and address translation (need to map register IDs to their placement in the shared file) are not there, or are not there to the same extent. There are properties to their physical accesses that are closer to accessing a low-port cache array than they would be to a CPU's more heavily ported and lower-density register file.Sorry, I don't think it's a good idea from a hardware designer point of view.
While registers and cache memory serve the same purpose, i.e. provide fast transistor-based memory to offset slow capacitor-based RAM, they belong to very different parts of the microarchitecture and use very different design techniques to implement.
That's the best-known use of a cache, but the register cache idea was given the name so that it would be considered different in terms of what it targeted. There are other caches out there, like the page walker caches used by the TLB fill mechanism that are not even accessible to software.Registers are part of the instruction set architecture, they are much closer to the ALU and are directly accessible from the microcode, so they are "hard-coded" as transistor logic. Caches are a part of the external memory controller, accessible through the memory address bus only, designed in blocks of "cells", and they are transparent to the programmer and not really a part of the ISA.
I haven't seen an actual transcript of the call, but multiple articles had the authors specifically stating they asked about this. It would seem odd that so many managed to misinterpret the answers the same way, or that AMD managed to accomplish this accidentally.Did AMD actually go on record saying this?
I think they are waiting 14nm node next year for drastic changes.AMD's options in a dwindling set for maintaining relevance and maintaining some kind of volume to its products, which goes to whether said future GCN will exist, and what resources it will be given to have a competent implementation.
I think they are waiting 14nm node next year for drastic changes.
Hopefully they are being conservative, since their quoted energy efficiency improvement is consistent with the process node transition only.
The transition from 28nm to 14/16nm is a hybrid case in and of itself.
20nm had some mild power efficiency benefits, but the structural changes with addition of FinFET is the big power-efficiency gain for the hybrid nodes.
Their strong point tends to be below the highest speeds, where they are massively better at leakage reduction and lower voltages when they aren't being pushed. That's a clock range that the GPUs can operate just fine in, and their massive transistor counts like the lower leakage.
The foundries are pushing to eke out a little more improvement over 20nm, despite the shared interconnect, and they freely market half the power or less at equivalent speed (vs 28nm).
A single CU has 256KB of register file. Fiji, with 64 CUs of the same design as prior GCN GPUs would have 16MB of registers.Register caching sounds like a good idea. Majority of the registers are just holding data that is not needed right now, as the GPUs don't have register renaming and register spilling to stack. Register cache should allow a considerably larger larger register file (little bit more far away), while keeping the performance and power consumption similar. This would definitely help GCN.
I didn't realize the foundries were so bullish about their FinFET processes. I had in mind figures closer to -30% power at isospeed, but in hindsight, that might have been relative to 20nm.
You could start here and work backwards:3dilettante and a few other hardware oriented guys (sorry forgot who) were discussing the GCN tessellator and geometry shader design in another thread year or two ago. The conclusion of that discussion was that the load balancing needs improvements (hull, domain and vertex shader have load balancing issues because of the data passing between these shader stages). These guys could pop-in and give their in-depth knowledge. I have not programmed that much for DX10 Radeons (as consoles skipped DX10). GCN seems to share a lot of geometry processing design with the 5000/6000 series Radeons.