You just contradicted yourself. And no, you can't strictly partition the data between things that can and things that can't move a shorter distance. If that was the case, why use the LLC instead of RAM if it all stays put anyway? Clearly you think something is worth sharing, but why end at the LLC?
I don't get your reasoning at all. Regarding your question, it is usually more efficient, to share it at a higher level, as lower levels get trashed (data doesn't fit in) and the the access latency is actually lower (you don't have to do a roundtrip to other cores lower level caches to get the data, but just to a common higher level cache, that's usually faster). And as you self said, the bandwidth of an L1-L2 interface is large enough and the potentially shared data in the L1 small enough, that it is no problem to push it out to a higher level. It is basically nothing else than done in a multicore CPU to ensure coherency anyway. It doesn't matter if that are symmetric cores or a mix of latency optimized and throughput optimized cores. The TOCs will only operate on larger datasets in average. Differences in the expected workload may also favour other choices for the design of the lower level caches (like L1).
Interesting. Could you point me to anything that describes it? Thanks.
Just look in any ISA manual for the VLIW architectures. The VLIW architectures explicitly controlled the behaviour through the instructions. If one wanted to use the result of an operation as input to a back-to-back dependent operation, one had to use (i.e. the compiler did this automatically) explicitly the result of the preceding instruction and not a register as the source (as the manual stated the result was not written back to the registers). This encoding into the instructions makes it simpler (the processor doesn't have to detect such dependencies, the behaviour of the bypass multiplexers is controlled directly by the instruction) without loss of functionality in the case of an inorder execution with fixed latency for all ALU operations (AMD's VLIW architectures).
That's what I said. VLIW5 and even VLIW4 became overkill due to changes in workload characteristics, so they switched to single-issue (which I don't think is the right move since modest multi-issue does have valuable benefits).
Multi-issue costs complexity (which you can try to minimize with static, compiler determined scheduling) and therefore die space and power. It's not so easy to say what is the right choice for armchair experts like us.
I understand your question is genuine but I'm not going to derail this thread with a big AMD versus NVIDIA architecture discussion. Anything relevant to unified architectures has already been discussed and it would take too much weeding through other architectural differences to get to a conclusive agreement about single-issue versus dual-issue. This is a tiny difference anyway compared to the CPU's IPC versus the GPU's CPI which is far more relevant to the topic here. If you're adamant about it I'd be happy to share my opinions on GPU to GPU differences if you cared to create a new thread about it.
It has nothing to do with AMD vs. nV. I was just curious, where your claim (that dual issue is the optimum for a GPU) comes from. I would think, this really depends on a lot of factors like the overall design and the expected workload and can't be judged in isolation. If you don't want to answer it here, just write me a PM.
I know these things but I have to admit I'm not entirely sure what to call this multi-issue-but-not-from-a-single-thread behavior.
For throughput tasks, single thread (could be also a subset of the data elements) behaviour isn't very important. One GCN processing core (a CU) can issue multiple instructions for a subset of the multiple threads running on it in a single cycle. On that level, it's the same as an nVidian SM does (and different from what a VLIW CU does, it only issues for a single thread in a given cycle).
Exactly. So we can't compare it as such. Their usage and purpose differs greatly but the results obtained are much more
closely comparable than the differences appears to make some assume.
What you still ignore to a large extent, is the argument about differing amounts of parallelism (and how it's exploited) in different workloads. What is your graph (which you link the second time, I said something to it already)
really showing?
That would fall under the definition of "trading blows". I gave examples of CPUs beating (integrated and discrete) GPUs. You're giving examples of (discrete) GPUs beating CPUs. Fits my argument just fine, especially since integrated GPUs are far weaker.
No. What you demonstrate is the variety of workloads and that a certain type if processor is vastly better suited to one kind of workload while another type of processor is vastly better at dealing with other workloads. Defining that as "trading blows" appears a bit ridiculous.
NVIDIA's warps are 32 elements but Kepler has 32 element SIMD units so they only take one cycle (at least for the regular 32-bit arithmetic ones).
Are there really 32 physical slots in each SIMD unit or just 16 (single cycle issue but blocked the next cycle for issue)? The latter would fit better with the number of schedulers in each SM.
Anyway, that is not the topic here.
GPUs do have to hide ALU latency. It is larger than one cycle so they have to swap threads to hide it.
GCN apparently doesn't have to do so. You only have to do it if latency>throughput. AMD's GPUs tend to keep latency=throughput (to some extent that is also true for the VLIW architectures, wavefronts are not exactly swapped on a instruction to instruction base but only for larger instruction groups called clauses; physical latency is a fixed 8 cycles, a wavefront gets a new VLIW instruction every 8 cycles, apparent latency: none, exactly two wavefronts are procesed in parallel at any given time).
The problem with the GPU's L1 cache isn't so much latency as it is bandwidth. GK104 is at
0.33 bytes per FLOP, whereas Haswell can do 4(+2) bytes per FLOP. Of course as noted before the usage of registers and caches is different between the GPU and CPU, and in this case high cache bandwidth of the CPU isn't that huge an advantage because it needs the L1 cache for more temporary variables (for which the GPU has its larger register file).
I definitely agree with the second sentence.
It may be also interesting to look on some size and bandwidth numbers (even when they don't tell too much, it just gives a feeling what the relevant numbers are). A single GCN CU (the Tahiti ones with higher DP speed and ECC, the other GCN members use slightly smaller ones) measures about 5.5mm² in 28nm, a Haswell core in 22nm measures about 14.5mm² (would be 23.5mm² normalized to 28nm assuming perfect scaling). A GCN CU @1 GHz can do 128 GFlops/s (SP), Haswell at a somewhat optimistic 4 GHz the same (is faster at DP though). The GCN CU integrates 256kB vector registers with an aggregate bandwidth of ~1 TB/s, 8 kB scalar registers with a bandwidth of 16GB/s, 64 kB shared memory with a bandwidth of 128GB/s, and 16 kB of vector memory L1 cache with a bandwidth of 64 GB/s (L1-L2 connection also provides 64GB/s). Furthermore, it can access the I$ with 32 GB/s (one needs less instructions for the same arithmetic throughput as a CPU) and the scalar data cache with about 16 GB/s.
Haswell on the other hand has 168 physical registers each for integer (64Bits) and floating point/SIMD (256 bits). That's ~1.3 kB integer registers and 5.25 kB SIMD registers. Let's concentrate on the SIMD part. The reg file has probably 6 read ports and 3 write ports or something in that range. That would be a total bandwidth of ~1,1 TB/s (has to be roughly the same as the arithmetic throughput is the same as that of a GCN CU). The 32kB L1 cache offers a bandwidth of 256 GB/s for reads and 128GB/s for writes, the 256kB L2 cache can be accessed with 256GB/s and the L1I cache with 64GB/s.
So in the end, a Haswell core needs about 4 times the normalized area (and at 4 GHz a lot of additional power) to provide the same arithmetic throughput, a comparable amount of SRAM in its L1+L2 cache with somewhat comparable bandwidth numbers (lower than the comparably sized reg files of a GPU of course, but higher than their smaller caches).
So both have qualities and limitations that cancel out against each other to some degree.
Yes, but in the end we are left with the way higher die area and power required for the same throughput. What that means for throughput oriented tasks is clear. It is even clearer, what it means for graphics, as the die size number I gave already includes conversion and filtering as well as texture decompression logic contained in the TMUs.
It means unification is closer than you might think from comparing these kind of numbers without the broader context.
What I left out above, is the scalar performance. There, a Haswell cores shines against a GCN CU. It simply puts much more emphasis on that area. That is it, what it means.
That's irrelevant. It only took one example of a GPU not "utterly destroying" a CPU to prove that this is not "always" the case. The "always" qualifier gave me free rein to look for any example fitting the definition of "huge out of order speculative cores" and "a GPU like throughput oriented architecture".
You
again forget about the massive differences existing between workloads. The "utterly destroyed" referred to a certain class of workloads (Novum was alluding specifically to graphics!) for which it is true. And the "always" was meant as a temporal qualifier, i.e. it was that way since the inception of GPUs and will be the same also in a few years (okay, always is quite strong; but you can't know for sure, that it's wrong and your reasoning doesn't make much sense in this context as the statement is definitely right for the foreseeable future).