OpenGL guy
Veteran
ECC does not affect compute performance on Tahiti.Nowhere in your linked article does it state 1 TFLOPS DP.
Also when ECC is enabled performance drops.
ECC does not affect compute performance on Tahiti.Nowhere in your linked article does it state 1 TFLOPS DP.
Also when ECC is enabled performance drops.
Nowhere in your linked article does it state 1 TFLOPS DP.
Also when ECC is enabled performance drops.
the presentation in the background states it's 1TFlop DP.Nowhere in your linked article does it state 1 TFLOPS DP.
.
rapso said:ECC should just increase the latency slightly afaik, but that's what GPUs suppose to hide, so it could be slower, but it shouldn't be visible in normal cases.
Look at the slide seen on the wall!Nowhere in your linked article does it state 1 TFLOPS DP.
Memory performance drops, peak throughput does not.Also when ECC is enabled performance drops.
I guess that's it, would explain why it's called "Xeon" (nothing like a gaming or multimedia device).Guys, one question if you don't mind.
So, what is the purpose of this thingie? Only supercomputers, right?
no DX for sure. but who knows, maybe there will be some MIC device for consumer, kind of like there was a CELL on (my lovely) winfast pxvc1000.We don't expect Intel can compete with AMD and NV, regarding drivers, DirectX support, etc. My point being is that they will never offer a gaming card based on this.
Putting wide SIMD in a 3GHz, quad issue OoO core will defeat the entire purpose of SIMD. You want a simple in order core to go with the vector units.like I said the page before, I hope skylake will have all the LRB juice in it. 512bit SIMD on 4 consumer cores with ~4GHz might end up with 1TFlop SP.
4x float or 8x float SIMD is ok, but a "wide" 16x float SIMD is defeating it's purpose? I cannot really come up with any idea why you might think that, can you elaborate?Putting wide SIMD in a 3GHz, quad issue OoO core will defeat the entire purpose of SIMD. You want a simple in order core to go with the vector units.
Indeed Haswell is believed to have two FMA units per core. Current architectures have two floating-point execution ports; one MUL and one ADD. Any configuration with just one FMA unit would lead to lower performance for legacy code, or port contention for new code, also resulting in lower throughput. So it should approach 500 GFLOPS DP for an 8-core.8(core) * 3.9GigaCycles *2op/cycle * 4(Double SIMD) and I end up with 249GFlops.
if you assume intel will double the execution units for FMA, then we'll end up with ~500GFlops for doubles. am I missing something?
As far as I know the "purpose of SIMD" and out-of-order execution are completely orthogonal. We have GPUs with single-issue SIMD, dual-issue SIMD, VLIW SIMD, and we have CPUs with in-order or out-of-order SIMD. And the choices seem unrelated to the width of the vectors. It would actually make sense for wider vectors to be paired with more dynamic scheduling since it would increase efficiency at a lower relative cost.Putting wide SIMD in a 3GHz, quad issue OoO core will defeat the entire purpose of SIMD. You want a simple in order core to go with the vector units.
If you are running a highly data parallel application the superscalar hardware will just be idling ... if you are running scalar code the SIMD will just be idling.4x float or 8x float SIMD is ok, but a "wide" 16x float SIMD is defeating it's purpose?
why would the superscalar HW be idle with SIMD code? the throughput should be as high as with scalar code, but on the other side, you'll have way more bandwidth demand from the caches and main memory, chances are higher that data is not available and the order of data reads by the memory controller is not strictly related to the request order.If you are running a highly data parallel application the superscalar hardware will just be idling ... if you are running scalar code the SIMD will just be idling.
It's actually well know that it's the other way around, it's nearly impossible to have a producer and a consumer and keep both equally busy on heterogenous solutions (usually the producer ends up idle).With a heterogenous solution, say Ivy Bridge and MIC, you can have a mix of SIMD code and scalar code without half your hardware idling ...
Guys, one question if you don't mind.
So, what is the purpose of this thingie? Only supercomputers, right?
We don't expect Intel can compete with AMD and NV, regarding drivers, DirectX support, etc. My point being is that they will never offer a gaming card based on this.
can you imagine how big your hypothetical 25 core Haswell would be? I doubt it could even be manufactured.
Well, you could fit ~10 Ivy Bridge cores in 160mm^2 without the L3 and GPU. So without the L3 and GPU it could probably be manufactured. Whether it should be on the other hand...
Sure you can. Whether it is wise thing to do is another matter, but you can certainly make a multicore processor without L3.That's because the L3 cache not just a performance optimization but actually plays a central role in the function of the processor and you can't omit it.
I included the controllers in my estimation.The processor rapso described would also need much more space for the very wide memory controllers.