In 3dmark11 Broadwell ULV with 24 EUs is faster than Haswell ULV with 40 EUs.
Haswell ULV with 20 EUs is faster than Haswell ULV with 40 EU's too....
In 3dmark11 Broadwell ULV with 24 EUs is faster than Haswell ULV with 40 EUs.
There is not enough BW to feed 40 EUs (without eDRAM). I believe the main purpose of non-EDRAM 40 EU models was to save power (40 EUs at half the clocks is more power friendly than 20 EUs). Of course a 40 EU chip costs more to manufacture, so more EUs at lower clocks is only a viable model for premium models (such as Apple's Macbook Air).Haswell ULV with 20 EUs is faster than Haswell ULV with 40 EU's too....
There is not enough BW to feed 40 EUs (without eDRAM). I believe the main purpose of non-EDRAM 40 EU models was to save power (40 EUs at half the clocks is more power friendly than 20 EUs). Of course a 40 EU chip costs more to manufacture, so more EUs at lower clocks is only a viable model for premium models (such as Apple's Macbook Air).
Yes, it's 2x FP16 rate with no limitations that I know of in terms of getting full FP16 MAD throughput. Unlike Imagination however you do need instructions to convert FP32<->FP16 (I think Maxwell does as well?) so you will lose some efficiency on mixed precision math.If I understood correctly, this means that Intel Gen8 has 2x FP16 rate, just like Nvidia and Imagination. ... I wonder if there are some other architecture limitations that cap the peak rates of 16 bit operations.
Note that this is already possible today and indeed it happens a fair bit even with standard graphics APIs. You don't need shared *virtual* memory for this, only shared *physical* memory, which has been present for a few generations. Shared virtual memory just makes it possible to directly share pointers embedded in data structures and so on. Down the road OS support could obviously also make this a lot more powerful (arbitrary-but-not-necessarily-useful example: texturing from a memory mapped file).With the huge shared L3 caches (and even L4 on some configurations) this would be totally awesome, as the CPU<->GPU traffic wouldn't need to hit memory at all.
It really does depend on the workload but indeed at 15W both Haswell GT2 and GT3 are very similar performance. 28W GT3 is slightly faster. Thorburn's point is absolutely correct though: the limit is as much about TDP as bandwidth in many cases. You simply can't power up the entire GT3 at high frequencies - usually also while you have some CPU render thread maxing out turbo there. Indeed we showed in the DirectX 12 demo that the power limitation is very real and freeing up power - of which the CPU uses a lot - can directly buy performance.At 15W GT2 (HD 4400) is faster than GT3 (HD 5000), at 28W (Iris 5100) is (often but not always) quicker than 35W (HD 4600). TDP is the primary limiting factor for HD 5000 and has some impact on Iris 5100, but memory bandwidth does become a limiting factor in some scenarios for Iris 5100.
I'm not convinced at least the 15W models were all that constrained by memory bandwidth (at least when equipped with dual channel memory, which, unlike in the amd camp, almost all vendors do...). And even if so, this could be improved even with the same memory bandwidth - if you look at gm108, the performance it achieves with just half the bandwidth of these intel chips (as it's saddled by 64bit ddr3) is quite astonishing, those bandwidth saving measures apparently do quite wonders.Gen7.5 was already highly bandwidth starved (without the 128 MB EDRAM). 20%-30% gains without any BW improvements are quite nice. I don't see how they could have improved the performance much further with the same memory configuration (DDR3/LPDDR3 1600). Let's wait for the EDRAM equipped models (with 48 EUs). Desktop / server models likely support DDR4 as well.
Do the 16 bit operations (int16 and fp16) operate in 16 or 32 bit registers? I want to know whether this can be used to reduce the GPR footprint (allow higher occupancy).Yes, it's 2x FP16 rate with no limitations that I know of in terms of getting full FP16 MAD throughput. Unlike Imagination however you do need instructions to convert FP32<->FP16 (I think Maxwell does as well?) so you will lose some efficiency on mixed precision math.
But did Gen 7.5 have cache coherency between the GPU and the CPU caches? If so, then Intel has been further ahead than I expected.Note that this is already possible today and indeed it happens a fair bit even with standard graphics APIs. You don't need shared *virtual* memory for this, only shared *physical* memory, which has been present for a few generations. Shared virtual memory just makes it possible to directly share pointers embedded in data structures and so on. Down the road OS support could obviously also make this a lot more powerful (arbitrary-but-not-necessarily-useful example: texturing from a memory mapped file).
Intel chip shares the BW with the CPU, and it had half the TDP (and even that is shared between the CPU and the GPU). Memory accesses require lots of power. 15 W is likely not enough to run the bus + the chips at full pace for long periods of time.I'm not convinced at least the 15W models were all that constrained by memory bandwidth (at least when equipped with dual channel memory, which, unlike in the amd camp, almost all vendors do...). And even if so, this could be improved even with the same memory bandwidth - if you look at gm108, the performance it achieves with just half the bandwidth of these intel chips (as it's saddled by 64bit ddr3) is quite astonishing, those bandwidth saving measures apparently do quite wonders.
The register file in Gen is extremely general. Short answer is yes you can reduce register pressure by using 16-bit data. Longer answer is the register file itself can be indexed in a pretty wide range of ways (different operand sizes, strides, etc), so at the architecture level how you load/store intermediates in it is quite flexible.Do the 16 bit operations (int16 and fp16) operate in 16 or 32 bit registers? I want to know whether this can be used to reduce the GPR footprint (allow higher occupancy).
It's a bit more nuanced than that as it depends on exactly which cache. In general (and my memory of the details is slightly fuzzy) GPU accesses will tend to snoop CPU caches but not the other way around. i.e. GPU writes need to be manually flushed from caches in Gen7.5. Additionally the texture cache is not coherent as that is very expensive, but normally not a huge deal as doing the required flushes before texturing is usually straightforward.But did Gen 7.5 have cache coherency between the GPU and the CPU caches?
Definitely true about it being shared with the CPU. Note as well that the LLC is shared with the CPU as well and on the 15W parts, it's half the size (dual core). The latter actually often hurts a lot more in practice than the former as 4MB of LLC really isn't enough for both the CPU and GPU to mitigate significant portions of the bandwidth usage so in some cases the driver will elect to cache fewer surfaces in LLC than would be desirable.Intel chip shares the BW with the CPU, and it had half the TDP (and even that is shared between the CPU and the GPU). Memory accesses require lots of power. 15 W is likely not enough to run the bus + the chips at full pace for long periods of time.
Heh, believe me there are a large number of single channel Intel systems out there too. For whatever reason OEMs seem to like providing base models with only 1 DIMM, then providing 2 DIMMs as an "upgrade". But of course to a user the option is just presented as "4GB vs 8GB RAM" without the note that the 4GB system is effectively crippled and will run at literally half the speed in a lot of cases. Sigh.(at least when equipped with dual channel memory, which, unlike in the amd camp, almost all vendors do...
Heh, believe me there are a large number of single channel Intel systems out there too. For whatever reason OEMs seem to like providing base models with only 1 DIMM, then providing 2 DIMMs as an "upgrade". But of course to a user the option is just presented as "4GB vs 8GB RAM" without the note that the 4GB system is effectively crippled and will run at literally half the speed in a lot of cases. Sigh.
And GT3 may be wasting a bit more static power and therefore possibly have lower performance? Or will GT3 shut down half the GPU in this case?It really does depend on the workload but indeed at 15W both Haswell GT2 and GT3 are very similar performance.
And GT3 may be wasting a bit more static power and therefore possibly have lower performance? Or will GT3 shut down half the GPU in this case?
Did Intel already sell a dual core solution as a core i7?
I have as aware that some core i5 could dual cores but for core i7 it is new for me.
On the topic of perfs per watts, it is night impossible to come with a fair comparison for Intel gpu:
Intel sells no discrete GPU.
The closest we have is AMD APU and they use a different process.
I would assume that running 20 EUs at 1300 MHz would consume quite a bit more power than running 40 EUs at 650 MHz (isn't power consumption roughly quadratic to clocks). Having more EUs is also better, as it lets the GPU to finish the frame sooner and shut down all caches, front ends, etc (race-to-sleep). I am not a HW engineer, maybe Andrew could help us to understand this better.And GT3 may be wasting a bit more static power and therefore possibly have lower performance? Or will GT3 shut down half the GPU in this case?
I would assume that running 20 EUs at 1300 MHz would consume quite a bit more power than running 40 EUs at 650 MHz (isn't power consumption roughly quadratic to clocks). Having more EUs is also better, as it lets the GPU to finish the frame sooner and shut down all caches, front ends, etc (race-to-sleep). I am not a HW engineer, maybe Andrew could help us to understand this better.
Haswell ULV with 20 EUs is faster than Haswell ULV with 40 EU's too....
I would assume that running 20 EUs at 1300 MHz would consume quite a bit more power than running 40 EUs at 650 MHz (isn't power consumption roughly quadratic to clocks).
Intel chip shares the BW with the CPU, and it had half the TDP (and even that is shared between the CPU and the GPU). Memory accesses require lots of power. 15 W is likely not enough to run the bus + the chips at full pace for long periods of time.
Not really. If you can equalize memory bandwidth, and TDP used, 15W U GT3 is always faster than 15W U GT2. See though in some models the GT2 is using cTDPup(like the Vaio Duo 13, and some non convertibles as well) and perform much better than the models using nominal TDP.
I would politely disagree - across 200+ data points of my own testing in various games and 3D application tests on otherwise identical hardware configurations HD 4400 consistently outperforms HD 5000. HD 5000 may well be quicker in synthetic tests which only stress individual elements of the GPU and leave the CPU cores more or less idle though.
I am honestly wondering what the answer to this question is.I mean why did they bother with doubling the die on the GT3 15W parts?
Haswell was running at very high GPU clocks. High end laptop models had 1.35 GHz max turbo (and even ULV models were running at 1.2 GHz). All the announced Broadwell models are running at more conservative GPU clocks of 950 MHz and 1.0 GHz. It seems that Intel has came to the conclusion that a wider GPU with more conservative clocks is better for performance/watt. If you look at the current Nvidia and AMD desktop designs, you'll notice that ~1.0 GHz seems to be the magical number for GPU efficiency. Fermi had ~1.5 GHz shader clock rate. Kepler dropped the clocks to back to ~1.0 GHz and provided one of the biggest improvements for performance/watt in Nvidia history. 14 nm allows Intel to fit a bigger GPU, so they don't need to keep it small and clock it up to 1.35 GHz anymore. I am sure this new design is better for performance / watt.I am honestly wondering what the answer to this question is.