Intel Broadwell (Gen8)

Thorburn · Jan 11, 2015

Paran said:
In 3dmark11 Broadwell ULV with 24 EUs is faster than Haswell ULV with 40 EUs.

Haswell ULV with 20 EUs is faster than Haswell ULV with 40 EU's too....

sebbbi · Jan 11, 2015

Thorburn said:
Haswell ULV with 20 EUs is faster than Haswell ULV with 40 EU's too....

There is not enough BW to feed 40 EUs (without eDRAM). I believe the main purpose of non-EDRAM 40 EU models was to save power (40 EUs at half the clocks is more power friendly than 20 EUs). Of course a 40 EU chip costs more to manufacture, so more EUs at lower clocks is only a viable model for premium models (such as Apple's Macbook Air).

Thorburn · Jan 11, 2015

sebbbi said:
There is not enough BW to feed 40 EUs (without eDRAM). I believe the main purpose of non-EDRAM 40 EU models was to save power (40 EUs at half the clocks is more power friendly than 20 EUs). Of course a 40 EU chip costs more to manufacture, so more EUs at lower clocks is only a viable model for premium models (such as Apple's Macbook Air).

At 15W GT2 (HD 4400) is faster than GT3 (HD 5000), at 28W (Iris 5100) is (often but not always) quicker than 35W (HD 4600). TDP is the primary limiting factor for HD 5000 and has some impact on Iris 5100, but memory bandwidth does become a limiting factor in some scenarios for Iris 5100.

Andrew Lauritzen · Jan 12, 2015

sebbbi said:
If I understood correctly, this means that Intel Gen8 has 2x FP16 rate, just like Nvidia and Imagination. ... I wonder if there are some other architecture limitations that cap the peak rates of 16 bit operations.

Yes, it's 2x FP16 rate with no limitations that I know of in terms of getting full FP16 MAD throughput. Unlike Imagination however you do need instructions to convert FP32<->FP16 (I think Maxwell does as well?) so you will lose some efficiency on mixed precision math.

sebbbi said:
With the huge shared L3 caches (and even L4 on some configurations) this would be totally awesome, as the CPU<->GPU traffic wouldn't need to hit memory at all.

Note that this is already possible today and indeed it happens a fair bit even with standard graphics APIs. You don't need shared *virtual* memory for this, only shared *physical* memory, which has been present for a few generations. Shared virtual memory just makes it possible to directly share pointers embedded in data structures and so on. Down the road OS support could obviously also make this a lot more powerful (arbitrary-but-not-necessarily-useful example: texturing from a memory mapped file).

Newer APIs will expose memory more directly though which will allow for even better sharing and bandwidth use.

Thorburn said:
At 15W GT2 (HD 4400) is faster than GT3 (HD 5000), at 28W (Iris 5100) is (often but not always) quicker than 35W (HD 4600). TDP is the primary limiting factor for HD 5000 and has some impact on Iris 5100, but memory bandwidth does become a limiting factor in some scenarios for Iris 5100.

It really does depend on the workload but indeed at 15W both Haswell GT2 and GT3 are very similar performance. 28W GT3 is slightly faster. Thorburn's point is absolutely correct though: the limit is as much about TDP as bandwidth in many cases. You simply can't power up the entire GT3 at high frequencies - usually also while you have some CPU render thread maxing out turbo there. Indeed we showed in the DirectX 12 demo that the power limitation is very real and freeing up power - of which the CPU uses a lot - can directly buy performance.

Obviously bandwidth is also relevant too, and it's a big user of power as well. But it's not the sole issue and for instance even if you could "just add EDRAM" to a 15W part (remember EDRAM itself does use power too), you likely wouldn't make it much faster. Obviously we run all these numbers though so it's not as if decisions about which SKUs/power levels need what bandwidth solutions are arbitrary or anything.

mczak · Jan 12, 2015

sebbbi said:
Gen7.5 was already highly bandwidth starved (without the 128 MB EDRAM). 20%-30% gains without any BW improvements are quite nice. I don't see how they could have improved the performance much further with the same memory configuration (DDR3/LPDDR3 1600). Let's wait for the EDRAM equipped models (with 48 EUs). Desktop / server models likely support DDR4 as well.

I'm not convinced at least the 15W models were all that constrained by memory bandwidth (at least when equipped with dual channel memory, which, unlike in the amd camp, almost all vendors do...). And even if so, this could be improved even with the same memory bandwidth - if you look at gm108, the performance it achieves with just half the bandwidth of these intel chips (as it's saddled by 64bit ddr3) is quite astonishing, those bandwidth saving measures apparently do quite wonders.

sebbbi · Jan 12, 2015

Andrew Lauritzen said:
Yes, it's 2x FP16 rate with no limitations that I know of in terms of getting full FP16 MAD throughput. Unlike Imagination however you do need instructions to convert FP32<->FP16 (I think Maxwell does as well?) so you will lose some efficiency on mixed precision math.

Do the 16 bit operations (int16 and fp16) operate in 16 or 32 bit registers? I want to know whether this can be used to reduce the GPR footprint (allow higher occupancy).

Andrew Lauritzen said:
Note that this is already possible today and indeed it happens a fair bit even with standard graphics APIs. You don't need shared *virtual* memory for this, only shared *physical* memory, which has been present for a few generations. Shared virtual memory just makes it possible to directly share pointers embedded in data structures and so on. Down the road OS support could obviously also make this a lot more powerful (arbitrary-but-not-necessarily-useful example: texturing from a memory mapped file).

But did Gen 7.5 have cache coherency between the GPU and the CPU caches? If so, then Intel has been further ahead than I expected.

When I was reading the Anandtech RRAM article (http://www.anandtech.com/show/8796/...hnology-reaching-commercialization-stage-soon), the first thing that came into my mind was virtual texturing. This storage tech seems perfect for fine grained latency sensitive streaming. Virtual texturing by memory mapping would be fun indeed.

mczak said:
I'm not convinced at least the 15W models were all that constrained by memory bandwidth (at least when equipped with dual channel memory, which, unlike in the amd camp, almost all vendors do...). And even if so, this could be improved even with the same memory bandwidth - if you look at gm108, the performance it achieves with just half the bandwidth of these intel chips (as it's saddled by 64bit ddr3) is quite astonishing, those bandwidth saving measures apparently do quite wonders.

Intel chip shares the BW with the CPU, and it had half the TDP (and even that is shared between the CPU and the GPU). Memory accesses require lots of power. 15 W is likely not enough to run the bus + the chips at full pace for long periods of time.

Andrew Lauritzen · Jan 12, 2015

sebbbi said:
Do the 16 bit operations (int16 and fp16) operate in 16 or 32 bit registers? I want to know whether this can be used to reduce the GPR footprint (allow higher occupancy).

The register file in Gen is extremely general. Short answer is yes you can reduce register pressure by using 16-bit data. Longer answer is the register file itself can be indexed in a pretty wide range of ways (different operand sizes, strides, etc), so at the architecture level how you load/store intermediates in it is quite flexible.

sebbbi said:
But did Gen 7.5 have cache coherency between the GPU and the CPU caches?

It's a bit more nuanced than that as it depends on exactly which cache. In general (and my memory of the details is slightly fuzzy) GPU accesses will tend to snoop CPU caches but not the other way around. i.e. GPU writes need to be manually flushed from caches in Gen7.5. Additionally the texture cache is not coherent as that is very expensive, but normally not a huge deal as doing the required flushes before texturing is usually straightforward.

Don't get me wrong, Gen8 does indeed add lots of additional coherence options (and obviously shared virtual addresses), but it's worth noting that the top level of CPU/GPU data sharing is already possible in Gen7.5 and indeed the driver will set it up for stuff like constant/vertex buffers with DISCARD semantics, etc.

sebbbi said:
Intel chip shares the BW with the CPU, and it had half the TDP (and even that is shared between the CPU and the GPU). Memory accesses require lots of power. 15 W is likely not enough to run the bus + the chips at full pace for long periods of time.

Definitely true about it being shared with the CPU. Note as well that the LLC is shared with the CPU as well and on the 15W parts, it's half the size (dual core). The latter actually often hurts a lot more in practice than the former as 4MB of LLC really isn't enough for both the CPU and GPU to mitigate significant portions of the bandwidth usage so in some cases the driver will elect to cache fewer surfaces in LLC than would be desirable.

mczak said:
(at least when equipped with dual channel memory, which, unlike in the amd camp, almost all vendors do...

Heh, believe me there are a large number of single channel Intel systems out there too. For whatever reason OEMs seem to like providing base models with only 1 DIMM, then providing 2 DIMMs as an "upgrade". But of course to a user the option is just presented as "4GB vs 8GB RAM" without the note that the 4GB system is effectively crippled and will run at literally half the speed in a lot of cases. Sigh.

Thorburn · Jan 12, 2015

Andrew Lauritzen said:
Heh, believe me there are a large number of single channel Intel systems out there too. For whatever reason OEMs seem to like providing base models with only 1 DIMM, then providing 2 DIMMs as an "upgrade". But of course to a user the option is just presented as "4GB vs 8GB RAM" without the note that the 4GB system is effectively crippled and will run at literally half the speed in a lot of cases. Sigh.

Indeed there are. Been running tests on a pair of Acer Aspire V3-371 SKUs recently - one with HD 4400 and 8GB of RAM and the other with Iris 5100 and 6GB. Was asked to take a look at them as the HD 4400 system was outperforming the Iris 5100 one by around 25-30%.

Turned out the Iris 5100 SKU not only had different capacity memory modules, but different vendor as well and was running in single channel memory mode. Swap them out for a matched pair of 4GB (or 2GB) modules and suddenly performance was restored.

Kaarlisk · Jan 12, 2015

Andrew Lauritzen said:
It really does depend on the workload but indeed at 15W both Haswell GT2 and GT3 are very similar performance.

And GT3 may be wasting a bit more static power and therefore possibly have lower performance? Or will GT3 shut down half the GPU in this case?

liolio · Jan 12, 2015

Did Intel already sell a dual core solution as a core i7?
I was aware that some core i5 could be dual cores but core i7 it is new for me.

On the topic of perfs per watts, it is night impossible to come with a fair comparison for Intel gpu:
Intel sells no discrete GPU.
The closest we have is AMD APU and they use a different process.

Thorburn · Jan 12, 2015

Kaarlisk said:
And GT3 may be wasting a bit more static power and therefore possibly have lower performance? Or will GT3 shut down half the GPU in this case?

It runs them all, but at lower frequencies. I don't believe you they can switch off sub-slices and they all run at the same frequency AFAIK.
You can monitor the power consumption of the chip as a whole, along with the CPU, GPU and Uncore power usage and CPU and GPU frequencies with the Intel XTU software (I'm sure there is other stuff that can do it as well) so it is fairly easy to compare and contrast by just say running a pair of laptops side by side with Unigine Heaven in a window and XTU displaying the stats. Might even do a bit of a comparison video myself actually....

liolio said:
Did Intel already sell a dual core solution as a core i7?
I have as aware that some core i5 could dual cores but for core i7 it is new for me.

On the topic of perfs per watts, it is night impossible to come with a fair comparison for Intel gpu:
Intel sells no discrete GPU.
The closest we have is AMD APU and they use a different process.

There have been a number of i7 dual-core CPUs, but only on mobile as far as I'm aware (i5's on desktop are MOSTLY quad-core, but even with Haswell there is a low power i5 dual-core LGA1150 chips as well).
All ULV (15W and 28W) chips are dual-core, including the i7's.

sebbbi · Jan 12, 2015

Kaarlisk said:
And GT3 may be wasting a bit more static power and therefore possibly have lower performance? Or will GT3 shut down half the GPU in this case?

I would assume that running 20 EUs at 1300 MHz would consume quite a bit more power than running 40 EUs at 650 MHz (isn't power consumption roughly quadratic to clocks). Having more EUs is also better, as it lets the GPU to finish the frame sooner and shut down all caches, front ends, etc (race-to-sleep). I am not a HW engineer, maybe Andrew could help us to understand this better.

3dilettante · Jan 12, 2015

sebbbi said:
I would assume that running 20 EUs at 1300 MHz would consume quite a bit more power than running 40 EUs at 650 MHz (isn't power consumption roughly quadratic to clocks). Having more EUs is also better, as it lets the GPU to finish the frame sooner and shut down all caches, front ends, etc (race-to-sleep). I am not a HW engineer, maybe Andrew could help us to understand this better.

Clocks alone have a linear effect on dynamic power consumption, whereas changing voltage has a generally quadratic effect on dynamic power consumption. The general trend over a core design's clock/voltage envelope is that sufficiently high clock bands will force voltage to kick up as well, causing both factors to rise.
Specific implementations or products tend to not cover the whole range, so the full curve may not be visible. There are also other effects like the physical characteristics of the device and what the process is tweaked to emphasize.

The optimization space would be complicated by the characterization of a device and business concerns (can it reach X clock at Y voltage, how reliably, how many devices in a manufacturing run can match this, what price can be set for a SKU utilizing this, what else is the system doing that can interfere with this, and on and on).

That aside, in the context of race to sleep, the point of static power consumption was raised. Clock gating does not eliminate static leakage, so there is a theoretical crossover point where many low-clock EUs can fail to save power versus a smaller number of higher-clocked units, although like the above it's likely a complicated optimization space (how well-utilized are the many EUs, what bottlenecks would the two solutions have, what are the physical characteristics of the process and the device with regards to static and dynamic power consumption, expected electrical/thermal targets, what is the yield impact of that many EUs, and on and on).
FinFETs have significantly helped with leakage, relative to planar transistors, so the inflection point wouldn't be the same for devices along Intel's chain of process transitions (or competitors with significantly inferior physical building blocks).

Power gating could clamp down on much of that remaining static power consumption, but that measure itself is costly in time and power. This cost needs to be amortized over a sufficiently long period of time. There's a Denver whitepaper that puts the CPU's amortization period in the tens of milliseconds, which sounds difficult to reconcile with a 3D application's preferred frame rates.
There may be some local power gating for elements that are being neglected for long periods of time, but the scenario of many versus fewer EUs sounds like the features being exercised would be consistent across implementations.

Kaarlisk · Jan 12, 2015

I only mentioned the hypothesis because I'd read about it in an Intel presentation.
https://intel.activeevents.com/sf14...F4FD3D70D8FB478DA239801/SF14_GVCS004_100f.pdf
Slides 11-25.
I also remember reading the interesting fact, in a different presentation, that DIMMs with more ranks also have a tiny bit more performance.

sebbbi · Jan 12, 2015

https://software.intel.com/en-us/ar...ted-parallelism-and-work-group-scan-functions

Broadwell supports device-side enqueue and the new OpenCL 2.0 work group operations. It's glad to see that Intel takes compute seriously. Now that all the major PC GPU manufacturers support these important features (Intel, AMD, Nvidia), I would like to see both of these features in DirectCompute as well. OpenCL and CUDA are so far ahead of DirectCompute that isn't not funny anymore. DirectX 11.1, 11.2 and 11.3 brought us some nice graphics features, but nothing new to the compute. Hopefully DirectCompute gets the attention it deserves and finally we can use these advanced GPUs fully in the PC games as well.

DavidC · Jan 17, 2015

Thorburn said:
Haswell ULV with 20 EUs is faster than Haswell ULV with 40 EU's too....

Not really. If you can equalize memory bandwidth, and TDP used, 15W U GT3 is always faster than 15W U GT2. See though in some models the GT2 is using cTDPup(like the Vaio Duo 13, and some non convertibles as well) and perform much better than the models using nominal TDP.

sebbbi said:
I would assume that running 20 EUs at 1300 MHz would consume quite a bit more power than running 40 EUs at 650 MHz (isn't power consumption roughly quadratic to clocks).

Clock difference is not that drastic. The 20EU models are usually running at about 1GHz, and 40EUs at 700-800MHz or so. That's because transistor performance drops drastically below certain voltage levels, and perf/watt doesn't improve either. It's not always 1/2 frequency = 1/4x power(1/2 F x 1/2 V). The IDF Spring 2013 presentation: BJ13_ARCS006_101_ENGf gives some insight to this. They are talking about taking advantage of "Fmax @ Vmin" to improve efficiency at the cost of die size. It seems they were aiming about 30% better perf/watt for GT3 over the GT2 at same TDP, which would translate into 20-30% better performance and obviously charge it as a premium part.

sebbbi said:
Intel chip shares the BW with the CPU, and it had half the TDP (and even that is shared between the CPU and the GPU). Memory accesses require lots of power. 15 W is likely not enough to run the bus + the chips at full pace for long periods of time.

The Intel GPU tech is clearly behind Nvidia's, period. They are supposedly one process gen ahead for one thing. Remember the Maxwell presentation where Nvidia compares to Intel iGPU parts and it handidly beats it in perf/watt? If you look at battery life consumption figures in gaming for Iris Pro parts, they are no better than comparable Maxwell ones. Actually, the Maxwell ones are better. Why would you buy an Iris Pro system? It's not even cheap! The Us are good in power use, but they suck in absolute perf/watt. Again, Nvidia is better there. I guess you can argue whos better in what in idle because trying to get iGPU and discrete GPU working together is finicky, but in load when gaming? Nvidia has a *MUCH* better perf/watt profile. No comparison.

The fact that in both mobile(Atom 22nm versus competitor 28nm) and GPU(again 1 process gen ahead) still loses to competitors really makes me doubt not only their claims about their process superiority, but about their design prowess as well.

Thorburn · Jan 17, 2015

DavidC said:
Not really. If you can equalize memory bandwidth, and TDP used, 15W U GT3 is always faster than 15W U GT2. See though in some models the GT2 is using cTDPup(like the Vaio Duo 13, and some non convertibles as well) and perform much better than the models using nominal TDP.

I would politely disagree - across 200+ data points of my own testing in various games and 3D application tests on otherwise identical hardware configurations HD 4400 consistently outperforms HD 5000. HD 5000 may well be quicker in synthetic tests which only stress individual elements of the GPU and leave the CPU cores more or less idle though.

DavidC · Jan 18, 2015

Thorburn said:
I would politely disagree - across 200+ data points of my own testing in various games and 3D application tests on otherwise identical hardware configurations HD 4400 consistently outperforms HD 5000. HD 5000 may well be quicker in synthetic tests which only stress individual elements of the GPU and leave the CPU cores more or less idle though.

You might be right there. I frequently refer to Notebookcheck's review and while HD 5000 has a single datapoint most of the time and on par/beating HD 4400's higher end scores, its not drastic difference as I thought. Also, the only Anandtech reference is HD 5000 on the MBA versus Acer Aspire S7's HD 4400. Not really good since I'd trust Apple over making better systems over Acer.

Really, between deceit with Core M performance, 14nm chips general poor showing, and how good competitors are doing I am amazed at how people think they have awesome products. They used to, not anymore. I mean why did they bother with doubling the die on the GT3 15W parts? Maybe marketing trickery is making people fall for GT3 CPUs that cost $100 more despite that there's nothing on the GPU side. Honestly I think at the most optimistic side they messed up royally with Broadwell and hoping for Skylake. If Skylake turns out to be a decent advancement we'll know that Broadwell was a mini-Presler and nowhere near the hype claimed.

More realistically though I'd bet 14/16nm ARM chips to kick Skylake chips to the curb, at least in graphics if not more.

iMacmatician · Jan 18, 2015

DavidC said:
I mean why did they bother with doubling the die on the GT3 15W parts?

I am honestly wondering what the answer to this question is.

sebbbi · Jan 18, 2015

iMacmatician said:
I am honestly wondering what the answer to this question is.

Haswell was running at very high GPU clocks. High end laptop models had 1.35 GHz max turbo (and even ULV models were running at 1.2 GHz). All the announced Broadwell models are running at more conservative GPU clocks of 950 MHz and 1.0 GHz. It seems that Intel has came to the conclusion that a wider GPU with more conservative clocks is better for performance/watt. If you look at the current Nvidia and AMD desktop designs, you'll notice that ~1.0 GHz seems to be the magical number for GPU efficiency. Fermi had ~1.5 GHz shader clock rate. Kepler dropped the clocks to back to ~1.0 GHz and provided one of the biggest improvements for performance/watt in Nvidia history. 14 nm allows Intel to fit a bigger GPU, so they don't need to keep it small and clock it up to 1.35 GHz anymore. I am sure this new design is better for performance / watt.

I expect to see a little bit bigger gains this time from the bigger GT3 part, as Intel has made various power saving improvements (and they have moved to 14 nm). Obviously GT3 on a 15W part will not give you anywhere close to the theoretical 2x performance. But even small additional gains over GT2 models like +20% are enough for people who want to buy a premium product. Let's wait for the benchmarks to see how this pans out.

Intel Broadwell (Gen8)

Thorburn

Moderator

sebbbi

Thorburn

Moderator

Andrew Lauritzen

Moderator

mczak

sebbbi

Andrew Lauritzen

Moderator

Thorburn

Moderator

Kaarlisk

liolio

Aquoiboniste

Thorburn

Moderator

sebbbi

3dilettante

Kaarlisk

sebbbi

DavidC

Thorburn

Moderator

DavidC

iMacmatician

sebbbi