22 nm Larrabee

liolio · May 13, 2011

I've a couple of "honest" questions. Some here are real software developers as Nick others seems to know their fair share either about hardware/micro-electronic and software, I'm just a geek, so no offence

First in regard to the comparison between SwiftShader and intel HD3000.
There's x5 difference in the 3D mark06 score. OK.
*What is the cost of running ShiftShader "itself" on the CPU? Is it in the same ball park as running the HD3000 drivers? Or higher, if yes significantly?
*In regard to power consumption, what is the usual power consumption of a CPU running 3Dmark06 on a discrete GPU? As I think it would be fair to consider the incompressible/fixed CPU cost to run something as 3Dmark.
*Overall can we consider the overall cost (in power and compute power) of swiftshader in the same ballpark as drivers?
*Another thing the HD3000 is not tiny by any mean if this floormap correct, it looks more like equal ~2 cores:

Overall it would be more fair to compare a quadcore to a dual core+IGP. From a costumer POV what serves the most? A quadcore? a dual core+ (shitty anyway)IGP? In regard to power how a HD3000 compares to two SnB cores? I guess that tough to find out. Anyway the IGP is likely way better in perfs per Watts by quiet an healthy margin.

Some questions more specifically aimed at you Nick.
* Is swiftShader optimized for AVX already?
* What are your expectations in regard to for example 3Dmark06 if it were implement if not straight to the metal using various libraries? How close do you think it would come to the IGP/HD3000?
* Say a bench or game were desgin with a CPU as hardware target, how close to think the end result would compare to an IGP (the HD3000 can serve as ref). Say you pass on some calculations and use more complex, bigger datastructures so more precompute values, do sacrifices clever trick elsewhere. Devs could count on 4GB or more of RAM, lot of cache, etc.
Basically do you think that it would be possible achieve for a quad-cores the "same" result as with an IGP+dual cores.

rpg.314 · May 13, 2011

liolio said:
Another thing the HD3000 is not tiny by any mean if this floormap correct, it looks more like equal ~2 cores:

It's more like 1.5. You have to count the L3 cache per core as well.

rpg.314 · May 13, 2011

Nick said:
It's not a lot of hardware at all. Like I said AVX already reserves the encoding bits to extend it to 1024-bit operations, FMA instructions are already specified, and gather/scatter requires little more than two 512 to 128-bit shuffle networks. Yet these minor things would make a major difference in SIMD efficiency (both effective performance and power consumption).

Sure, all you need is encoding space and the generalized shuffle networks don't cost a dime of L1D latency. Or area.

And none of this is specific to graphics. Every other application out there that uses SIMD (for which Intel clearly considered it worthwhile widening things to 256-bit execution), would gain significant benefit from these few things as well. And for applications which previously saw no gains from SIMD, gather/scatter can make the difference. Not to mention all of the new applications that become possible when the CPU approaches usable 1 TFLOP performance. So the value of this goes way beyond graphics without boundaries.

Where are these apps which scale with cpu cores and vector width, but not with GPU's, let alone scaling even more with GPU's? Really, where are they?

I don't want or need magical fairly tales. I want real examples.

There's no evidence of that. The latest games are already specifying quad-core CPUs in the recommended system spec, and AMD will soon launch a highly anticipated 8-core CPU. The software world is slow to adopt multi-core programming techniques, but it's really a one-time investment. Once you have a scalable software architecture, more cores get you direct benefit. Even NVIDIA's Kal-El processor is betting on 4 cores becoming the norm soon. It would be foolish to think that once the majority of software is making use of 4 cores, it's not going to evolve beyond.

Like I said, games aren't the answer. And still no real examples, just sermons.

That IGP is helpless on its own. So you have to take the power consumption of the API and driver layers running on the CPU into account, as well as the power consumption of the L3 cache and memory controller.

Pointless as these costs apply to sw rendering as well.

There will still be a power efficiency difference, but once again you have to look at the complete bundle of advantages you get in return.

I get really big expensive and power hungry transistors with their potential users running towards the other side of the die as fast as they can.

Also, you said yourself I should compare it to an IGP expected in that time frame: It will be more generic, meaning it's actually closer to a CPU architecture itself, and in relative terms less power efficient than a more dedicated IGP.

It will still thrash the cpu cores. Absolutely thrash them.

And again, once the software actually starts to make optimal use of the highly programmable throughput CPU architecture, you can do more with less. I know for example of a medical application with a dedicated SSE4 optimized software renderer for voxel data, which is several times faster than using their Direct3D 9 rendering path and SwiftShader.

If it is bog standard voxel rendering then what were they thinking? They should have bundled a $50 video card with their app instead of investing all that money into code.

That's an entirely different situation. LRB1 was supposed to compete in the high end market. Missing its target by 40% was completely unforgivable. If instead it takes 70 Watt for a system to achieve the same legacy graphics performance as a system with an IGP achieves at 50 Watt, that's not nearly as disastrous. Power consumption is a limiting factor in the high-end, but not so much in the low-end. Price and features are at least as important for commercial success.

Anyway, there's no need to ditch the IGP as long as it serves a purpose. Intel (as well as AMD) could add gather/scatter and AVX-1024 support and leave it up to the developers to choose between legacy graphics or cutting-edge custom development. By the way, the latter doesn't mean everyone has to reinvent the wheel. People can create open-source or commercial libraries/frameworks/engines for various application fields, expanding the possibilities way beyond the current small set of restrictive APIs. Also, developers would become independent of hardware drivers (both for stability and performance they currently still cause a lot of issues).

While that would be nice, that road has a lot of stumbling blocks. Games are made for consoles these days, with a few PC specific features used. With no console with a flexi arch around, who'll invest that much for one chip out of three?

That said, it's still underutilized, and the one reason for that is the lack of gather/scatter. Compilers have a hard time parallelizing loops without it. Once gather/scatter support is added, a mere recompile will speed up any application which has loops with independent iterations. That's practically all applications. So again, the use and value of these features goes well beyond graphics alone.

You expect Office to speed up with scatter/gather. Or Windows?

rpg.314 · May 13, 2011

Nick said:
You're going to have to have to show us some proof of that. It's very early days since single-die CPU+IGP chips have only just appeared. So far I've only seen evolutionary progress, while quad-core and wider vectors are entering the mainstream CPU market.

Just measure the total throughput growth of CPU's over the last 10 years. Contrast it with the area devoted to igp's focusing on last 3 years.

Hint, only 55% area of SB is CPU, where has the rest gone?

compres · May 13, 2011

rpg.314 said:
It's more like 1.5. You have to count the L3 cache per core as well.

Yo are stating that the GPU does not share the L3 BW with the cores. Can anyone confirm?

3dilettante · May 13, 2011

The GPU has access to the L3. The L3's dimensions are determined by the cores in SB. The tiny L2 on SB and its advanced power gating rely on there being an L3 tile per core.

compres · May 14, 2011

So the more cores the more L3. And the GPU has access.

Are there any tests showing the same GPU in 2 vs 4 core sandyb. configurations?

entity279 · May 14, 2011

rpg.314 said:
Where are these apps which scale with cpu cores and vector width, but not with GPU's, let alone scaling even more with GPU's? Really, where are they?

I don't want or need magical fairly tales. I want real examples.

I concur. Nick, you went way too quick over this point. The vector part of our x86 CPUs has always seemed underutilized to me, by consumer apps at least. It's not always trivial to code for either, and you need to give the compiler hints.

CarstenS · May 14, 2011

compres said:
So the more cores the more L3. And the GPU has access.

Are there any tests showing the same GPU in 2 vs 4 core sandyb. configurations?

The additional SB cores can also access other cores tiles, but they have to go the long way, increasing latency. It's not like the IGP suddenly has more memory for itself.

rpg.314 · May 14, 2011

afaik, that access is read only.

Nick · May 16, 2011

liolio said:
I've a couple of "honest" questions. Some here are real software developers as Nick others seems to know their fair share either about hardware/micro-electronic and software, I'm just a geek, so no offence

I'm actually a computer engineer with a minor in embedded systems. But no offence taken.

*What is the cost of running ShiftShader "itself" on the CPU? Is it in the same ball park as running the HD3000 drivers? Or higher, if yes significantly?

Good question. The vast majority of execution time goes to dynamically generated processing routines. The rest is divided between some 'fixed-function' processing, format conversions, and the actual 'driver' and API functionality. The latter two (which is what I assume you meant by SwiftShader "itself") are really thin layers. There's a very short path between the application and starting the actual calculations.

That said, some reviews report that Intel puts a lot of load on the CPU while rendering 3D graphics: CPU Usage in Graphics. Some even claim all geometry shaders execute on the CPU.

In any case to objectively compare pure software rendering against the IGP, I don't think we can neglect the many roles the CPU still plays for assisting the IGP. Unfortunately I don't have a Sandy Bridge system myself so I can't provide any accurate numbers.

Is swiftShader optimized for AVX already?

No.

What are your expectations in regard to for example 3Dmark06 if it were implement if not straight to the metal using various libraries? How close do you think it would come to the IGP/HD3000?

Hard to say. If I recall correctly it uses some blur filters which could be implemented way more efficiently with custom vector code instead of lots of texture lookups. But I'm sure that by having a full overview of the rendering process at an application level, there's a lot more that can be optimized by departing from the legacy graphics pipeline.

Just look at the sheer computing power. An i7-2600 can do 218 GFLOPS (not counting in any turbo mode). At 800x600, that's a staggering 450,000 floating-point operations per pixel per second, or a budget of 15,000 operations per pixel at 30 frames per second. Currently a lot of this power goes to waste though because of the lack of gather/scatter (forcing some memory accesses to be serial scalar operations), and because the API demands certain detours.

Basically do you think that it would be possible achieve for a quad-cores the "same" result as with an IGP+dual cores.

With gather/scatter, FMA and AVX-1024 support, yes, I'm convinced that the IGP would be a waste of silicon. It might take many more years for gather/scatter support to be implemented though, so quad-cores are probably outdated by then. But given that the CPU is already ahead of the IGP in GFLOPS, FMA will double it again, the IGP is limited by bandwidth, and graphics itself is getting more generic, I think it's very doubtful that the IGP can outrun its fate.

Nick · May 16, 2011

rpg.314 said:
Sure, all you need is encoding space and the generalized shuffle networks don't cost a dime of L1D latency. Or area.

It's only two instruction encodings, not a big deal. And gather/scatter can have higher L1 latency than other memory accesses. Especially with AVX-1024 on 256-bit execution units that latency is easily hidden. And area shouldn't be too much of an issue either given that LRB3 apparently has wider shuffle networks and more cores.

Where are these apps which scale with cpu cores and vector width, but not with GPU's, let alone scaling even more with GPU's? Really, where are they?

Why exclude applications which scale with the GPU? Every single GPGPU application is a really nice example of something that would greatly benefit from gather/scatter and extra cores.

Besides, it's a chicken-and-egg problem. There aren't many truly scalable multi-threaded applications yet because there's still a fairly low percentage of quad-core systems. But that's going to change in the next few years. Also note that there are very few consumer GPGPU applications, for the exact same sort of reason (few DX10+ capable systems). Developers simply won't invest into something that is not likely to pay off. But that doesn't mean we can't start looking at the sort of technology that will be most interesting for the future. Given that the CPU is ahead of the IGP in processing power (and there's more to come with FMA), but lacks some efficiency, it makes sense to add gather/scatter support, lower power consumption with FMA-1024, and replace the IGP with more CPU cores.

Pointless as these costs apply to sw rendering as well.

Only partially, and it shifts the balance. If for instance the IGP itself consumes 20 Watt and the rest of the system consumes 30 Watt during rendering, then a total power consumption of 70 Watt for pure software rendering isn't all that bad. Some would incorrectly compare the 20 Watt against 70 Watt, while it's really 50 Watt versus 70 Watt. And when you look at the potential for doing more with less the balance can totally tip in favor of generic software.

While that would be nice, that road has a lot of stumbling blocks. Games are made for consoles these days, with a few PC specific features used. With no console with a flexi arch around, who'll invest that much for one chip out of three?

The Xbox 360 has three CPU cores, the PlayStation 3 has Cell. Over the course of their existence game developers have started to use all this power, and the same multi-threaded engines were deployed on the PC as well. So even if the next generation of consoles don't have a fully homogeneous architecture, it's still quite likely that they'll sport more cores and continue to advance multi-threaded game development in the PC market as well.

You expect Office to speed up with scatter/gather. Or Windows?

Absolutely. Any non-trivial codebase has loops which can be auto-vectorized a lot more efficively with gather/scatter.

Nick · May 16, 2011

entity279 said:
The vector part of our x86 CPUs has always seemed underutilized to me, by consumer apps at least. It's not always trivial to code for either, and you need to give the compiler hints.

Yes, SIMD is underutilized, and the number one reason is that compilers have a really hard time parallelizing code. And that's because ever scalar operation has a vector equivalent, except for load/store! Support for gather/scatter would fix that.

nAo · May 16, 2011

Nick said:
Yes, SIMD is underutilized, and the number one reason is that compilers have a really hard time parallelizing code. And that's because ever scalar operation has a vector equivalent, except for load/store! Support for gather/scatter would fix that.

If an hypothetical compiler is able to generate gather/scatter instructions for a given code sequence then it would also be able to replace those instructions (if not supported) with loads and stores, it's not really rocket science. Performance might be less optimal but it's certainly not the lack of gather/scatter instructions in some ISAs making the life of certain parallelizing compilers hard.

Nick · May 16, 2011

nAo said:
If an hypothetical compiler is able to generate gather/scatter instructions for a given code sequence then it would also be able to replace those instructions (if not supported) with loads and stores, it's not really rocket science. Performance might be less optimal but it's certainly not the lack of gather/scatter instructions in some ISAs making the life of certain parallelizing compilers hard.

Less optimal is a huge understatement. Emulating a 256-bit gather operation takes 18 instructions! Even a nearly braindead hardware implementation of it could have reduced it to two parallel sets of 4 serial load operations without occupying any ALU pipelines. That would have been "less optimal". Today's situation is just horrible.

I didn't say gather/scatter support would make the compiler's life less hard per se, but it would make it a whole lot more effective. Currently a lot of effort into auto-vectorization simply goes to waste because the lack of gather/scatter negates the results.

Also note that it's not getting any better. FMA support will make Intel's architecture capable of 32 floating-point operations per cycle per core. Compared to the 18 instructions it takes to gather 8 values, that's like driving an F1 car with the parking brakes on. AVX and FMA make the serial load/store bottleneck appear four times narrower. So it's clear that something needs to be done if they want this wide SIMD ISA to be utilized more and offer a return on their investment. Fortunately Intel researchers appear to realize this too:

Atomic Vector Operations on Chip Multiprocessors

Note that gather/scatter units with a maximum throughput of 1 vector every cycle are considered perfectly feasible. And with AVX-1024 executed in four cycles on 256-bit execution units they'd get the same SIMD width as NVIDIA, while reducing the out-of-order execution overhead by a factor four. FMA increases performance/Watt as well. It's all within reach.

So the question isn't whether or not this will one day be added to CPU architectures. The question is what will GPU manufacturers do to compete with it? AMD is in a nice position because it can add these features to its CPU line too while at the same time offering GPUs that continue to target hardcore gamers. NVIDIA appears to be forced to sacrifice some graphics performance to increase GPGPU efficiency. Project Denver has the potential to conquer some desktop/laptop CPU market space, but they have a lot of catching up to do to design something like this, compensate for the process disadvantage, and get developers to program for it. The ARM architecture and NVIDIA's experience with throughput computing could result in a killer platform though.

rpg.314 · May 16, 2011

Nick said:
Less optimal is a huge understatement. Emulating a 256-bit gather operation takes 18 instructions! Even a nearly braindead hardware implementation of it could have reduced it to two parallel sets of 4 serial load operations without occupying any ALU pipelines. That would have been "less optimal". Today's situation is just horrible.

I didn't say gather/scatter support would make the compiler's life less hard per se, but it would make it a whole lot more effective. Currently a lot of effort into auto-vectorization simply goes to waste because the lack of gather/scatter negates the results.

Also note that it's not getting any better. FMA support will make Intel's architecture capable of 32 floating-point operations per cycle per core. Compared to the 18 instructions it takes to gather 8 values, that's like driving an F1 car with the parking brakes on. AVX and FMA make the serial load/store bottleneck appear four times narrower. So it's clear that something needs to be done if they want this wide SIMD ISA to be utilized more and offer a return on their investment. Fortunately Intel researchers appear to realize this too:

Atomic Vector Operations on Chip Multiprocessors

Note that gather/scatter units with a maximum throughput of 1 vector every cycle are considered perfectly feasible. And with AVX-1024 executed in four cycles on 256-bit execution units they'd get the same SIMD width as NVIDIA, while reducing the out-of-order execution overhead by a factor four. FMA increases performance/Watt as well. It's all within reach.

So the question isn't whether or not this will one day be added to CPU architectures. The question is what will GPU manufacturers do to compete with it? AMD is in a nice position because it can add these features to its CPU line too while at the same time offering GPUs that continue to target hardcore gamers. NVIDIA appears to be forced to sacrifice some graphics performance to increase GPGPU efficiency. Project Denver has the potential to conquer some desktop/laptop CPU market space, but they have a lot of catching up to do to design something like this, compensate for the process disadvantage, and get developers to program for it. The ARM architecture and NVIDIA's experience with throughput computing could result in a killer platform though.

That's a 3 year old paper using a sw simulator. Doesn't mean much for/against a real product.

3dilettante · May 16, 2011

The SIMD model simulated in the paper adds a second memory pipeline right next to the LSU, and adds an additional data path to check the contents of the load/store queues.

The core is assumed to be in-order, and the memory hierarchy expands the check process somewhat. It does assume a very heavily banked L2, and a directory-based coherence protocol with a smaller number of states than what is customary.
Only one can be issued at a time per-thread, and the operations are blocking.

I think there are significant barriers to implementing the scheme as described on a high-speed OoO design with a different memory pipeline coupled with heavy memory speculation.

Gipsel · May 16, 2011

Nick said:
Less optimal is a huge understatement. Emulating a 256-bit gather operation takes 18 instructions! Even a nearly braindead hardware implementation of it could have reduced it to two parallel sets of 4 serial load operations without occupying any ALU pipelines.

One simple question:
And how much would it improve the performance if the memory accesses cannot be coalesced? If they can, it should perform very well with the already existing loads, isn't it? And if it can't be coalesced, I would look up how much performance this costs for GPUs for instance (occupying the ALUs isn't a problem in such cases). SRAM or DRAM arrays don't get more ports and a higher bandwidth just because an ISA supports gather/scatter

.

3dilettante · May 16, 2011

There is an instruction cache benefit if a bunch of loads can be represented by a single gather, even if internally the chip just ran a little microcoded loop and spat out scalar loads in sequence. Perhaps the scatter/gather could run through a similar process as some of the string operations that run in microcode.

Being able to coalesce would save on the number of cache accesses and shave off cycles. How aggressive the implementation is in pursuing coalescing opportunities and how well it can broadcast and permute from cache line to vector lane would determine how complex the memory pipeline would be.
We may need to agree on what kind of implementation we are speculating on before guessing at numbers.

Gipsel · May 16, 2011

My point was that gather/scatter alone isn't going to win the game.

Of course one needs less space in the instruction cache for instance. But that gives you a performance benefit probably just in the single percentage range. And one occupies a lot less execution resources, true. But again, this won't be a game changer, even if it would buy 30% or even 50% performance on average (which it probably won't do). What would help is a cache/memory structure which can handle a lot of simultaneous requests to different addresses, i.e. a cache with let's say 8 read ports or something like that. So where is the missing performance factor of 4 to 8 to low end GPUs/IGPs supposed to come from? Not from gather/scatter additions ito the ISA in my opinion. It makes things simpler and also a bit faster, but not that much.

By the way, texture units are great things, especially the accompanying specialized caches. Nvidia didn't remove the separate L1 texture caches and did not integrate them into the general purpose L1/local memory. And even Larrabee had texture units/cache. If I would have to guess the reason ...

22 nm Larrabee

liolio

Aquoiboniste

rpg.314

rpg.314

rpg.314

compres

3dilettante

compres

entity279

CarstenS

Moderator

rpg.314

Nick

Nick

Nick

nAo

Nutella Nutellae

Nick

rpg.314

3dilettante

Gipsel

3dilettante

Gipsel

Similar threads