Software/CPU-based 3D Rendering

that's why it would make sense to fuse the vector units, no special dev support needed, no disruptive change. Haswell has about half a TFlop/s on cpu side and even more of that on gpu side, but depending on what you use, half die die is idle, doesn't burn much power (probably), yet it's such a waste. that's just gonna increase in future, broadwell will probably have a twice as powerful GPU (rumors :) ), skylage will support 512bit SIMD -> twice units. wouldn't be smart to have an idle TFlop/s on either side. the other parts (x86 specific and fixed function units on gpu) won't increase in size dramatically.
Without developer adoption you miss all the benefits from increased flexibility.
I don't think the vector ALUs by themselves take up much space, so I doubt sharing them between otherwise separate CPU and GPU makes sense given the logistics overhead. "Wasting" die area often is an acceptable trade-off.
 
Nick-
I think your assessment of Haswell's CPU vs iGPU power/performance is out of whack. First let's look at an example

2C + GT2 desktop; 30W @ 4GHz = 250GFlops; GT2 - same 1.2GHz @ 15W = 400GFlops

It looks like the GT2 die area is roughly equivalent to 3 CPU cores (based on these die shots). This means at desktop power envelope frequencies, we are looking at ~125GFlops/HSW Cpu core and roughly the same perf/mm2 for the GPU at these frequencies. But the gpu has 3x the GFlops/watt - 8Gflops/w for CPU, 26 GF/w for GPU. If you actually did replace the GPU with more CPU cores running at 4GHz @ iso area, to hit the same total package flops package power would go up from ~45W to 75W.

Some games are CPU heavy with physics, AI, game play logic, etc- those computations still have to happen. You can't dedicate all the compute on the package to graphics in those cases. But let's ignore that for now and say we had a best case scenariowhere the game logic is light (e.g. a unified sw rendering device can dedicate nearly all the flops to graphics).

Now let's look at a mobile part.

In a HSW 2C+ GT2 @ 15W Ultrabook, my impression is that, while doing graphics intensive gaming, CPU is typically running ~3W (heavily downclocked, idle much of the time), uncore is another 1-2W and GPU is ~11W with GPU running at ~1Ghz. So GPU is hitting 350GFlops @ 30GF/watt.

Now let's convert that to a unified device with 5C (2 original + 3 from replacing GT). what frequency do we have to run them at to hit 350GFlops? based on the 125GF/core @ 4GHz, this works out to about 2.2GHz (0.56). Let's assume we are on the cubic part of the power curve over the entire 2.2 to 4GHz range (best case power scaling). CPU power would be ~17%/core vs 4GHz =2.6W and 26GF/w. Multiply by 5 cores = 15W. So in this best possible case, the CPU just barely hits the same flops performance.

Alternatively, you could throw more die area at the problem and go "wider and slow" - like the 2C + GT3 sku in the Macbook Air. Remember that 1.2GHz is in the cubic portion of the GPU power/frequency curve. Backing off to, say, 800 MHz yields 80% of the graphics perf @ half the power per "GT unit" (0.8^3=0.512). Thus we can double the number of GT units at ISO power while still hitting 1.6x performance for a gpu total of ~500GFlops @ 11W = 46GF/w.

Total die area is now equivalent to an 8C cpu (2 original + 3cpu/"GT slice"* 2 slices). To get 500GFlops out of 8 cpu cores, they would have to be running at 2Ghz. At what point does the CPU power/frequency curve start being linear instead of cubic? Does going slower on the cpu continue to yield proportional perf/watt benefit?

Remember that I am giving the best possible scenario. If the SW renderer is less efficient in ANY way (as the 100W cpu vs 50 gpu suggests), the SW renderer will lose on perf/area and perf/watt hands down. I have seen forum threads that indicate that AVX2 workloads actually have worse perf/watt than indicated here - some indications that the power control unit adds an extra 0.1V when AVX2 is active.

I won't take this conversation down to lower (e.g. tablets and phones) which is where "mainstream" usages are going. For these environments every milliwatt counts. Desktop composition effects like "aero" etc are being moved from the 3d engine to fixed function hardware in display to save power by avoiding lighting up the 3d engine and burning memory bandwidth on rendering passes. Media codec blocks offer much lower power than programmable and are required to hit 6-11 hours of video playback time.
 
I don't really see it. It certainly took some time for the GPU market to mature to a point where creating low-end models even made sense. But you can't compare the entire market today to the range of performance bins of a single chip which was never meant to be a low-end offering. The span has been huge for over a decade.
I see your point. For a while there was no need to get a 3D card unless you were a hardcore gamer, so 2D cards were the actual low-end offering at that time. Still, I find it interesting that what we consider a commercially viable dedicated 3D graphics processing solution evolved from one chip design with a small differentiation in clock speed, into a whole range of designs with widely different performance and cost. Even the 2D cards can do 3D now, so to speak. The next logical step is that enhanced CPUs can do 3D adequately at a reduced cost, and far more. So again my only point was that we should not be looking at the current high-end to evaluate the viability of CPU-GPU unification.
No doubt integrated GPUs are popular these days. Someone might try unification, but it would be a disruptive change.
It's already being tried. It seems nowhere near what Intel could achieve with AVX-512 though. Anyway, it's not disruptive at all. Note that today's DirectX 11 hardware still supports DirectX 7, which had no concept of programmable shaders, let alone the unified ones the hardware is made up of. Legacy APIs are implemented using drivers and/or user mode libraries so the architectural changes are abstracted away. Likewise unifying the CPU and GPU wouldn't cause any disruption of legacy functionality. It's only when you actually want to make use of the new capabilities that you'll have to adopt new techniques.
There are significant hurdles to overcome. Some technical ones like power draw, but also getting developer support...
There are absolutely big hurdles to overcome. Getting developer support isn't one of them though. There will be no disruption of legacy functionality, and there will be no shortage of interest in unified computing. LLVM is already adding AVX-512 support as we speak! Developers are eager to enter a new era of computing where the hardware and its drivers no longer determine what's supported and what's not. It also solves a slew of QA and support issues.

Some seem to think that heterogeneous computing will already lead us into that era, but the reality is that heterogeneous computing is not compiler-friendly. Developers still have to deal with a large number of complex issues, and then has to do it all over again for different heterogeneous setups (discrete/integrated, different brands, different generations, different performance category, different drivers). HSA attempts to standardize heterogeneous computing, but it just joins the ranks of other frameworks that have not and will not gain ubiquity. Also to illustrate the importance of being compiler-friendly, multi-threading is far simpler than heterogeneous computing, and yet its adoption is not without challenges. TSX will help with that, but heterogeneous computing isn't going to get easier unless it becomes more homogeneous. Unified memory is a step in that direction but won't be the last one. As the computing power increases and pitfalls become greater, things have to inevitably become more manageable for developers.

Unifying things takes micro-management tasks out of the hands of developers, and deals with it in hardware instead of software. Shader unification did that, memory unification does that, and computing unification will too.
...and the impact on market segmentation. After all, a chip replacing low-end CPU+GPU combinations would have to be much more powerful as a CPU alone to fulfil the same roles. But that puts it in direct competition with midrange offerings for customers willing to pay more, potentially lowering margins substantially.
That's an interesting point and I have to admit that marketing can do crazy things to technological advances that leave me somewhat puzzled (both slowing it down and speeding it up). It seems too early for octa-core and 64-bit technology in mobile phones, while fusing off TSX technology in the i7-4770K is hampering progress. Anyway, I guess much of it comes down to competition. Intel has been selling quad-cores in the enthusiast market for too long because current AMD octa-cores are a disappointment. Steamroller may change that and Intel would counter that by making 6-core and 8-core more affordable. They'll just create even more powerful chips with high margins to take the place of the current high-margin ones.
 
A single image buffer/texture trashes the slow L3 cache completely! Your 8 MBytes are nothing - NOTHING - in modern graphics.
No need to yell. 8 MB is definitely something, and with 25.6 GB/s of RAM bandwidth it can be filled again 50 times at 60 Hz. So the fact that writing a "single" image buffer/texture with no other accesses happening in between would thrash the L3 cache "completely" isn't all that terrible.

In practice it's far more effective than that worst case due to the 16-way associativity and (P)LRU replacement policy. Data that is accessed a few times per frame doesn't get evicted. This is for instance probably the case for FQuake. It achieves up to 160 FPS at 2560x1600, which is a 16 MB color buffer so the L3 can't hold it but this represents only 10% of RAM bandwidth usage. The textures are tiny and repetitive so it's entirely possible for them to stay in the 8 MB L3 even though 16 MB is written to the color buffer every frame.

Also note that automatic prefetching ensures that it takes only a very short while after the cache has been thrashed, for the hit ratio to become very high again for a coherent access pattern (which is more often than not the case for graphics). In fact I found it to be impossible to beat the automatic prefetching with software prefetching. So you can get good use of those 50 refills while maintaining a high average hit ratio, thus preventing the cores from stalling a lot.

In my experience software rendering is arithmetic limited instead of bandwidth limited and the L3 cache is one of the reasons why not. It would take AVX-512 to change that and by then DDR4 will be standard while an L4 cache would retain all color buffers and depth buffers and lots of other data, with roughly twice the bandwidth and much lower power consumption than RAM. So I don't see bandwidth as an issue at all for CPU-GPU unification.
We are not talking about a Java lecture class teaching how to draw something on the screen in the easiest and most elegant way! We are talking about the how you can do the best graphics with the limited resources given!
The reality is that sacrificing some low-level performance is necessary to gain high-level performance. Nobody writes their shaders in assembly any more, because it's too tedious and hell to maintain. It is humanly impossible to keep track of all the little details and to keep everything tuned to perfection while the high-level rendering approaches change and are sometimes completely overhauled. So we use high-level languages to crank up productivity and we rely on compilers to do the repetitive low-level optimization work. The transition from assembly shaders to high-level shaders happened around the time of shader unification, because that's when you no longer had to worry about what different instructions it would translate into or how to optimize the register usage for the different cores.

We're now approaching the point where heterogeneous computing is riddled with too many low-level details for the developer to manage effectively. The solution is unification to make it compiler-friendly, and while that definitely comes at a cost it will eventually be lower than what is gained by allowing developers to focus on the high-level computing problems which have a greater impact on performance.

This is an important driving force for more convergence. In the near future we're getting unified memory for the exact same reasons, and even though it comes at a hardware cost it's advertised as a performance feature because it allows developers to worry less about managing memory transfers and letting the hardware handle it more efficiently overall.
Having a unified processing pipeline requires additional explicit commands in order to tell the CPU that we don't want to trash the caches, stream data, etc. and trying to write efficient code without any structured boundaries is logistically a contradiction.

Separating critical code sections for finely tuned processing of sorted data, with as little interaction with outside code as possible, is the most fundamental principle for performance efficient software on the CPU, GPU, APU, everywhere.
It is a pipe dream to think that you can finely control all these things and gain a large benefit, across hundreds of different heterogeneous setups.

Things like streaming writes and software prefetches rarely result in a significant speedup, but can easily work adversely in scenarios you didn't foresee. For example a small render target texture should not use streaming stores if it's soon reused, but it should if it isn't, and what's considered small and what's considered soon will change between hardware generations and may not be deterministic due to user input. So it's generally better to let the hardware use conservative heuristics and to focus on your algorithms and universally applicable optimizations. Don't get me wrong, I'm not saying you should be oblivious to the architecture(s) you're targeting, but don't take Knuth's advice lightly: premature optimization is the root of all evil.
Showing examples that a scattered access to a data stream here and there can improve something is not an argument for your radical denial that there are problem classes which will always be processed fundamentally more efficient by hardware that is designed to process these.
That argument instantly falls apart when we look at unified shaders. Vertices and pixels are no longer fundamentally processed more efficiently by hardware that is designed to process them.
I won't even start to argue with your view that the failure of Larrabee as a GPU is no argument against your agenda... :rolleyes:
I don't view it that black and white. Larrabee thought us things that work and things that don't work, to various degrees and under various circumstances. All I was saying is that Larrabee is definitely not direct proof that unification of the CPU and integrated GPU won't work, because that's not what it is by a long shot. You're going to have to use very specific arguments and yet not lose track of the bigger picture to argue against that.
 
Without developer adoption you miss all the benefits from increased flexibility.
That is not entirely true. Unified shaders automatically balanced the vertex and pixel workloads, even for DirectX 9 games that were released before their existence. Also, even for features that do require developer adoption to make use of them, it can happen at various levels. Application developers generally don't have to be aware of new CPU instructions. It's the compiler developers who adopt them, and application developers can benefit from it with the click of a button or by implicitly using an updated library from another developer.
I don't think the vector ALUs by themselves take up much space, so I doubt sharing them between otherwise separate CPU and GPU makes sense given the logistics overhead. "Wasting" die area often is an acceptable trade-off.
The vector ALUs themselves don't take up much space, but the register files, caches, scheduling logic, memory interfaces, etc. to be able to use them do. So you can't look at the ALUs in isolation. That said, doubling the vector ALU width of an existing architecture costs significantly less than doubling up all of the structures that surround it.

I don't think wasting die area is that easily acceptable. Sure, power consumption limits are dictating that a lower percentage of transistors can be active so instead of having a smaller die you can still have more transistors that you might invest into a specialized unit "for free" since you're still paying for the same die size. It is misleading though because those extra transistors from a process shrink used to be worth more than that. On the other hand the technology to keep shrinking is driving up the wafer cost. Viable future lithography solutions such as multi-patterning, EUV and E-beam all suffer from lower throughput. 450 mm wafers can help offset that but it's a major investment.

So a small die size is favorable. This should make something that can fulfill multiple roles without duplicate logic quite attractive. Some specialized instructions can maximize performance/Watt while not costing a lot of area.
 
Everyone discussing in this thread should read this paper (it they haven't already). This is exactly what I have been trying to say in my earlier posts. More threads (better occupancy) isn't always a performance boost (often it's the other way around).
I have been lobbying to get a HLSL extension for GPU scalar unit access (warp_invariant keyword for variables) as both Nvidia and AMD GPUs have fully programmable scalar units these days. Automated compiler support for scalarization would be even better, but I would prefer to have an option to have full programmer control (until the automatic compiler scalarization is perfect).

If we had automated scalarization support in DirectX shader compiler, the software (HLSL compute shaders) would start to use more scalar instructions. That would simplify many algorithms that need to mix coarse (warp wide) processing with fine (pixel/thread wide) processing. Currently you need to either multipass (separate coarse from fine processing at kernel level) or split the kernel (by a barrier) to two steps and use groupshared memory. But the barrier splitting is tricky to get right, since you'd want to have radically different thread counts for coarse and fine grained processing (and you can't change the block thread count between barriers). The scalar unit solves most of these problems elegantly. And it allows you to keep data in registers instead of writing/reading to/from groupshared memory between different stages of the kernel. The Nvidia occupancy paper (mentioned earlier in this post) shows major gains of performance when data is kept in registers instead of in groupshared memory (and it's good for programmer as well, as you don't need to fight against bank conflicts). If GPU scalar units get official support (by DirectX for example), I expect much higher usage of them, and thus more beefed up scalar units in the future GPU hardware (increased ratio of scalar:vector throughput).
 
that's why it would make sense to fuse the vector units, no special dev support needed, no disruptive change. Haswell has about half a TFlop/s on cpu side and even more of that on gpu side, but depending on what you use, half die die is idle, doesn't burn much power (probably), yet it's such a waste. that's just gonna increase in future, broadwell will probably have a twice as powerful GPU (rumors :) ), skylage will support 512bit SIMD -> twice units. wouldn't be smart to have an idle TFlop/s on either side. the other parts (x86 specific and fixed function units on gpu) won't increase in size dramatically.
Problem with that approach is simple: Unification of vector ALUs alone would increase the distance that data must move inside the chip. Vector ALUs must be close to the registers files (for both latency and energy efficiency reasons) and register files must be (quite) close to the renaming logic, which again must be close to the ROBs and instruction decoding.

Bulldozer is already sharing vector ALUs between two (closely tied together) CPU cores. Even that increases instruction latencies quite a bit. A big vector unit shared by all CPU cores and the GPU would be much further away. I would expect similar latency as we have with shared L3 caches (30+ extra cycles at least). But the biggest problem would be the much increased energy consumption (needed for the data movement inside the chip).
 
Anyway, it's not disruptive at all. Note that today's DirectX 11 hardware still supports DirectX 7, which had no concept of programmable shaders, let alone the unified ones the hardware is made up of. Legacy APIs are implemented using drivers and/or user mode libraries so the architectural changes are abstracted away. Likewise unifying the CPU and GPU wouldn't cause any disruption of legacy functionality. It's only when you actually want to make use of the new capabilities that you'll have to adopt new techniques.

There are absolutely big hurdles to overcome. Getting developer support isn't one of them though. There will be no disruption of legacy functionality, and there will be no shortage of interest in unified computing. LLVM is already adding AVX-512 support as we speak! Developers are eager to enter a new era of computing where the hardware and its drivers no longer determine what's supported and what's not. It also solves a slew of QA and support issues.
I did not mean lack of "legacy" API support, but disruption in the economic sense: a different value proposition.

For the use case of running most existing D3D/OGL games, I highly doubt a "UPU" of the same die size would be better (in terms of performance and performance/W). But if would be substantially faster as a CPU alone, and it would be more flexible than a GPU, allowing use of new graphics techniques.

That, however, requires developer adoption. And since such UPUs would not be ubiquitous for quite a while, and since they won't replace high-end GPUs (certainly not initially) dropping support for existing APIs isn't an option for most developers. Granted, most game developers just license an engine, but that and having to target both UPUs and GPUs may limit the extent to which they take advantage of the flexibility of the former.
 
I did not mean lack of "legacy" API support, but disruption in the economic sense: a different value proposition.

For the use case of running most existing D3D/OGL games, I highly doubt a "UPU" of the same die size would be better (in terms of performance and performance/W). But if would be substantially faster as a CPU alone, and it would be more flexible than a GPU, allowing use of new graphics techniques.

That, however, requires developer adoption. And since such UPUs would not be ubiquitous for quite a while, and since they won't replace high-end GPUs (certainly not initially) dropping support for existing APIs isn't an option for most developers. Granted, most game developers just license an engine, but that and having to target both UPUs and GPUs may limit the extent to which they take advantage of the flexibility of the former.
Agreed. This is why several more years of convergence are required, also from the GPU side. The gap has to be small enough to make the leap. But keep in mind that replacing the integrated GPU with more CPU cores will provide a final push. Also I'd expect this to happen in the desktop market before portable/mobile markets, where an increase in power consumption isn't a big deal. Furthermore, Intel is investing heavily in graphics research and they're using the CPU to assist the integrated GPU, which I expect to really take off with AVX512F. So that can ease the transition. And as early as the next generation GPUs might support render target read/modify/write which drops the ROPs and comes at a small cost but enables a ton of new techniques that will surely find good developer adoption. The power cost is definitely small since Tegra already supports it. There's also a demand to make the graphics pipeline more flexible, which as discussed before requires high scalar performance to control things, which adds cost but once again increases performance after a gradual adoption.

So the disruption will be smaller than what you might expect it to be today. It just has to happen in bite-size steps. Note that shader unification also didn't happen in one disruptive swoop (but still pretty darn fast)!
 
Without developer adoption you miss all the benefits from increased flexibility.
indeed, it would manage the status quo. you don't get any benefits, yet you don't rely on getting those to amortise the cost of removing fixed fuction units.

I don't think the vector ALUs by themselves take up much space, so I doubt sharing them between otherwise separate CPU and GPU makes sense given the logistics overhead. "Wasting" die area often is an acceptable trade-off.
the point is to make it less and less separate. it's not just the pure alu units, it's everything involved: register, caches, load/store units.
In the end, 'shader' would be nothing else than micro-ops, invoked by the FFU-rasterizer, just like other micro-ops are coming from the x86 decoder. (hyper threading not just from different sources, but different kind of sources).
 
Back
Top