Software/CPU-based 3D Rendering

3dcgi · Sep 6, 2013

Nick said:
It's not just pixel processing that got more programmable. Vertex processing first evolved from fixed-function to programmable too. It also borrowed features from pixel processing such as texture lookups, before unification made sense. My point is that convergence happened from both ends. I find it surprising that you think this has little bearing on the value of CPU-GPU unification, despite the very similar double convergence. CPUs became multi-core, got SIMD units that kept widening, they got instructions such as FMA and gather, they can do SMT, etc. GPUs became programmable, got generic floating-point and integer instructions, data access was generalized and caches were added, the arithmetic latencies lowered from hundreds to only a few clock cycles, etc.

In the stock market there is a saying that past performance is not an indicator of future success and a trend is a trend until it isn't. That's what I think about this. The current trend may continue or it may not. The fact that it's there catches my attention, but it doesn't prove anything.

Nick said:
But perhaps you consider these to be meaningless facts which may show similar convergence up to now but have no correlation between them that would result in a similar outcome? What ties them together is that the underlying forces which caused each of these things are really the same. There are hardware and software aspects. Hardware gets cheaper as you can share resources. As you indicated yourself, you no longer have to over-provision different structures to prevent them from being a bottleneck. This applies to CPU-GPU unification as well. It also reduces the cost of design and validation, which should not be underestimated. Software also was and is a huge motivator for unification. Having the same capabilities in each shading stage and not having to worry about balancing made the programmers' lives easier, and allowed them to develop new algorithms and applications. Likewise CPU-GPU unification will simplify things and create boundless new possibilities.

We can continue to make this more flexible without having full unification. Look at HSA vs. AVX. Anyway, I'm not really interested in that far in the future as so much can happen manufacturing wise that changes the equation. That's one reason I usually don't participate in this discussion. If I can see the future 5 years out that's plenty of time to react. Until then I want to be aware of possibilities, but not place any bets.

Nick said:
I think you're suffering from retrospective bias. You perceive shader unification as obvious, because you already know it happened. To eliminate that bias you have to put yourself back in the timeframe well before it happened and observe that it was far from obvious. Or if you do observe that there was convergence taking place and fundamental forces driving it toward unification, you should compare that to today's situation...

Did at any point the hardware and software engineers say vertex and pixel processing has become programmable enough now, we'll stop right here and not unify them? No! They did unify, and what's more, GPUs as a whole continued to become more programmable. So I see no reason to assume that the desire for more programmability of the GPU and more computing power for the CPU will die out. AVX-512 is derived from the Xeon Phi ISA and would widen the CPU's SIMD units yet again and add more GPU features such as predication. Meanwhile GPUs will add virtual memory support which matches that of the CPU. That's a significant step in programmability, but it will still leave developers wanting more.

The underlying reason for that is consumer demand. Hardcore gamers want new experiences. So once performance is fine, transistors and power should be spent on enabling developers to create such new experiences. This has been a driving force for shader unification, and continues to be a driving force for CPU-GPU convergence.

That timeline doesn't start with 'hardware' vertex processing. For a while the "obvious" thing to do was to use the CPU for vertex processing, and the GPU for pixel processing. So once again I think you're suffering from some retrospective bias. Back then the idea of not only moving vertex processing to the GPU, but also to make it fully programmable and to make pixel processing use floating-point and unify it all was ridiculous due to the extreme cost. In contrast having an 8-core CPU with AVX-512 at 14 nm would be small and cheap and with 2 TFLOPS of computing power the graphics capabilities will be nothing to sneeze at. And that's not even the point in time when I'm expecting CPU-GPU unification to happen. There's much to be gained from adding some specialized instructions to overcome the remaining cost.

I don't think I suffer from retrospective bias as the first time someone suggested a unified shader to me (probably late 2002) I thought it was a great idea. Up to that point I wasn't thinking about these issues.

The days of CPU vertex processing were determined by economics, not because people thought GPU processing was a bad idea. In the future the silicon cost might equalize, but power is a question. So there are more variables this time.

What is your point in time for this to happen?

3dcgi · Sep 6, 2013

Davros said:
Will performance ever be fine ?
Do you ever see a time when people will say "thats it. No need to ever buy a new gpu this one is powerful enough"

Some people will think that and often have in the past, but usually some new technology comes along like high res video that changes the mind of many people. That's why predicting the future is hard. You don't know what you don't know.

Nick · Sep 6, 2013

ninelven said:
Because I don't avoid the reality that power consumption, heat, die size, and cost make this an unlikely proposition however theoretically attractive?

Neither do I avoid them, but I still reach the conclusion that unification will happen. Power consumption, heat, die size and cost were all things that could have prevented the unification of vertex and pixel processing, but didn't. The forces that caused convergence were much stronger, and they're still active today causing convergence between the CPU and GPU.

There is a difference between wanting something to happen and believing it will.

I want to win the lottery. Do I believe I am going to?

That's not even remotely an applicable analogy. The lottery has fixed (tiny) odds. Unification is about technology trends, and those are influenced by what people want. Engineers try to satisfy the desire for new graphics experiences, which in turn demands more performance and more programmability, within the bounds of what is physically and economically possible. Even those laws of physics and economy push things toward unification as it helps avoid hitting the bandwidth wall and it saves certain costs. There are also counter-acting forces but as indicated above they're not stopping convergence they're merely keeping the progress at an incremental pace.

nAo · Sep 6, 2013

This got to be the best B3D thread ever

Still · Sep 8, 2013

Nick, I ask you again why you think that the main streaming processor should be a part of the CPU pipeline in the first place. Why would you want to systematically stall a massive CPU core that has been heavily optimized for sequential instructions and incoherent data access with a completely opposing data problem? Yes, sometimes some streaming processing is necessary in general code, but this doesn't mean that you will be able to do things better by completely merging opposing processing concepts.
The reality is that writing efficient code for mixed SISD/SIMD with x86 + SSE/AVX code is MUCH HARDER than tackling the problems with separate code paths because you trash the the CPUs fragile memory architecture. Graphics are the prime example of wide data stream processing and you fail to present a concrete benefit of having ANY code solution for this stuffed in the middle of SISD code.

Again:
The past trend of GPUs increasing programmability was a consequence of the scaling of computing resources and the lack of a scalable/complete render pipeline solution.
IC scaling will slow down at an accelerating rate since yesterday: See recent papers or panel talks on the subject.
Future interactive rendering solutions will be complete with simplified code and a dramatically increased performance/power dependency on hardware architecture: See recent papers on path tracing, cone tracing, etc.

Davros · Sep 8, 2013

I have some questions

Nick said:
If hypothetically the integrated GPU was dropped, an 8-core CPU with AVX-512 could be pretty mainstream at 14 nm, and would be capable of 2 TFLOPS.

Nick said:
With just 100 GFLOPS worth of computing power, you can compete with a 50 GFLOPS GPU.

So seeing as we have a gpu with 3.7tflops in what year are you expecting we will see a 7.4 tflop cpu ?

Nick said:
Your vision is extremely limited if you evaluate it like that. For starters it is not making use of AVX2, FMA, TSX, gather, or F16C. So you're not even looking at today's full potential.

Seeing as you are a massive fan of avx2 and its capabilities why not ?

keldor314 · Sep 9, 2013

Here's an interesting tidbit coming from perfromance of my dad's OSX port of my fractal flame renderer:

On the recent macbook pro, the GPU is an Nvidia 650M (this uses Kepler architecture), with ~650 GFLOPS single precision, while the CPU is some varient of an Ivy Bridge i7, with something like 120 GFOPS single precision by my best estimate (I couldn't find a straight number anywhere). Perfromance on the GPU is approximately 5x higher than CPU, indicating that both processors achieve similar effeciencies. The significant difference is that when you switch from GPU to CPU, the fan spins up, indicating that the CPU renderer is drawing more power, despite only reaching 1/5th the performance!

For reference, the fractal flame algorithm is well inside the GPGPU realm, and cannot take advantage of any of the GPU's special purpose hardware, with the exception of a single texture access from a 4 KB texture each iteration. Control flow is highly data dependent and partially divergent. Memory access is incoherent, with random transaction to a large (order of GBs) buffer, as well as random transaction from a smaller (perhaps 1 MB) point pool. However, depending on the specific fractal, arithmatic to memory ratio varies widely. Each iteration has exactly one transaction from each buffer, as well as one texture access, as well as a variable amount of math, which may or may not include a significant amount of nasty transcendential stuff. The typical case in the benchmark uses a simple arithematic path, without the transcendentials.

Now, when you run the Windows version of the renderer on two discrete GPUs (GTX 680 + GTX 570), the performance shoots up by an order of magnitude over the mobile GPU, giving something like a 50x performance boost over the CPU renderer.

Nick · Sep 9, 2013

3dcgi said:
In the stock market there is a saying that past performance is not an indicator of future success and a trend is a trend until it isn't. That's what I think about this. The current trend may continue or it may not. The fact that it's there catches my attention, but it doesn't prove anything.

As I said before, it's impossible to prove what the future will bring. So your last sentence is kinda meaningless. That doesn't mean we can't have a meaningful discussion though. Engineers speculate about the future all the time, and it's only through elaborate discussion of the challenges and finding solutions for them, that the future is shaped.

That said, I think I do like your stock market analogy. Technology and corporations are both limited by physical realities, market demand, and human beliefs. And indeed, a trend is a trend until it isn't. But then I find it quite chocking that for almost two decades the CPU and GPU have been converging closer together, while at every point in time most people have been betting against that. I say most people, because the GPU manufacturers themselves did unify vertex and pixel processing and did make it generic enough for the HPC market. Unified memory is also advertised as increasing programmability and bringing the CPU and GPU closer together for performance benefits. Likewise the AVX roadmap clearly aims to bring most of the GPU's throughput computing qualities to the CPU.

So I find it quite fascinating that such a strong trend, which a lot of people were wrong about time and time again, is still being betted against. I mean, if this really was a stock market, it should be far harder to convince people that the trend will end soon, versus that it will continue long enough for a 'merger' to happen. There must be a psychological factor that I don't quite understand. Perhaps it is deeply imprinted that CPUs are slow and GPUs are fast, and this is never going to change despite sharing the same silicon technology?

We can continue to make this more flexible without having full unification. Look at HSA vs. AVX.

Yes and no. You can continue to have just convergence for now, but you can't keep continuing that much longer without eventually ending up with unification. So really the question becomes what happens after HSA and AVX-512? When you keep scaling a heterogeneous chip, the cost of communicating between the CPU side and the GPU side becomes too high. So you want the GPU to gain some autonomy. Note that uncoincidentally the next step in programmability is to give developers full control over the creation and scheduling of tasks. This is a very sequential, scalar performance sensitive thing. It's one of the reasons GCN got a scalar unit, which can be programmed using the same homogeneous ISA as the vector units. Also, while AMD went from VLIW4 to low-latency scalar for the vector units, NVIDIA went from scalar to 2-way superscalar (which is really VLIW2 since it's statically scheduled) and Echelon is described as some kind of VLIW3. So AMD will likely adopt superscalar execution too in the future. This means a high ILP per thread, which is very CPU-like.

As developers start to take advantage of the GPU's primitive autonomy, they'll quickly run into its limits and demand more scalar performance...

Anyway, I'm not really interested in that far in the future as so much can happen manufacturing wise that changes the equation. That's one reason I usually don't participate in this discussion. If I can see the future 5 years out that's plenty of time to react. Until then I want to be aware of possibilities, but not place any bets.

But what can happen manufacturing-wise really? Either progress continues, or it doesn't. Every semiconductor researcher on the planet is looking for ways to continue the growth in transistor count and to lower the power consumption. We've already observed how planar transistor technology ran into physical limits, but this was solved quite effectively with 3D transistor technology. There's no doubt many improvements to it will follow, and as we run into new limits those will be addressed the best we can as well. Both CPU and GPU manufacturers depend on this and there are billions of dollars available to keep some progress on some fronts going for a long time to come. The alternative is to claim that unification will not happen because we'll run out of silicon nodes before the convergence has evolved far enough. To me that sounds like admitting that the CPU and GPU are indeed heading towards unification, but GPUs will stop scaling before that happens, ha ha Nick loses... Clearly we'd all lose in such a scenario.

The only other thing that could happen manufacturing-wise that could turn the convergence around is if data movement became cheap again. But that's a big call because for as long as microprocessors have existed, they've had to add more cache levels to deal with the bandwidth wall. It's a fundamental law of geometry that as the number of transistors quadruples, the bandwidth through the edges only doubles. On-die storage, more metal layers and higher frequency memory interfaces helped deal with the problem, but it's not going to get easier overall. New breakthroughs don't last long. So discrete GPUs are doomed for performing any sort of collaborative work with the CPU. We already observe that for consumer GPGPU applications, where response times are expected to be real-time, discrete GPUs are at a disadvantage. Either you need an APU to communicate efficiently with the CPU cores, or you need to add CPU cores on the discrete GPU so it can work autonomously. In any case it's a step closer to unification.

I don't think I suffer from retrospective bias as the first time someone suggested a unified shader to me (probably late 2002) I thought it was a great idea. Up to that point I wasn't thinking about these issues.

In late 2002 the Radeon 9700 was launched, which removed any doubt that pixel processing could be done with floating-point accuracy. One day before that launch, shader unification was a much more insane idea. But really nothing changed from one day to the next in silicon technology. That's what retrospective bias can do to you. I guess we'll have to wait until CPUs with AVX-512 support launch, and software renderers for it appears, until people see with their own two eyes what can be done with silicon technology that really isn't that much different from today's.

Also, there's a difference between thinking it's a great idea but having doubts about its feasibility, and being confident it will happen in several years. Retrospective bias can easily make people forget they ever doubted anything. I recall some discussions with people about how Intel would implement FMA support, where they said it's unlikely that there would be two unit per core. After Haswell launched, they conveniently couldn't recall being in much disagreement with me and two FMA units was a likely, almost obvious choice...

Perhaps you don't suffer from this at all, but personally I wouldn't be so fast to claim that I'm immune to it. It's really a human thing. But it's something to be aware of when comparing past evolutions with current and future ones. Things that you recall as being relatively obvious at some point may not have been obvious at all a day earlier. I think this can be applied reversely by letting go of your fear to be wrong, to predict the future fairly accurately. Most people who don't believe in CPU-GPU unification don't want to make an alternate prediction either. Or they expect things to stay the same, which is really the least likely thing to ever happen.

The days of CPU vertex processing were determined by economics, not because people thought GPU processing was a bad idea. In the future the silicon cost might equalize, but power is a question. So there are more variables this time.

Don't you see that's really the same thing? People don't think doing graphics on a unified architecture is a bad idea in and of itself either, they just think it will consume too much power. People desired unified shaders (knowingly or unknowingly), so engineers strived to make it a reality and overcame the challenge of cost. Note that it still adds cost, but we reached a threshold where people rather pay that instead of lose the advantages. Likewise fully unified computing would enable things people knowingly or unknowingly desire, so engineers are working towards that as well. Great progress has been made in the past and we could reach the threshold where the extra power consumption is worth the advantages, in the foreseeable future.

What is your point in time for this to happen?

If I had to guess it would be four years after AVX-512 becomes available in mainstream CPUs. That's two years to figure out which additional instructions can make it more efficient for graphics, and two more years to write the final software and tune the hardware to be highly power efficient.

This of course assumes that Intel aggressively wants to pursue something like this, which is a pretty big business decision. It's not uncommon for something to be technically feasible but one signature is missing to go ahead with it, which can make it take many years longer. AMD firmly chose the side of heterogeneous computing many years ago and consequently has yet to even announce AVX2 support, let alone double the width of the SIMD units. So Intel is being quite progressive and doing a lot of things that are beneficial for software rendering. They've surprised me several times by introducing technology earlier than expected. That said a trend is only a trend until it isn't so I don't want to be too bullish about it either. Things may take longer, but I'll be happily surprised once more if they don't.

liolio · Sep 9, 2013

Nick said:
If I had to guess it would be four years after AVX-512 becomes available in mainstream CPUs. That's two years to figure out which additional instructions can make it more efficient for graphics, and two more years to write the final software and tune the hardware to be highly power efficient.

This of course assumes that Intel aggressively wants to pursue something like this, which is a pretty big business decision. It's not uncommon for something to be technically feasible but one signature is missing to go ahead with it, which can make it take many years longer. AMD firmly chose the side of heterogeneous computing many years ago and consequently has yet to even announce AVX2 support, let alone double the width of the SIMD units. So Intel is being quite progressive and doing a lot of things that are beneficial for software rendering. They've surprised me several times by introducing technology earlier than expected. That said a trend is only a trend until it isn't so I don't want to be too bullish about it either. Things may take longer, but I'll be happily surprised once more if they don't.

I'm not sure of INtel motives for now, they have competent GPU now so they don't suffer from a lack of competitiveness. To me it is clear though that Intel wants to kill GPGPU and will go through quite some lengths to achieve that.
It looks tro be the same in the mobile world, Intel somehow can afford to be late to the party but when if comes to the professional world (and micro servers) they can't and it shows Avoton is here mostly one year ahead of the competition.

Davros · Sep 9, 2013

Nick said:
If I had to guess it would be four years after AVX-512 becomes available in mainstream CPUs.

So your revising your timeframe now ?

Nick said:
I will make one bold claim about AVX2 though: it means the end of GPGPU in the consumer market.

ps: what year are you expecting avx512 to be available ?

rapso · Sep 9, 2013

Davros said:
ps: what year are you expecting avx512 to be available ?

some leaked paper claim 2015 in consumer cpus.

larrabee derrivered parts kinda have it all the time.

btw. check out the embree presentation, comparing a xeon phi 7100 vs a xeon sandy bridge with 8 cores

a 12-14core haswell-e will probably beat the phi in tracing, I'd assume the same for rasterization/old-school-rendering.

Sulik · Sep 9, 2013

Why would you want to systematically stall a massive CPU core that has been heavily optimized for sequential instructions and incoherent data access with a completely opposing data problem?The reality is that writing efficient code for mixed SISD/SIMD with x86 + SSE/AVX code is MUCH HARDER than tackling the problems with separate code paths because you trash the the CPUs fragile memory architecture

Technically, CUDA would be a perfect language for unified CPU/GPU processing. Stalling a CPU core when switching to parallel mode of execution (where SIMD units are used in a GPU-like fashion) only stalls the current 'main' cpu thread: other non-SIMD execution units could continue processing work for another thread (really no different than normal "hyperthreading"). You can always distribute workload such that a particular CPU thread is doing mostly parallel or mostly serial work.

I'm saying this as someone who likes my GPU discrete, but it certainly seems like eventually a unified approach may become more efficient for mixed workloads (On the other hand, dedicated silicon is always the most efficient for anything, so fixed function engines and specialized application-specific instructions are also likely to have a major impact in future designs).

Davros · Sep 9, 2013

heres a question if we get unification we we have to revert to sub 2ghz speeds (for some reason gpu's cant clock as high as cpu's)
and xenon phi is only 1.3ghz

Nick · Sep 10, 2013

Still said:
Nick, I ask you again why you think that the main streaming processor should be a part of the CPU pipeline in the first place. Why would you want to systematically stall a massive CPU core that has been heavily optimized for sequential instructions and incoherent data access with a completely opposing data problem? Yes, sometimes some streaming processing is necessary in general code, but this doesn't mean that you will be able to do things better by completely merging opposing processing concepts.

An architecture is only as good as the software that gets written for it. So with that in mind it's easy to see why CPUs go to such great lengths to extract ILP. I believe the same philosophy applies to extracting DLP, and ultimately TLP. On a homogeneous architecture, it is easier for the compiler to vectorize loops with independent iterations with minimal or no guidance from the developer.

I don't consider these opposing concepts. ILP, DLP and TLP are really just forms of parallelism that vary in locality and can overlap. For instance multiple iterations of a loop can be executed concurrently using either out-of-order execution to extract ILP, SIMD to extract DLP, or multi-threading to extract TLP. The architecture should be able to efficiently use any and all of these. GPUs heavily bank on SIMD and multi-threading to extract DLP and TLP, but the instruction latencies have been reduced considerably and superscalar execution is used to extract ILP. The latter is important even to GPUs because it reduces the thread count and thereby improves data access locality, which in turn increases the cache hit ratio and thus reduces bandwidth needs.

So really 'computing' just consists of processing code with various degrees of ILP, DLP and TLP waiting to be extracted. It can vary basically from one instruction to the next. Now imagine having a heterogeneous architecture with three types of cores, one for each type of parallelism. This can work just fine for code that only has one of these three types of parallelism, or which does not switch between them very frequently. However this would leave some large classes of applications unable to achieve good performance. Which is why we don't have such architectures!

I think we're still missing out on things with today's two-way heterogeneity. There are definitely applications which are too parallel for the CPU but are also too sequential for the GPU. We currently have to select either of the two architectures to run them on, which isn't optimal, or we have to try to get the CPU and GPU to collaborate, which isn't optimal either due to the communication overhead and it's very hard to develop for (in part due to the many different configurations out there). Also note that the issue gets worse if we scale things up but don't bring them closer together.

GPUs are adding the ability to control task creation, which really is about enhancing TLP capabilities, while CPUs are widening the SIMD units to extract more DLP. So everybody realizes that extracting one type of parallelism is pretty much a solved problem; it's extracting multiple types of parallelism that is the new challenge and will create a new era in computing. Note that NVIDIA claims that in the future "The GPU is the Computer", and that "The Real Challenge is Software". So it's good that they're taking ILP seriously with Echelon, but to really make it developer friendly they have lots of work ahead of them.

To get back to your questions, I don't see why you think the CPU would be stalled by running SIMD workloads. I also don't think that CPU cores are very massive now that we can fit multiple ones on one die. It also is nowhere near as massively optimized for sequential instructions as it used to be, but instead also caters for TLP and DLP. And as addressed above they are not opposing data problems but all just part of the 'computing' spectrum.

The reality is that writing efficient code for mixed SISD/SIMD with x86 + SSE/AVX code is MUCH HARDER than tackling the problems with separate code paths because you trash the the CPUs fragile memory architecture.

Is this reality based on your experience in writing software renderers? I have quite the opposite experience. Also first you say the CPU is optimized for incoherent memory accesses, and now its memory architecture is fragile? It's really the GPU's memory hierarchy that is not very forgiving.

Graphics are the prime example of wide data stream processing and you fail to present a concrete benefit of having ANY code solution for this stuffed in the middle of SISD code.

Hey, don't tell me I've failed to present something before you've even explicitly asked for it. To me and lots of other people the benefits of unified computing are obvious (when the power issue can be overcome), but if you need me to spell them out I'm happy to.

First of all note that Intel is bringing wider SIMD units to the CPU and AMD added a scalar unit to GCN. So these are major companies that both see value in bringing SISD and SIMD together in one core. One of the benefits is that when you vectorize a loop there are always some operations that would be equal for all the elements of the vector. So it would be wasteful to use a SIMD instruction and instead a scalar instruction can be used. Note that these are really interspersed.

Next, in between vectorizable loops you typically have scalar decision making code. It controls which task to execute next, where to pull the data from and where to store results, etc. This hasn't been very apparent in graphics yet because there are rarely more than two shader stages. The hardware has a fixed way of controlling the flow of data, and for general-purpose workloads we've up till now relied on the CPU to schedule new tasks. This isn't workable much longer though. Graphics will evolve to allow control over the work and data flow, and the round-trip latency to the CPU has to be eliminated to avoid executing a completely different task on the same core and thereby destroying data locality. So while in this scenario the scalar instructions are not interspersed with vector instructions, it is important to have the scalar unit(s) really closeby.

Again:
The past trend of GPUs increasing programmability was a consequence of the scaling of computing resources and the lack of a scalable/complete render pipeline solution.

Do you think the render pipeline is complete now? And do you think that it is finally scalable now that (accoding to you) semiconductor scaling is tapering off? Or will things have to evolve again if new breakthroughs allow semiconductor technology to keep scaling?

IC scaling will slow down at an accelerating rate since yesterday: See recent papers or panel talks on the subject.

Please point me to them. There have always been papers claiming that the end is near.

Future interactive rendering solutions will be complete with simplified code and a dramatically increased performance/power dependency on hardware architecture: See recent papers on path tracing, cone tracing, etc.

The code might look simpler, but the work for the hardware gets more complex. You can't generally expect higher visual quality with fewer computations going on. Also, when things (finally) get simpler for developers, they'll just crank it up a notch to deliver new experiences. IEEE-754 calculations in a unified shader were hardly the most power efficient way to perform most graphics calculations, but it allowed developers to worry much less about numerical precision and range and focus on the math. Likewise future architectures should allow developers to focus on algorithms and 'computing' instead of heterogeneous overhead.

zed · Sep 10, 2013

Davros said:
This is what Nick created
http://transgaming.com/swiftshader

Please sell this to Google/Apple
there software versions of opengl es in their android/IOS emulators suck dolphin balls

Still · Sep 10, 2013

Nick, obviously you made a software renderer, but your explanation, why graphics on the CPU would be beneficial in any way, is not convincing at all.
The CPU pipeline for sequential (OoO or not) code is virtually idle for the given task while you do stream computing on the SIMD units because it trashes/pollutes the caches. Yes, you can use write-combined memory with stream loading, etc., but then your scalar access with sequential code to it will be slow as hell.

The reality is that mixing sequential processing with stream processing increases code and hardware complexity. It is well understood that the complexity of a system can be reduced significantly by dividing it into it's problem classes. The fundamental flaw in your logic is that you are insisting on the complete opposite of this solely because you recently have observed a trend of increasing programmabilty of GPUs and a tighter integration into the system, without any qualified analysis if the trend implies a convergence. Intel has performed an analysis with Larrabee, which is a long shot away from your radical suggestion, and the result is that there won't be an convergence even for that. So I wonder if any evidence and reason could possibly convince you.

It is a completely different question if this trend will continue in the future (>5 years) at all or even will turn around.
I really don't have the time to link to the panel talks and papers which conclude that IC performance scaling is slowing down at an accelerating rate since the introduction of the 32 nm node. Just trust me that this is an established fact.
There are solutions for graphics running at interactive frame rates TODAY which are endlessly scalable and produce physical lighting for all kinds of problems. A complete fixed function implementation of this would improve performance efficiency by at least an order of magnitude because of the high dependency on data locality.
You don't even have to look so far: a fixed function ray tracer (for occluders) is being implemented for tiled rendering GPUs on mobile platforms TODAY.

sebbbi · Sep 10, 2013

Davros said:
heres a question if we get unification we we have to revert to sub 2ghz speeds (for some reason gpu's cant clock as high as cpu's)
and xenon phi is only 1.3ghz

A 1 GHz GPU with 4x shader units would reach around the same performance as a 4 GHz model with 1x shader units (in highly parallel workloads). However the 4 GHz model would consume much more power. Power scaling isn't linear when clocks are increased, you also need to increase voltage. Also, the higher clock, the less distance the electrons can move in a single clock cycle. Thus you likely need to split some pipeline stages, and this adds extra transistors to the design, and causes extra latencies (you need to add even more transistors to hide those latencies).

Because GPU programming model supports only highly parallel (DLP) workloads (program is guaranteed to have plenty of easy to extract parallelism), the GPU manufacturers can keep the clocks (and IPC) low and scale up the number of execution units. This is the main reason why GPUs are currently more power efficient than CPUs. CPU manufacturers on the other hand cannot scale the clocks down and the parallel execution resources up (cores & SIMD width), because most CPU software is still mainly single threaded and requires high serial execution performance.

Haswell doubled both the integer and the floating point SIMD performance (and cache bandwidths), but customers didn't care, as these improvements didn't shown any improvement in most CPU benchmarks. Software is the main limiting factor for CPU TLP/DLP scaling. Hardware could be scaled up easily if all the programmers started to use fine grained task based work stealing systems or SPMD-style programming (such as http://ispc.github.io/). This would allow lower clocks, wider SIMDs and more cores. And would thus improve CPU efficiency dramatically. It's all about the software.

I do not believe in fully heterogeneous architectures that have only a single type of cores. Nvidia's exascale plans show two types of cores: serial optimized and parallel optimized ones. ARM's big-little is a mixture of highly serial optimized cores and simpler cores with lower clocks (that can be brought up in highly threaded scenarios and/or for background maintenance tasks). Both of the cores have same instruction sets to make things easier for the programmer. I could see a future Intel design with a small amount of serial optimized fat cores (Haswell descendants) combined with plenty of throughput optimized Xeon Phi cores on the same die. As both cores share the same instruction set (and the same coherent L3/L4 caches), it would be easy for programmer to share work between them.

In the future (not in the next 2 years), this kind of design could also survive without an integrated GPU (allowing two types of CPU cores to spend the whole die transistor budget). With some fixed function hardware to handle the hardest bits (anisotropic filtering for example), this kind of design should be quite efficient. Larrabee, the predecessor to Xeon Phi, had texture sampling units in it's CPU cores.

sebbbi · Sep 10, 2013

Still said:
The CPU pipeline for sequential (OoO or not) code is virtually idle for the given task while you do stream computing on the SIMD units because it trashes/pollutes the cashes. Yes, you can use write-combined memory with stream loading, etc., but then your scalar access with sequential code to it will be slow as hell.

The CPU pipelines are not idling when processed SIMD code. SIMD instructions are just like any other instructions. The full OoO machinery (ROBs, register renaming, data prefetching, etc) is used to keep the wide SIMD unit fed every cycle. The wider SIMD unit you have, the more you are willing to spend transistors to keep it fully fed (as the gains are worth it). It's a good thing that we can reuse the existing OoO machinery to gain much higher arithmetic payload by SIMD. I agree that the heavy OoO machinery is a bit too much for simple scalar operations (scalar ALU execution units use only a tiny fraction of CPU transistors), but to keep those 16 wide FMA pipes occupied it's much more understandable.

GPU execution unit (GCN) has a pipeline for scalar ("serial") execution as well. The scalar pipeline performs the wave/warp wide calculations (such as all ALU regarding to branching and constants). For example in GCN, the scalar pipeline executes one instruction for the wave in the same time as the 64 wide SIMD executes one instruction for all threads in the wave. Thus the serial to SIMD execution ratio is 1:1. In comparison Intel Sandy/Ivy Bridge (sorry I am not 100% familiar with Haswell yet) decodes four uops per cycle. It can execute two 8 wide vector SIMD ALU instructions per cycle. In the case of good SIMD utilization, the remaining two instructions are either loads, stores or scalar instructions (including branches, just like in GCN). This leads to the same 1:1 serial to SIMD ratio (on both GPU and CPU). However the CPU SIMD width is only 8 in current designs, but that's going to increase to 16 in CPUs that support AVX-512 (http://software.intel.com/en-us/blogs/2013/avx-512-instructions).

I don't see any technical reasons why CPU SPMD processing (for example by using http://ispc.github.io/) would trash caches more than GPU processing of the same algorithm. GPU (GCN again) has 10-way "hyper-threading" per SIMD core. Intel's CPUs have 2-way HT. GPU SIMD core needs to keep 5x data in registers and in 5x data in L1 caches, because it needs to have 5x more threads on fly per SIMD core. In addition, Intel has fast L2 caches that are very close to the execution cores (fast latency compared even to other CPU designs), big L3 caches (up to 15 MB) and in latest models (Haswell) a huge 128 MB L4 cache to back everything up. I don't see any reasons why CPU would suffer more from cache trashing than GPU. And my personal experience on GPU programming (and whitepapers by others) back up this. On GPU you need to design your whole algorithm according to your memory access patterns. If you can't coalesce properly, your performance plummets, if you trash your L1 your performance plummets, etc, etc. For example in a GPU sorting algorithm, you are perfectly happy about spending half of your ALU just to reorder your shared memory work buffers in a way that it results in better global memory performance.

Write combining on CPUs is not always the best choice, but it's often a very good option to have, especially when using SPMD processing model. Usually you want to use a similar double buffering method that you use on GPU as that solves plenty of synchronization issues. Manual synchronization of tens of thousands of threads would be a pain in the ass when writing in-place algorithms (neither CPU or GPU solve this elegantly without double buffering).

Still · Sep 10, 2013

sebbbi, you are right with what you say but my point is that you starve the execution of sequential code while processing a lot of data with SIMD because it trashes the caches and the sequential code is highly dependent on those. Larger caches won't help there. Of course, you can certainly benefit from those units in generic code if that's all you got. But there is simply no noteworthy benefit when doing massive graphics rendering on the CPU because you have to deal with the same problems as with a separate GPU.

Davros · Sep 10, 2013

Just a little note since audio has been mentioned as an analogy
ms has decided to use a dedicated dsp on the xb1 and not have audio run on the cpu

Software/CPU-based 3D Rendering

3dcgi

3dcgi

Nick

nAo

Nutella Nutellae

Still

Davros

keldor314

Nick

liolio

Aquoiboniste

Davros

rapso

Sulik

Davros

Nick

zed

Still

sebbbi

sebbbi

Still

Davros

Similar threads