Software/CPU-based 3D Rendering

Blazkowicz · Oct 29, 2012

milk said:
I think it is reasonable to think that if today's cpu architacture was more efficient then gpu's, amd, ati, game consoles, smartphones and the like would be using them intead.

Or we would at least have an occasional voxel-based PC game, or game liberally using arbitrary software rendering code.

Davros · Oct 29, 2012

I just tried Ati SmartShader on the off chance it still worked, It doesnt

milk · Oct 30, 2012

Blazkowicz said:
Or we would at least have an occasional voxel-based PC game, or game liberally using arbitrary software rendering code.

Well, to counter-argue ourselves, there is this...

Davros · Oct 30, 2012

And this

Dag · Oct 30, 2012

As of April 23, Battlefield 4. (That may, hopefully, have changed).

3dilettante · Oct 31, 2012

3D_world said:
The future of 3D rendering:

Fast - 4 ghz
Complex - out-of-order instruction dispatch, massive caches
Not massively parallel - 12-24 cores

Whatever future date this is, trends within the forseeable future do not point to this being feasible for consumer-level real-time graphics.
If Intel's near-threshold computing works out, it pushes for a stronger bifurcation between the throughput-oriented logic and the high-speed logic.
Fast cores will not tolerate the changes made to allow for extremely low-voltage operation, while throughput-oriented cores and logic will favor parallelism, since they may not clock very high individually but will require very, very little power.

As aggravating as the GPU on chips like Sandy Bridge and the like may be to some, it simply provides more utility for the user than a few more high-speed cores they won't use, and it does so within a much more confined power and die budget.
With mobile and cloud computing overtaking the thermally-maxed client PC market as a driving force for development, this doesn't look to change in the near term.

There is some possibility for a Larrabee-type solution of small throughput cores being put on the same die as powerful OoO heavies, which has been on Intel's to-do list for probably going on a decade now. It seems like it almost made it at some point before delays or the competitiveness of the IGP won out. Even with this, the likely demands on the silicon and how Intel has accepted the use of specialized logic when it is needed point to the idea of a dozen or so OoO core monster chip as the future of rendering not being likely for the next 2-3 silicon nodes. We are running out of those, by the way.

Rodéric · Oct 31, 2012

Davros said:
And this

Best game ever.

milk · Oct 31, 2012

And there we go. Outcast. It took just two pages.

Osamar · Oct 31, 2012

Rodéric said:
Best game ever.

Sorry for the semi-off topic.

It could be possible to "patch" Outcast with an improved voxel renderer for higher resolution using modern CPU or even a GPU voxel renderer?

Just, mind flying.

Voxilla · Oct 31, 2012

Osamar said:
It could be possible to "patch" Outcast with an improved voxel renderer for higher resolution using modern CPU or even a GPU voxel renderer?

As it just renders a height field no true voxel rendering is needed. Long time ago I made a prototype to render the height map with polygons (using CPU based depth adaptive tessellation). It ran pretty fast, on those now ancient GPUs (besides much better image quality)

Davros · Oct 31, 2012

Osamar said:
Sorry for the semi-off topic.

It could be possible to "patch" Outcast with an improved voxel renderer for higher resolution using modern CPU or even a GPU voxel renderer?

Just, mind flying.

Last time I ran outcast there was a 1024x768 patch

Nick · Nov 1, 2012

3dilettante said:
Whatever future date this is, trends within the forseeable future do not point to this being feasible for consumer-level real-time graphics.

Some trends do, some trends don't. Let's try to avoid cherry picking...

If Intel's near-threshold computing works out, it pushes for a stronger bifurcation between the throughput-oriented logic and the high-speed logic.

That would mean GeForces and Radeons will use such design in the foreseeable future? I don't see that "trend" happening. In fact current GPUs have similar or higher voltages than CPUs.

It's a very interesting technology for lowering the power consumption during standby operation of mobile devices, but as soon as you want some real work to be done the voltage and frequency have to swing up to deliver cost-effective performance. Note that Intel already adopted an 8T SRAM design, precisely to allow lowering the voltage when parked. This kind of technology doesn't apply any more or any less to the GPU or CPU.

As aggravating as the GPU on chips like Sandy Bridge and the like may be to some, it simply provides more utility for the user than a few more high-speed cores they won't use, and it does so within a much more confined power and die budget.

Multi-core adoption has been slow due to the programming challenges involved and due to the chicken-or-egg issue between developers and consumers. These issues are slowly but surely getting resolved. Intel's TSX technology greatly simplifies multi-threaded development and lowers the synchronization overhead, while quad-core is becoming mainstream and thus it becomes interesting for developers to invest into it and subsequently the increase in multi-threaded software will be an incentive for consumers to buy more cores. It's still early days though. We haven't witnessed any paradigm shift in software development, where multi-core design become as natural as say object-oriented design. But it's bound to come. There's still plenty of untapped task parallelism in generic software. And even graphics engines are becoming multi-threaded.

Also note that clock frequency and voltage can be regulated on a core by core basis. When the workload is SIMD-intensive, the operating parameters can be adjusted. And again, executing very wide vector operations on less wide execution units reduces the switching activity elsewhere. So there's no reason why such a CPU wouldn't be able to be as power efficient as a GPU (and I do mean a contemporary GPU, the convergence is still happening from both ends). Die budget isn't an issue either. Between Yonah and Haswell there will be an eightfold increase in floating-point throughput per core. That's not costing an eightfold increase in transistor budget. Extending the SIMD units further will also be relatively cheap.

There is some possibility for a Larrabee-type solution of small throughput cores being put on the same die as powerful OoO heavies, which has been on Intel's to-do list for probably going on a decade now. It seems like it almost made it at some point before delays or the competitiveness of the IGP won out. Even with this, the likely demands on the silicon and how Intel has accepted the use of specialized logic when it is needed point to the idea of a dozen or so OoO core monster chip as the future of rendering not being likely for the next 2-3 silicon nodes. We are running out of those, by the way.

The cores have to be homogeneous: a set of scalar integer units and a set of wide vector units, fed off the same instruction stream. It may pose some hardware challenges, but it's the only thing software developers will adopt. HSA doesn't stand a chance against the programmer-friendliness of AVX2 and its successors.

That said, looking at the very rapid evolution from fixed-function to programmable to unified that happened to the GPU, I don't think we'll run out of silicon nodes before the CPU and GPU will unify. Note that if the iGPU was ditched and replaced with CPU cores, a mainstream Haswell chip could already achieve close to 1 TFLOP. It won't happen in the next few years, but we're certainly closer to unified graphics than you think.

Davros · Nov 1, 2012

Nick said:
It won't happen in the next few years,

Hang on you said earlier avx2 will kill discreet gpu's
and avx2 is coming next year

3dilettante · Nov 1, 2012

Nick said:
Some trends do, some trends don't. Let's try to avoid cherry picking...

A large number of trends are pointing against consumer-level 24-core OoO chips.
The ROI for such chips for the consumer market is uncompelling; the set of applications users have and those that need that many cores is aside from niche cases software that is often a solution searching for a problem; those chips don't work for more power-constrained environments in a market where a big chunk of the new revenue is power-constrained, and an extra generic core on top of many provides no signifcant utility, "blue crystals", or strong enough means to encourage upgrades.

That would mean GeForces and Radeons will use such design in the foreseeable future? I don't see that "trend" happening. In fact current GPUs have similar or higher voltages than CPUs.

Their operating parameters are far closer to what is necessary for near-threshold operation than a CPU clocked at 4 GHz, particularly mobile GPUs.
One of the demonstration circuits Intel had for the technology besides a 10-100 MHz Pentium was a lighting accellerator for a low-voltage graphics solution.

It's a very interesting technology for lowering the power consumption during standby operation of mobile devices, but as soon as you want some real work to be done the voltage and frequency have to swing up to deliver cost-effective performance.

It's not just for standby operation. It is meant to provide performance per watt that is potentially, but only theoretically at this point, an order of magnitude better that what is possible with silicon that cannot sustain regular operation much below 1V. If applied to specialized or fixed-function hardware, it would be hardware that would be lost in the noise of a single active CPU core's ramping up and down.
For serial performance, it appears to be a non-starter because ticking along anywhere near desktop CPU speeds is physically penalized or impossible because of the process and design features of the tech. It's much more interesting for specialized or throughput-oriented uses, hence the emphasis on mobile graphics and HPC.

Multi-core adoption has been slow due to the programming challenges involved and due to the chicken-or-egg issue between developers and consumers.

And a lack of compelling need.
If you need the mass throughput of a multicore, Intel wants you to get a SB-E or somesuch. Failing that, there are cloud servers using Xeons with many cores.
Consumers on average have dropped back to the dual to quad core transition with the proliferation of mobile media consumption devices with rapidly advancing GPUs.
We can do core upon core upon core, with the biggest challenge being that, beyond the low single digits, core count is a Do Not Care.

Also note that clock frequency and voltage can be regulated on a core by core basis. When the workload is SIMD-intensive, the operating parameters can be adjusted. And again, executing very wide vector operations on less wide execution units reduces the switching activity elsewhere. So there's no reason why such a CPU wouldn't be able to be as power efficient as a GPU (and I do mean a contemporary GPU, the convergence is still happening from both ends).

Intel's mobile graphics solutions IGP or Larrabee-ish cores will, if its tech works, never clock anywhere near as high as minimally acceptable for the primary CPU. Their active power while working--not idle--could be 10-100x lower.
This is very interesting for actual workloads people care about.

Nick · Nov 1, 2012

3dilettante said:
A large number of trends are pointing against consumer-level 24-core OoO chips.
The ROI for such chips for the consumer market is uncompelling, the set of applications users have and those that need that many cores is aside from niche cases software that is a solution searching for a problem, those chips don't work for more power-constrained environments in a market where a big chunk of the new revenue is power-constrained, and an extra generic core provides no "blue crystals" or strong enough means to encourage upgrades.

I never said 24-core. Also note that discrete graphics cards are slowly becoming a niche market. People who do require such performance level are also likely in the same market for a 24-core CPU several years from now. So let's be very clear about what segment and time frame we're talking about. It's obvious that the CPU will unify with the iGPU first, before core counts go up. Something like an 8-core successor to Haswell with no iGPU can have plenty of generic computing power for mainstream graphics needs and many other purposes (including new ones).

Their operating parameters are far closer to what is necessary for near-threshold operation than a CPU clockd at 4 GHz, particularly mobile GPUs.

Please don't compare a mobile GPU against a desktop CPU. Many discrete graphics cards are bigger power hogs than the CPU, even at 4 GHz. Mobile Haswell CPUs will consume as low as 10 Watt (and that's CPU+GPU).

So exactly what operating parameters do you believe to be "far" closer to what is necessary for near-threshold operation on a GPU versus a CPU? Peak clock frequency is affected by pipeline length but otherwise it seems to me that a CPU is just as close to being able to operate at near-threshold voltage than a GPU.

One of the demonstration circuits Intel had for the technology besides a 10-100 MHz Pentium was a lighting accellerator for a low-voltage graphics solution.

Actually that Pentium (a 4 stage architecture) was able to run at up to 915 MHz at 1.2 Volt, and the logic side was still operational at 0.28 Volt. So I don't see any reason to assume that a GPU would be "far" closer to NVT operation than any CPU. The required design changes are the same for both.

It's not just for standby operation. It is meant to provide performance per watt that is potentially, but only theoretically at this point, an order of magnitude better that what is possible with silicon that cannot sustain regular operation much below 1V.

Yes, but striving for this optimal performance/Watt completely obliterates performance/dollar. Hence outside of ultra-low performance niche devices that need to run on harvested energy, the only practical use is for standby operation, still requiring it to be able to run at a relatively high frequency during peak usage, to be commercially viable.

And a lack of compelling need.
If you need the mass throughput of a multicore, Intel wants you to get a SB-E or somesuch. Failing that, there are cloud servers using Xeons with many cores.
Consumers on average have dropped back to the dual to quad core transition with the proliferation of mobile media consumption devices with rapidly advancing GPUs.
We can do core upon core upon core, with the biggest challenge being that, beyond the low single digits, core count is a Do Not Care.

No. This is exactly the chicken-and-egg issue I mentioned. Back when 640 kB was enough for everyone, there was no "compelling need" for a mobile phone capable of running Angry Birds. You don't miss what you never had. Likewise, today there appears to be a low demand for more cores, but that's only because of a lack of software, which is in turn caused by the huge challenges of multi-core development. It's not due to a lack of task parallelism, nor a lack of desire for higher performance itself. People still want CPUs with higher single-threaded performance. TSX will no doubt be a game-changer for multi-core by simplifying things for developers and making it more efficient at the same time.

Intel's mobile graphics solutions IGP or Larrabee-ish cores will, if its tech works, never clock anywhere near as high as minimally acceptable for the primary CPU. Their active power while working--not idle--could be 10-100x lower.
This is very interesting for actual workloads people care about.

Haswell consumes 10x less power at low frequency and voltage. So like I said, the operating parameters of future CPUs with very wide SIMD units could be adjusted to the workload on a core-by-core basis. So you'll get the benefits of homogeneous computing, with the performance of heterogeneous computing.

keldor314 · Nov 1, 2012

That CPUs are better than GPUs at complex code patterns is somewhat of a misconception. If anything, GPUs can perform better here.

The big reason is that additional parallelism gives the processor much more flexibility to schedule instructions. Whereas a CPU must resort to complex OOO schemes to avoid stalls in the pipeline, which are less efficient with complex code, a GPU simply schedules an instruction from another thread, which, provided there are enough threads ready to run, completely masks a stall.

Consider a branch with nasty, data-dependent, behavior. In this case, the CPU can only guess which way the branch will go, meaning there's a high chance for a mispredict, which costs perhaps 15 cycles x 6 way superscalar = 90 instructions. A GPU would just schedule another thread, and not even attempt to predict the branch, meaning that as long as you can somehow keep the SIMD lanes coherent (which is sometimes actually possible in this case!), you take no branch penalty. This is even more troublesome in a difficult to decode architecture like x86, where it may take several cycles to decode the instruction to the point where you even know that it's a branch in the first place, meaning you can sometimes mispredict on static branches, or even non-branches!

Another tough issue with superscalar architectures is that you can very easily run out of instruction level parallelism. Worse, the amount of silicon it takes to achieve ILP of width N is of complexity order O(N^2), meaning you'll fall further and further behind a simple multithreaded architecture, which has complexity closer to O(N) or perhaps O(N log N).

One big argument for CPUs is that GPUs rely on SIMD instructions, which break down in highly irregular control flow situations. Keep in mind that CPUs also rely on SIMD, though not as wide (4 or 8 in the case of AVX, vs. 32 or 64 for Nvidia and ATI, respectively). Thing is, even with only 1 live SIMD lane, on something like a GTX 680, this still translates to 48 operations per clock (8 SMX x 4 instructions/SMX x 2 wide superscalar, with 6 ALU banks per SMX), which is still very near to CPU scalar performance in the best case (6 cores x 6 wide superscalar = 36 operations/clock, though at a higher clock frequency). Keep in mind that this assumes that the CPU is able to avoid stalls, as well as fill in all of the instruction slots every cycle, which is rather difficult, especially for complex code, meaning that the CPU numbers are inflated. The GPU is much harder to stall, due to the aggressive hyperthreading, and code that actually fragments control flow to this degree is really quite rare outside of pathogenic cases, so it's likely to perform much better that these numbers suggest.

As for programability, the SIMT model is far easier to use than the explicit SIMD model CPUs use. As for efficiency, remember that you can always force the SIMT model to work as an explicit SIMD model is software, but you cannot necessarily do the same in reverse, since SIMT requires full lane level predication (including predicated branches).

The bottom line is that for any problem where it is possible to break it down into thousands of threads, a GPU will almost always outperform a CPU, and the margin by which it does will tend to grow. Since pretty much any rendering algorithm operates on millions of independent pixels (with multiple independent samples for each one!), GPUs will pretty much always trump CPUs for rendering.

GPUs have changed a lot in the last 5 years, to the point that it's very hard to tell the difference between a GPU and a CPU at a functional level. The difference is in the microarchitecture, where GPUs are optimized for running many simultaneous threads, while CPUs are optimized for running only a few.

3dilettante · Nov 1, 2012

Nick said:
I never said 24-core.

My initial statement wasn't in response to a claim you made. If a one-sentence reply is all that is made to contradict my claim, I don't find it unreasonable to assume you are taking the same position.

Also note that discrete graphics cards are slowly becoming a niche market.

Aside from certain advantages such as PCB-mounted high speed memory, the discrete/integrated dichotomy is almost orthogonal. With the likely advance in memory stacking and 2.5D/3D integration coming at some point in the future, the GPU add-in board could go away. My argument on the divergent needs of silicon targeting one workload or the other would not change based on location.

People who do require such performance level are also likely in the same market for a 24-core CPU several years from now. So let's be very clear about what segment and time frame we're talking about. It's obvious that the CPU will unify with the iGPU first, before core counts go up. Something like an 8-core successor to Haswell with no iGPU can have plenty of generic computing power for mainstream graphics needs and many other purposes (including new ones).

I was specific on the segment given: consumer-level real time graphics.
The time frame I gave for forseable could have been tightened down to indicate that it was based on data that could be pulled from publically available process roadmaps from the major foundries and Intel, along with any product roadmaps--although this detail peters out earlier. That's about the 14nm node or equivalent for the foundries, maybe one more for Intel.

A user looking to replace their Radeon 7970 or GeForce 680 on a quad or hex core system is not going to be in the same market for a 24-core CPU.
I'm not sure what you mean by a Haswell successor. Haswell will most likely have SKUs that can max out consumer TDPs with 4 cores, and its immediate successor isn't going to make six times as many cores more palatable. Go much further and it's less likely to be a successor than a replacement.

Please don't compare a mobile GPU against a desktop CPU. Many discrete graphics cards are bigger power hogs than the CPU, even at 4 GHz. Mobile Haswell CPUs will consume as low as 10 Watt (and that's CPU+GPU).

That's a specious argument based on a superficial sampling of top-end GPUs not designed for NTV operation. Those GPUs burn power, but it is expected that they can accomplish far more graphically than the 4 GHz CPU, and they do.
To then compare discrete desktop product to an ultraportable platform is pointless, and ignores that Haswell's portable variant actually has more GPU as a fraction of areas specifically because having more low-clocked silicon saves power.

So exactly what operating parameters do you believe to be "far" closer to what is necessary for near-threshold operation on a GPU versus a CPU?

Their clock speeds are far lower and architecturally they tend to favor simpler pipelines and an economy in logic implementation. Their processing engines are closer to an original Pentium than a Haswell. Some mobile GPUs can operate in the hundreds of MHz, which is much closer than a multi-GHz processor to the low ceiling NTV puts on switching speeds.
NTV adds area and complexity costs, and it becomes a negative once it approaches regular speeds and voltages.

Peak clock frequency is affected by pipeline length but otherwise it seems to me that a CPU is just as close to being able to operate at near-threshold voltage than a GPU.

The switch speeds ceiling allowed by NTV would require a very long pipeline, assuming that an acceptable FO4 per stage is reachable with a pipeline specified to run NTV and at 4 GHz.

Actually that Pentium (a 4 stage architecture) was able to run at up to 915 MHz at 1.2 Volt, and the logic side was still operational at 0.28 Volt. So I don't see any reason to assume that a GPU would be "far" closer to NVT operation than any CPU. The required design changes are the same for both.

Its power efficiency curve is not as interesting at 1.2V, and we see the frequency curve just about stall above .8V. It's a small and ancient core that's burning power at the upper end of its range that can be matched by more modern designs with more performance.
Configuring the chip for NTV requires trade-offs against high-speed operation, and forcing it to those speeds actually makes it less efficient or less manufacturable.

Yes, but striving for this optimal performance/Watt completely obliterates performance/dollar.

The glut is in transistor counts and the sheer number of die the industry can produce to service a slower-growing global demand. Intel's already idling fabs at 22nm due to softness in demand. There is more flexibility in terms of transistor count and area, but very little for power going forward.

Hence outside of ultra-low performance niche devices that need to run on harvested energy, the only practical use is for standby operation, still requiring it to be able to run at a relatively high frequency during peak usage, to be commercially viable.

Intel's not getting funding from the US government on NTV vector permute engines for the sake of harvested energy computing. The power constraints for HPC at the exascale level are immense. Haswell's low-wattage variant is pushing further towards broad areas of low-speed logic as a power/performance tradeoff.

No. This is exactly the chicken-and-egg issue I mentioned. Back when 640 kB was enough for everyone, there was no "compelling need" for a mobile phone capable of running Angry Birds.

Back then mobile phones bricks, and a desktop tower couldn't have run Angry Birds.
There was no compelling need for the physically impossible, or at least no more than any other thing requiring unicorns.

You don't miss what you never had. Likewise, today there appears to be a low demand for more cores, but that's only because of a lack of software, which is in turn caused by the huge challenges of multi-core development. It's not due to a lack of task parallelism, nor a lack of desire for higher performance itself. People still want CPUs with higher single-threaded performance. TSX will no doubt be a game-changer for multi-core by simplifying things for developers and making it more efficient at the same time.

There is a fundamental shift in the dynamics of the market, from the outset of the IBM-compatible era.
Until recently, the PC had the anomolous benefit of being a business, media, and personal use portal. It was an open and fragmented era where creative, commercial, and individual use flexibility and capabilities were satisfied and funded by the same pool of silicon and the same pool of dollars.
This is not the same era.
The drivers for creative computing or scientific computing are no longer the same as consumer computing, or the same as business computing or enterprise system computing.

It used to be that engineering and revenue went into and came from this one big pool where all stakeholders could benefit from the PC chip as a disruptive technology.
If any sector stagnated, there were other needs or other customers who wanted more, and their contribution pushed the whole forward. The marginal utility of the next big thing drove rapid upgrade cycles across the whole domain.

The market trends now are for a fragmentation of a mature platform, one that is no longer disruptive but mundane and plodding.
For various reasons, we see spending going away from the single clunky box or merchant chip that does everything inconveniently for the consumer.
The consumer market is at least in part regressing, because silicon integration has advanced so far that people now have portable devices that can do just enough of the job of that clunky box that does everything, just not very prettily. The new platform is an inflexible portal for consumption, locked down, and hostile to creating content or processing it. It doesn't need to last, and it is better the more disposable it becomes.
Their money is not going to bring about a need for 24-core PC chips. Their devices do not necessarily want cloud servers running on those either. The supercomputers want more than those chips can provide.
There is still a need for pushing the envelope here, but it is not universally beneficial, so it is not going to be the product priced for the consumer.

Haswell consumes 10x less power at low frequency and voltage.

Will you be able to test at some point in the future what FPS is acheivable for some games using Swiftshader for a 10W Haswell chip?
You can then run the same games with the GPU on.
Log battery life.

So like I said, the operating parameters of future CPUs with very wide SIMD units could be adjusted to the workload on a core-by-core basis. So you'll get the benefits of homogeneous computing, with the performance of heterogeneous computing.

It's not enough for those interested in NTV, particularly since so much of Haswell's output will rely on binning to get the cream of the crop. NTV is meant for even lower power consumption with better throughput per Watt, and it is meant to do so consistently.

Davros · Nov 1, 2012

3dilettante said:
Will you be able to test at some point in the future what FPS is acheivable for some games using Swiftshader for a 10W Haswell chip?

Not sure about the 10w but I can take a very, very (cue a few more very's) rough guess for a 4.8ghz haswell (if such a thing was to ever exist) running UT2004 1680x1050 max details, a small 2 player level = 15-20fps

Nick · Nov 2, 2012

Davros said:
Hang on you said earlier avx2 will kill discreet gpu's
and avx2 is coming next year

I never said that.

Nick · Nov 2, 2012

keldor314 said:
That CPUs are better than GPUs at complex code patterns is somewhat of a misconception. If anything, GPUs can perform better here.

The big reason is that additional parallelism gives the processor much more flexibility to schedule instructions. Whereas a CPU must resort to complex OOO schemes to avoid stalls in the pipeline, which are less efficient with complex code, a GPU simply schedules an instruction from another thread, which, provided there are enough threads ready to run, completely masks a stall.

If that was the silver bullet, all we'd need is more than 2-way SMT on a CPU with very wide SIMD units.

But it's not that simple, and GPUs do stall. Complex code uses a big working set, and if your register file and/or caches are too small to hold that working set, then you use up precious bandwidth just to pull in frequently used data. Once you're out of bandwidth, and this can happen at many levels, the GPU stalls.

The CPU's out-of-order execution enables it to keep running just one or two threads per core, thereby maximizing access localities and ensuring high cache hit rates, making it inherently bandwidth efficient at every level. Computing power is getting cheaper, while bandwidth is getting harder to scale, so inevitable the GPU has to learn tricks from the CPU. In fact this has already been going on for years; they try to keep the thread count low by decreasing the back-to-back execution latency. Eventually they'll want to bypass the results back to the top of the execution units, and it's a relatively small step from that to not always schedule instructions from different threads, but to also schedule independent instructions from the same thread.

Consider a branch with nasty, data-dependent, behavior. In this case, the CPU can only guess which way the branch will go, meaning there's a high chance for a mispredict, which costs perhaps 15 cycles x 6 way superscalar = 90 instructions.

Actually the misprediction rate of a modern CPU is incredibly low. And graphics is far more regular than the average code.

A GPU would just schedule another thread, and not even attempt to predict the branch...

Which only works when you have enough threads to schedule between. Due to high memory access latencies, having many data-dependent branches will stall the GPU.

Branch prediction is a necessity to keep the thread count low. Sooner or later GPUs will need it. Stalling all the time because your data is far away, is worse than mispredicting every now and then.

Another tough issue with superscalar architectures is that you can very easily run out of instruction level parallelism.

Not really. First of all, Sandy Bridge has an out-of-order window of 168 instructions. There's not a lot of ILP to miss. In fact Hyper-Threading only offers at most 30% speedup. If ILP was a major issue you'd expect that to be significantly more.

And again, the GPU's urge to switch to another thread to maximize ALU utilization can work against itself. The register file and cache contention create a bottleneck that is far worse than missing a bit of ILP.

Worse, the amount of silicon it takes to achieve ILP of width N is of complexity order O(N^2), meaning you'll fall further and further behind a simple multithreaded architecture, which has complexity closer to O(N) or perhaps O(N log N).

It's really not that complex. The NetBurst architecture was theoretically capable of executing four arithmetic operations per clock cycle. It had all the logic for it, 12 years ago. The effective IPC was far lower though, but this had several reasons beyond how wide the execution core was. Core 2 was less wide, but IPC went up, and it didn't cost O(N^2) in transistor budget. Haswell has four arithmetic execution ports again, but this isn't all that complex by today's standards. There's very little logic that has to scale by O(n^2) to make that happen.

As for programability, the SIMT model is far easier to use than the explicit SIMD model CPUs use. As for efficiency, remember that you can always force the SIMT model to work as an explicit SIMD model is software, but you cannot necessarily do the same in reverse, since SIMT requires full lane level predication (including predicated branches).

The CPU supports any programming model, including SIMT with predication.

The bottom line is that for any problem where it is possible to break it down into thousands of threads, a GPU will almost always outperform a CPU, and the margin by which it does will tend to grow. Since pretty much any rendering algorithm operates on millions of independent pixels (with multiple independent samples for each one!), GPUs will pretty much always trump CPUs for rendering.

You're making (false) assumptions here. Aside from the lack of gather support, which will soon be fixed, the CPU is behind on the GPU due to difference in SIMD width. That can be fixed too while still calling it a CPU.

GPUs have changed a lot in the last 5 years, to the point that it's very hard to tell the difference between a GPU and a CPU at a functional level. The difference is in the microarchitecture, where GPUs are optimized for running many simultaneous threads, while CPUs are optimized for running only a few.

Yes, many differences between the CPU and GPU have disappeared one by one. But the convergence is still ongoing, and it's happening from both ends, so it's only a matter of years before those last few differences fade as well.

Software/CPU-based 3D Rendering

Blazkowicz

Davros

milk

Like Verified

Davros

Dag

3dilettante

Rodéric

a.k.a. Ingenu

milk

Like Verified

Osamar

Voxilla

Davros

Nick

Davros

3dilettante

Nick

keldor314

3dilettante

Davros

Nick

Nick

Similar threads