Software/CPU-based 3D Rendering

Nick · Dec 5, 2012

3dilettante said:
That's pretty much what they do these days. You're basically writing out the same words Qualcomm uses to describe the power management for Krait in its marketing.
A core with dynamic voltage and frequency control is able to get information from activity counters, firmware heuristics, and possibly the OS scheduler to determine what the workload demands are.
Aggressively integrated gating and dynamic frequency adjustments have made their way into any power constrained environment.

I do not follow why we're on this tangent as if this is a new concept, or how a core that can vary its voltage or clock gate isn't just the exact same set of circuits, except at a different voltage and some of the clocks are not enabled.

Indeed it's not a totally new concept. I already mentioned Turbo Boost. But I expect this to be extended to also be able to adjusting the frequency and voltage based on workload type, not just on workload intensity.

Today's integrated GPUs have a lower absolute power consumption than the CPU cores. So to create a unified CPU, which takes over the role of said integrated GPU, the cores that run a highly parallel workload should be throttled down. This also leaves more power budget for sequential tasks, which are more important for the responsiveness of the system. Today's CPUs treat all workloads the Hurry-Up-and-Get-Idle (HUGI) way. But this is only suitable for sequential workloads that do go idle. Parallel workloads need a different strategy for the best user experience.

I'm talking about computing devices that literally reconfigure themselves. A truck that shifts from second to third gear is still the same truck.

Nice analogy, except that today's processors have very little that resembles those gears yet. A truck with few gears isn't very efficient at driving up slopes or driving on the highway, or both. So we need processors to not only throttle the gas, but be able to switch gears to adapt to different conditions. We have multi-core and SIMD for basic TLP and DLP extraction, but there's lots more that can be done to increase throughput and lower power consumption. Together with adapting to the workload type with things like long-running wide vector instructions and lowering the voltage/frequency, CPUs would offer plenty of the reconfiguration you're looking for. The rest of the reconfiguration for different uses can be handled in software.

Properly supplying quadruple SIMD througput is more expensive than you let on, and I've stated the position that for the desired performance goals by 2018 or 2020 for Exascale, the default power budget is too high to begin with.
To reiterate, the proposed gains are modest and the baseline not good enough.

Which is why my power consumption argument had four parts. Don't isolate one and say it's not good enough.

There are reports that there may be Haswell high performance SKUs with 160W+ TDPs.
Westmere stopped at 130.
Can you add some clarification on what you mean by this?

My point was that despite substantially increasing the throughput per core, Haswell's power consumption will still be lower (per core). And they're only one process node apart.

FinFET is quite impressive in the lower voltage domain, especially in more modestly clocked designs.
The improvent in the 4 GHz 1V+ realm is back the modest tens of percent.

Which is exactly why during a throughput-oriented workload the frequency should be reduced.

I'm not sure why it's fine to pin hopes on one lab's silicon nanowires that may someday be looked at and a whole NTV Pentium that physically exists and has been manufactured has to be discounted.

I'm not saying to completely discount it. But the problem with NTV today is that it comes with severe compromises. Having to drop the clock frequency by an order of magnitude just isn't commercially viable outside of some niche markets. It's valuable to drop the voltage a little with every new node, but this doesn't demand the full set of changes required for true NTV operation. In essence, NTV technology is a last resort and we want as little of it as possible.

Junctionless transistors on the other hand have very promising scaling characteristics. It's a more desirable technology than being forced to go the NTV route. But I'm not pinning all my hopes on junctionless transistors. Intel is also for instance experimenting with III-V TFETs, which might operate at 0.3 Volt but without sacrificing clock frequency. So my main point was that with so much R&D now going into lowering the power consumption of transistors, there's bound to be some progress that makes current trends too pessimistic.

Why would a mobile GPU with a short pipeline, relatively simple design and operating point in the hundreds of MHz fare worse with NTV than a Pentium with a short pipeline, relatively simple design, and an operating point in the hundreds of MHz?

It wouldn't fare worse. It would fare just as badly.

Which steps do you think are left that aren't already heavily gated, because yes, extensive gating is being done already. This is why I've stated your supposed gains are modest, because they seem to be including things that have been done for five years.

There are no long-running vector instructions in consumer CPUs. So even though the gating mechanisms are already largely in place, it's not being taken full advantage of during high DLP workloads.

Duplicated logic is duplicated stuff with costs going to zero.

The transistor cost has already been going to zero for decades, but the transistor count has been increasing at roughly the same pace. So area cost isn't going to zero. And even though not all of the transistors can be active at the same time, there's still an increasing absolute amount that can. Also, there's large portions that can be gated off temporarily even during high performance operation. All of this means it's still a substantial waste to have duplicate functionality and use only half.

I'm not denying the problem of dark silicon, but once again it looks like you're interpreting something that is in fact highly undesirable as something that's somehow an advantage to GPUs. They suffer from this just as much. It may be a hurdle for CPU-GPU unification, but it's definitely not reversing it. And again, there are substantial advantage to unification aside from transistor (area) cost. It improves data locality and makes writing efficient code much easier.

Data movement can be managed and coalesced so that a design can intelligently weight the occassional burst in consumption when starting the offload process versus the ongoing economies in power consumption.

Managing and coalescing the data movement doesn't come for free. It worsens the data locality and you'll end up running latency sensitive code on the GPU. It works form the point of view of minimizing the data transfer overhead, but it's making things less efficient elsewhere.

This may sound like a simple matter of optimizing it until you reach the right balance, but it's a veritable programming nightmare. There are tons of heterogeneous system configurations; some with powerful discrete GPUs, some with feeble integrated GPUs, some with lots of registers per thread, some with few, some with good shared memory bandwidth, some with severe bottlenecks, some with crippled double-precision performance, some with limited integer performance, etc.

To make matters worse, things are not getting any better for heterogeneous systems. Bandwidth and latency doesn't increase at the same rate as computing power. So moving data between the CPU and GPU becomes ever more costly. The only solution is unification.

Why does the memory hierarchy see things as being significantly different? The data cache and the memory controller have very little awareness of what instructions are doing outside of the memory accesses they generate. The long-running SIMD instructions basically make a quarter-wide SIMD demand the same amount of operand data.

Yes, the hardware prefetcher isn't currently aware of long-running SIMD instructions and their latency hiding qualities. So it would have to be made aware of that. This isn't hard, it's just a gear shift.

This would make me start to question why this is on a big OoO core when it seems all its design features are negated, but it has to jump through hoops to appear simpler.

Did unifying the GPU cores negate their design features for vertex and pixel shaders? No, they just cater for both now, and leave features unused when not needed by the shader. Likewise, long-running vector instructions would just make the CPU cater for high efficiency DLP extraction, which allows to gate some features aimed at ILP extraction. Those ILP features are still highly necessary for sequential scalar workloads. Again, just another gear shift.

Intel's power control unit has been subtly overriding OS power state demands since Nehalem and possibly one of the Atom designs at the time.

But that appears to be about homogeneous cores. Sure, it can be extended to heterogeneous ones, but it's not solving the inherent problem that bandwidth and latency between heterogeneous cores is scaling more slowly than computing power. And in fact with a homogeneous ISA (virtual or not) across heterogeneous cores and a unified address space across disjoint memories, the developer becomes less aware of where the code is running and where the data is located. Just for arguments sake, he could be running the OS on the GPU and graphics on the CPU, with both pulling data from the other side. I really don't think that disguising a heterogeneous system as being a homogeneous one fixes things. It might be an improvement on average, but it's really just another convergence step toward full unification.

It's already clear that the discrete GPU's days are numbered, so RAM memory, the memory controllers and part of the cache hierarchy all become physically and not just virtually unified. And while this isn't optimal for specific types of workloads, it's way better than a CPU pulling data out of graphics RAM or a discrete GPU pulling data out of system RAM, with the risk of that bringing down performance only getting worse.

It's only a matter of time before the cores have to be physically unified as well, to prevent having to run code on the wrong type of core because bandwidth and/or latency don't allow to migrate it. This problem doesn't occur when all cores are equally capable of extracting ILP, DLP and TLP.

Going forward, Intel has been putting forward standards to allow system components to communicate guard bands on latency requirements, so that their next SOC will be able to coalesce activity periods at its discretion to better enable power gating.

What standards?

It does exactly what I want it to do, and exactly what the customer would want it to do.

Improve graphics performance at the expense of GPGPU?

Or what Daly said, removing things incidental to the real problem.

Removing memory space barriers means adding hardware features. So you can call it removal all you want, it's still convergence.

When you have a hammer...

If it quacks like a duck...

Seriously, what you described is definitely closer to software rendering and further away from heterogeneous hardware. Compilation and scheduling are latency sensitive tasks, and since you'll want each core to adapt individually, you want each core to have CPU-like qualities. In theory you could just pair up each GPU core with a CPU core, but since you want a shared ISA as well we can at least unify instruction fetch and decode. Assuming that at this point the CPU side of each core uses out-of-order execution and the GPU side uses in-order execution, you still need ways to synchronize data between them. So the memory subsystem also has to be tightly interwoven, especially since you also want a uniform memory space. This practically just leaves scheduling. But scheduling instructions or scheduling threads isn't horrendously dissimilar. You may as well have just one generic out-of-order SMT scheduler and have things like long-running SIMD instructions to lower the switching activity.

Andrew Lauritzen · Dec 5, 2012

keldor314 said:
Note that not only are we comparing state of the art tracers, but we even have identical test scenes used, so this is probably as fair a comparison as you're likely to get.

Uhh, dude that's an awful comparison. Did you even read the abstract of the second paper? "Recently, a new family of dynamic ray tracing algorithms, called divide-and-conquer ray tracing, has been introduced. This approach partitions the primitives on-the-ﬂy during ray traversal, which eliminates the need for an
acceleration structure". Yeah, *totally* comparable to the GPU benchmarks with precomputed static geometry... oh wait.

I'm not going to go into the details here (you can go look them up and run the relevant code yourselves), but CPUs are extremely competitive with GPUs at ray tracing iso-power.

keldor314 said:
Yes, good algorithms try to keep the rays as coherent as possible, but the curse of dimensionality limits how much of this is possible (ray space is 6 dimensional, and hit space is 3 dimensional - there's a LOT of room for rays to avoid each other and prevent reuse).

Sure it's not an easy problem, but it's the *only* problem going forward. You can't keep brute forcing stuff with memory bandwidth when power becomes the limiting concern. Really, you have to get that power efficiency is the primary determining metric for performance going forward, and burning memory bandwidth is not the way to do it. You're well served by treating everything like a distributed database problem: only count memory movements.

keldor314 said:
Some form of diffuse ray tracing is going to find its way into gaming in the near future. In fact, the upcoming Unreal Engine 4 has showcased a form of it...

... and it's not path tracing and doesn't scale the same way that you have described.

Come on man, that post is basically marketing. I'm all for efficient ray tracing on any suitable architecture but you're trying to trivialize something that is way more complicated than you seem to understand and declare a winner based on your extremely simplified model.

Exophase said:
I think the claim was that the original (unreleased) Larrabee reached XBox 360 levels

Does anyone really think Larrabee was so pathetic that it could only match an Xbox 360? Ugh :S

Exophase said:
Beyond that, Haswell isn't going to be nearly as good at automatically hiding latency, and the instruction set will still have deficiencies; for instance predication will have to be done explicitly with blends instead of with control words, needing more registers (and instructions) in an already more constrained register file.

While I agree it's never as simple as FLOPS counting, what advantage do you think Larrabee had in latency hiding that the big cores do not? Latency hiding in Larrabee was done via prefetching and software fibering (which actually worked quite well)... in fact the hardware was *worse* at hiding it automatically than big cores (thus it was critical to do a good job in software).

But I agree there are lots of missing bits... masking, format conversions, texture sampling. Haswell EP may be a decent software rasterizer but I doubt it'll match even the original Larrabee.

Exophase · Dec 6, 2012

Andrew Lauritzen said:
Does anyone really think Larrabee was so pathetic that it could only match an Xbox 360? Ugh :S

sebbbi said it was slightly faster. I'm not that familiar with it myself. AFAIK people saw results in 2009 (or was it 2008?) and were pretty underwhelmed, so it couldn't have been that amazing. Of course, the raw attainable compute would have been a good order of magnitude higher than Xenos + Xenon.

Andrew Lauritzen said:
While I agree it's never as simple as FLOPS counting, what advantage do you think Larrabee had in latency hiding that the big cores do not? Latency hiding in Larrabee was done via prefetching and software fibering (which actually worked quite well)... in fact the hardware was *worse* at hiding it automatically than big cores (thus it was critical to do a good job in software).

Much more parallelism. In that Larrabee is much slower and wider, by a factor of around 4, so in terms of clock cycles there's far less latency to hide, and has four-way SMT instead of two-way.

If by software fibering you mean SMT then I guess we're talking about different things here. Yes, you have to manage the threads explicitly, so I'm not saying that Larrabee will hide latency for you in a serial stream of code, but I am saying that it presents to the programmer far better tools for hiding latency if the code is highly data regular.

... okay, on re-reading my post I see I did use the word "automatically." I guess SMT and wide SIMD/slow clocks are kind of semi-automatically? I mean, you can target it with the same OpenCL or what have you that you run on compilers and the compiler will take care of it.. the programmer doesn't have to do very much latency hiding.

Exophase · Dec 6, 2012

keldor314 said:
Either that or SwiftShader is relying on unsafe compiler optimizations that only work correctly most of the time. Shader compilers are rather notorious for that.

I'm guessing the majoriy of SwiftShader's performance is dictated by code it generates for shaders, which may also inline some stuff that's normally fixed function like texturing (at least to handle computed addresses). I have a feeling a lot of the other performance critical stuff is done with hand-coded assembly.

Nick · Dec 6, 2012

3dilettante said:
Daly's talking about the energy required to drive signals over wires from one point in the chip to another, or some kind of transaction with endpoints on and off chip.

One of the implied components to reducing scheduling overhead, aside from the transistor savings, is the reduction in the data transport related to that process. All scheduling and propogation down the pipeline is implicitely generating some number of bits of data per instruction, as manifested by the switching of signal wires and the changes in internal bookeeping state. Other costs, such as accessing the branch prediction hardware is moving data, just data that the software doesn't see.

Sure, but the bookkeeping doesn't increase with increasing SIMD width. And with long-running SIMD instructions the bookkeeping cost can even go down. So combined it would provide a substantial improvement in performance/Watt.

The energy cost of that movement is dependent on the distance that needs to be traveled and is influenced by the properties of the interconnect and how it is physically and electrically implemented. He mentions creating a very small register operand cache to keep the operand paths in the common case extremely short...

CPUs have a bypass network which sends results straight back into the ALUs. The 16-entry ORF in Daly's presentation is pretty close to being the same thing. Convergence.

But just like a CPU, it requires minimizing the number of threads to maximize the chances of actually being able to reuse these recent results. This demands some clever scheduling. So you shouldn't cherry-pick the things that will lower power consumption and not look at the bigger picture.

Nick · Dec 6, 2012

Davros said:
To sell Tesla ?

To sell GeForces first.

Nick · Dec 6, 2012

ninelven said:
I'll take that as refusing to answer the question. If you are going to persist in being purposefully obtuse, then there really isn't any point.

It's not me not answering the question that would be the cause of there not being any point. I kindly asked you to state the point you were trying to make.

EDIT: I suppose I can attempt one last time...

Why, exactly, do you think it is more accurate to compare GF110 to GK104 than GK110? And if that is the case, then to what would you compare GK110?

GK110 has 7.1 billion transistors. Even if you account for the denser design, there's nothing in the Fermi family to compare it to.

Your turn. Why do you believe this is not readily obvious?

keldor314 · Dec 6, 2012

Andrew Lauritzen said:
Uhh, dude that's an awful comparison. Did you even read the abstract of the second paper? "Recently, a new family of dynamic ray tracing algorithms, called divide-and-conquer ray tracing, has been introduced. This approach partitions the primitives on-the-ﬂy during ray traversal, which eliminates the need for an
acceleration structure". Yeah, *totally* comparable to the GPU benchmarks with precomputed static geometry... oh wait.

The numbers I cited were from the high performance MBVH ray tracer they were comparing their results to. Hence, as fast of a tracer as they were able to find. I did not use the numbers from the new tracer they presented, which has performance around 5x worse than the static MBVH CPU one.

I'm not going to go into the details here (you can go look them up and run the relevant code yourselves), but CPUs are extremely competitive with GPUs at ray tracing iso-power.

If you can find better numbers, then by all means, do so. But keep in mind that the diffuse ray numbers are the ones of interest. I'm sure you can find much higher rates for primary rays, but they are more or less irrelevant to the bulk of the work in generating a high quality render. You need only a handful of primary rays per pixel, but hundreds of diffuse rays to get any sort of noise free global illumination, and let's not even mention caustics...

... and it's not path tracing and doesn't scale the same way that you have described.

Not naive path tracing, no. It does, however, require tracing large numbers of rays with very poor coherence. Remember, ray space is 6 dimensional, meaning the chance of any two rays not only passing near to each other briefly but following a similar route through the acceleration structure is very, very small. Honestly, if you can come up with a method that can match the quality of an unbiased renderer while retaining memory coherence, it will be a major breakthrough.

Sure it's not an easy problem, but it's the *only* problem going forward. You can't keep brute forcing stuff with memory bandwidth when power becomes the limiting concern. Really, you have to get that power efficiency is the primary determining metric for performance going forward, and burning memory bandwidth is not the way to do it. You're well served by treating everything like a distributed database problem: only count memory movements.

No one's arguing that the voxel cone tracing method that UE4 uses is identical to path tracing. The argument I'm making is that it has very similar memory and compute characteristics to conventional path tracing, since both are essentially traversing a acceleration structure looking for intersections. The real point is that this type of workload is almost certain to become commonplace in games in the future.

Secondly, the performance numbers strongly suggest that ray tracing is bound to global memory performance, and that large caches don't matter. If they did matter, the scaling wouldn't match the respective memory bandwidth so cleanly. Thus, the question is what amount of power is wasted on the CPU chip itself. The GPU is running say 200 watts and the CPU is running 100, but the GPU has about 10x more performance, which means it's doing the same work with only 20% the power consumption. A second question is whether the CPU would be able to keep up with the workload if it suddenly had 10x more bandwidth.

Nick · Dec 6, 2012

keldor314 said:
Let's talk ray tracing. Specifically, diffuse rays, which are both highly divergent in both memory and execution, as well as being the important part of global illumination. In general, diffuse rays are both the slowest class of rays, as well as being absolutely necessary, not to mention you have to trace a greater number of them in a given render than primary rays.

Here are the most recent papers I can find, for both GPU (kepler) and CPU (i7 with AVX).

http://www.tml.tkk.fi/~timo/publications/aila2012hpg_techrep.pdf
http://dl.dropbox.com/u/10411297/Downloads/incoherent_dacrt_eg2012_final.pdf

Note that not only are we comparing state of the art tracers, but we even have identical test scenes used, so this is probably as fair a comparison as you're likely to get.

I'll do you one better: identical algorithms. The GTX 680 loses quite badly against the quad-core CPU, especially when you take power consumption into account. Haswell will most definitely grow that gap, even against the 680's refresh.

It's also worth noting that the GTX680 claims 192.2 GB/s memory bandwidth, while the i7-2600 claims 21 GB/s, a 9x difference. This means that for all the talk about superior memory systems with large caches and OOO, the GPU is actually beating the CPU in performance by slightly more than the difference in memory bandwidth...

Actually the GPU article you linked quite clearly shows that the performance closely follows the peak FLOPS, not the bandwidth. So you can't make any conclusions from a correlation in the CPU's and one of the GPU's bandwidth and their ray-tracing performance (using different algorithms).

The basic reason that CPUs see no benefit from their large caches is that, for diffuse/secondary rays, you very rapidly lose any coherence between rays, so they are all accessing random parts of the scene. Yes, good algorithms try to keep the rays as coherent as possible, but the curse of dimensionality limits how much of this is possible (ray space is 6 dimensional, and hit space is 3 dimensional - there's a LOT of room for rays to avoid each other and prevent reuse). Since the scene is large - generally as large as will fit in memory, since we want fancy scenes - this means the cache miss rate is near worst case.

If that was true then things would be bandwidth limited, but we just ruled that out.

So clearly you're making wrong conclusions.

Nick · Dec 6, 2012

keldor314 said:
Larrabee is a GPU style architecture, not a CPU.

It's something in the middle really. Its modest 4-way round-robin SMT and relatively large L2 caches make it capable of running operating system code at bearable performance. I have yet to see any GPU from NVIDIA or AMD do that.

It's a compromise though. It's neither catering for high-performance sequential code nor for best-in-class rasterization graphics. What I'm proposing instead is a unified architecture which retains 100% of the CPU performance but can adapt to high throughput workloads to offer adequate graphics performance for the lower end of the market, and work its way up from there.

keldor314 · Dec 6, 2012

Nvidia's OpenCL implementation really stinks. I don't believe they've even touched it since the early fermi days, at least, no more so than necessary to have it run (somewhat) on new hardware. I suspect they're trying to play politics and sink the API so that people use Cuda. Or something. Their OpenCL support before fermi was actually pretty decent, so yeah. In fact, my old GTX 295 outperforms my GTX 680 on highly compute bound code, which, given the code (very small working set, lots of arithmetic), makes no sense whatsoever.

As for AMD, their OpenCL implementation is merely unstable. I've had it crash the computer (not just the program, the whole OS, forcing a reboot) with code that runs perfectly fine on Nvidia hardware (apart from the incorrect results from an unrelated regression from *2 years ago* that they still haven't fixed, but never mind that...). Actually, that particular piece of code breaks tesla too, but it does run on fermi and kepler. This is on an older card (HD5770, so juniper), so maybe it works on GCN, but it's still not good.

My dad also ported my work to OSX, and has had his own set of problems with bugs in OpenCL, some quite severe. I think it took Apple 2 months to get the new Macbook Pro to even get through OpenCL initialization without crashing, hard, forcing a reboot. On both CPU and GPU devices. Blew up on clCreateBuffer, iirc.

As for the API itself, not only is it painfully low level, but it isn't even implemented literally. Did you know that when you call clEnqueueWriteBuffer it doesn't actually copy anything until you execut a kernel. What's the point of making it explicit when the driver (for both AMD and Nvidia) ignores it and copies data over whenever it feels like it, generally at kernel launch? Even then, it sometimes (AMD) reconfigures the kernel to use memory in strange and unusual ways, like splitting apart basic kernels that access multiple arrays to multiple kernels accessing single arrays. A nice feature, I suppose, except that it completely broke my method of testing for the amount of available memory (and no, there's no way in the API to query this directly). I had my AMD card running simple kernels successfully which had working sets far larger than the memory on the card, but of course, trying to use that amount of memory with a real production kernel crashed. I never did figure out a way to get available memory on an AMD card, other than taking the total memory and multiplying by .75 or so.

So basically, I don't feel like any sort of OpenCL benchmark is particularly valid, at least not when comparing hardware, since the state of the implementations is so bad. It's like trying to benchmark two CPUs by seeing how long it takes to open a web page, where one is running Firefox and the other Chrome. Or maybe they're running Netscape and Internet Explorer. That would probably be a more accurate analogue.

I never had these sorts of problems with Cuda, a pity it won't do what I need it to for my current project (no runtime compile support).

Andrew Lauritzen · Dec 6, 2012

Exophase said:
sebbbi said it was slightly faster. I'm not that familiar with it myself. AFAIK people saw results in 2009 (or was it 2008?) and were pretty underwhelmed, so it couldn't have been that amazing.

I'm not clear on what people saw, but it certainly was not the final optimized renderer. I don't imagine I can talk details but while it took a long time to get the software up to decent performance, it did eventually hit close to the initial targets (although far too late). And I think you can guess that those targets were not Xbox 360 or anywhere near its performance level. I'm kind of insulted that people would think any decent graphics person would have gone to work on the project if that was the target

Exophase said:
If by software fibering you mean SMT then I guess we're talking about different things here.

No, four hardware threads is insufficient to hide texture latency (100s to 1000s of cycles). You need to actually have a software mechanism to store and reload different contexts - that's what I mean by "software fibering" (i.e. extremely lightweight, cooperatively scheduled "fibers" with minimal context save/restore). Turns out you can make it pretty efficient for domain specific stuff like shading languages.

Exophase said:
Yes, you have to manage the threads explicitly, so I'm not saying that Larrabee will hide latency for you in a serial stream of code, but I am saying that it presents to the programmer far better tools for hiding latency if the code is highly data regular.

Not really TBH. The original boards did not use hardware prefetch, and remember that they used GDDR which has much higher latencies than regular DDR, so you were pretty much screwed if you didn't have good software prefetch. For predictable patterns this is fairly easy but for unpredictable stuff you need software fibering or something similar (i.e. a way to prefetch when you know the memory address, then go do something else for a while). That said, it did work pretty well for shading, but you don't want to try and write that code in C...

Exophase said:
I mean, you can target it with the same OpenCL or what have you that you run on compilers and the compiler will take care of it.. the programmer doesn't have to do very much latency hiding.

Sure, the compiler can handle the relevant code, but this is true on CPUs as well. I doubt Nick/SwiftShader just completely rely on the hardware to hide latency automagically, but he can correct me if I'm wrong

keldor314 said:
I'm sure you can find much higher rates for primary rays, but they are more or less irrelevant to the bulk of the work in generating a high quality render.

Agreed that primary rays are irrelevant as they are almost always more efficient to rasterize. But you can't really directly compare "diffuse rays" from separate papers even if they use the same scene as the BRDFs are of critical importance to secondary effects (among other things).

As Nick pointed out, there's really no excuse for not running exactly the same code on both GPUs and CPUs these days... almost everything that runs well on GPU compute runs pretty well on CPUs as well (exception being texture sampling and similar special functions), so there's really no need to play "let's try and find the best implementation on target A/B!" anymore. And I don't mean you have to implement it in OpenCL and JIT it for both - you just should do exactly the same algorithm and parallelism/SIMD on both and get exactly the same results. Realistically, I simply don't trust random paper implementations for architectural comparisons.

keldor314 said:
You need only a handful of primary rays per pixel, but hundreds of diffuse rays to get any sort of noise free global illumination, and let's not even mention caustics...

That's far from clear. For conventional brute force path tracing, sure, but there are approximate methods that are noise free by construction. You'll find that anyone who's doing real-time rendering isn't too concerned with being unbiased... it's far more important to look visually pleasing and avoid digital artifacts (noise, aliasing, etc). That's really where the interest in voxel cone tracing comes from... it's extremely wrong, but it behaves in a reasonable way.

keldor314 said:
Not naive path tracing, no. It does, however, require tracing large numbers of rays with very poor coherence.

"Tracing" cones through volumetric data structures is a very different algorithm than intersecting individual rays with triangles. Tracing cones is actually totally predictable memory access patterns and the natural LOD that falls out of walking up a mip or similar chain brings a lot of coherence back. In fact, you get the sort of strange effect that blurry effects are cheaper than sharp ones (which you can kind of get at with path tracing + fancy reconstruction but not as directly).

keldor314 said:
Honestly, if you can come up with a method that can match the quality of an unbiased renderer while retaining memory coherence, it will be a major breakthrough.

Agreed, but unbiased is pretty uninteresting in real-time, for the reasons that I mentioned. It's nice theoretically but not a really important constraint for making a game look pretty.

keldor314 said:
Secondly, the performance numbers strongly suggest that ray tracing is bound to global memory performance, and that large caches don't matter.

Like Nick I'm confused by this statement... the NVIDIA paper argues the opposite.

keldor314 said:
The GPU is running say 200 watts and the CPU is running 100, but the GPU has about 10x more performance, which means it's doing the same work with only 20% the power consumption.

Ok, ignoring that the 10x number is bogus, you actually have to measure the power utilized while running the workload (at the wall, not just the GPU since it can't run in isolation)... you can't just use TDP.

I downloaded Embree (free CPU ray tracer optimized for incoherent rays - i.e. it doesn't really even take advantage of primary rays and is fairly insensitive to recursion depth) and ran some of the sample scenes and saw ~45Mrays/s path tracing (with full GI, etc.) with zero tweaking, and I don't think Embree even uses AVX/AVX2 yet. On a 50 million poly scene (all onscreen) with path tracing to a depth of 16 (i.e. significantly more complicated than the examples in the NVIDIA paper), I still see 15Mrays/s, so I think it's safe to say that the 20Mrays/s primary rays on a fairly simple scene number you quoted is "suboptimal", to be kind...

Anyways I don't really have the time or inclination to dig out more numbers, but I think it's clear that naive comparison you made isn't the reality. There are some forthcoming publications on this very topic in flight, so I'm inclined to wait for the topic to be treated more formally.

keldor314 · Dec 6, 2012

Andrew Lauritzen said:
...I still see 15Mrays/s, so I think it's safe to say that the 20Mrays/s primary rays on a fairly simple scene number you quoted is "suboptimal", to be kind...

My numbers were for diffuse rays, not primary, so 20 Mray/s is fairly consistant with your 15 Mray/s. Actually, the paper gives numbers for both SSE and AVX, with the SSE number being around 15. Once you get past a few hundred thousand tris, performance is pretty level for the incoherent rays, since the cache stops playing a role, and ray traversal is O(log n) by number of tris. Unless of course you run out of memory all together, but 50 million isn't *that* big.

As for the hairball scene, looking at that thing, it seems likely that any bounding hierarchy is likely to be a bit of a mess, so it's probably a worst case outlier as far as performance goes.

For what it's worth, there are other cases where CPUs lag far behind. A good example are transcendental functions (sin, cos, log, exp, etc.) where the gap, at least for single precision, is quite large. Kepler can retire 256 of these instructions per clock cycle. There's just no way that a CPU can come anywhere close, especially since it doesn't have hardware paths for these functions. My work with GPGPU has been dealing with flame fractals, which do in fact use transcendentals fairly heavily, and I've measured my performance gap at around 40x, comparing my old GTX 295 to an equally old phenom X6. All indications point to the gap actually being wider with current hardware, since GPUs have more than doubled since then, but CPUs have only gotten around 50% faster. With a gap this wide, it actually hurts performance if you try to have the CPU run at the same time as the GPU, since any gaps in scheduling kernels on the GPU outweigh the amount the CPU can contribute.

To be fair, the CPU reference makes a silly mistake and uses double precision, which is really completely unnecessary, but with a gap that wide, GPUs are quite clearly far faster, though maybe the real number as only 15x. I would rate both the CPU and the GPU implementation as (besides the double precision thing) as having about the same time spent working on optimization. I haven't been able to measure the OpenCL implementation directly on the CPU as it crashes (did I mention that OpenCL really stinks?). It is worth noting that when my dad ported the thing to OSX he got around 5x difference, but here we're comparing an i7 on the macbook pro to a mobile class GPU (I want to say 650M), which by my comparisons is around 4x slower than my GTX680, which gives around 20x performance gap between 680 and i7, both running identicalish opencl code, so far as this is possible given the limitations of Apple's OpenCL CPU implementation (max 1 thread per block! WTF Apple??).

3dilettante · Dec 6, 2012

Nick said:
Indeed it's not a totally new concept. I already mentioned Turbo Boost. But I expect this to be extended to also be able to adjusting the frequency and voltage based on workload type, not just on workload intensity.

Using long-running SIMD instructions is basically another metric, where the heuristics maintain a history of the instruction mix. This is the same as with Sandy Bridge and 256-bit AVX.
If you want something more in-depth, there are some possible settings at system initialization that can disable functionality if the system's workload is known not to use them.
It sounds like you are implying the need for explicit software means of dictating the gating status of the core, which is more complicated than just software hints since these cores can freely pull in instruction streams from other workloads, absent some controlled environment or OS policy.

Today's CPUs treat all workloads the Hurry-Up-and-Get-Idle (HUGI) way. But this is only suitable for sequential workloads that do go idle. Parallel workloads need a different strategy for the best user experience.

This has long been the best means of acheiving optimum energy efficiency, particularly if a design is overspecified for the load.

Nice analogy, except that today's processors have very little that resembles those gears yet. A truck with few gears isn't very efficient at driving up slopes or driving on the highway, or both. So we need processors to not only throttle the gas, but be able to switch gears to adapt to different conditions. We have multi-core and SIMD for basic TLP and DLP extraction, but there's lots more that can be done to increase throughput and lower power consumption. Together with adapting to the workload type with things like long-running wide vector instructions and lowering the voltage/frequency, CPUs would offer plenty of the reconfiguration you're looking for.

The sort of device I'm discussing is the endpoint of the subject we are debating. It's not a conclusion of which solution is superior, but the point when what we are discussing becomes irrelevant. It is a futurist's prediction and I stated it is is far off, probably past the last lithography node.
The footprint of gates, wires, and the units and data paths of a long-SIMD design were set in stone, or set in silicon and projection masks in the factory. This is not the flexibility I'm speaking of.

I may need some detail on how long-running SIMD is change from what is currently being done, because the way I see it described is exactly what is being done with the exception that it is being done so aggressively that it forces pipeline stalls and screws with the OoO engine.

My point was that despite substantially increasing the throughput per core, Haswell's power consumption will still be lower (per core). And they're only one process node apart.

That's true if the top bins of Haswell EX have more than 10 cores.
Westmere EX had 130W over 10 cores.
We'll have to see where in the 160-180W range the big Haswell chips fall, as the core count and TDP can be combined to fall above and below the per-core consumption of Westmere.
Driving cores to the upper clock range is sufficient to enter very bad power/performance scaling, so in part this may be due to Intel's taking away a self-imposed limitation on Westmere, or the best way for Haswell to be a sufficient leap in general performance over the chips currently residing at 130 and 150W.

Junctionless transistors on the other hand have very promising scaling characteristics. It's a more desirable technology than being forced to go the NTV route. But I'm not pinning all my hopes on junctionless transistors. Intel is also for instance experimenting with III-V TFETs, which might operate at 0.3 Volt but without sacrificing clock frequency. So my main point was that with so much R&D now going into lowering the power consumption of transistors, there's bound to be some progress that makes current trends too pessimistic.

Waiting on III-V is 8 or more years. TFET is something more speculative and may be a decade or more.
Other than the slowly creeping replacement of silicon, this is starting to bump into other long-term attempts at replacing silicon and replacing lithography.
We could be waiting years awkwardly waiting for the something awesome to come along.

It wouldn't fare worse. It would fare just as badly.

The original Pentium ran at several hundred MHz and continued to run in the same ballpark with NTV.
Mobile GPUs run at several hundred MHz.
Niche HPC hardware and FPGAs run at several hundred MHz.
They can all do as horribly as they do right now, just many times more power efficiently.

The transistor cost has already been going to zero for decades, but the transistor count has been increasing at roughly the same pace. So area cost isn't going to zero.

The rise in the count is why the cost goes to zero, and area is a critical component in the cost equation. If area didn't go down or stay constant while the transistor count doubled, Moore's law wouldn't work.
On top of the packaging and processing costs that put a minimum cost for anything manufactured, the area=$ part is why per transistor costs are negligible.

And even though not all of the transistors can be active at the same time, there's still an increasing absolute amount that can.

20-30% per node, optimistically. That means that the general-purpose silicon can get its 20-30% increase in transistor count, and then there's 70-80% of the chip that they can put something in or just have unexposed wafer.

Also, there's large portions that can be gated off temporarily even during high performance operation.

Already being done.

I'm not denying the problem of dark silicon, but once again it looks like you're interpreting something that is in fact highly undesirable as something that's somehow an advantage to GPUs.

I consider it a reality. It's not an advantage so much as it undermines your cost argument because everybody has factored it in already. Specialized silicon that is off is part of the 80% (and increasing) of the transistor count that is off anyway. I don't see it as an advantage to GPUs. It just makes me shrug when the objection to having it is that the GPU's transistors won't be on if not running an appropriate load.
It costs little to nothing when on a load where it contributes nothing, but when it does kick in, it has a high upside in efficiency, peak performance, and product utility.
It seems like a decent tradeoff in my eyes.

Managing and coalescing the data movement doesn't come for free. It worsens the data locality and you'll end up running latency sensitive code on the GPU. It works form the point of view of minimizing the data transfer overhead, but it's making things less efficient elsewhere.

The lack of a free lunch doesn't seem like a strong detraction from anything.
There is absolutely nothing we've discussed that comes for free.
It strikes me as extra baffling because the earlier part of this discussion concerning long-running SIMD is a design that somehow knows the workload it's running and sort of adjusts itself.

It strikes me as very disingenuous to say that only one type of core can have access to this knowledge.
It's particularly true since major shifts in unit activity do incur costs, as we see with Sandy Bridge and its warmup period and known performance penalty for excessively switching between SIMD widths.

There are a lot of transfers and costs that can be considered acceptable if they are within the same ballpark in terms of latency and overhead for incidental events such as that.
If Intel is free to caution programmers not to do X, or risk wrecking performance on SB, the same leeway can be granted elsewhere.

This may sound like a simple matter of optimizing it until you reach the right balance, but it's a veritable programming nightmare.

I don't consider it simple. It's what everyone is either doing in some part already where the costs are justified, and researching in expanding.

To make matters worse, things are not getting any better for heterogeneous systems. Bandwidth and latency doesn't increase at the same rate as computing power. So moving data between the CPU and GPU becomes ever more costly. The only solution is unification.

Or just moving them within millimeters of each other and use physical integration to provide growth in bandwidth.
Whatever gets the job done works for me.

Yes, the hardware prefetcher isn't currently aware of long-running SIMD instructions and their latency hiding qualities. So it would have to be made aware of that. This isn't hard, it's just a gear shift.

I would say that the caches, interconnect, and memory controller would be in the same boat.
And that extra coalescing doesn't come for free, if you consider that a valid argument.

Did unifying the GPU cores negate their design features for vertex and pixel shaders? No, they just cater for both now, and leave features unused when not needed by the shader.

The actual hardware demands for the two types in terms of units and data paths weren't that dissimilar.

Likewise, long-running vector instructions would just make the CPU cater for high efficiency DLP extraction, which allows to gate some features aimed at ILP extraction. Those ILP features are still highly necessary for sequential scalar workloads. Again, just another gear shift.

The physical circuits and units involved are massively overdetermined for that use case, and the cost of that is significantly higher than zero.
Every load and store would be rammed through a memory pipeline designed for 4GHz OoO speculation and run by a scheduler, retirement logic, and bypass networks specified to provide peak performance and data transfer to portions of the core you declare are fine to be unused.
The actual physical stuff in the core would know the difference. It's why shifting gears doesn't let a double-decker bus drive under a low bridge, to continue the labored metaphor.

Just for arguments sake, he could be running the OS on the GPU and graphics on the CPU, with both pulling data from the other side.

What user program has the authority to move the OS threads anywhere?
I guess it is true that if a programmer has more power than he should have and he does pointless things that there could be a problem. It's sort of solved by the growing trend of the chip's cores, microcontrollers, and firmware very quietly overriding what software thinks is happening.

I really don't think that disguising a heterogeneous system as being a homogeneous one fixes things. It might be an improvement on average, but it's really just another convergence step toward full unification.

Improvement on average is a good thing in my eyes.

It's only a matter of time before the cores have to be physically unified as well, to prevent having to run code on the wrong type of core because bandwidth and/or latency don't allow to migrate it. This problem doesn't occur when all cores are equally capable of extracting ILP, DLP and TLP.

Threads migrate all the time, even between homogenous cores. The costs are measurable and can be scheduled and managed if they aren't explicitly spelled out by the software so that the chip knows what kind of core a thread needs.

What standards?

Haswell's new power states and management include changes that fall under the Platform Activity Alignment header.
The chip's microcontroller, the OS, and the hardware negotiate at a low level the latency tolerance of various interrupts, and the timing of OS events so that there are longer stretches of time between wakeup periods. This is on top of a power control unit that can at its discretion ignore what the OS says a core should be at.

Intel is also providing lists of recommended system components and firmware versions for low-power Haswell platforms.

Removing memory space barriers means adding hardware features. So you can call it removal all you want, it's still convergence.

What the software thinks the silicon does and what it actually does aren't the same thing. The separation of memory pools is not a requirement of specialized silicon, and the ISA is an implementation detail.

Seriously, what you described is definitely closer to software rendering and further away from heterogeneous hardware. Compilation and scheduling are latency sensitive tasks, and since you'll want each core to adapt individually, you want each core to have CPU-like qualities.

What I described is something meant for everything. I guess that means that software rendering falls in that set, as do the optimizing compilers and firmware used by GPUs.
It also leverages things like the run-time optimizations used by the a JIT VM or various forms of binary translation. There's a wide window of difference between lightweight scheduling and occasional re-optimizations or function changes and static compilation.

There's a long history of really cool stuff down each individual avenue, and people want to unify that.
The result of it is that it puts at least another thin layer between all but bare-metal code and software, but it can use dynamic information to make better choices at run-time.

In theory you could just pair up each GPU core with a CPU core, but since you want a shared ISA as well we can at least unify instruction fetch and decode.

It doesn't seem to be strictly necessary. The throughput core is meant for workload intensive and generally latency tolerant work. The link only needs to be that strong if I have a latency-tolerant workload that is simultaneously latency-sensitive.
The same argument would go against running a workload on multiple threads and they go to different homogenous cores. Who knows what could happen?

Unifying fetch and decode might be a choice, but it too doesn't seem to be strictly necessary since the fetch and decode requirements can be different between cores. There would be no software-visible difference.

Assuming that at this point the CPU side of each core uses out-of-order execution and the GPU side uses in-order execution, you still need ways to synchronize data between them.

It is true that out-of-order and in-order are orthogonal to synchronization.

So the memory subsystem also has to be tightly interwoven, especially since you also want a uniform memory space.

It is true that if cores are cache-coherent, they should be interconnected somehow. I wasn't aware this was a new development.

Sure, but the bookkeeping doesn't increase with increasing SIMD width. And with long-running SIMD instructions the bookkeeping cost can even go down. So combined it would provide a substantial improvement in performance/Watt.

CPUs have a bypass network which sends results straight back into the ALUs. The 16-entry ORF in Daly's presentation is pretty close to being the same thing. Convergence.

The bookeeping hardware and wires are as physically large or as long with long-running SIMD as they are with regular instructions.
I've not disputed that some mild power improvements can be had after compromising the OoO engine. The mass of extra wires and extra distances they cross won't be changing, and on top of the portion of the core that can't be gated, units that are being used will be sized to serve a much larger core. While the end result of an OoO engine that is kneecapped is the same, the units meant so service a core with higher instruction throughput and speed will provide a big core's energy cost for a small core's outcome.

A bypass network is a part of the processor pipeline bypasses the otherwise necessary writeback to register/read register again process, and is usually done automatically by hardware via tag checks on the bus.

The value on the bypass bus is whatever value came out of the ALU, irrespective of the destination register, and unless the ALU performs the same operation twice, that value is gone afterwards and further accesses will need to come from the register file.
The ORF is a software-managed set of registers used to keep excessive evictions from occurring from the RFC, and can service multiple accesses across multple cycles. It is guided by the compiler's choices in register ID usage, and because the source value is in the instruction, no tag checking is needed. It's very much not a bypass network.

Even if it weren't a software managed data store, and the OFC was a bypass network, the fact that bypassing has existed in countless specialized and generalized architectures is barely a better sign of convergence than the fact that Nvidia's design also uses instruction decoders.

sebbbi · Dec 6, 2012

keldor314 said:
Larrabee is a GPU style architecture, not a CPU. Unless you write GPU style code for it, you're stuck with x86 cores somewhere between the 486 and pentium 1 in architecture (dual issue like P1, but no MMX). The most similar architecture I can think of is AMD's GCN, for the HD7000 series.

Larrabee didn't have any "GPU-style" fixed function hardware in addition to it's texture filtering units. The whole graphics pipeline (including blending, depth buffering, triangle setup, rasterization) was just pure x86 software. You could implement similar software pipeline on any CPU.

The recently released Xeon Phi is basically Larrabee without the texture filtering units. It can run four threads per core, just like it's predecessor, and it has plenty of simple in-order cores. But this doesn't make it a GPU, unless you consider almost all the other high end supercomputer/database processors also as GPUs: IBM PowerPC A2 sports 16 in-order cores that run 4 threads each (featured in Sequoia and many other supercomputers in top 50 list), Oracle Sparc T2/T3/T4 have up to 16 cores and run 8 threads per core (featured for example in the K-computer and many database servers). Even the monolithic (out-of-order) IBM Power7/8 CPUs run 4 threads per core. Seems that Intel is the last one running only two threads per core in their high end CPUs. Intel is the only one that has to balance between (often single threaded) consumer applications and server workloads. 4 way SMT does nothing if the software has only a single thread.

Exophase said:
sebbbi said it was slightly faster. I'm not that familiar with it myself. AFAIK people saw results in 2009 (or was it 2008?) and were pretty underwhelmed, so it couldn't have been that amazing. Of course, the raw attainable compute would have been a good order of magnitude higher than Xenos + Xenon.

Sightly faster might have been a understatement by me. It's sometimes hard to remember that the console performance hasn't improved a single bit in 7 years, while at the same time the PC GPU performance has skyrocketed

Intel was running Gears of War 1 in their test setup, and Gears 1 required 25 Larrabee cores to reach stable 60 fps (at 1600x1200). The final Larrabee chip would have had either 32 cores or 48 cores (depending on the source), so it would have had enough performance to render Gears 1 at stable 60 fps in 1920x1080. That matches roughly four times the Xbox 360 graphics performance. So yes, Larrabee was faster than current generation consoles, but still considerably behind the PC GPUs 3 years ago (when Larrabee was announced).

Link to the paper:
http://software.intel.com/sites/default/files/m/9/7/6/9/c/18198-larrabee_manycore.pdf

Looking it back from todays perspective, it seems that Larrabee did actually pretty well in the benchmarks, because the rendering code was designed for a GPU, and simply ran in a software rasterizer trying to emulate the GPU. As Andrew said earlier, you can use more sophisticated rendering methods on the CPU (instead of relying so much on brute force). This can give you a big boost in performance and image quality. Quadtree based light culling is one good technique. Logarithmic shadow mapping (and other more advanced shadow mapping techniques) are another (http://gamma.cs.unc.edu/LOGSM/). GPUs are also slow in rendering one pixel sized triangles (pixel quad overhead drops performance to 1/4). CPU can select rasterization algorithm based on triangle size/properties. In the future we will have more and more detailed geometry (smaller triangles). GPUs have to adapt.

Andrew Lauritzen said:
No, four hardware threads is insufficient to hide texture latency (100s to 1000s of cycles). You need to actually have a software mechanism to store and reload different contexts - that's what I mean by "software fibering" (i.e. extremely lightweight, cooperatively scheduled "fibers" with minimal context save/restore). Turns out you can make it pretty efficient for domain specific stuff like shading languages.

I agree that software fibers should be a good technique, if you want to run unknown shader code on a CPU (via common front end), and not hard code each rendering step by yourself (allowing you to manually schedule / prefetch things based on your algorithm).

Actually if you would have both sophisticated out-of-order execution + 4 way SMT (at least Power7/8 has both), most of the texture latency could be hidden pretty well. Haswell's ROBs have room for 192 instructions in it. If you are rendering data to g-buffer first (and doing all the ALU heavy stuff later), your shaders are not going to have many instructions (likely less than 50). One AVX2 instruction is processing eight pixels at once, and we likely have ROB visibility for at least one other 8 pixel group (and all the remaining texture fetches in the currently processed eight pixels). So the CPU could be simultaneously fetching most of them (and not stalling on each individual fetch).

Thoughts about g-buffer shader texture sampling:
- We can assume 3 texture fetches to 128 bit DXT block compressed formats. One 64 byte cache line holds four 4x4 DXT blocks (8x8 area is better than 4x16 area, because of fetch locality).
- 7*7 / 8*8 = 77% of (random) bilinear texture fetches need only a single 64 byte cache line (not in 0.5 pixel outer borders of 8x8 block). In trilinear filtering four cache lines share the same cache line as their lower mip data. As texture coordinates are linearly interpolated on polygonal surfaces, the texture access patterns are usually cache friendly.
- Virtual texturing is free, because CPU can allocate a big virtual memory area (1 TB) that can hold all the texture data, and load 4k pages from HDD + commit. No "shader"
instruction needed for that...

Thoughts about g-buffer shader ALU instructions:
- One AVX2 instruction would always process eight pixels at the same time (SoA style, like GPUs do). It could be for example a 4x2 block of pixels, while the other HT thread processes the 4x2 block below it (for better L1 cache locality). A single core would thus be processing a 4x4 block at a time. It could then use Morton order to process the blocks as cache efficiently as possible (locality is important for caches).
- For each pixel you need instructions for:
-- Interpolate the input texture coordinate
-- Calculate bilinear weights for the texture coordinate. As all textures share a single texture coordinate (usual for virtual texturing), you do not need to do this 3 times.
-- Fetch eight samples for each texture (4 for bilinear, x2 for trilinear).
--- DXT fetch isn't actually that hard to do, if you store the second DXT color as a difference to the first. Extract the two bit interpolator, and then do a single FMA (col2*interp+col1). Most of the bit extract logic (mask calculations) could be also done once per pixel (as all texture fetches share the same texture coordinate).
--- Actually you can do even better than that if all the four bilinear samples are inside the same 4x4 DXT block. In that case you only need to calculate the average of the four interpolators (there's an AVX2 instruction for this) and interpolate between the colors with it.
--- AVX2 is actually very helpful in DXT decompression, as it adds support for 256 bit integer operations (including 32 x 8 bit). So we have four wide 8 bit integer SIMD per pixel per cycle (we are processing eight pixels simultaneously). That's handy for texture processing (8888 RGBA output to g-buffer).
--- Creating a custom DXT-style format for the game's needs would improve cache utilization when the texture access patterns are not good (as all the 10-12 texture channels would be always in the same cache line).

I don't think the g-buffer rasterization step would actually be that bad for CPU either. CPU can at least utilize its ALU while rendering to g-buffer. GPU ALU is usually completely wasted as g-buffer rendering consists mainly of fixed function hardware usage: triangle setup, interpolation, ROPs, z-buffering and texture fetch/filtering. 2048 shader cores of modern GPUs are just idling (moving data from register that receives texture fetch to register that is output to render target). The only real math you likely have is a 3x3 vector multiply to transform the normal vector from tangent space to view space (and that is 6 x (scalar) FMA + 3 x (scalar) MUL = 9 scalar instructions per pixel, or 9 AVX2 instructions for eight pixel tile).

As I said earlier the other rendering steps (lighting, post processing, shadow map rendering) are fine for a CPU. CPU doesn't need any kind of SMT for them, as all memory accesses in those steps are perfectly linear.

Andrew Lauritzen said:
"Tracing" cones through volumetric data structures is a very different algorithm than intersecting individual rays with triangles. Tracing cones is actually totally predictable memory access patterns and the natural LOD that falls out of walking up a mip or similar chain brings a lot of coherence back. In fact, you get the sort of strange effect that blurry effects are cheaper than sharp ones (which you can kind of get at with path tracing + fancy reconstruction but not as directly).

Yes, who doesn't love SVOs because of that... and the automated perfect (connected) LOD

Nick · Dec 6, 2012

keldor314 said:
Either that or SwiftShader is relying on unsafe compiler optimizations that only work correctly most of the time.

There are five accuracy setting in the public demo, labeled IEEE, WHQL, Accurate, Partial and Approximate. The default is Accurate, which causes a tiny few WHQL tests to fail, but does not result in rendering errors in any game I know. That said, the WHQL or even the IEEE setting barely affect performance.

There are also nine different run-time compiler optimizations to choose from, all of which are generic and preserve code correctness. The default is to only use Instruction Combining, aside from the hard-coded optimization passes. Feel free to play with these settings and find any errors.

Davros · Dec 6, 2012

Is it still dx8/9 only nick ?

Nick · Dec 6, 2012

keldor314 said:
Nvidia's OpenCL implementation really stinks.

That really shows they're throwing the baby out with the bath water when it comes to consumer GPGPU. For it to be anywhere near useful, a standard is needed which requires developers to write code only once and not spend time tweaking for specific architectures. I'm afraid HSA will fare the same fate. AMD is investing a lot of money into it, but unless they get NVIDIA and Intel on board, it's not going to pay off. AVX2 on the other hand will do great since it can be used to implement any compute language and auto-vectorize any legacy code.

That said, NVIDIA's apparent decision to no longer invest in consumer GPGPU hardly matters to my conclusion about ray-tracing performance. Even the HD 7970 only outperforms the CPU by 3x while consuming 2x more power and costs more than that CPU (and you also still need a CPU so you can't actually look at the GPU's power and cost in isolation). And again, it will look worse against Haswell.

Besides, regardless of how much an IHV's implementation of OpenCL "stinks", it's really sad how poorly these GPUs perform against the CPU, for which OpenCL wasn't even developed in the first place. So many TFLOPS, so little result. It tells us something about how fragile the GPU architecture and its driver really are when using it for something outside of its comfort zone. Meanwhile the CPU is making big strides to fix its weaknesses at a relatively low cost, and there's plenty more potential.

As for AMD, their OpenCL implementation is merely unstable. I've had it crash the computer (not just the program, the whole OS, forcing a reboot) with code that runs perfectly fine on Nvidia hardware (apart from the incorrect results from an unrelated regression from *2 years ago* that they still haven't fixed, but never mind that...). Actually, that particular piece of code breaks tesla too, but it does run on fermi and kepler. This is on an older card (HD5770, so juniper), so maybe it works on GCN, but it's still not good.

My dad also ported my work to OSX, and has had his own set of problems with bugs in OpenCL, some quite severe. I think it took Apple 2 months to get the new Macbook Pro to even get through OpenCL initialization without crashing, hard, forcing a reboot. On both CPU and GPU devices. Blew up on clCreateBuffer, iirc.

This also shows why the industry eventually needs/wants software rendering. It guarantees that what you see on your development system is what customers see on their system. You basically ship the same 'driver' with your application. Better performance from dedicated hardware isn't worth much if it only actually runs reliably on one class of hardware, and support costs of graphics issues can be very substantial. Aside from the gaming industry, consumer application developers steer clear from using the GPU, despite its theoretical potential, precisely because of its unreliability and requiring high expertise. With a plethora of APIs and compute languages of increasing complexity, this isn't getting any better for dedicated hardware, while their software implementations are highly reliable. So all we really need is CPUs with higher throughput and reduced power consumption. That's exactly where they're heading.

So basically, I don't feel like any sort of OpenCL benchmark is particularly valid, at least not when comparing hardware, since the state of the implementations is so bad.

But this is really the state of things. If you want a fair comparison, you have to run the same benchmark and it has to output the same result. So at some level you have to run the same code. And I don't think looking at an OpenCL benchmark is somehow unfairly biasing things in favor of the CPU. It's the closest thing we have to a GPGPU standard. It's true that the implementations have to be close to optimal to correctly compare the hardware, but if you think that both NVIDIA's and AMD's OpenCL implementation are really poor, then that means OpenCL will only help sell more multi-core CPUs with wide SIMD units. This goes to show how important the software side of things really is. Fabulous hardware specifications and exclusive benchmarks are worthless; the only thing that counts is what the average consumer's application gets out of it. So any way you look at it these OpenCL results, things are more promising for the CPU.

Andrew Lauritzen · Dec 6, 2012

keldor314 said:
My numbers were for diffuse rays, not primary, so 20 Mray/s is fairly consistant with your 15 Mray/s.

The 15/20 number would be most directly comparable to the 58Mrays number in the NVIDIA paper, not the fairly small conference scene. And even then, the San Miguel scene is significantly less expensive because it has large polygons mixed with small ones and lots occluded/offscreen. The scene I tested had 5x more polygons *and* pretty much all of them were onscreen and subpixel, which introduces significantly more divergence in the traversal structure.

Regardless of how you want to hand wave these numbers, you're talking 1-2x iso power at most for GPUs, not 10x or some other ridiculous claim. And if NVIDIA's claim that this is FLOPS bound is true, Haswell is going to bump that pretty significantly back the other direction...

keldor314 said:
For what it's worth, there are other cases where CPUs lag far behind. A good example are transcendental functions (sin, cos, log, exp, etc.) where the gap, at least for single precision, is quite large.

Sure, but that's partially because GPU transcendentals are not as accurate. If you want 0.5ulp it's going to be just as expensive, and conversely you can use cheaper instruction sequences on the CPU for similar accuracy to what GPUs do (see ISPC's different math implementations for instance). Certainly GPUs do still win there, but that's not really an architectural thing... it's trivial to add hardware to make transcendentals faster on CPUs, but it's not exactly a bottleneck in most applications. There are a few workloads that are basically trig benchmarks (some financial stuff, etc) but it's hardly worth optimizing an architecture around those.

keldor314 said:
All indications point to the gap actually being wider with current hardware, since GPUs have more than doubled since then, but CPUs have only gotten around 50% faster.

Uhh, AVX alone should net you 2x on float math, and I think you'd be hard pressed to argue that Ivy Bridge is otherwise the same speed (or 50% slower) than a Phenom x6... And Haswell will give another huge bump obviously.

Seriously, grab a modern CPU and start playing around with ISPC. I think you'll be surprised with how decently you can do in most workloads.

keldor314 said:
It is worth noting that when my dad ported the thing to OSX he got around 5x difference, but here we're comparing an i7 on the macbook pro to a mobile class GPU (I want to say 650M), which by my comparisons is around 4x slower than my GTX680, which gives around 20x performance gap between 680 and i7.

Right because a mobile i7 is exactly the same speed as a desktop one too... seriously, you want to be comparing things iso-power if you're going to draw any architectural conclusions. And yeah, if you're going to write specific GPU code, at least take the time to copy/paste most of it into ISPC too and give it a run there on a decent CPU. I'm not saying you'll never see wins for the GPU (IFS are a great example where GPUs are well-suited), but these 40x+ wins are simply not possible iso-power with good code, excepting cases that can make use of texture filtering.

Andrew Lauritzen · Dec 6, 2012

sebbbi said:
Larrabee didn't have any "GPU-style" fixed function hardware in addition to it's texture filtering units.

Indeed. Larrabee had a wider vector unit and few specific instructions to do rasterization more efficiently, but honestly the math of rasterization is not really a concern going forward. Larrabee was certainly more CPU than GPU, although there were definitely aspects of each in how you write code for it (but I guess that's kind of the point of the discussion here... GPU style code tends to run well on CPUs too).

sebbbi said:
So yes, Larrabee was faster than current generation consoles, but still considerably behind the PC GPUs 3 years ago (when Larrabee was announced).

Ah so apparently someone did mentioned some numbers publicly...
http://www.tomshardware.com/news/intel-larrabee-nvidia-geforce,7944.html

Larrabee was intended to be released in 2010, so indeed it would have been somewhat slower than Fermi in the same year, but not too bad for a first generation product. And of course it is far more flexible in terms of custom rendering pipelines; there's a few graphics workloads that I still haven't been able to match Larrabee's performance with my 680/7970 actually... mostly due to GPU programming model limitations. Overall though, it's still a tough sell when the vast majority of the market is DX/GL and conventional GPUs are very optimized to those specific pipelines.

Pretty interesting thoughts on texture sampling... I wonder what SwiftShader and similar CPU rasterizers are doing and where their bottlenecks lie. When playing with WARP it's clear that texture sampling is basically the entire bottleneck, followed by ROPs (especially MSAA). I think some of this could be improved in software, but there's certainly something to be said for these pieces of GPU hardware, especially when they are able to work with reduced precision data types, which massively helps power.

sebbbi said:
It could then use Morton order to process the blocks as cache efficiently as possible (locality is important for caches).

Right, although every time I've written a CPU rasterizer, binning/tiled rendering has far and away been the fastest and most power-efficient way. And as I know from looking at traces, game developers still love throwing 200x overdraw of particles at rendering APIs, and in those cases CPU software binning rasterization can actually outperform high-end GPUs

sebbbi said:
2048 shader cores of modern GPUs are just idling (moving data from register that receives texture fetch to register that is output to render target).

Right, but to some extent in this discussion the ALUs on GPUs are one of their least interesting parts. It seems to be more and more clear that we *are* able to just slap more math power on big cores and do pretty well with it, so it's the other architectural differences that are more interesting to me (scheduling, texture sampling, etc). Your thoughts on these are quite interesting, so thanks for posting!

Software/CPU-based 3D Rendering

Nick

Andrew Lauritzen

Moderator

Exophase

Exophase

Nick

Nick

Nick

keldor314

Nick

Nick

keldor314

Andrew Lauritzen

Moderator

keldor314

3dilettante

sebbbi

Nick

Davros

Nick

Andrew Lauritzen

Moderator

Andrew Lauritzen

Moderator

Similar threads