Software/CPU-based 3D Rendering

Genefer uses fp64 for its computation. That the GeForce 680 is slower than the 580 has little to do with "scheduling".

Edit: I see the 560 Ti being faster than the 680. Now that's more worrisome. I will run this benchmark tomorrow, when I have access to a 680.
 
I didn't.
Maybe not intentionally, but when you ask to compare K20 against Fermi, you didn't quite follow the conversation.
And K20 is representative of Kepler's GPGPU compute architecture...
Yes, but that doesn't mean that comparing it against any Fermi chips will allow you to make accurate conclusions about the architecture. GK107 is just as much a "representative" of the Kepler architecture, but it wouldn't be correct to compare it against GF110.
Wait, I thought it was an architectural discussion... I'm confused.
 
Nick said:
GK107 is just as much a "representative" of the Kepler architecture, but it wouldn't be correct to compare it against GF110.
Why not? I don't disagree, but I am curious as to your reasoning.

Nick said:
you didn't quite follow the conversation.
Oh, I think I followed it quite well.
 
But it's not adjusted based on the workload, which is what I was asking about. To balance a CPU's responsiveness and throughput for a given TDP, it has to adjust things based on the workload, per core.
That's pretty much what they do these days. You're basically writing out the same words Qualcomm uses to describe the power management for Krait in its marketing.
A core with dynamic voltage and frequency control is able to get information from activity counters, firmware heuristics, and possibly the OS scheduler to determine what the workload demands are.
Aggressively integrated gating and dynamic frequency adjustments have made their way into any power constrained environment.

I do not follow why we're on this tangent as if this is a new concept, or how a core that can vary its voltage or clock gate isn't just the exact same set of circuits, except at a different voltage and some of the clocks are not enabled.
I'm talking about computing devices that literally reconfigure themselves. A truck that shifts from second to third gear is still the same truck.

But you asked specifically; "how many more silicon nodes do you think we have left to hide this future in".
Silicon nodes are the primary means for increasing the core count of aggressive OoO straightline cores in a consumer device with one or possibly two dies for a socket (or BGA package if it's Broadwell).

I've posited the use of technologies and design choices that can allow designers to work around this, particularly in areas where silicon nodes are providing less than the necessary scaling despite the progression of Moore's law.

The power consumption might indeed be the trickiest part. But I don't see any reason for despair. First of all, CPU cores can double or quadruple the SIMD throughput (again) without costing an equal increase in power consumption, because it represents only a fraction of the power budget.
Properly supplying quadruple SIMD througput is more expensive than you let on, and I've stated the position that for the desired performance goals by 2018 or 2020 for Exascale, the default power budget is too high to begin with.
To reiterate, the proposed gains are modest and the baseline not good enough.

There's no way Haswell will consume more than Westmere, and that trend will most likely continue.
There are reports that there may be Haswell high performance SKUs with 160W+ TDPs.
Westmere stopped at 130.
Can you add some clarification on what you mean by this?

And then there's the opportunity for long-running wide vector instructions which allow a further reduction in power consumption. Next, there's the piecemeal introduction of NTV technology and adjusting the clock frequency based on the workload.
Modest gains insufficient for the order(s) of magnitude scaling desired, and adjusting clock frequency is officially old hat at this point.
Do you mean something more to adjusting clock frequency than I am interpreting?


And lastly, tons of research is going into lowering the transistors' power consumption now. Multigate transistors were an important breakthrough, and junctionless transistors could be the next major leap which make the ITRS projections highly conservative.
FinFET is quite impressive in the lower voltage domain, especially in more modestly clocked designs.
The improvent in the 4 GHz 1V+ realm is back the modest tens of percent.
I'm not sure why it's fine to pin hopes on one lab's silicon nanowires that may someday be looked at and a whole NTV Pentium that physically exists and has been manufactured has to be discounted.

We don't have to wait and see. The facts are already known. At the "optimal" operating point, the clock frequency is ~9x lower, while the power consumption is ~45x lower. To compensate for this loss in absolute performance, you'd need an order of magnitude more transistors. And that's just to keep the performance level. It offers a nice 5x reduction in power consumption, but at an insane increase in die size. Note that I didn't even calculate in the transistor/area increase due to NTV technology itself yet, nor any performance loss due to Amdahl's Law.
A design that targets density, an economy in complexity and transistors, and has latency-tolerant and highly parallel workloads should be very interested in this. It's just not handy for aggressive OoO desktop cores, but that's not enough to make me discount it.
A design that cuts transistors at the expense of general performance can still appeal to power-limited parallel computation, and the low absolute power consumption is very helpful if using high levels of integration. Die stacking can bring multiple layers of low-power silicon to bear, while also allowing stacking with memory, whereas a die with an order of magnitude more power consumption can severely constrain it.

Transistors may be getting cheaper, but only at the rate of Moore's Law, at best. Only niche markets where low power consumption is way more important than absolute performance, can afford to have chips that nominally run at NTV voltage. The only commercially viable use for consumer products is for low idle power consumption.
So it's only an exponential curve. This does point to a widening of the scope, since we've moved beyond harvested energy-only products.

Slow in absolute speed. A GPU running at NTV voltage will decimate the framerate. That does compromise the user experience.
Why would a mobile GPU with a short pipeline, relatively simple design and operating point in the hundreds of MHz fare worse with NTV than a Pentium with a short pipeline, relatively simple design, and an operating point in the hundreds of MHz?

Wide out-of-order execution CPUs and DirectX 11 GPUs are coming to mobile devices. So the desktop is still the trendsetter. Regardless, the majority of people aren't gamers. They rarely use the GPU to its fullest. Again just look at the distribution of HD 2500s and HD 4000s. Business desktops benefit more from a quad-core than from a more powerful integrated GPU aimed at gaming.
They're coming to Windows mobile devices, and the Haswell mobile CPUs with integrated GPUs are deliberately expanding the area devoted to the GPU and lowering clocks to derive better performance per watt.

There's nothing wrong about adjusting to the workload. And there's nothing artificial or constraining about it either. When the buffers are full of long-running instructions, the previous stage(s) can be clock gated for a certain number of cycles since there's plenty of work anyway. They might even do this already (they certainly do something similar for the uop cache and decoders, although that's in the in-order part).
Which steps do you think are left that aren't already heavily gated, because yes, extensive gating is being done already. This is why I've stated your supposed gains are modest, because they seem to be including things that have been done for five years.

And yes, you could have different architectures for each and every workload, but then there's duplication of logic, extra data movement, and more programming troubles.
Duplicated logic is duplicated stuff with costs going to zero.
Data movement can be managed and coalesced so that a design can intelligently weight the occassional burst in consumption when starting the offload process versus the ongoing economies in power consumption.
The absolute size of the specialized logic can also play in its favor, since multiple divisions can fit in the same area as a single large core. Their transfer costs would need to be evaluated in this regard as well, on top of power savings related to specialization at the design and silicon level.

Indeed, with long-running SIMD instructions there is potential for making things like prefetching less aggressive. I've mentioned that before in the Larrabee thread.
Why does the memory hierarchy see things as being significantly different? The data cache and the memory controller have very little awareness of what instructions are doing outside of the memory accesses they generate. The long-running SIMD instructions basically make a quarter-wide SIMD demand the same amount of operand data.
They might not be using the Haswell method of gathering vector data, which is apparently what I posited a while back as a microcoded loop of reads.

No. Scalar instructions are interleaved with the long-running SIMD instructions. So they execute at a slower pace as well.
This would make me start to question why this is on a big OoO core when it seems all its design features are negated, but it has to jump through hoops to appear simpler.

Please elaborate on these design choices. And what do their SoCs do behind the scenes that the software isn't aware of (that other designs don't do)?
Intel's power control unit has been subtly overriding OS power state demands since Nehalem and possibly one of the Atom designs at the time. That might have not been a SOC, so that may have been an innacurate recollection on my part. Going forward, Intel has been putting forward standards to allow system components to communicate guard bands on latency requirements, so that their next SOC will be able to coalesce activity periods at its discretion to better enable power gating.

Parallel to this are firmware and hardware changes by upcoming ARM designs that will make even core assignment something much more fluid under the hood than would be visible to software.

But we weren't talking about Kepler's primary function.
I'm talking about doing anything and everything to get the most performance per Watt, including using silicon that is dedicated to subsets of different workloads that may be underutilized or gated fully off at other times.
It does exactly what I want it to do, and exactly what the customer would want it to do.
Crying over potentially underutilized transistors that have halved in price for almost 50 years is not high on my list of priorities. Most of the transistors on a chip have been aggressively kept off for about a decade, so what's a few more to add to the pile?

Bringing it one step closer to unification.
Or what Daly said, removing things incidental to the real problem.


Sounds like software rendering to me.
When you have a hammer...
 
Last edited by a moderator:
Deferred lighting and post processing are pure 2d passes and easy to prefect perfectly (all fetched data in L1 + streaming stores). In total those passes take 45% of the frame time.
Just to add a minor plug here... if you get the ISPC package from http://ispc.github.com/, one of the included examples is called "deferred", and it does the entire tiled-deferred shading pass (a la. Frostbite 2 or equivalent) on the CPU using multithreading and SSE/AVX. If you happen to have ICC installed, you can compile the Cilk version which also uses a much more efficient quad-tree based recursion with work stealing to do the light binning rather than the naive static tiling that GPUs are typically forced to do.

If you run this on a fairly modern system (Ivy Bridge or equivalent is the best since it has AVX+half float conversions as sebbbi mentions), it's surprisingly competitive with GPUs iso-power. Haswell should get a pretty significant boost as well (2x FMA, more cache BW).

Tiled shadow map rendering (256 kB tiles = 512x256 pixels) fits completely inside Sandy/Ivy/Haswell L2 cache of the CPU cores (and thus all cores can nicely work on parallel without fighting to share the L3). Shadow map vertex shaders are easy to run efficiently on AVX/FMA (just 1:1 mapping, nothing fancy, 100% L1 cache hit).
Indeed, I'm not too concerned about depth-only/shadow-map rendering in software. That's pretty easy to do efficiently and certainly doesn't require the machinery of a GPU to solve. One could argue that the fixed-function rasterizer is fairly power-efficient here, but it's not clear that it matters a lot in this case.

So basically the only thing that doesn't suit CPU rendering that well is the object rendering to g-buffers. And that's only 25% of the whole frame... and even that can be quite elegantly solved (by using a tile based renderer). I'll write a follow up on that when I have some more time :)
I've come to a similar conclusion myself actually, that GPUs architecture is increasingly most relevant to the start of the frame where you basically just race to lay down a pile of simple interpolated attributes as quickly as possible. I'd be interested in your "further thoughts" though :)

Texture sampling/random access and large register files for hiding the resulting latency are the interesting bits of GPUs right now, and it's not clear to me how much of that survives the transition to power-constrained devices. Certainly the IMR model of just naively blowing bandwidth and trying to hide the latency is not going to be ideal in the future unless we can bring most of the relevant data on-chip. Even then, it seems from a physics points of view it's going to become more and more necessary to optimizing locality of access in our algorithms, and the large-HW-context/latency hiding model is not particularly useful to that end.

None of this is to say that we need to get rid of GPUs or that I'm not personally going to make use of any hardware available to me. I mean, the reality of the situation is that we're somewhat stuck behind legacy "feed-forward" APIs (and in some cases, PCI-E) too, so it's hard to actually realize a lot of these possibilities efficiently right now. That said, that shouldn't be allowed to colour the theoretical discussion.
 
Simplifying scheduling is great, but it only removes the scheduling overhead, which is largely unnecessary in a massively threaded processor. It's not really what Dally is referring to when he's talking about "moving data". He's just talking about the basic action of fetching.

Daly's talking about the energy required to drive signals over wires from one point in the chip to another, or some kind of transaction with endpoints on and off chip.

One of the implied components to reducing scheduling overhead, aside from the transistor savings, is the reduction in the data transport related to that process. All scheduling and propogation down the pipeline is implicitely generating some number of bits of data per instruction, as manifested by the switching of signal wires and the changes in internal bookeeping state. Other costs, such as accessing the branch prediction hardware is moving data, just data that the software doesn't see.

The energy cost of that movement is dependent on the distance that needs to be traveled and is influenced by the properties of the interconnect and how it is physically and electrically implemented. He mentions creating a very small register operand cache to keep the operand paths in the common case extremely short, and he talks about mitigating the cost of data movement across larger distances by changing the signaling method to some kind of small-swing signaling.
This can be further modified or complicated by the frequency the wires need to switch at, hence the various signaling methods with high energy efficiency but really low speeds.

The design is trying to minimize execution's footprint in terms of tranistor and wire energy consumption, and on top of that keeping things on-chip.
 
I don't disagree, but I am curious as to your reasoning.
If there's some point you're trying to make, then please state it.

For what it's worth, I fully realize that a new process node may require minor architectural changes to maximize its potential. So Fermi at 28 nm or Kepler at 40 nm and an equal die size would still not make them perfectly comparable, because that's not what they were designed for. That said, they made not minor but quite major architectural changes going from Fermi to Kepler, which clearly compromise GPGPU in favor of graphics. That's not dictated by the new process; there's no reason for it to be slower at anything. It's a deliberate choice.

So it's easy to conclude that optimizing for latency does matter to general-purpose throughput-oriented chips.
 
Nick said:
If there's some point you're trying to make, then please state it.
I'll take that as refusing to answer the question. If you are going to persist in being purposefully obtuse, then there really isn't any point.

EDIT: I suppose I can attempt one last time...

Why, exactly, do you think it is more accurate to compare GF110 to GK104 than GK110? And if that is the case, then to what would you compare GK110?
 
Last edited by a moderator:
I've come to a similar conclusion myself actually, that GPUs architecture is increasingly most relevant to the start of the frame where you basically just race to lay down a pile of simple interpolated attributes as quickly as possible. I'd be interested in your "further thoughts" though :)

Texture sampling/random access and large register files for hiding the resulting latency are the interesting bits of GPUs right now, and it's not clear to me how much of that survives the transition to power-constrained devices. Certainly the IMR model of just naively blowing bandwidth and trying to hide the latency is not going to be ideal in the future unless we can bring most of the relevant data on-chip. Even then, it seems from a physics points of view it's going to become more and more necessary to optimizing locality of access in our algorithms, and the large-HW-context/latency hiding model is not particularly useful to that end.

None of this is to say that we need to get rid of GPUs or that I'm not personally going to make use of any hardware available to me. I mean, the reality of the situation is that we're somewhat stuck behind legacy "feed-forward" APIs (and in some cases, PCI-E) too, so it's hard to actually realize a lot of these possibilities efficiently right now. That said, that shouldn't be allowed to colour the theoretical discussion.

The ALUs in a GPU architecture are also relevant, compared to a OoO core with very few threads. While they can use a more flexible memory pipeline and better thread generation etc., they are certainly better than just adding some vector instructions to a massive OoO core and then fighting the OoO core.
 
Texture sampling/random access and large register files for hiding the resulting latency are the interesting bits of GPUs right now, and it's not clear to me how much of that survives the transition to power-constrained devices. Certainly the IMR model of just naively blowing bandwidth and trying to hide the latency is not going to be ideal in the future unless we can bring most of the relevant data on-chip.
On-chip, or in-stack, or on-interposer?
There's strong interest in all those possibilities, with the likely timeline favoring 2.5D integration first for a decently-sized DRAM pool.
This seems like a good enough win for everything that uses main memory, save very large data sets that might require some kind of pass-through to external DRAM.
Latency could get a one-time boost by shaving off a few nanoseconds in transit time, but unless the DRAM arrays get any faster, the memory device delay is going to remain the primary barrier. Bandwidth would improve significantly, however.

Longer term, 3D integration, or if some kind of multi-level logic process could in certain cases beat out the standard method of having large pools of SRAM on the same plane as the logic.
A 3D integrated Pentium 4 Intel studied took advantage of TSVs that were physically much shorter than the data lines that crossed the core or connected to the L2.
Heat and mechanical concerns make it a curiousity, at least for now.
 
If cpu/gpu did merge that could be the Saviour of amd.
intel havnt made a good gpu nv hant made a cpu good or otherwise
 
Let's talk ray tracing. Specifically, diffuse rays, which are both highly divergent in both memory and execution, as well as being the important part of global illumination. In general, diffuse rays are both the slowest class of rays, as well as being absolutely necessary, not to mention you have to trace a greater number of them in a given render than primary rays.

Here are the most recent papers I can find, for both GPU (kepler) and CPU (i7 with AVX).

http://www.tml.tkk.fi/~timo/publications/aila2012hpg_techrep.pdf
http://dl.dropbox.com/u/10411297/Downloads/incoherent_dacrt_eg2012_final.pdf

Note that not only are we comparing state of the art tracers, but we even have identical test scenes used, so this is probably as fair a comparison as you're likely to get.

For the Conference Room scene, the GTX680 achieved 245.5 million rays per second, whereas the i7-2600 managed 18.5-20.4, depending on the number of bounces. This is somewhere from a 12-14x difference in speed in the GPU's favor.

For the Fairy Forest scene, the speedup was around 9x (I'm not going in copy any more numbers - look at the papers yourselves :p)

Finally, though there isn't a high-poly scene shared between the papers, I will note that the GTX680 managed to render the San Miguel scene (11 million triangles) at 58.8Mray/s diffuse, while the i7-2600 rendered the Hairball scene (2.9 million triangles) at 4.6-6.1Mray/s, giving around 10x improvement for high-poly scenes as well.

It's also worth noting that the GTX680 claims 192.2 GB/s memory bandwidth, while the i7-2600 claims 21 GB/s, a 9x difference. This means that for all the talk about superior memory systems with large caches and OOO, the GPU is actually beating the CPU in performance by slightly more than the difference in memory bandwidth, though it may just be a case of the CPU having a slower memory kit installed (though the spec sheet claims DDR3 1333 for the 21 GB/s number, so I don't really see it going much lower...).

Now before anyone says anything about "bruteforce path-tracer" and "more efficient bidir/MT/any of the many other ways of improving the path-tracing algorithm", I might point out that the difference from a Mray/s standpoint is generally only about 2x for the GPU benchmarks I've seen, meaning the advanced algorithm on GPU is still throwing around 5 times as many rays as the simple brute force algorithm on CPU.

The basic reason that CPUs see no benefit from their large caches is that, for diffuse/secondary rays, you very rapidly lose any coherence between rays, so they are all accessing random parts of the scene. Yes, good algorithms try to keep the rays as coherent as possible, but the curse of dimensionality limits how much of this is possible (ray space is 6 dimensional, and hit space is 3 dimensional - there's a LOT of room for rays to avoid each other and prevent reuse). Since the scene is large - generally as large as will fit in memory, since we want fancy scenes - this means the cache miss rate is near worst case. Again, remember that all the image quality benefits of ray tracing over rasterization is due to the effects of diffuse and secondary rays! Primary rays are equivalent to rasterization.

Some form of diffuse ray tracing is going to find its way into gaming in the near future. In fact, the upcoming Unreal Engine 4 has showcased a form of it...
 
Only good for PS2 level graphics? Many best looking PS3 games use similar techniques to do deferred lighting with CPU (Cell SPUs). Larrabee was also doing software rendering on x86 cores, and the game performance was (slightly) faster than current generation consoles (PS3 & Xbox 360). Haswell-E should have comparable single precision flops to Larrabee (two 256 bit FMAs per cycle, 8 cores, 4 GHz = 1024 GFLOP/s), so it should be much better than PS2 in rendering, and likely also beat PS3 (but only slightly, like Larrabee did).

Larrabee is a GPU style architecture, not a CPU. Unless you write GPU style code for it, you're stuck with x86 cores somewhere between the 486 and pentium 1 in architecture (dual issue like P1, but no MMX). The most similar architecture I can think of is AMD's GCN, for the HD7000 series.

Also current gen consoles = 7 year old hardware, so matching their performance is far from awe-inspiring. I mean really, next-gen smartphones are getting close to doing this...

And I suspect that even "matching" is an overstatement. Ever tried playing Skyrim on high at 780p (or whatever it is) with the WARP directx device? Those settings are more or less what the Xbox 360 runs it at.
 
Last edited by a moderator:
And I suspect that even "matching" is an overstatement. Ever tried playing Skyrim on high at 780p (or whatever it is) with the WARP directx device? Those settings are more or less what the Xbox 360 runs it at.

I think the claim was that the original (unreleased) Larrabee reached XBox 360 levels, not that current CPUs can with the state of the art software renderers. And I have no idea how well WARP does or does not qualify for that.

Still, the crux of the claim is that an 8-core Haswell-EP will be able to do the same because it'll have the same FP32 performance as the original Larrabee (and presumably because it'll have gather). In my eyes there's still a lot of other things that separate Larrabee from Haswell and make me question this claim. For one thing, the old comparisons were made with Larrabee hardware that still had dedicated TMUs and possibly other fixed function components. Beyond that, Haswell isn't going to be nearly as good at automatically hiding latency, and the instruction set will still have deficiencies; for instance predication will have to be done explicitly with blends instead of with control words, needing more registers (and instructions) in an already more constrained register file.
 
Still, the crux of the claim is that an 8-core Haswell-EP will be able to do the same because it'll have the same FP32 performance as the original Larrabee (and presumably because it'll have gather).
The gather performance is an interesting question, since the first Larrabee had an undisclosed implementation that may have needed a code sequence or was microcoded, while the 22nm variant has the version we know about. I'm not as certain what the scatter implementation it had or has now versus the lack of it in Haswell.

I would think the first Larrabee wouldn't do a read for every vector element and could take advantage of locality.
Haswell's gather is apparently an initial version that is a relatively straightforward microcode loop with one read per vector element.
 
And I have no idea how well WARP does or does not qualify for that.

To give you some ballpark idea
From the swiftshader faq :
"on a modern quad-core Core i7 CPU at 3.2 GHz, the SwiftShader DirectX® 9 SM 3.0 demo scores 620 in 3DMark06"

"The SwiftShader DirectX® 9 SM 3.0 demo is twice as fast as Microsoft®’s WARP renderer in benchmarks of Crysis."

grabbed the first post I found about 3dmark06 on a gpu
"I use an E8400, 4GB PC2-8000 OCZ Platinum, and 4850 (700/1150) and score ~13300"
 
To give you some ballpark idea
From the swiftshader faq :
"on a modern quad-core Core i7 CPU at 3.2 GHz, the SwiftShader DirectX® 9 SM 3.0 demo scores 620 in 3DMark06"

"The SwiftShader DirectX® 9 SM 3.0 demo is twice as fast as Microsoft®’s WARP renderer in benchmarks of Crysis."

grabbed the first post about 3dmark06 on a gpu
"I use an E8400, 4GB PC2-8000 OCZ Platinum, and 4850 (700/1150) and score ~13300"

So at least according to Nick WARP is far from state of the art in performance. Not really that surprising, MS wouldn't have a huge investment in making this as fast as possible.

The gather performance is an interesting question, since the first Larrabee had an undisclosed implementation that may have needed a code sequence or was microcoded, while the 22nm variant has the version we know about. I'm not as certain what the scatter implementation it had or has now versus the lack of it in Haswell.

I would think the first Larrabee wouldn't do a read for every vector element and could take advantage of locality.
Haswell's gather is apparently an initial version that is a relatively straightforward microcode loop with one read per vector element.

I think I remember slides detailing gather in the original Larrabee, which confirmed that it could gather all elements in the same cacheline in one cycle (like the current one can). On the other hand I don't remember seeing anything about the explicit nature of this needing the mask loop, but that was probably still the case for it too.

Given that Haswell has two scalar load ports hopefully the gather ucode can execute at two elements per cycle. A four cycle gather would be pretty good, even if it blocks the decoders due to coming from the ucode ROM.
 
Microcoded gather is really quite aweful for a modern pixel shader, since every memory operation can be a gather, and the optimizer may not even be able to tell, thus forcing it to implement as a gather even if it's not really necessary. Just look at relief mapping done in, for instance, Crysis - you have maybe a dozen gather ops per pixel during the ray casting phase, possibly even 50-100. Since each pixel's ray samples are dependent on the particular texels it hits, they won't be in any sort of nice stride, so you really do have to use gather.

Wide SIMD is nice, but you really do need scatter/gather and full predication to make good use of it in anything but the simplest algorithms. Otherwise you end up with 4 SIMD instructions surrounded by 20 serial instructions packing and unpacking data into SIMD vectors.

So at least according to Nick WARP is far from state of the art in performance. Not really that surprising, MS wouldn't have a huge investment in making this as fast as possible.

This just drives home the point that writing a fast renderer for CPU is *hard*. Microsoft has actually been putting a fair bit of work into WARP, which now supports DX10 and DX11 feature levels as of Windows 8. For them to leave 50% of the performance on the table for "very simple" things like rendering says something important. Either that or SwiftShader is relying on unsafe compiler optimizations that only work correctly most of the time. Shader compilers are rather notorious for that.
 
Last edited by a moderator:
Back
Top