Software/CPU-based 3D Rendering

3dcgi · Dec 14, 2012

Andrew Lauritzen said:
No I'm aware of that

The point I'm making is one where there *is* divergence in that control flow, but where divergence can imply that you can just make *all* threads take the "general" case, rather than running *both* branches (back to back) for the SIMD. This isn't true for all uses of branches so you can't infer it from anything other than explicit user internet, but that's the point: the languages don't let you express that.

Ah. I get it now. I guess I've never had the need to do that so it took a while to get it.

Andrew Lauritzen · Dec 14, 2012

It's usually not a huge deal unless the branches are both very expensive TBH, but "any/all" type functionality is definitely useful. CUDA can sort of do it with ballot(), but none of the other languages I don't think.

sebbbi · Dec 14, 2012

3dcgi said:
Ah. I get it now. I guess I've never had the need to do that so it took a while to get it.

It's a very good way to gain performance if you have an special case branch that can only handle the special case, but is considerably faster. And you have a general purpose branch that can handle both cases.

For example: The general purpose branch is 20 instructions, and the special case branch is 10 instructions. Without "ifAny" style branch, you will execute 30 instructions for divergent warps. With it, you will only execute 20 (savings of 10 instructions).

This kind of branching is a violation of SPMD, so it's not officially supported in most SPMD languages (OpenCL, DirectCompute). You can do similar think in CUDA with warp voting. However GPU hardware doesn't need any extra instructions for this.

if (c)
{
// common case
[20 instructions]
}
else
{
// optimal special case
[10 instructions]
}

Compiles to something like this:

do the comparison c
if comparison result of c has no bits jump to A
(c) instruction 1
(c) instruction ..
(c) instruction 20
A:
if comparison result of c has all bits jump to B
(!c) instruction 1
(!c) instruction ...
(!c) instruction 10
B:

Where comparison result c is a bitfield (one bit per lane), and the bracketed (c) and (!c) are predicated instructions.

The so called "ifAny" version looks like this:

do the comparison c
if comparison result of c has ANY bits jump to A
instruction 1
instruction ...
instruction 10
jump to B
A:
instruction 1
instruction ..
instruction 20
B:

The second version only uses jump instructions, and it doesn't even need lane predication. On consoles you can write your own GPU microcode, so you can write constructs like this. On CUDA you can write constructs like this also (with warp voting), but it makes your program pretty easily unportable. Also you need to be exactly sure that the general case code also works right for all the special case inputs. Otherwise you will get different results for each thread based on the other threads in the warp. Floating point rounding errors can also be a problem, as compiler can optimize both paths differently (and you will receive slightly different results based on other threads on the same warp). But most CUDA code already is written assuming that warp width is 32, so they are not portable either way (you can assume that 32 threads are always lock stepped, and use this for optimization purposes. You need less barriers this way, but the code breaks down, if the warp width is not what was expected. Many libraries do this extensively).

rpg.314 said:
Intel's desktop CPU's aren't going to represent the best balance between CPU and GPU for obvious reasons. Mobile chips, which represent the majority of the present day chips and the future of computing will tell quite a different tale.

OK, so lets then bring some mobile chips to the debate.

Sandy Bridge i7 extreme (2960XM: 4 cores, 2.7 GHz / 3.4* GHz turbo + HD 3000, 650 MHz / 1.3 GHz turbo):
- CPU: 4 (cores) * 8 (AVX) * 2 (separate mul + add ports): Nominal (2.7 GHz) = 172.8 GFLOP/s, turbo (3.4 GHz) = 217.6 GFLOP/s
- GPU: 12 (EU) * 4 (physical width of EU) * 2 (FMA): Nominal (650 MHz) = 62.4 GFLOP/s, turbo (1.3 GHz) = 124.8

Sandy Bridge i7 performance (2760QM: 2 cores, 2.4 GHz / 3.2* GHz turbo + HD 3000, 650 MHz / 1.3 GHz turbo):
- CPU: 4 (cores) * 8 (AVX) * 2 (separate mul + add ports): Nominal (2.4 GHz) = 153.6 GFLOP/s, turbo (3.2 GHz) = 204.8 GFLOP/s
- GPU: 12 (EU) * 4 (physical width of EU) * 2 (FMA): Nominal (650 MHz) = 62.4 GFLOP/s, turbo (1.3 GHz) = 124.8

Sandy Bridge i3 mainstream (2370M: 4 cores, 2.4 GHz, no turbo + HD 3000, 650 MHz / 1.15 GHz turbo):
- CPU: 2 (cores) * 8 (AVX) * 2 (separate mul + add ports): Nominal (2.4 GHz) = 76.8 GFLOP/s
- GPU: 12 (EU) * 4 (physical width of EU) * 2 (FMA): Nominal (650 MHz) = 62.4 GFLOP/s, turbo (1.15 GHz) = 110.4 GFLOP/s

CPU wins the FLOP/s race in all the other Sandy Bridge mobile parts, except the low end i3 model, and even then only if the GPU is running at maximum turbo clock. In high end Ivy Bridge chips, the CPU and GPU are basically tied. The mainstream Ivy Bridge models with HD 4000 win against CPU handily, but the low end models with HD 2500 graphics do not. So, I wouldn't personally describe it as a "quite a different tale". Currently sold Intel integrated mobile GPUs are pretty much tied with the CPU in peak FLOP/s. But this thing alone of course doesn't warrant any kind of vector ALU sharing between CPU and GPU. That would require much more than equal FLOP/s peak performances.

(*) CPU turbo clocks in the chart are based on the lowest turbo values (four cores active).

rpg.314 said:
..Haswell GPU is substantially bigger then IB...

Haswell will double GPU performance from Ivy Bridge. But Haswell also doubles the CPU flops (dual FMA pipes). So the FLOP/s percentage shouldn't change much at all.

keldor314 · Dec 14, 2012

Andrew Lauritzen said:
No I'm aware of that

The point I'm making is one where there *is* divergence in that control flow, but where divergence can imply that you can just make *all* threads take the "general" case, rather than running *both* branches (back to back) for the SIMD. This isn't true for all uses of branches so you can't infer it from anything other than explicit user internet, but that's the point: the languages don't let you express that.

I'm not sure what you're getting at. GPUs only take a branch if at least one of the vector lanes takes it. So if none of the threads happen to take the special case, it skips over it. This has been true since GeForce 6 with shader model 3 introduced with DirectX 9.0c (Remember? When ATI still was at shader model 2 for that gen).

Andrew Lauritzen said:
It's usually not a huge deal unless the branches are both very expensive TBH, but "any/all" type functionality is definitely useful. CUDA can sort of do it with ballot(), but none of the other languages I don't think.

Actually, CUDA added any() and all() intrinsics at the same time they added ballot (last of the tesla architectures, so GTX 280 generation).

RecessionCone · Dec 14, 2012

Andrew Lauritzen said:
Interesting, thanks for the information.

Cool that that's at least allowed, but I think you can see that the problem that I described still exists in that model: it has to allocate based on "worst case", which is equivalent to the "big switch statement" uber-shader problem. I just don't really see getting around that issue without a shift in GPU design.

CNP is one answer to this problem, since it allows you to dynamically adjust the number of registers & amount of on-chip memory a kernel uses. Of course, using it can involve spilling state. You will spill your shared memory as well as your registers if a child kernel needs space to run.

keldor314 · Dec 14, 2012

Andrew Lauritzen said:
Wait, it's better to compare a GPU with a 140W TDP (+ a decent CPU's TDP to drive it) to only the CPU part of a 77W TDP CPU/GPU? You seem to ignore power usage which, as I've explained several times now, makes your comparisons completely irrelevant when discussion architectural efficiency.

Fine. Normalize it according to TDP and you have the GPU at around 1.3 Tflop/s.

sebbbi · Dec 14, 2012

keldor314 said:
I'm not sure what you're getting at. GPUs only take a branch if at least one of the vector lanes takes it. So if none of the threads happen to take the special case, it skips over it. This has been true since GeForce 6 with shader model 3 introduced with DirectX 9.0c (Remember? When ATI still was at shader model 2 for that gen).

He is not talking about this. He is talking about divergent warps, where both paths must be executed. The special case occurs if one of the code paths is a general case that could be used for both threads that go either way. Of course the compiler cannot know this, and you cannot represent this by standard SPMD means. Warp voting / microcode (on consoles) can be used to optimize this case (but then you lose portability, and the code is no longer pure SPMD).

(any and all intrinsics are also not the same thing as this, they are optimizations for vector data type processing)

RecessionCone · Dec 14, 2012

sebbbi said:
He is not talking about this. He is talking about divergent warps, where both paths must be executed. The special case occurs if one of the code paths is a general case that could be used for both threads that go either way. Of course the compiler cannot know this, and you cannot represent this by standard SPMD means. Warp voting / microcode (on consoles) can be used to optimize this case (but then you lose portability, and the code is no longer pure SPMD).

(any and all intrinsics are also not the same thing as this, they are optimizations for vector data type processing)

I think the point is that you can use any() to do this in CUDA:
if (__any(need_general_case())) {
slow_general_case();
} else {
fast_special_case();
}

This works on CUDA architectures 1.2 and later (GPUs released in 2008). Perhaps OpenCL should consider adding support for sub-work-group constructs.

sebbbi · Dec 14, 2012

RecessionCone said:
I think the point is that you can use any() to do this in CUDA:
if (__any(need_general_case())) {
slow_general_case();
} else {
fast_special_case();
}

This works on CUDA architectures 1.2 and later (GPUs released in 2008). Perhaps OpenCL should consider adding support for sub-work-group constructs.

Yes, this is exactly what I am talking about. In DirectX/DirectCompute, the keywords any and all are used in different purposes, hence the slight misunderstanding. This is quite handy construct for some cases, but as said earlier both OpenCL and DirectCompute do not support it, because they want to be pure SPMD, and not reveal the SIMD hardware underneath (as in the future some SPMD hardware might not be SIMD at all, and not have warps/etc).

rpg.314 · Dec 14, 2012

sebbbi said:
CPU wins the FLOP/s race in all the other Sandy Bridge mobile parts, except the low end i3 model, and even then only if the GPU is running at maximum turbo clock. In high end Ivy Bridge chips, the CPU and GPU are basically tied. The mainstream Ivy Bridge models with HD 4000 win against CPU handily, but the low end models with HD 2500 graphics do not. So, I wouldn't personally describe it as a "quite a different tale". Currently sold Intel integrated mobile GPUs are pretty much tied with the CPU in peak FLOP/s. But this thing alone of course doesn't warrant any kind of vector ALU sharing between CPU and GPU. That would require much more than equal FLOP/s peak performances.

(*) CPU turbo clocks in the chart are based on the lowest turbo values (four cores active).

Haswell will double GPU performance from Ivy Bridge. But Haswell also doubles the CPU flops (dual FMA pipes). So the FLOP/s percentage shouldn't change much at all.

While mainstream IB wins today, going forward you are not going to see >2 CPU cores in the mainstream, just a lower power budget. With new nodes, only the GPU cores will see substantial increase in performance. Haswell will increase CPU throughput by 2x, but it will be the sole jump from Sandy -> Broadwell, across 4 years. GPU performance is increasing every year by 2x, atleast from Sandy -> IB and then IB to HSW. With a shrink at 14 nm, it is not out of the question that we could see another 2x jump in the GPU throughput. Besides, Intel could very well choose to fuse off the 2nd FMA pipeline for mobile parts to fit it in a 10W (or even 8W) profile.

3dilettante · Dec 14, 2012

Skywell by some accounts may be doing something interesting.
Depending on whom you believe, Broadwell may be a prelude to some of it.

The interposer memory for Haswell, if it does get some use this gen would mean that there would be two additional chances to iterate on the technology. By 14nm, there will hopefully be more widespread experience with 2.5D memory integration.

keldor314 · Dec 14, 2012

sebbbi said:
Yes, this is exactly what I am talking about. In DirectX/DirectCompute, the keywords any and all are used in different purposes, hence the slight misunderstanding. This is quite handy construct for some cases, but as said earlier both OpenCL and DirectCompute do not support it, because they want to be pure SPMD, and not reveal the SIMD hardware underneath (as in the future some SPMD hardware might not be SIMD at all, and not have warps/etc).

But this is exactly how it works all the time -

if (_any(condition))

is identical to

if (condition)

in terms of what the GPU actually does.

The _any is implicit to every conditional flow control operation, so if none of the threads in the warp needs special processing, the entire code section is completely skipped, not predicated, jumped over.

The only place where you would want to use _any would be some sort of operation where a thread needs to know about the results of other threads in the warp for some sort of cooperation, though I'm having a hard time thinking of a place where you wouldn't actually need _ballot.

Andrew Lauritzen · Dec 14, 2012

keldor314 said:
The _any is implicit to every conditional flow control operation, so if none of the threads in the warp needs special processing, the entire code section is completely skipped, not predicated, jumped over.

This is the same miscommunication/misunderstanding that 3dcgi and I had originally. We are talking about *divergent* control flow but in cases where you only need to take *one* of the two branches for divergence (for some algorithms).

Anyways to unroll the stack, the point was that these cases both aren't expressable in most GPU languages (other than CUDA), and also that they do not require any predication code when implemented on CPUs. i.e. when you write code in ISPC, you have somewhat less need for predication than OpenCL or similar has to do when targetting the CPU, because you can actually write both "uniform" and "varing" code/conditionals/etc. It's worth reading the InPar ISPC paper for more info: http://llvm.org/pubs/2012-05-13-InPar-ispc.html. Obviously you still need masking/predication code in some cases, but in the original comment I was just giving some examples of why it's typically not relevant to performance.

sebbbi · Dec 14, 2012

keldor314 said:
But this is exactly how it works all the time -

if (_any(condition))

is identical to

if (condition)

in terms of what the GPU actually does.

No, they are not identical. In fact they are very different. Please read my earlier post about this topic (#243). We are not talking about the easy case here where all threads go to the same path (and the whole branch can be jumped over instead of predicated).

The branch "if (_any(condition))" is ran for all 32 threads in the warp, even if just a single thread in the warp passes the comparison. In this case all the code in the branch is ran (without predication) even for the threads that do not have the condition true. This changes the program behavior in a big way (threads can affect the calculation results of another threads in the warp, not just the performance).

This technique can be used in code optimizations (look at the post #243 for more info). I have used it personally several times, and it can be a big gain in some specific cases. It's not however a commonly used technique. Not many programmers are familiar with it (it's only available in CUDA and on consoles, not on portable compute languages such as OpenCL and DirectCompute).

rpg.314 said:
Haswell will increase CPU throughput by 2x, but it will be the sole jump from Sandy -> Broadwell, across 4 years. GPU performance is increasing every year by 2x, atleast from Sandy -> IB and then IB to HSW.

I don't believe Intel is freezing their CPU development, and focusing solely on IGP performance. Sandy Bridge doubled the vector FLOP/s for clock, Haswell doubles the vector FLOP/s per clock (basically every tick seems to double it). Intel has already publicly announced in their AVX 2 documents, that they are planning to widen the CPU vector registers first to 512 bits and later to 1024 bits (1024 bit floating point operations, 512 bit integer operations). Intel has always improved their vector performance (and flexibility) in every single architecture they have released. Even Ivy Bridge (a tock and a process shrink) got new CVX16 instructions to help with 16 bit half float vector processing (a nice addition for graphics processing).

keldor314 said:
The second problem is the usage of explicit SIMD. Coding for this directly leads to very ugly code, whereas leaving it to the compiler may or may not produce satisfactory results, depending very much on the details of the code in question. Thing is, short of digging through ASM, it's pretty much impossible to know whether the compiler did a good job, and even if you do find a place where it's not auto-vectorizing well

...

The SIMT model is easier to write fast code for if for no other reason than it being fairly easy to reason about.

I fully agree with your statements. Writing SIMD asm/intrinsics by hand is a torture, but unfortunately something that we game technology programmers have to do, since automatic code vectorization tools are bad and unpredictable. Every SIMD programmer spends lots of time in looking the compiler generated ASM (intrinsics can easily spill data from registers to memory downgrading performance). Looking though compiler generated ASM isn't actually that bad... compared to writing long complex SIMD algorithms by hand

I love the SIMT/SPMD model as much as you do. Writing code with it is fast and the code looks much cleaner compared to CPU SIMD intrinsic code. However the easy to write code is often deceiving. The only reason we are writing SIMT code is to improve performance of our algorithms. If you are not careful about the memory access patterns, your algorithm will run at 1/10 of the speed (or even lower). You have to spend lots of time in optimizing your memory access patterns, just like you have to do for you performance critical CPU code. And this takes lots of time. SIMT doesn't automatically solve the biggest problem in writing efficient code: memory access patterns. You have to solve it manually, no matter what programming model you are using.

Performance critical SIMT code needs to use thread block shared memory extensively. This is basically a manual cache for the thread block. It works much like Cell's local store, but is even smaller (48 KB). It is even harder to use efficiently as Cell's local store, because of the small size, and because multiple threads are using it concurrently, and you have to synchronize the accesses. I personally like SIMT, but I have heard countless complaints about how difficult Cell SPUs are to program (and unfortunately there's some similarity between the memory access optimized code for both).

CPUs can be very good in latency hiding and bandwidth utilization, if you create your own cache optimized processing system. Our system reminds a data base query engine, but instead of optimizing data accesses between memory<->HDD<->servers, it optimizes data accesses between L1<->L2<->memory. We do complex adaptive data reordering in memory to improve the cache performance. Games are pretty easy targets for a system like this, because the game world (and the camera / character location) only gradually change from frame to frame (and at 60 fps, it's not much). This keeps the memory access patterns predictable from frame to frame (and allows gradual data layout optimization / acceleration/query structure optimization). Similar thing of course could be implemented for a GPU driven system, but some of the techniques might be hard to implement efficiently on current generation GPUs.

rpg.314 · Dec 15, 2012

sebbbi said:
I don't believe Intel is freezing their CPU development, and focusing solely on IGP performance. Sandy Bridge doubled the vector FLOP/s for clock, Haswell doubles the vector FLOP/s per clock (basically every tick seems to double it). Intel has already publicly announced in their AVX 2 documents, that they are planning to widen the CPU vector registers first to 512 bits and later to 1024 bits (1024 bit floating point operations, 512 bit integer operations). Intel has always improved their vector performance (and flexibility) in every single architecture they have released. Even Ivy Bridge (a tock and a process shrink) got new CVX16 instructions to help with 16 bit half float vector processing (a nice addition for graphics processing).

Intel certainly could widen their vector units. The question is if it would be worth the power to run code on a CPU with float16/float32 vector unit when a GPU with the same vector unit is sitting right there.

My guess is that code which does not mind that wide vectors would not need as much assistance from an OoO core, so it would be more power efficient on a GPU which has gained some amount of further flexibility vis a vis a GPU of today.

Andrew Lauritzen · Dec 15, 2012

rpg.314 said:
Intel certainly could widen their vector units. The question is if it would be worth the power to run code on a CPU with float16/float32 vector unit when a GPU with the same vector unit is sitting right there.

Indeed that is the crux of the question

"Dark silicon" aside, those extra transistors still cost money.

rpg.314 said:
My guess is that code which does not mind that wide vectors would not need as much assistance from an OoO core, so it would be more power efficient on a GPU which has gained some amount of further flexibility vis a vis a GPU of today.

This is where I'm not sure, and it heavily depends on what that "further flexibility" is. There's an increasing amount of work that is a bit too irregular for current GPUs but from which I can easily extract - say - 8-32 wide SIMD parallelism out of. Now it's not that GPUs themselves can't run efficiently on workloads with that sort of divergence, but that's where it gets to my other complaint... the only realistic way to write such code right now is persistent threads (non-portable) and/or big switch statements (pay - in occupancy - for the most expensive branch in every invocation, even if you never call it).

Again, my ISPC quad-tree tiled deferred implementation is a pretty decent example of this. The brute force way on GPUs is too inefficient going forward, but there's still plenty of data parallelism in there.

The question really just becomes relevant here because I don't foresee the ratio of regular to irregular work in graphics staying how it is right now, which is what the "heterogeneous" camp is basically arguing. Indeed I think we've mined most of the regular stuff (primary visibility and shading) and most stuff from now on is going to be fundamentally more irregular. So will the GPU adapt to service that stuff too (at some power cost) or will it be increasingly relegated to a vanishingly small primary visibility stage at the start of the frame (i.e. it becomes a power-efficient depth buffer/g-buffer rasterizer and doesn't really even need shading)?

keldor314 · Dec 15, 2012

sebbbi said:
No, they are not identical. In fact they are very different. Please read my earlier post about this topic (#243). We are not talking about the easy case here where all threads go to the same path (and the whole branch can be jumped over instead of predicated).

The branch "if (_any(condition))" is ran for all 32 threads in the warp, even if just a single thread in the warp passes the comparison. In this case all the code in the branch is ran (without predication) even for the threads that do not have the condition true. This changes the program behavior in a big way (threads can affect the calculation results of another threads in the warp, not just the performance).

Doh! You're right of course. And I even knew that too, had I been thinking straight >.<

Anyway, the way to deal with uber-kernel worst case behavior is to either force it to use less registers via --max_reg_count (or something like that, I forget the exact parameter) at the runtime JiT stage and then eat the penalty if the worst case happens. Mind you, since reg count is in fact determined at a very late JiTted stage, you can very easily just do this on demand once you know the actual workload. ptxas is somewhat notorious for being slow if you try to force the reg count too low, however, and also, it's really rather rare that forcing the reg count down will actually give you extra performance.

On this note, it's worth noting that neither LLVM ir nor PTX contain any register assignment - this is done in the final stage of compilation, which is pretty much always JiTted on GPUs for compatibility reasons, if nothing else. This is actually a fairly important point in performance, since GPU designers are free to change their architecture at any time, unlike CPUs, which are constrained to supporting x86 and all it's extensions, many of which are now obsolete.

Also, supporting legacy ISA is not just a matter of "oh, it worked just fine on simple hardware from yesteryear, so it should be quite easy to fit on modern processes", there are some nasty side effects it can cause. A couple examples:

x86 has variable instruction length - that is, different instructions are encoded with more or less bytes of machine code. This saved bandwidth and space back in the day. However, this creates some problems with superscalar, since you have to break them apart somehow in a single cycle so that you know where the next group of instructions in the pipe begins. Basically, the result of this is that x86 ends up having extra stages at the front of the pipeline (which would be completely unneeded had the ISA been designed with things like superscalar in mind), which hurts branching performance, even when branches are predicted correctly (branch prediction only works once the instruction has gone far enough down the pipe for you to know that it's actually a branch, but this is later in the pipe due to the instruction crack stuff in front. It also fails completely if the branch is data-dependent and fairly random).

Another issue, this one having ramifications at both OS and architectural levels, is virtual memory. Namely, while 4 kb pages made sense when you only had 1-2 MB of memory, today they cause problems with the L1 cache. Namely, you can end up with a situation where you have two cache lines with different virtual addresses both pointing to the same physical address - this might come up if you're using shared memory between two processes. Also, you can't force them to share the same line when you pull them into the cache, since they may not be eligible to fit in the same line, depending on cache associativity. Now you suddenly have a coherence problem. You could of course make sure to translate virtual addresses to real addresses every time you hit the cache, but this means extra hardware = higher latency, which is a bad thing at the L1 level. You could also make the cache have higher associativity, and then check the physical addresses only when the cache line is read in. The problem here is that even a 2-way associative cache has close to twice the latency of a 1-way direct mapped cache, so the direct mapped cache generally would have better performance at the L1 level. Also, the small page size bloats the page table, so you either need a bigger (and therefore higher latency) TLB, or else you end up dealing with more TLB misses, and traversing the page table is expensive. A proper solution would be to increase page size so that the pages are a multiple of the L1 cache size - by doing this you can make sure that they never alias, since the nth line in the page will always be eligible to occupy the same cache line. In theory, you could change the OS to use larger pages, or maybe make sure that conflicting pages are never in the cache at the same time, by using evict instructions for instance, but then you lose backwards compatibility.

The bottom line is that being able to JiT code, all the time, allows you to drop binary backwards compatibility, and thereby change the high level architecture any time it's beneficial. This is the real advantage of JiTted code, but it only works if pretty much all the code in the ecosystem is JiTted. Since GPUs have used virtual assembly code from the beginning of programmable shaders, they really can ignore binary compatibility and still have backwards compatibility enforced. Getting a CPU ecosystem to that point is one heck of an uphill battle.

Lux_ · Mar 8, 2013

A comparison of GPUs and CPUs in Folding@Home. If I understand correctly, then i7 also ran OpenCL code, not native code.
Implicit solvation is a method of representing solvent as a continuous medium instead of individual “explicit” solvent molecules, most often used in molecular dynamics simulations. (Wikipedia)

Code:

OpenCL, single precision
   HD7970: (Explicit 18.1 ns/day | Implicit 101.3 ns/day)
Tesla K20: (Explicit 18.1 ns/day | Implicit  84.5 ns/day)
GTX 660Ti: (Explicit 16.1 ns/day | Implicit  77.0 ns/day)
   HD4000: (Explicit  3.2 ns/day | Implicit  18.0 ns/day)
 i7-3770K: (Explicit  3.1 ns/day | Implicit   3.4 ns/day)

More test results for example here.

keldor314 · Mar 12, 2013

I see that K20 is 30% slower with opencl than with cuda. Ouch.

Anyway, the only reason a native kernal would be faster would be if the vendor's opencl compiler produced poor code compared to some other compiler you use for a "native" kernel (Like with Nvidia

). Opencl is just C with some extensions, so performance should be quite high with a decent compiler.

Geri · May 4, 2013

i have released my first game with software 3d rendering in the previous month ( http://gerischess.tk )

it needs a cpu with 8 core on max graphics, and 6 gbyte ram
(and i also 8 core to process the ai, becouse i parallerised that too)

i need to run it on somewhat high resolution, and i had to disable some optimisations, becouse if i would enable them, the puppets would become more badly recognizable, wich is not a good option for a chess game (for an adventure game, or an fps, this isnt a problem at all).

i have some short technical documentation in the package about the grahpics engine, read it, if you dudes are interested in it (and please buy the game, if you enjoy it

)

i switched to software rendering in the last year, i will probably not release any gpu rendered game, becouse the drivers (especially intel drivers, wich have 55%+ market share) are too buggy, and these extreme lowcost gpus are useless even to render a simple shadow map on 2048x2048. i am happy with my decide, so for me and my games, software rendering is not the future, its the present

Software/CPU-based 3D Rendering

3dcgi

Andrew Lauritzen

Moderator

sebbbi

keldor314

RecessionCone

keldor314

sebbbi

RecessionCone

sebbbi

rpg.314

3dilettante

keldor314

Andrew Lauritzen

Moderator

sebbbi

rpg.314

Andrew Lauritzen

Moderator

keldor314

Lux_

keldor314

Geri

Similar threads