Software/CPU-based 3D Rendering

With just 100 GFLOPS worth of computing power, you can compete with a 50 GFLOPS GPU.
So seeing as we have a gpu with 3.7tflops in what year are you expecting we will see a 7.4 tflop cpu ?
That's not a very relevant question, because it's a moving target. I recall a discussion right after the GeForce 3 was launched where a developer claimed that it will render "anything you can throw at it" at smooth framerates, and you'd never be able to achieve that with software rendering. Obviously we're way past that point now, and software rendering can deal with far more complex graphics than the GF3 ever could. Despite that, most people understandably don't quite find it good enough yet.

So if it's a moving target, you might wonder if it will ever be 'good enough'? Yes, I strongly believe so. The GeForce 3 only came in three flavors, with the fastest one less than 40% faster than the slowest one. Today we have GPUs with ~4 TFLOPS but we also have integrated GPUs with ~400 GFLOPS that sell really well. This is why I believe CPU-GPU unification will happen from the bottom up. Matching such low-end graphics performance is a much easier target, and even though it moves too it moves more slowly. By getting rid of the integrated GPU an affordable 8-core CPU with 512-bit SIMD units would theoretically be feasible on today's technology. If it took zero time to design the optimal instruction set and write all the software, it could already be 'good enough' in the low-end desktop market (not to mention you'd get a powerful 8-core CPU which is valuable for much more than legacy graphics).

Obviously it doesn't take zero time, but the time it will take allows for more progress/convergence which makes the market size for it bigger and makes it worth the investment.
 
The CPU pipeline for sequential (OoO or not) code is virtually idle for the given task while you do stream computing on the SIMD units because it trashes/pollutes the caches. Yes, you can use write-combined memory with stream loading, etc., but then your scalar access with sequential code to it will be slow as hell.
Except none of that is observed in practice.

The mistake you're making is that you consider scalar and SIMD code to be separate. I guess that stems from a heterogeneous way of thinking. The SIMD code can't make the the scalar code go idle, because it's all the same instruction stream. One can't run ahead of the other (outside of the out-of-order execution window). If one slows down the other, they both slow down. But even if it occasionally slows down that doesn't mean there's a cache trashing problem with this architecture...

Cache trashing is really a misnomer in this scenario. GPUs have tiny caches and trash them all the time. Except we call that 'high efficiency throughput computing', because while reducing the SIMD width (down to a scalar if necessary) would put an and to the trashing, it would also kill performance!

Really the simplest way to look at this is that SIMD code is nothing other than scalar code from a loop body that has been parallelized by executing multiple iterations in lock-step. It's just another form of scalar code. A way to represent multiple scalar cores all executing the same instruction. If it causes a proportional increase in cache traffic, that's really just an indication that you're probably achieving the speedup that you wanted. So take the good with the bad.

The only thing that really matters is the cache hit/miss ratio per byte. A unified architecture with out-of-order execution is in theory better than a GPU because with fewer threads a higher data access coherency can be achieved. Note that Hyper-Threading, which is only 2-way SMT, in some cases doesn't provide a speedup due to cache contention (the bad kind of cache trashing). Now imagine how ugly that can get with the many threads a GPU runs. The only saving grace for GPUs is that in graphics the code is typically short and there's data access coherency between them. For longer code it's important to keep the threads close together. You can even interleave them in software and achieve better performance at lower occupancy.

The take-home message is that you can achieve all this and more with a unified architecture and scalarization is fundamentally a good thing.
 
I recall a discussion right after the GeForce 3 was launched where a developer claimed that it will render "anything you can throw at it" at smooth framerates, and you'd never be able to achieve that with software rendering. Obviously we're way past that point now, and software rendering can deal with far more complex graphics than the GF3 ever could. D

complex yes, faster I very much doubt it taking a few approximations
using swiftshader and being generous in assuming a current top end cpu would out perform a q6000 by 3x in ut2004 with it. As far as i can tell geforce 3 wins
 
Except none of that is observed in practice.
WTF!? Did you ever run a profiler?

The mistake you're making is that you consider scalar and SIMD code to be separate.
Damnit!! You you are even in complete denial now that there is something called latency vs. throughput and that a compromise based on limited resources is what determines the optimal system architecture.

GPUs have tiny caches and trash them all the time.
EXACTLY! Now try to explain to yourself why the CPU has huge caches and the GPU has small ones. :rolleyes:

Whatever... nothing valuable will come out if I continue to argue with you
 
Whatever... nothing valuable will come out if I continue to argue with you
Don't bother. We are not worthy. I effectively got told I know little about video and 3D graphics hardware and so am retreating to stay blissful in my ignorance. :p
 
Which is a perfectly valid assessment of tea maker's skills. All you do in technical threads is reading tea leaves. </lame pun>
 
So if it's a moving target, you might wonder if it will ever be 'good enough'? Yes, I strongly believe so. The GeForce 3 only came in three flavors, with the fastest one less than 40% faster than the slowest one. Today we have GPUs with ~4 TFLOPS but we also have integrated GPUs with ~400 GFLOPS that sell really well.
Low-end GeForce 2 MX variants also sold really well at the time. I'm not sure what your point is with that comparison.
 
I think he's saying there is a market for very low powered (low power in fps not watts) graphics
and thats true there are people who need nothing other than desktop compositing and having cpu rendered gfx would work for them. But where I disagree is that is showing a trend that will move up and one day (i'm talking next 10 years not next 25+) a gpu less system will be enough for everyone, and that the power of software renderer's is increasing faster than the rate of complexity in games or that we will reach a time when devs say "ok our games are pretty enough we wont improve them anymore"

I think we are a very long way from the no improvement scenario, we have 3d (although niche) and those people ideally want 120fps, 4k has just arrived (i play at 5.5k when I can) and a huge trend in the physical modelling of systems and thats going to suck up a ton of flops, plus a million other things that i cant think of.
 
The reality is that mixing sequential processing with stream processing increases code and hardware complexity. It is well understood that the complexity of a system can be reduced significantly by dividing it into it's problem classes.
Unification makes the code far simpler, not more complex. If you know how to write a loop with independent iterations, you know how to implicitly mix scalar and vector code. Heterogeneous computing is way more complicated than that if you want acceptable results. As for the hardware, how can a CPU+GPU be less complex than just a CPU? Even with the wider SIMD units, Haswell's abilities as a sequential scalar CPU seem unaffected and they were able to lower the power consumption while increasing performance, all on the same process node.

That said, hardware complexity doesn't have to be an issue in and of itself. For instance branch prediction is a complex but small piece of logic that is well worth the performance advantage and what it contributes to simplifying software. So while CPU-GPU unification will require a lot of effort from the hardware designers, that's what they're paid to do and shifting the problems from software to hardware is a good thing once process technology allows for it. Again, unified shaders on the GPU didn't come for free either.
The fundamental flaw in your logic is that you are insisting on the complete opposite of this solely because you recently have observed a trend of increasing programmabilty of GPUs and a tighter integration into the system, without any qualified analysis if the trend implies a convergence.
First of all I haven't just "recently" observed a trend of increasing programmability of GPUs. I've observed it since the Voodoo2 added dual-texturing capabilities. Secondly the mere observation of that trend for the past 15 years isn't what makes me believe it will inevitably result in unification. It's just in support of that. The difference is in what you and I believe to be the driving force. You seem to believe that I assume the trend will continue simply due to having been a trend so far. Indeed such reasoning is very fragile and wouldn't necessarily result in unification. But instead I believe there is a strong desire for full programmability, and hardware is just the limiting factor to get there without losing too much performance.

Of course I can't present much of a "qualified analysis" for what is mostly a desire. But it's pretty obvious that consumers are not going to buy new hardware unless it has novel capabilities. Higher performance of existing applications isn't a novelty for very long. You have to enable new applications. So GPU manufacturers have to increase programmability to stay in business. CPU manufacturers already have ultimate programmability but lack performance to enable new applications, so they adopt things like multi-core and wide SIMD units. So they're really striving for the same things, without losing previous qualities. CPUs aren't going to let go of out-of-order execution to improve throughput computing and GPUs aren't going to let go of wide SIMD units to become better at scalar workloads. The only solution is a unified architecture which combines all those qualities. It's just a matter of time for process technology to enable that.
Intel has performed an analysis with Larrabee, which is a long shot away from your radical suggestion, and the result is that there won't be an convergence even for that. So I wonder if any evidence and reason could possibly convince you.
Larrabee isn't an analysis of that, because it's not a unified architecture. Instead of combining the qualities of the CPU and GPU, they made compromises to aim for something in the middle. The result is that it fails as a CPU and fails as a GPU. It was also a mistake to try to enter in the high-end market. It would have cost a fortune to sell it cheap enough to gain sufficient market share for developers to want to target it. The only market where such an in-the-middle compromise makes some sense is the HPC market, which they've entered with considerable success. But for graphics in the consumer market they should instead focus on architectures that are already worth the money by still being successful as a CPU and slowly becoming capable of replacing a low-end GPU. With the AVX roadmap they seem to be very well on track for that.

So I'm sorry but your evidence of the failure of Larrabee as a high-end discrete GPU is in no way an indication that the unification of the CPU and integrated GPU will also be a failure.
It is a completely different question if this trend will continue in the future (>5 years) at all or even will turn around.
I really don't have the time to link to the panel talks and papers which conclude that IC performance scaling is slowing down at an accelerating rate since the introduction of the 32 nm node. Just trust me that this is an established fact.
Intel claims that Broadwell (14 nm) will be at least 30% more power efficient. And after that they seem to want to widen the SIMD units to 512-bit. So if the performance scaling has slowed down since 32 nm I'm not seeing it. I see absolutely no reason why I should trust you that this is an established fact.

Sure, some companies can no longer continue the historic pace due to increasing costs. This is especially a problem for low volume custom designs. CPUs are produced in very high volume though because they are so widely applicable. This is another reason why unification makes sense. One architecture takes care of all your computing needs.
 
Power consumption is everything in the low end arena. This is becoming dominated by tablets and laptops, and people care about battery life.

To my best knowledge, the biggest on-chip power usage comes from active caches and buffers. Now, clearly the cache takes less power than a full memory access, but there are other types of buffers on a CPU that are high usage.

One good example is OOO hardware. Here you have the scoreboard, the reorder buffer, as well as the instruction issue queue, each of which are accessed every single cycle.

* * *

On the subject of computing evolution, I think we're seeing a new divide, between mainstream portable devices such as laptops, tablets, and even smartphones and high performance workstations. So the question becomes what workloads they are actually used for.

Since this discussion appears to be about mainstream devies, I'll set aside the workstation catagory.

There are three main catagories I can think of where performance actually matters for mainstream users: gaming, videos, and internet browsing. Let's deal with them one at a time.

Gaming has been heavily dependent on the GPU for more than a decade. The reason for this is that there are two big performance sinks: graphics, and more recently, physics. Both are very much throughput tasks, and neither is particularily dependent on data throughput with the rest of the program: graphics you occasionally upload some new textures, and with physics you have to track a few thousand objects every frame, which is really only a few dozen kb of data. Benchmarks still show that with games, the CPU hardly matters, that plugging in a faster GPU will increase performance.

Videos are typically handled by special purpose hardware, since they require a large amount of processing, and the data format is ugly enough that a general purpose processor would waste a lot of time (and waste power!) doing bitwise logic to unpack data. Also, there is a lot of use of low precision data types (bytes and fixed point) which just waste power if you use them on a full size ALU.

Internet is really the interesting case, since it is dependent on both CPU and GPU are taxed. The GPU because those darn ads keep adding new and more expensive effects, and CPU because of the various scripts that seem to pop up everywhere. Scripting is actually a very interesting case, since it's largely interpreted from bytecode. Interpreters resist most of the serial performance enhancements of CPUs since when you think about it, every bytecode instruction is a interpreted though a data dependent branch to some bit of code or another that actually performs the instruction.

Ultimately, the question is where to spend your power budget.
 
Last edited by a moderator:
complex yes, faster I very much doubt. As far as i can tell geforce 3 wins
Nope. A quad-core CPU outperforms a GeForce3, while only using 128-bit SIMD.

Imagine what four times the SIMD width, twice the number of cores, a proper gather implementation and DDR4 could do in the hopefully not too distant future. It would be comparable to a much more recent GPU while still offering far more capabilities. The market for this should consist not just of people who are not hardcore gamers, but also of people who are hardcore gamers and want their discrete GPU to be accompanied by a powerful CPU instead of a worthless integrated GPU.
 
Disagree on my pc (q6600) ut 1680x1050 high quality no aa af unknown small 2 player level (me + 1 bot) swiftshader produced 7fps
its hard to find gf3 ut2004 benchmarks but I did find a forum about one with people posting their results (from 2004)
http://forums.epicgames.com/threads/352328-Post-your-UT2004-Benchmark-results-here-(download-here)
UT2004 Build UT2004_Build_[2004-02-10_03.01]
Windows XP 5.1 (Build: 2600)
AuthenticAMD PentiumPro-class processor @ 1402 MHz with 511MB RAM
NVIDIA GeForce3 (5655)

ctf-bridgeoffate?spectatoronly=true?numbots=17?attract cam=true?quickstart=true -benchmark -ini=UT2004_1024x768.ini -userini=benchmarkuser.ini -seconds=70 -exec=santaduck_ctf.txt

10.535425 / 27.422403 / 88.753197 fps
Score = 27.426121

lot of gf4 results here (sorry but ut2004 geforce3 benchmarks are hard to come by) if you know how much faster the gf4 is than the gf3 you could extrapolate
http://forums.epicgames.com/archive/index.php/t-349590.html

Heres another quote from 2005
My Geforce 3 TI 200 runs UT2004 on Medium Settings with an avg of 35-40 FPS
its not much info but i would suggest that if that system was to run ut2004 at the same settings as me it would score more than 7fps
 
WTF!? Did you ever run a profiler?
It would be easier to sum up the days I haven't run a profiler.

You're trying to turn something into a problem that really just isn't an issue at all. On a contemporary quad-core with a 256-bit store port each, it would take 65,536 cycles to 'erase' an 8 MB L3. That's a lot of cycles before a scalar instruction would absolutely have to get its data from RAM (which typically still takes less than 100 cycles). In a more typical scenario it takes far longer for the L3 to be trashed due to actual reuse of data. Of course the scalar code's L3 miss can happen well before the very last cache line is evicted, but it's 16-way associative, which doesn't behave very differently from an infinitely associative cache.

That just leaves the scalar and vector code to share the cache as equals. I mean that in the sense that vector processing is a bundle of scalar operations in parallel as described earlier, so it's entitled to use a larger portion of the cache in the hopes that it can sustain its high throughput. Note that this also means that scalar instructions and vector instructions are equally affected by each other. Vector data occupies more of the cache but there's also a larger impact if a miss occurs.

This may sound like a precarious balance but it works really well in practice. Just check your profiler.
Damnit!! You you are even in complete denial now that there is something called latency vs. throughput and that a compromise based on limited resources is what determines the optimal system architecture.
I'm not in denial of anything, let alone complete denial. There's an impact, but it's much smaller than what you're portraying. GPUs are now way more latency optimized than what they used to be, without disastrous consequences. In fact it was highly necessary to reduce on-die storage needs and increase the cache hit ratio to reduce bandwidth. Echelon cares even more about per thread performance and is expected to run at up to 2.5 GHz. Meanwhile CPUs have become way less aggressive about achieving low latencies, and there's a significant focus on throughput. The Pentium 4 had integer ALUs with an effective back-to-back execution latency of 7.6 GHz^-1, at 90 nm. Nowadays the design target is a mere 4 GHz on a process that is capable of way more if they pushed for it, while the floating-point performance has gone up eightfold per core (nearly 50x for FLOPS/Watt) with more to come.

So there's far less latency vs. throughput focus going on. They don't just target one, but both, and the balances between them are converging.
EXACTLY! Now try to explain to yourself why the CPU has huge caches and the GPU has small ones. :rolleyes:
What's the point you're trying to make? Keep in mind that you have to compare the entire storage hierarchy. Classically GPUs have traded large caches for large register files. They have lots of thread contexts to store. They're somewhat interchangeable. The thread count can be reduced which saves register space but then the caches have to grow to achieve high hit ratios. Echelon is designed to include 320 MB of SRAM.
Whatever... nothing valuable will come out if I continue to argue with you
It was just getting interesting. I had never deeply thought about why scalar and vector code coexist peacefully in my profiling results until you considered there to be an issue.
 
Low-end GeForce 2 MX variants also sold really well at the time. I'm not sure what your point is with that comparison.
Yes, and older generation GPUs are also still sold today. But my point was that for the latest generation of hardware, for each time frame, we went from a very small range in performance between the low-end offering and the high-end offering, to a very large one.

This tells us something about people's graphics performance expectations. An integrated GPU is considered adequate by a very significant portion of the market. Hence CPU-GPU unification will start to make sense much sooner than when it can match today's high-end GPUs.
 
I don't have a strong opinion one way or the other, but kudos for staying cool headed when others are not.
 
Last edited by a moderator:
I think he's saying there is a market for very low powered (low power in fps not watts) graphics
and thats true there are people who need nothing other than desktop compositing and having cpu rendered gfx would work for them. But where I disagree is that is showing a trend that will move up and one day (i'm talking next 10 years not next 25+) a gpu less system will be enough for everyone, and that the power of software renderer's is increasing faster than the rate of complexity in games or that we will reach a time when devs say "ok our games are pretty enough we wont improve them anymore"
To be fair, the number of artistic choices which are no longer technology limited is increasing. Many games don't aim for hyperrealism.


Yes, and older generation GPUs are also still sold today. But my point was that for the latest generation of hardware, for each time frame, we went from a very small range in performance between the low-end offering and the high-end offering, to a very large one.
I don't really see it. It certainly took some time for the GPU market to mature to a point where creating low-end models even made sense. But you can't compare the entire market today to the range of performance bins of a single chip which was never meant to be a low-end offering. The span has been huge for over a decade.

No doubt integrated GPUs are popular these days. Someone might try unification, but it would be a disruptive change. There are significant hurdles to overcome. Some technical ones like power draw, but also getting developer support and the impact on market segmentation. After all, a chip replacing low-end CPU+GPU combinations would have to be much more powerful as a CPU alone to fulfil the same roles. But that puts it in direct competition with midrange offerings for customers willing to pay more, potentially lowering margins substantially.


Here's a little fact I found interesting, though: all the major GPU companies (desktop and mobile) also design CPUs.
 
No doubt integrated GPUs are popular these days. Someone might try unification, but it would be a disruptive change. There are significant hurdles to overcome. Some technical ones like power draw, but also getting developer support and the impact on market segmentation. After all, a chip replacing low-end CPU+GPU combinations would have to be much more powerful as a CPU alone to fulfil the same roles.
that's why it would make sense to fuse the vector units, no special dev support needed, no disruptive change. Haswell has about half a TFlop/s on cpu side and even more of that on gpu side, but depending on what you use, half die die is idle, doesn't burn much power (probably), yet it's such a waste. that's just gonna increase in future, broadwell will probably have a twice as powerful GPU (rumors :) ), skylage will support 512bit SIMD -> twice units. wouldn't be smart to have an idle TFlop/s on either side. the other parts (x86 specific and fixed function units on gpu) won't increase in size dramatically.


Here's a little fact I found interesting, though: all the major GPU companies (desktop and mobile) also design CPUs.
matrox? :D
 
It would be easier to sum up the days I haven't run a profiler.

You're trying to turn something into a problem that really just isn't an issue at all. On a contemporary quad-core with a 256-bit store port each, it would take 65,536 cycles to 'erase' an 8 MB L3. That's a lot of cycles before a scalar instruction would absolutely have to get its data from RAM (which typically still takes less than 100 cycles). In a more typical scenario it takes far longer for the L3 to be trashed due to actual reuse of data. Of course the scalar code's L3 miss can happen well before the very last cache line is evicted, but it's 16-way associative, which doesn't behave very differently from an infinitely associative cache.
A single image buffer/texture trashes the slow L3 cache completely! Your 8 MBytes are nothing - NOTHING - in modern graphics.

Unification makes the code far simpler, not more complex. If you know how to write a loop with independent iterations, you know how to implicitly mix scalar and vector code. Heterogeneous computing is way more complicated than that if you want acceptable results.
We are not talking about a Java lecture class teaching how to draw something on the screen in the easiest and most elegant way! We are talking about the how you can do the best graphics with the limited resources given!
Having a unified processing pipeline requires additional explicit commands in order to tell the CPU that we don't want to trash the caches, stream data, etc. and trying to write efficient code without any structured boundaries is logistically a contradiction.

Separating critical code sections for finely tuned processing of sorted data, with as little interaction with outside code as possible, is the most fundamental principle for performance efficient software on the CPU, GPU, APU, everywhere.
Showing examples that a scattered access to a data stream here and there can improve something is not an argument for your radical denial that there are problem classes which will always be processed fundamentally more efficient by hardware that is designed to process these.

I won't even start to argue with your view that the failure of Larrabee as a GPU is no argument against your agenda... :rolleyes:
 
Back
Top