Software/CPU-based 3D Rendering

I think vertex and pixel unification was free as it meant chip architects no longer had to over provision the separate units to account for the cases in which you are completely bottlenecked by one or the other.
it's not free. you notice that especially on mobile hardware that moves from separate units to unified. suddenly all math is done with 32bit floats instead of cheaper mid-precision halfs.

You make many good points so I won't focus on those, but I disagree that previous fixed function tessellators didn't catch on because of limited flexibility. DX11 tessellation doesn't have much more flexibility outside of the hull shader and most uses of DX11 tessellation so far haven't taken advantage of the additional flexibility.
it's probably not visually visible, but they do. there is culling, density based vertex offset (as you can sample textures in the domain shader), for shadows the tesselation is calculated by the camera, not shadow view (so not just some distance based magic), some games use seam-fixing of displacement on borders of patches etc. would just not work the oldschool way, imo.

@Nick as you seem to have some in-depth knowledge, how much do hardware rasterizers cost? 20years ago, they were already part of the chips and those had like 5MTransistors in total, nowadays they spill out more pixel, of course, but it's still 1-4 of them on a chip and it seems to be not that expensive.
It seems somehow pointless so far to argue it's too expensive vs. FixedFunction-makes-sense if there are no numbers that we argue about.
 
You make many good points so I won't focus on those, but I disagree that previous fixed function tessellators didn't catch on because of limited flexibility.

(...)

The multi-pass nature of previous methods might have been an issue for some people, but I view that as a potential performance issue not a case of limited flexibility.
Yes, it was basically a performance issue. However the tight 16.6 ms frame budget means that performance plays a crucial role in determining the usable feature set. Performance and feature set are tightly linked together in real time graphics.

The old ATI tessellator was fast enough when you only needed constant edge tessellation factors (or static / infrequently changing CPU generated edge factors). For dynamic continuous tessellation you could generate these factors by GPU every frame, but that meant writing the factors to memory and reading them back, and thus the performance wasn't usually good enough. I would argue that the tessellator data paths were not flexible enough to fetch these factors directly from a general purpose on chip memory (and also that the vertex shader data paths were not flexible enough to push this data in the same on-chip cache on demand for the tessellator). This is a common problem with fixed function hardware, it suits certain usage patterns very well, but once you start stretching the usage pattern, the performance doesn't degrade gracefully (as you often need to start doing stuff partially by programmable hardware, and that often means going though the main memory). Fully programmable hardware is nice, because it always allows you to move data from one part of the algorithm to another though the L1 cache because all the operations are done by the same general purpose processing unit. Also the performance degrades nicely when data set flows out of L1 (as L2 and L3 support L1 automatically).
(...)most uses of DX11 tessellation so far haven't taken advantage of the additional flexibility.
That's understandable, as most game developers haven't yet moved to tessellation based content pipelines. DX11 features (tessellation and compute) on PC are currently mostly used for additional eye candy for ultra high detail settings (and not as main building blocks of the graphics engines). We need to wait for current generation consoles (and DX9/DX10 PCs) to fade out. At that point all studios can use tessellation and compute widely in their content pipelines and graphics engines. Not as extra effects, but as main building blocks.

I don't think this is a hardware problem. Software is just underdeveloped currently. I am sure that we will see many new innovations coming in the next years, as next gen is starting, and lots of additional graphics programmers start to use their brainpower to figure out how to exploit the programmable DX11 hardware in the best way.
 
Last edited by a moderator:
it's not free. you notice that especially on mobile hardware that moves from separate units to unified. suddenly all math is done with 32bit floats instead of cheaper mid-precision halfs.
That's true on mobile, but not on the desktop. When ATI moved to a unified architecture pixel shaders were already using 32 bit floats.
 
that's probably because ATI already paid the price when they had to move from 24 to 32 bit to comply with SM3 requirements. NVidia on the other side was still faster when you've worked with halfs.

So it's probably 'right' that it was free for ATI, but if SM3 had not enforced FP32, being unified would have cost them.

I also wonder how 'free' it really is in regards of resource usage. I was once told that although the units are unified, there is still some limit in free distribution of them, as some pipelines might queue up for writing to some buffer, and if there was no unit processing those buffer, you'd stall. so technically it's 'free', but in practice you still have some units dedicated to do pixel processing to free up buffer, in worst case.
 
What you were told is basically correct though that doesn't increase the cost of unification. Only the complexity.
 
I think vertex and pixel unification was free as it meant chip architects no longer had to over provision the separate units to account for the cases in which you are completely bottlenecked by one or the other.
Oh, so the architects were wrong to ever have them separate?

Obviously we have to look at a much longer timeline. In the days of Shader Model 1.x, unification was an insane idea due to the very high cost. With Shader Model 2.x's floating-point processing some people began to theorize about it, but the two types of units were still highly specialized at their respective task. It is only with Shader Model 3.0 that the next obvious step was unification and that the cost would be compensated by the advantages.

We're now somewhere around the equivalent of the 2.0 model for CPU-GPU unification. There's overlap between the types of processing they can do, but they each have a high degree of specialization for different tasks. With AVX-512 on the CPU side and fully unified memory spaces with Echelon-like low-latency high-ILP cores on the GPU side, we'll probably be at the equivalent of the 3.0 model. There will still be differences, but the cost of unification in the next generation will be compensated by the advantages.

Load balancing to avoid bottlenecks is definitely one of them. Today's game developers have to account for a wide range of setups with a slow or fast CPU and a slow or fast GPU. What's more, they have to maintain well balanced concurrency at any given point in time to avoid bottlenecks, which isn't trivial at all when there's high latency and low bandwidth between the CPU and GPU.

If you think today's console ports are bad, wait until the new games assume an APU-like architecture where CPU-GPU communication is relatively fast. Mark my words; discrete GPUs will suffer. People might blame the game developers for a while, but really the idea of keeping the CPU and GPU far apart is not sustainable. The only direction things will evolve is closer together, and the costs involved become negligible over the advantages of having a homogeneous architecture.
 
Nick said:
The only direction things will evolve is closer together, and the costs involved become negligible over the advantages of having a homogeneous architecture.
While I agree about the advantages of moving to a single die, the part about the costs of going completely homogeneous just screams bullshit to me.
 
Oh, so the architects were wrong to ever have them separate?

Obviously we have to look at a much longer timeline. In the days of Shader Model 1.x, unification was an insane idea due to the very high cost. With Shader Model 2.x's floating-point processing some people began to theorize about it, but the two types of units were still highly specialized at their respective task. It is only with Shader Model 3.0 that the next obvious step was unification and that the cost would be compensated by the advantages.
I never said it was wrong to have them separate. Only that the unification cost was essentially free at the time they were unified.
 
@Nick as you seem to have some in-depth knowledge, how much do hardware rasterizers cost? 20years ago, they were already part of the chips and those had like 5MTransistors in total, nowadays they spill out more pixel, of course, but it's still 1-4 of them on a chip and it seems to be not that expensive.
It seems somehow pointless so far to argue it's too expensive vs. FixedFunction-makes-sense if there are no numbers that we argue about.
I'm not arguing that a fixed-function rasterizer is too expensive. To my knowledge it's really very cheap. So it's not about the cost of the unit itself, but about the cost of the limitations it imposes on the rest of the architecture and the diminishing value of keeping it.

As an analogy, fixed-function vertex and pixel processing would be very cheap too by today's standards. And when all you're doing is stuff like for instance compositing a webpage, it would be more efficient than using today's floating-point programmable cores. But when you're trying to render complex scenes you really don't want to be doing it with multi-pass techniques on fixed-function hardware. So regardless of how cheap the hardware is, that's not a sufficient reason to keep it around. I think it's only a matter of time before developers want to take graphics to the next level and find the current rasterizer unit too limited as well. You could try to extend its capabilities or even make it programmable, but then you quickly reach the point where you can just let the programmable cores take over. We've already seen that happen with anti-aliasing techniques.

Getting rid of the rasterizer would also be a consequence of unification regardless of why it happens. For argument's sake, let's agree that after AVX-512 becomes mainstream, the next logical step will be to at least unify the CPU and integrated GPU's programmable logic. It is cost effective for those who have little demand for hardcore gaming, but any other argument works. Then the question becomes what to do with the remaining fixed-function logic. Personally I don't think you can keep the rasterizer unit as we know it today. How would that even work? I'm really open to suggestions here. But as far as I can tell the best you can do is have some specialized instructions.

So I'm not against fixed-function hardware in itself. Especially not because of cost. I just don't think there's much of a future for it as it becomes too limited and can't be fit into unified architectures.
 
While I agree about the advantages of moving to a single die, the part about the costs of going completely homogeneous just screams bullshit to me.
Thank you for your extensive list of compelling arguments which did not apply to vertex/pixel processing unification but will indefinitely prevent CPU-GPU unification...

Seriously, anyone can call a theory bullshit without having spent a smidgen of thought on it. But good job on agreeing about the advantages of something that has already taken place years ago and has proven itself in practice. Do you also agree on the benefits of a unified address space? And how do you feel about the idea that ROP units may be eliminated as early as the next generation? Just curious about your thought process.
 
I never said it was wrong to have them separate. Only that the unification cost was essentially free at the time they were unified.
Yes, but that is stating the obvious. We all know how vertex/pixel unification panned out and that the cost was smaller than the advantages at the time they unified. So I was digging for a reason why you stated it in the first place. Correct me if I'm wrong, but your response and what you responded to seemed to imply that you think there's a fundamental difference between the cost of vertex/pixel processing unification and the cost of CPU-GPU unification. I just wanted to point out that the cost of vertex/pixel processing unification was huge when looking at a larger timeline than the blatantly obvious "time they were unified". The cost for CPU-GPU unification, at this point in time, is smaller than that and continues to get smaller. Which indicates that sooner or later the CPU and GPU will have their own time when the cost for unification will be essentially free due to the advantages.
 
Nick said:
Just curious about your thought process.
No you aren't. And AFAICS you are not concerned with anyone else's thoughts but your own. You routinely talk past valid points others bring up (limitations of silicon, power consumption, die area, cost), and offer no tangible evidence of your own either to support your own claims or dismiss those of others.

Nick said:
Do you also agree on the benefits of a unified address space?
Yes.
In fact, I also see the advantages of a unified ISA (or at least one being a superset of the other). But whether those will ultimately be worth it or even necessary is unclear to me. I would also like to point out that even if you have a unified ISA that does not mean you have to have a pool of identical cores.

Nick said:
And how do you feel about the idea that ROP units may be eliminated as early as the next generation?
Ambivalent?

Nick said:
Seriously, anyone can call a theory bullshit without having spent a smidgen of thought on it.
Right, you know how much thought I have put into it. As you are apparently clairvoyant, perhaps you could provide me with this week's winning lottery numbers and we can put the accuracy of your predictions to a practical test.
 
Yes, but that is stating the obvious. We all know how vertex/pixel unification panned out and that the cost was smaller than the advantages at the time they unified. So I was digging for a reason why you stated it in the first place. Correct me if I'm wrong, but your response and what you responded to seemed to imply that you think there's a fundamental difference between the cost of vertex/pixel processing unification and the cost of CPU-GPU unification. I just wanted to point out that the cost of vertex/pixel processing unification was huge when looking at a larger timeline than the blatantly obvious "time they were unified". The cost for CPU-GPU unification, at this point in time, is smaller than that and continues to get smaller. Which indicates that sooner or later the CPU and GPU will have their own time when the cost for unification will be essentially free due to the advantages.
I think unifying vertex and pixel shaders has little bearing on the utility of unifying the CPU-GPU into a homogenous architecture. It's a statement that as pixel processing got more programmable unification with vertex processing made sense and does not indicate what will happen in the future.

Many will agree that GPUs will get more programmable. It is less obvious that the end game of programmability is a homogenous architecture. I don't claim to know how things will end up and keep an open mind about the options yet if a homogenous architecture does win out I think it is far in the future.

I don't know the workstation timeline, but in the consumer market hardware vertex processing started in 1999 and unification happened in 2005. It would have happened a year or more earlier had ATI not messed up with R400. That was a quick transition because the benefits were obvious.

You say at this point the cost for CPU-GPU unification and software rendering is smaller than vertex-pixel unification. I disagree.
 
Just curious about your thought process.
No you aren't.
Yes I am. Do you think I have these lengthy discussions just to kill time? No, I enjoy computer graphics technology and to understand where it's heading I am genuinely interested in why people come to certain conclusions.
AFAICS you are not concerned with anyone else's thoughts but your own. You routinely talk past valid points others bring up (limitations of silicon, power consumption, die area, cost), and offer no tangible evidence of your own either to support your own claims or dismiss those of others.
If I was only concerned about my own thoughts I would keep them to myself and not put them out in the open for everyone to criticize. Heck I invite people to criticize them, because I am interested in their thoughts. I invite you to criticize them. But be prepared to engage in a discussion where I will formulate counter-arguments. That's not the same thing as talking past valid points. I take those points very seriously and for unification to happen they would have to be overcome.

Regarding offering "tangible evidence" let's get it straight that this isn't a trial for a past event. We're all speculating about the future, and there is no tangible evidence that the CPU and GPU will not unify either. All we have is evidence of past evolution and the assumption that it will continue or indications that it won't. The evidence of past convergence between the CPU and GPU is in strong favor of unification and has thus far overcome the many arguments that are used against it continuing. There is also tangible evidence of things that have converged and then unified in the past. So like it or not it's a harder case to defend the claim that unification won't happen.

If there are indications that this convergence will turn around, which you think I'm underestimating, then I'm most interested in hearing about your reasoning.
Yes.
In fact, I also see the advantages of a unified ISA (or at least one being a superset of the other). But whether those will ultimately be worth it or even necessary is unclear to me. I would also like to point out that even if you have a unified ISA that does not mean you have to have a pool of identical cores.
Wow, major flashback! It's uncanny how almost your exact words were used to discuss vertex/pixel processing unification on the GPU. A lot of people considered it an advantage to have mostly the same instructions available for both, but it was still unclear to them what the exact value would be, and they didn't consider physical unification to be a necessary consequence of it. Guess what, the GPU did get a fully unified ISA, the advantages for creating more advanced graphics (and beyond) are huge, and as far as I know they all have unified cores.

So how is your desire for a unified address space and a homogeneous ISA fundamentally different and not in favor of CPU-GPU unification? What exactly have I been underestimating to such an extent that it "screams bullshit" to you?
Ambivalent?
I am simply asking you to take a stance and provide some argumentation of your own. But if you're really ambivalent about the ROPs unifying into the shader cores, why are you unambivalent about CPU-GPU unification? Again, a penny for your thoughts.
Right, you know how much thought I have put into it.
With all due respect unless this is the best you can do you don't seem to have put much thought into it at all. You just rehash what others have said without making it your 'own' arguments by defending them with insightfulness instead of going off on a personal tangent. But I'll give you the benefit of the doubt once more. Enlighten me with your wisdom. You might want to start with the ROPs.
 
I'll give you my very simple reasoning
cpu rendering is too slow for the people who need rendering
you could argue its getting faster, but is it
run crysis in software on the best cpu available when swiftshader was released
run it again on the best cpu available now
run it again on the best gpu available when swiftshader was released
run it again on the best gpu available now
whats shown the biggest speedup ?
 
Nick said:
So how is your desire for a unified address space and a homogeneous ISA fundamentally different and not in favor of CPU-GPU unification?
Because I don't avoid the reality that power consumption, heat, die size, and cost make this an unlikely proposition however theoretically attractive?

There is a difference between wanting something to happen and believing it will.

I want to win the lottery. Do I believe I am going to?
 
I think unifying vertex and pixel shaders has little bearing on the utility of unifying the CPU-GPU into a homogenous architecture. It's a statement that as pixel processing got more programmable unification with vertex processing made sense and does not indicate what will happen in the future.
It's not just pixel processing that got more programmable. Vertex processing first evolved from fixed-function to programmable too. It also borrowed features from pixel processing such as texture lookups, before unification made sense. My point is that convergence happened from both ends. I find it surprising that you think this has little bearing on the value of CPU-GPU unification, despite the very similar double convergence. CPUs became multi-core, got SIMD units that kept widening, they got instructions such as FMA and gather, they can do SMT, etc. GPUs became programmable, got generic floating-point and integer instructions, data access was generalized and caches were added, the arithmetic latencies lowered from hundreds to only a few clock cycles, etc.

But perhaps you consider these to be meaningless facts which may show similar convergence up to now but have no correlation between them that would result in a similar outcome? What ties them together is that the underlying forces which caused each of these things are really the same. There are hardware and software aspects. Hardware gets cheaper as you can share resources. As you indicated yourself, you no longer have to over-provision different structures to prevent them from being a bottleneck. This applies to CPU-GPU unification as well. It also reduces the cost of design and validation, which should not be underestimated. Software also was and is a huge motivator for unification. Having the same capabilities in each shading stage and not having to worry about balancing made the programmers' lives easier, and allowed them to develop new algorithms and applications. Likewise CPU-GPU unification will simplify things and create boundless new possibilities.
Many will agree that GPUs will get more programmable. It is less obvious that the end game of programmability is a homogenous architecture. I don't claim to know how things will end up and keep an open mind about the options yet if a homogenous architecture does win out I think it is far in the future.
I think you're suffering from retrospective bias. You perceive shader unification as obvious, because you already know it happened. To eliminate that bias you have to put yourself back in the timeframe well before it happened and observe that it was far from obvious. Or if you do observe that there was convergence taking place and fundamental forces driving it toward unification, you should compare that to today's situation...

Did at any point the hardware and software engineers say vertex and pixel processing has become programmable enough now, we'll stop right here and not unify them? No! They did unify, and what's more, GPUs as a whole continued to become more programmable. So I see no reason to assume that the desire for more programmability of the GPU and more computing power for the CPU will die out. AVX-512 is derived from the Xeon Phi ISA and would widen the CPU's SIMD units yet again and add more GPU features such as predication. Meanwhile GPUs will add virtual memory support which matches that of the CPU. That's a significant step in programmability, but it will still leave developers wanting more.

The underlying reason for that is consumer demand. Hardcore gamers want new experiences. So once performance is fine, transistors and power should be spent on enabling developers to create such new experiences. This has been a driving force for shader unification, and continues to be a driving force for CPU-GPU convergence.
I don't know the workstation timeline, but in the consumer market hardware vertex processing started in 1999 and unification happened in 2005. It would have happened a year or more earlier had ATI not messed up with R400. That was a quick transition because the benefits were obvious.
That timeline doesn't start with 'hardware' vertex processing. For a while the "obvious" thing to do was to use the CPU for vertex processing, and the GPU for pixel processing. So once again I think you're suffering from some retrospective bias. Back then the idea of not only moving vertex processing to the GPU, but also to make it fully programmable and to make pixel processing use floating-point and unify it all was ridiculous due to the extreme cost. In contrast having an 8-core CPU with AVX-512 at 14 nm would be small and cheap and with 2 TFLOPS of computing power the graphics capabilities will be nothing to sneeze at. And that's not even the point in time when I'm expecting CPU-GPU unification to happen. There's much to be gained from adding some specialized instructions to overcome the remaining cost.
 
I'll give you my very simple reasoning
cpu rendering is too slow for the people who need rendering
I'm sorry but your reasoning is way too simple. It's not about what the situation is today, it's about what could be realized when significant resources are put behind it. That's what progress is about. You don't judge ideas by how well they're executed today, but by looking at the future potential.
you could argue its getting faster, but is it
run crysis in software on the best cpu available when swiftshader was released
run it again on the best cpu available now
run it again on the best gpu available when swiftshader was released
run it again on the best gpu available now
whats shown the biggest speedup ?
Your vision is extremely limited if you evaluate it like that. For starters it is not making use of AVX2, FMA, TSX, gather, or F16C. So you're not even looking at today's full potential. And then there's still the direct future potential of AVX-512. And beyond that it should be conceivable that a few new instructions can speed it up even more.

Next, it's not really sensible to compare the best available hardware. Titan is one heck of a GPU but its market share is ridiculously small. Likewise there are 10-core CPUs which are even more expensive, but it's dubious to label them as best since they lack AVX2 for now. But even if we settle on for example a 200$ budget, that will buy you a pretty decent quad-core CPU with AVX2 (although the integrated GPU is stealing die space that could nearly double the core count), while a discrete GPU is helpless by itself so you get maybe a GT 640 and a Pentium branded CPU. Note also that while the second setup will still be faster at graphics, it will suck at everything else.

So compared to today's software rendering, there is potential for using 16x higher floating-point computing power in a few years from now for the same price class. GPUs have nowhere near such potential for such time span.
 
Will performance ever be fine ?
Do you ever see a time when people will say "thats it. No need to ever buy a new gpu this one is powerful enough"
You completely missed the point. It's about spending transistors and power on performance versus on programmability. In that sense yes performance is definitely fine (as in good enough for now) since people continuously want more programmability too instead of spending everything on maximizing performance of legacy functionality.

As I've discussed before, they are related. You don't want an uber-fast fixed-function graphics card that would render UT2003 at a thousand FPS but would require a million passes to render Battlefield 3. Even if the arithmetic was fast and high precision it's still going to end up being slower due to being bandwidth limited. Note that the address space unification for next generation GPUs is also advertised as a programmability feature which will improve performance by optimizing bandwidth usage.

So increasing programmability is both desirable and a strict must. It will continue to converge GPU architectures closer to the CPU, while at the same time CPUs gain much of the same qualities as GPUs since clock frequency and ILP are practically depleted but TLP and DLP can continue to scale.
 
Back
Top