Fast software renderer

Voxilla · Aug 11, 2009

Jawed said:
I'll be here all day if I try and construct a list for a GPU. There's so many of them that it's impossible to use them close to efficiently, they're constantly fighting each other - apparently it's so bad they're seeing 10% utilisation in typical high end games

It must be like our brain works. Seriously specialization has it's benefits. You would not want to emulate floating point with integer arithmetic.

3dilettante · Aug 11, 2009

Nick said:
Looks like Core i7 does that. It seems like an obvious thing to do when facing the task of selecting an instruction from threads that are otherwise equivalent. And frankly I care little about what's done today and more about what could be done in the future.

All the descriptions I've seen indicate SMT mode has each thread take turns fetching instructions at the front end.
That's just one cycle of hidden latency for a branch that won't resolve for another 15 cycles (best case).

Branch prediction is still very useful because there can be other causes of latency (most notably cache misses) that force it to run speculative instructions anyway.

Branch prediction doesn't help with data cache misses, and an instruction cache miss is going to cause a stall either instantly or in a cycle if there's some kind of target instruction buffer.

In those cases it's good to know that you have a 99% chance you're still computing something useful. 4-way SMT would not suffice if you got rid of branch prediction altogether. But going to extremes to avoid any speculation means slowing down your threads to a crawl and introducing all sorts of other issues.

For an subset of the workloads out there, this is a worthwhile compromise.

Furthermore, single-threaded performance will remain important. Even the most thread friendly application inevitably has sequential code, and you want your processor to execute that as fast as possible.

The point of conflict I see here is that there are design compromises necessary for executing serial code as fast as possible that impact execution in all non-serial cases, as well as have knock-on effects on other parts of the system.

One of my former professors did some research on Characterizing the Branch Misprediction Penalty.

A rough approximation would be: branch density * misprediction rate * IPC * pipeline length. So let's say we have on average a branch every 10 instructions, a misprediction rate of 10%, an IPC of 2, and a pipeline length of 16. That's 30% wasted work, for the worst case scenario you'll find in practice.

The focus of that paper was in latency and performance impact, which is not exactly what I am focusing on.

As to the approximation you are using.
What is the mispredict rate you've chosen? Is it 10% chance of misprediction per individual branch, or the cumulative probability of a misprediction somewhere in a 64-instruction window with 6 branches in that range?
Successive branches lead to cumulative misprediction rates that can lead to 30% of ROB entries not being committed with more reasonable misprediction rates per branch instruction.

What do you mean by IPC in this case?
Barring an I-cache miss, a good 4-wide speculative processor is going to push close to 4 instructions through the front end of the pipeline every clock.
Referencing the stacked penalty model in that paper, the front-end penalty is going to be over 3 times higher than the 5-stage chosen in the model.

100% of all speculatively issued instructions will go through the mispredict pipeline up to the final point of execution. In terms of ROB instructions that do not commit, that's 30% of instructions going through hardware that as a percentage of the non-cache core area is close to 2/3 or more comprised of active logic.
This is a fixed power cost 100% of the time.
Some as yet undetermined percentage of the wasted instructions will go so far as to execute and have their results pending in the ROB when they are negated, depending on the situation. This is the decode+schedule power draw all speculated instructions draw, then a certain amount due to execution unit consumption, for which I have no figures but will vary depending on the operation type.
Loads and stores in i7 are very heavily speculated, more so than what is already the case in other OoO chips.

Heck, even a 30% wastage didn't sound all that terrible to me. How much does a GPU waste by having a minimum batch size of 32 or 64?

As GPUs probably try to do and Larrabee's VPU has been documented as doing, a known invalid lane is clock-gated.
That's different than running an instruction through the pipeline and not knowing it is invalid until the end.

Drawing a 10 pixel character in the distance takes as long as drawing a wall that covers the entire screen because you have 90% of the chip sitting idle.

While not desirable, silicon is relatively cheap in this situation.
If all the chip is doing is drawing a wall or a 10 pixel character, whatever time it takes will be sufficiently fast for the purposes of the GPU's target market.

The figures given by OTOY are quite shocking. Also, try running Crysis with SwiftShader at medium quality at 1024x768. It may only run at 3 FPS, but that's merely a factor 10 from being smoothly playable, on a processor that has no texture samplers, no special function units, no scatter/gather abilities, only 128-bit vectors, and damned speculation...

That doesn't sound particularly compelling, in the face of other engineered solutions that aren't searching for a problem to solve.

So clearly there is such a thing as keeping too much data on chip. It's wrong, and not preferable. You should only keep as much data around as is necessary for hiding actual latency. Today's GPU architectures are far better in this respect than NV40, but they're still not ideal. I agree with Voxilla that they need a cached stack, both for allowing register spilling without decimating performance, and to have unlimited programmability.

Right, CPUs don't keep much data on chip, they just send it off to the 8-12 MB of cache--oh I see what we have here.

Nick · Aug 11, 2009

Jawed said:
If you've got a good reason for not calling it a strand in a SIMD-based architecture, then I'm all ears.

Strand sounds ok to me. Intel's terminology for Larrabee seems to make sense, so I'll stick to that as close as possible in further discussions. Thanks for the suggestion.

Dear god, CUDA's definitions are an abomination!

Clearly the developer needs to be aware that the hardware enforces groupings - Brook+/CUDA/OpenCL create an explicit hierachy of execution domain, thread-groups (to allow strands to exchange data with each other), threads and strands. There's just a host of confusing names that obfuscate the essentials of the underlying hardware.

Yeah exactly. But I still stick to my opinion that software developers shouldn't have to care. They simply write the code for a single strand and instantiate it many times over an array of data. It's up to the framework and the device to determine how many strands/fibers/threads it should really be using. Just think of OpenCL, which makes no explicit mention of it, and can be implemented on a GPU, CPU, Cell, Larrabee, etc. By the way, OpenCL uses the terms kernel and work-item, clearly just to make sure that the last few people who were able to keep the terminologies apart go nuts as well.

Why's it a mistake? You're asserting that 4-way SMT is all that's needed, that's what Larrabee has. How're you going to hide the rest of the latency?

I jumped the gun saying it's a mistake. I'm not worried about the 4-way SMT, they can adjust that in the future without far stretching consequences, but the 16-wide SIMD is a different story. Future evolutions may demand widening (or narrowing) it, or make them MIMD units. But I realize now that's a software issue; more precisely a compiler issue. I just hope they succeed at getting people to adopt the entire platform, and not run into backward compatibility issues forcing them to drag old choices with them.

Oh and 24MB caches on CPUs say hello. Most of the data sat around in those monster L3s gets used once. Why's it sat there any longer than it's sole use?

GPUs have (texture) caches too... They are simply the best option to reduce bandwidth and latency, despite indeed containing some data that goes stale.

You can either try loading the necessary data explicitely, or you can let caching heuristics do it for you automatically. Given the increase in complexity and wide variety of algorithms, my vote goes to caches. Maybe not as much as 24 MB, but Larrabee's 4-8 MB sounds about right for a wide variety of workloads.

silent_guy · Aug 11, 2009

Jawed said:
They're terribly mismatched because modern games leave 90% of the resources unused, apparently

The italics around apparently being the key point. (I can't find back the original source of that 90% number.)

Let's define resources in terms of area of the full chip.

I can see how crysis at the highest quality level on a g98 has major parts of the chip are sitting idle, waiting for a shader to complete. But on a well proportioned RV770 or g98 or GT200? Come on...

Nick · Aug 12, 2009

3dilettante said:
While not desirable, silicon is relatively cheap in this situation.
If all the chip is doing is drawing a wall or a 10 pixel character, whatever time it takes will be sufficiently fast for the purposes of the GPU's target market.

What? So if someone manages to create a chip half the size that performs the same, it would be no big deal?

That doesn't sound particularly compelling, in the face of other engineered solutions that aren't searching for a problem to solve.

I'm afraid you missed the whole point of the discussion. It's about (future) convergence. I don't care if today's CPUs waste 30% of their cycles on speculation, it has benefits too and in the right proportion it could help GPUs reach higher efficiency. And not just speculation, every CPU concept. As Jawed clarified, NVIDIA already puts a form of out-of-order execution to good use. And Larrabee uses caches and SMT. The succesful architectures of the future will not disregard any technique just because on today's CPUs it appears to have drawbacks that collide with the classic understanding of a GPU architecture. GPU manufacturers are fighting their own kind of MHz-war, and it's about to end. Things like ease of programmability become far more more important than raw theoretical performance.

Vice versa, CPUs have a lot to learn from GPUs, but they're quickly moving in the right direction. Today, a Core i7 delivers about 100 GFLOP/s. By 2012 the number of cores has doubled, the vector width has doubled, we'll have FMA support, and it will run at a slightly higher clock frequency. That's good for 1 TFLOP/s. Soon after that we should see support for scatter/gather appear. That does become compelling.

The most beautiful thing of all is that developers will be largely unrestricted by the API or any fixed-function hardware. As shown by FQuake, one can achieve amazing performance by doing things the way the developer intends to do them, instead of bending over backwards to work within the constraints of the API and the hardware. API's won't dissapear, but they'll become more of a framework like OpenCL. Today's GPUs are still very much designed specifically to support the Direct3D pipeline. Everything is meticulously balanced to support that as best as possible. But as soon as you do something out of the ordinary you hit one bottleneck after the other, and that's happening a lot given the growing diversification of graphics techniques.

Anyway, I have a hard time believing that GPU architects are sitting on their hands. Programmable texture sampling can be implemented using largely generic scatter/gather units and doing the filtering in the shader units. ROP is also destined to become programmable, and tesselation units could eventually become programmable rasterizers as well. By moving everything to programmable cores the de facto bottlenecks vanish. Larrabee has a head start, but possibly not for very long.

I don't think there's any question whether in the end the CPU or the GPU will 'win'. They'll both win, but each in their own domain. Discrete graphics cards will keep ruling the high-performance graphics market, while CPUs will become adequate for low-end markets. Compare it to sound processing. Nowadays 95% of us don't have a discrete card any more. The CPU processes sound in a driver "on the side". With a CPU already capable of 1 TFLOP/s, there's no need to shell out for a second more dedicated chip, unless you want the absolute latest in graphics.

MfA · Aug 12, 2009

Nick said:
I don't care if today's CPUs waste 30% of their cycles on speculation, it has benefits too and in the right proportion it could help GPUs reach higher efficiency. And not just speculation, every CPU concept.

GPUs in fact have a far easier time using every CPU concept than x86 ... I don't think Intel is ever going to introduce split branches for instance.

3dcgi · Aug 12, 2009

Jawed said:
If you've got a good reason for not calling it a strand in a SIMD-based architecture, then I'm all ears.

Does anyone besides Intel use this term? Nvidia, AMD, and Microsoft don't use it.

DavidC · Aug 12, 2009

Voxilla said:
Here a rather fast software renderer engine demo called FQuake.
It renders some level of the original Quake game.

http://users.skynet.be/fquake/

This demo engine is highly optimized making use of multiple threads and SSE code.
Perspective, bilinear texture mapping runs at 650 Mpix/s on a Quad Core 2, 3.2Ghz.

The engine makes use of an algorithm that ensures zero overdraw.
At 2560x1600 resolution the engine runs at between 120 and 160 frames per second, only slightly depending on scene complexity, corresponding to between 500 and 650 Mpix/s texture and pixel fill rate.

The mapped texture consists of two layers, a material and light map. Though normally you would do this with multi texturing, the engine does it with single texturing. To make this possible a LRU texture cache is maintained with on the fly composted material/light texture maps as needed.

To make optimally of all CPU cores, the screen is split up according to the number of cores. The splitting positions of the screen are continuously moved to adapt to the scene complexity so that all cores are maximally loaded.

You can fly through the scene, clicking the mouse left and right buttons for forward and backward movement. Holding the middle button with left/right causes quad speed.
You can also switch between bilinear and point sampling, by pressing the space bar.

What are the chances of a person with Core i7 running this same benchmark?? With Swiftshader and 3DMark2001, the Core i7 is significantly faster than the Core 2 Quad.

http://www.forum-3dcenter.org/vbulletin/showthread.php?t=267950&page=6

Voxilla · Aug 12, 2009

DavidC said:
What are the chances of a person with Core i7 running this same benchmark?? With Swiftshader and 3DMark2001, the Core i7 is significantly faster than the Core 2 Quad.

http://www.forum-3dcenter.org/vbulletin/showthread.php?t=267950&page=6

Nick has a Core i7, so he might tell.
There seems to be a tendency that on a more core machine the total CPU usage percentage drops Also with lower resolution screens this happens. It could be the driver part that copies the image from CPU to GPU memory. When I only display every 10 frames the rendered image, CPU utilization is well over 90%.

Scali · Aug 12, 2009

Nick said:
Today's GPUs are still very much designed specifically to support the Direct3D pipeline. Everything is meticulously balanced to support that as best as possible. But as soon as you do something out of the ordinary you hit one bottleneck after the other, and that's happening a lot given the growing diversification of graphics techniques.

I think nVidia already went this direction with the G80.
They started adding features that have no meaning for graphics, and since G80 they really didn't add any graphics features, they only improved Cuda.
Mainly things like the shared cache and the double precision math. Rather radical changes from a hardware point-of-view, but Direct3D10 doesn't even know they exist. Direct3D11 only knows what to do with those things for compute shaders, not graphics.

And in a slightly different direction, Intel also started experimenting with more programmability and less hardwired functionality in their IGPs. Part of the clipping and triangle setup is actually performed by special kernels. Same with acceleration of video decoding.
Obviously Larrabee will be a similar approach... I wonder when nVidia and AMD are going to go down this route aswell. Perhaps already with DX11 hardware?

Jawed · Aug 12, 2009

Nick said:
It wasn't a suggestion, I was merely pointing back to the past to illustrate the effect of keeping lot of data on chip. It's where we don't want to go! I should have made that a little clearer.

Phew!

Still, having lots of space for context on chip is currently the only solution. And with the difference in performance between on-die memory and off-die memory only increasing, there's no solution in sight.

3D chip stacking is likely to lead to entire layers of nothing but memory - so on-chip memory is set merely to grow, not shrink. Regardless of the name you give the processor, CPU or GPU.

Jawed

Jawed · Aug 12, 2009

3dilettante said:
Larrabee is a garbage desktop CPU. Assuming it hits 2.0 GHz, to most consumer applications it will appear as a 250 MHz Pentium.
One core and 1/4 threading for single-threaded apps, and single-issue for everything not using its vector instructions.

Hmm, I was expecting a single thread to run at full speed. The other 3 threads would simply be idle.

At least it'll have bags of cache

Jawed

Arwin · Aug 12, 2009

Jawed said:
Hmm, I was expecting a single thread to run at full speed. The other 3 threads would simply be idle.

At least it'll have bags of cache

Jawed

I looked into this and from what I read you can configure each core 1 to four threads, which would suggest that if you configure it to run with 1 thread, it would run at full speed.

3dilettante · Aug 12, 2009

Nick said:
What? So if someone manages to create a chip half the size that performs the same, it would be no big deal?

Your example isn't one where such a chip exists. It's an example of a chip with significantly reduced graphics settings falling an order of magnitude short, whereas a non-speculating chip with a similar TDP runs circles around it.

If the example actually did exist, I would obviously go for the smaller chip, unless the bigger chip also came with a free unicorn.

I'm afraid you missed the whole point of the discussion. It's about (future) convergence. I don't care if today's CPUs waste 30% of their cycles on speculation, it has benefits too and in the right proportion it could help GPUs reach higher efficiency.

Speculation has benefits where it is appropriate, depending on the performance criteria of a design.
The wastage may be invariant over time. Expecting a certain performance level in code that provides no obvious extra non-speculative work per clock may only come by speculation.
For workloads where this is not a problem, the amount of non-computational active logic and die area expended on what turns out to be irrelevant bit-fiddling becomes increasingly inappropriate.

For power-limited and mobile applications, which are increasingly driving the market, it does not yet appear to be appropriate, and I haven't seen anything rumored from now until the transition 22nm, after which there's still nothing but also nothing substantive about anything else anyway.

As a side note:
In the case of other constraints, like the limits of the memory bus or on-die communications networks, the more precious a resource becomes, the less acceptable it is to waste it on things that turn out not to be needed.

And not just speculation, every CPU concept. As Jawed clarified, NVIDIA already puts a form of out-of-order execution to good use.

I've seen it described more as an out of order completion, though the exact workings of the scheduler escape me at the moment.

And Larrabee uses caches and SMT. The succesful architectures of the future will not disregard any technique just because on today's CPUs it appears to have drawbacks that collide with the classic understanding of a GPU architecture.

Larrabee's threading model is not new to GPUs. The individual SIMDs in RV6xx onwards are effectively sequenced in a dual-threaded manner.
Read/write coherent caches are not yet in use, but their presence is orthogonal to the question of speculation.

Larrabee's level of speculation is bare-minimum. It has an extremely short pipeline; it is an in-order design, and much of the code it runs is statically unrolled.
As far as use of its vector capability, the pixel shader emulation reduces the amount of speculation to as close to zero as possible. It also reduces the use of the coherent capability of its caches to a very low level, but that is a separate issue.

GPU manufacturers are fighting their own kind of MHz-war, and it's about to end. Things like ease of programmability become far more more important than raw theoretical performance.

This is an interesting side-debate, but conservative architectures are just as programmatically flexible as highly speculative ones. If there were a difference in how the software saw speculation on a chip, it is either a programmer gotcha that will haunt coders who code to close to the metal (or miss some of the wierd memory behaviors that can result in multi-threaded code), or a sign that the chip is broken.

Vice versa, CPUs have a lot to learn from GPUs, but they're quickly moving in the right direction. Today, a Core i7 delivers about 100 GFLOP/s. By 2012 the number of cores has doubled, the vector width has doubled, we'll have FMA support, and it will run at a slightly higher clock frequency. That's good for 1 TFLOP/s. Soon after that we should see support for scatter/gather appear. That does become compelling.

It still seems like a weak case to me. It assumes we won't have replaced "but can it play Crysis (on medium)" as a lame gag for another 3 years.
I think the world will have moved on from Crysis at that point.
The more prevalent hardware case in 2012 won't be 8-core desktops, but probably hybrid solutions with on-die IGPs or GPUs.
As the "core race" has already become uninteresting for most users.

By moving everything to programmable cores the de facto bottlenecks vanish. Larrabee has a head start, but possibly not for very long.

This is separate from the speculation debate, and in part sounds like a "bottlenecks you know versus the bottlenecks not yet discovered" case.

Nick · Aug 12, 2009

DavidC said:
What are the chances of a person with Core i7 running this same benchmark??

On my Core i7 920 @ 3.2 GHz (Turbo Boost enabled) running the 64-bit version and facing a wall I'm getting 835 MPix/s. When I set the process affinity to only odd cores, to disable Hyper-Threading, I get 730 MPix/s. In both cases CPU usage is slightly over 80%.

With Swiftshader and 3DMark2001, the Core i7 is significantly faster than the Core 2 Quad.

Yes, that's due to Core 2 Quad being two dies connected by the FSB, while Core i7 is a single symmetric die. SwiftShader's vertex processing and primitive setup is also multi-threaded. So inter-core latency and bandwith play a role in how performance scales. Core 2 Quad is pretty bad because the FSB is quite high latency and low bandwidth. Core i7 is way better, but SwiftShader 2.0 was actually optimized with a Core 2 Quad in mind (and the 2.01 update fixed an issue with multi-threading performance on Phenom).

3dilettante · Aug 12, 2009

Arwin said:
I looked into this and from what I read you can configure each core 1 to four threads, which would suggest that if you configure it to run with 1 thread, it would run at full speed.

That would be interesting. The descriptions I've seen so far hinted at a less flexible round-robin scheme.
If the core can be set to a 1-thread mode, Larrabee would look like a 1 GHz Pentium.

Jawed · Aug 12, 2009

Nick said:
Yeah exactly. But I still stick to my opinion that software developers shouldn't have to care. They simply write the code for a single strand and instantiate it many times over an array of data. It's up to the framework and the device to determine how many strands/fibers/threads it should really be using. Just think of OpenCL, which makes no explicit mention of it, and can be implemented on a GPU, CPU, Cell, Larrabee, etc. By the way, OpenCL uses the terms kernel and work-item, clearly just to make sure that the last few people who were able to keep the terminologies apart go nuts as well.

Brook (not Brook+ because of the introduction of LDS) is a pure stream language and forces the programmer to code exactly as you say. So is OpenGL and D3D10 and earlier versions of D3D.

CUDA introduced the concept of a fixed-size, shared memory. This low level architectural detail "breaks" the stream programming model. It's the MS-DOS 640KB limit all over again

D3D11 mandates 32KB of shared memory (and surely CUDA will mandate at least that for GT3xx cards).

Arguably shared memory is a performance kludge - an alternative to launching two kernels in succession (since all writes to shared memory require bounding with a fence). You could say it's no more bad than programming for a CPU with a known 32KB L1, 256KB L2 and 4MB L3. The GPUs lose a lot of performance moving data off chip, only to read it back again - even though they're capable of hiding such latencies. So shared memory takes on the role of joining two distinct kernels separated by a fence to become arbitrary writes/reads of shared memory in a single kernel, with the performance constraints of fences and limited capacity. Shared memory was originally called parallel data cache - hinting at what I think was its original graphics-related purpose of caching vertex and geometry kernel output - i.e. barycentrics for triangles and GS-amplification - for use by pixel-shading and setup.

Is 16KB enough? Dunno. I'm doubtful. As I am that 32KB is enough. Future versions of D3D are supposedly planned to increase this amount.

OpenCL allows the programmer to query the device to obtain the "key dimensions" (I just picked out a few from Table 4.3 of the specification, v1.0.43):

CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS - Maximum dimensions that specify the global and local work-item IDs used by the data parallel execution model.
CL_DEVICE_MAX_WORK_ITEM_SIZES - Maximum number of work-items that can be specified in each dimension of the work-group
CL_DEVICE_MAX_WORK_GROUP_SIZE - Maximum number of work-items in a work-group executing a kernel using the data parallel execution model
CL_DEVICE_GLOBAL_MEM_CACHE_TYPE - Type of global memory cache supported. Valid values are: CL_NONE, CL_READ_ONLY_CACHE and CL_READ_WRITE_CACHE
CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE
CL_DEVICE_LOCAL_MEM_TYPE - Type of local memory supported. This can be set to CL_LOCAL implying dedicated local memory storage such as SRAM, or CL_GLOBAL
CL_DEVICE_LOCAL_MEM_SIZE - Size of local memory arena in bytes. The minimum value is 16 KB

It's then up to the programmer to decide if they can be bothered to optimise for the myriad variations.

I jumped the gun saying it's a mistake. I'm not worried about the 4-way SMT, they can adjust that in the future without far stretching consequences, but the 16-wide SIMD is a different story. Future evolutions may demand widening (or narrowing) it, or make them MIMD units. But I realize now that's a software issue; more precisely a compiler issue. I just hope they succeed at getting people to adopt the entire platform, and not run into backward compatibility issues forcing them to drag old choices with them.

Do you think AVX will have similar problems?

GPUs have (texture) caches too... They are simply the best option to reduce bandwidth and latency, despite indeed containing some data that goes stale.

You can either try loading the necessary data explicitely, or you can let caching heuristics do it for you automatically. Given the increase in complexity and wide variety of algorithms, my vote goes to caches. Maybe not as much as 24 MB, but Larrabee's 4-8 MB sounds about right for a wide variety of workloads.

As far as I can tell ATI pre-fetches texture data, as the coordinates for texturing are known before the pixel shader commences - though I don't know how the hardware resolves the ordering of the pre-fetches, nor how much or what kind of pre-fetch interval is used. Additionally Larrabee has various cache control techniques which should enable the programmer to "drive" the caches more effectively than merely relying upon basic cache-line fetching and line-sizes. Though I'm aware that x86 does have some of this stuff, I think Larrabee goes a step further, taking leaves out of Itanium's book. For texturing, specifically, I've no idea if Larrabee does prefetching - there are dedicated texel caches, but that's all that's known. I'm doubtful anything automatic takes place, it'd be totally up to the programmer/driver-writer.

Jawed

Jawed · Aug 12, 2009

Scali said:
I think nVidia already went this direction with the G80.
They started adding features that have no meaning for graphics, and since G80 they really didn't add any graphics features, they only improved Cuda.
Mainly things like the shared cache and the double precision math.

For what it's worth, I reckon that shared memory has two uses in graphics:

holds attributes + barycentrics stuff for the pixel shaders to interpolate attributes, as required
holds the output of GS when amplification is turned on

So the CUDA-specific burden on G80 seems to me to be practically non-existent. You'd have to argue that there's something about CUDA, per se, that made NVidia go after "serial scalar" rather than vector processing of vertices and pixels.

Jawed

Scali · Aug 12, 2009

Jawed said:
For what it's worth, I reckon that shared memory has two uses in graphics:

holds attributes + barycentrics stuff for the pixel shaders to interpolate attributes, as required

holds the output of GS when amplification is turned on

So the CUDA-specific burden on G80 seems to me to be practically non-existent. You'd have to argue that there's something about CUDA, per se, that made NVidia go after "serial scalar" rather than vector processing of vertices and pixels.

Jawed

Do you know it is actually implemented this way? Shared memory is not the only way to implement these features, as other vendors have demonstrated.
So theoretically nVidia *could* have implemented it this way, but I wouldn't be surprised if they didn't. In fact, I'd be surprised if they did.

Either way, it demonstrates the point that by adding extra 'general purpose' features and instructions, you can expand the usefulness of the processing core, and either use it for new graphics and non-graphics applications. So that would indicate a trend that we're already moving away from a 'hardwired' or 'optimized' Direct3D implementation to something more general purpose, which happens to also work okay for Direct3D.

Arwin · Aug 12, 2009

3dilettante said:
That would be interesting. The descriptions I've seen so far hinted at a less flexible round-robin scheme.
If the core can be set to a 1-thread mode, Larrabee would look like a 1 GHz Pentium.

Why would a core on Larrabee run half of the chip's clock? On Cell the SPEs run the same speed as the PPE, all on the main clock speed. Or are you assuming Larrabee running at 1GHz (as some of the early articles do because initial performance numbers were reported with that clockspeed). I'd assume we're not going to see many LRB chips in PCs at less than 1GHz.

Not sure we'll see it even in any chip at all.

Talking about fast software renderers, I have a completely new idea for setting up data and rendering it. Since I have a relatively poor background in rendering I'm not sure how new it really is, and I'm wondering if I should just discuss it here or file a patent immediately.

Probably just discussing it here would be good, or in a new thread.

Fast software renderer

Voxilla

3dilettante

Nick

silent_guy

Nick

MfA

3dcgi

DavidC

Voxilla

Scali

Jawed

Jawed

Arwin

Now Officially a Top 10 Poster

3dilettante

Nick

3dilettante

Jawed

Jawed

Scali

Arwin

Now Officially a Top 10 Poster

Similar threads