Fast software renderer

Discussion in 'Rendering Technology and APIs' started by Voxilla, Jul 19, 2009.

  1. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    711
    Likes Received:
    282
    It must be like our brain works. Seriously specialization has it's benefits. You would not want to emulate floating point with integer arithmetic.
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    All the descriptions I've seen indicate SMT mode has each thread take turns fetching instructions at the front end.
    That's just one cycle of hidden latency for a branch that won't resolve for another 15 cycles (best case).

    Branch prediction doesn't help with data cache misses, and an instruction cache miss is going to cause a stall either instantly or in a cycle if there's some kind of target instruction buffer.

    For an subset of the workloads out there, this is a worthwhile compromise.

    The point of conflict I see here is that there are design compromises necessary for executing serial code as fast as possible that impact execution in all non-serial cases, as well as have knock-on effects on other parts of the system.

    The focus of that paper was in latency and performance impact, which is not exactly what I am focusing on.

    As to the approximation you are using.
    What is the mispredict rate you've chosen? Is it 10% chance of misprediction per individual branch, or the cumulative probability of a misprediction somewhere in a 64-instruction window with 6 branches in that range?
    Successive branches lead to cumulative misprediction rates that can lead to 30% of ROB entries not being committed with more reasonable misprediction rates per branch instruction.

    What do you mean by IPC in this case?
    Barring an I-cache miss, a good 4-wide speculative processor is going to push close to 4 instructions through the front end of the pipeline every clock.
    Referencing the stacked penalty model in that paper, the front-end penalty is going to be over 3 times higher than the 5-stage chosen in the model.

    100% of all speculatively issued instructions will go through the mispredict pipeline up to the final point of execution. In terms of ROB instructions that do not commit, that's 30% of instructions going through hardware that as a percentage of the non-cache core area is close to 2/3 or more comprised of active logic.
    This is a fixed power cost 100% of the time.
    Some as yet undetermined percentage of the wasted instructions will go so far as to execute and have their results pending in the ROB when they are negated, depending on the situation. This is the decode+schedule power draw all speculated instructions draw, then a certain amount due to execution unit consumption, for which I have no figures but will vary depending on the operation type.
    Loads and stores in i7 are very heavily speculated, more so than what is already the case in other OoO chips.

    As GPUs probably try to do and Larrabee's VPU has been documented as doing, a known invalid lane is clock-gated.
    That's different than running an instruction through the pipeline and not knowing it is invalid until the end.

    While not desirable, silicon is relatively cheap in this situation.
    If all the chip is doing is drawing a wall or a 10 pixel character, whatever time it takes will be sufficiently fast for the purposes of the GPU's target market.

    That doesn't sound particularly compelling, in the face of other engineered solutions that aren't searching for a problem to solve.

    Right, CPUs don't keep much data on chip, they just send it off to the 8-12 MB of cache--oh I see what we have here.
     
  3. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Strand sounds ok to me. Intel's terminology for Larrabee seems to make sense, so I'll stick to that as close as possible in further discussions. Thanks for the suggestion.

    Dear god, CUDA's definitions are an abomination! ;)
    Yeah exactly. But I still stick to my opinion that software developers shouldn't have to care. They simply write the code for a single strand and instantiate it many times over an array of data. It's up to the framework and the device to determine how many strands/fibers/threads it should really be using. Just think of OpenCL, which makes no explicit mention of it, and can be implemented on a GPU, CPU, Cell, Larrabee, etc. By the way, OpenCL uses the terms kernel and work-item, clearly just to make sure that the last few people who were able to keep the terminologies apart go nuts as well. ;)
    I jumped the gun saying it's a mistake. I'm not worried about the 4-way SMT, they can adjust that in the future without far stretching consequences, but the 16-wide SIMD is a different story. Future evolutions may demand widening (or narrowing) it, or make them MIMD units. But I realize now that's a software issue; more precisely a compiler issue. I just hope they succeed at getting people to adopt the entire platform, and not run into backward compatibility issues forcing them to drag old choices with them.
    GPUs have (texture) caches too... They are simply the best option to reduce bandwidth and latency, despite indeed containing some data that goes stale.

    You can either try loading the necessary data explicitely, or you can let caching heuristics do it for you automatically. Given the increase in complexity and wide variety of algorithms, my vote goes to caches. Maybe not as much as 24 MB, but Larrabee's 4-8 MB sounds about right for a wide variety of workloads.
     
  4. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    The italics around apparently being the key point. (I can't find back the original source of that 90% number.)

    Let's define resources in terms of area of the full chip.

    I can see how crysis at the highest quality level on a g98 has major parts of the chip are sitting idle, waiting for a shader to complete. But on a well proportioned RV770 or g98 or GT200? Come on...
     
  5. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    What? So if someone manages to create a chip half the size that performs the same, it would be no big deal?
    I'm afraid you missed the whole point of the discussion. It's about (future) convergence. I don't care if today's CPUs waste 30% of their cycles on speculation, it has benefits too and in the right proportion it could help GPUs reach higher efficiency. And not just speculation, every CPU concept. As Jawed clarified, NVIDIA already puts a form of out-of-order execution to good use. And Larrabee uses caches and SMT. The succesful architectures of the future will not disregard any technique just because on today's CPUs it appears to have drawbacks that collide with the classic understanding of a GPU architecture. GPU manufacturers are fighting their own kind of MHz-war, and it's about to end. Things like ease of programmability become far more more important than raw theoretical performance.

    Vice versa, CPUs have a lot to learn from GPUs, but they're quickly moving in the right direction. Today, a Core i7 delivers about 100 GFLOP/s. By 2012 the number of cores has doubled, the vector width has doubled, we'll have FMA support, and it will run at a slightly higher clock frequency. That's good for 1 TFLOP/s. Soon after that we should see support for scatter/gather appear. That does become compelling.

    The most beautiful thing of all is that developers will be largely unrestricted by the API or any fixed-function hardware. As shown by FQuake, one can achieve amazing performance by doing things the way the developer intends to do them, instead of bending over backwards to work within the constraints of the API and the hardware. API's won't dissapear, but they'll become more of a framework like OpenCL. Today's GPUs are still very much designed specifically to support the Direct3D pipeline. Everything is meticulously balanced to support that as best as possible. But as soon as you do something out of the ordinary you hit one bottleneck after the other, and that's happening a lot given the growing diversification of graphics techniques.

    Anyway, I have a hard time believing that GPU architects are sitting on their hands. Programmable texture sampling can be implemented using largely generic scatter/gather units and doing the filtering in the shader units. ROP is also destined to become programmable, and tesselation units could eventually become programmable rasterizers as well. By moving everything to programmable cores the de facto bottlenecks vanish. Larrabee has a head start, but possibly not for very long.

    I don't think there's any question whether in the end the CPU or the GPU will 'win'. They'll both win, but each in their own domain. Discrete graphics cards will keep ruling the high-performance graphics market, while CPUs will become adequate for low-end markets. Compare it to sound processing. Nowadays 95% of us don't have a discrete card any more. The CPU processes sound in a driver "on the side". With a CPU already capable of 1 TFLOP/s, there's no need to shell out for a second more dedicated chip, unless you want the absolute latest in graphics.
     
  6. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,807
    Likes Received:
    473
    GPUs in fact have a far easier time using every CPU concept than x86 ... I don't think Intel is ever going to introduce split branches for instance.
     
  7. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,435
    Likes Received:
    263
    Does anyone besides Intel use this term? Nvidia, AMD, and Microsoft don't use it.
     
  8. DavidC

    Regular

    Joined:
    Sep 26, 2006
    Messages:
    347
    Likes Received:
    24
    What are the chances of a person with Core i7 running this same benchmark?? With Swiftshader and 3DMark2001, the Core i7 is significantly faster than the Core 2 Quad.

    http://www.forum-3dcenter.org/vbulletin/showthread.php?t=267950&page=6
     
  9. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    711
    Likes Received:
    282
    Nick has a Core i7, so he might tell.
    There seems to be a tendency that on a more core machine the total CPU usage percentage drops Also with lower resolution screens this happens. It could be the driver part that copies the image from CPU to GPU memory. When I only display every 10 frames the rendered image, CPU utilization is well over 90%.
     
    #109 Voxilla, Aug 12, 2009
    Last edited by a moderator: Aug 12, 2009
  10. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    I think nVidia already went this direction with the G80.
    They started adding features that have no meaning for graphics, and since G80 they really didn't add any graphics features, they only improved Cuda.
    Mainly things like the shared cache and the double precision math. Rather radical changes from a hardware point-of-view, but Direct3D10 doesn't even know they exist. Direct3D11 only knows what to do with those things for compute shaders, not graphics.

    And in a slightly different direction, Intel also started experimenting with more programmability and less hardwired functionality in their IGPs. Part of the clipping and triangle setup is actually performed by special kernels. Same with acceleration of video decoding.
    Obviously Larrabee will be a similar approach... I wonder when nVidia and AMD are going to go down this route aswell. Perhaps already with DX11 hardware?
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Phew!

    Still, having lots of space for context on chip is currently the only solution. And with the difference in performance between on-die memory and off-die memory only increasing, there's no solution in sight.

    3D chip stacking is likely to lead to entire layers of nothing but memory - so on-chip memory is set merely to grow, not shrink. Regardless of the name you give the processor, CPU or GPU.

    Jawed
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Hmm, I was expecting a single thread to run at full speed. The other 3 threads would simply be idle.

    At least it'll have bags of cache :lol:

    Jawed
     
  13. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    17,682
    Likes Received:
    1,200
    Location:
    Maastricht, The Netherlands
    I looked into this and from what I read you can configure each core 1 to four threads, which would suggest that if you configure it to run with 1 thread, it would run at full speed.
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Your example isn't one where such a chip exists. It's an example of a chip with significantly reduced graphics settings falling an order of magnitude short, whereas a non-speculating chip with a similar TDP runs circles around it.

    If the example actually did exist, I would obviously go for the smaller chip, unless the bigger chip also came with a free unicorn.

    Speculation has benefits where it is appropriate, depending on the performance criteria of a design.
    The wastage may be invariant over time. Expecting a certain performance level in code that provides no obvious extra non-speculative work per clock may only come by speculation.
    For workloads where this is not a problem, the amount of non-computational active logic and die area expended on what turns out to be irrelevant bit-fiddling becomes increasingly inappropriate.

    For power-limited and mobile applications, which are increasingly driving the market, it does not yet appear to be appropriate, and I haven't seen anything rumored from now until the transition 22nm, after which there's still nothing but also nothing substantive about anything else anyway.

    As a side note:
    In the case of other constraints, like the limits of the memory bus or on-die communications networks, the more precious a resource becomes, the less acceptable it is to waste it on things that turn out not to be needed.

    I've seen it described more as an out of order completion, though the exact workings of the scheduler escape me at the moment.

    Larrabee's threading model is not new to GPUs. The individual SIMDs in RV6xx onwards are effectively sequenced in a dual-threaded manner.
    Read/write coherent caches are not yet in use, but their presence is orthogonal to the question of speculation.

    Larrabee's level of speculation is bare-minimum. It has an extremely short pipeline; it is an in-order design, and much of the code it runs is statically unrolled.
    As far as use of its vector capability, the pixel shader emulation reduces the amount of speculation to as close to zero as possible. It also reduces the use of the coherent capability of its caches to a very low level, but that is a separate issue.

    This is an interesting side-debate, but conservative architectures are just as programmatically flexible as highly speculative ones. If there were a difference in how the software saw speculation on a chip, it is either a programmer gotcha that will haunt coders who code to close to the metal (or miss some of the wierd memory behaviors that can result in multi-threaded code), or a sign that the chip is broken.

    It still seems like a weak case to me. It assumes we won't have replaced "but can it play Crysis (on medium)" as a lame gag for another 3 years.
    I think the world will have moved on from Crysis at that point.
    The more prevalent hardware case in 2012 won't be 8-core desktops, but probably hybrid solutions with on-die IGPs or GPUs.
    As the "core race" has already become uninteresting for most users.

    This is separate from the speculation debate, and in part sounds like a "bottlenecks you know versus the bottlenecks not yet discovered" case.
     
  15. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    On my Core i7 920 @ 3.2 GHz (Turbo Boost enabled) running the 64-bit version and facing a wall I'm getting 835 MPix/s. When I set the process affinity to only odd cores, to disable Hyper-Threading, I get 730 MPix/s. In both cases CPU usage is slightly over 80%.
    Yes, that's due to Core 2 Quad being two dies connected by the FSB, while Core i7 is a single symmetric die. SwiftShader's vertex processing and primitive setup is also multi-threaded. So inter-core latency and bandwith play a role in how performance scales. Core 2 Quad is pretty bad because the FSB is quite high latency and low bandwidth. Core i7 is way better, but SwiftShader 2.0 was actually optimized with a Core 2 Quad in mind (and the 2.01 update fixed an issue with multi-threading performance on Phenom).
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    That would be interesting. The descriptions I've seen so far hinted at a less flexible round-robin scheme.
    If the core can be set to a 1-thread mode, Larrabee would look like a 1 GHz Pentium.
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Brook (not Brook+ because of the introduction of LDS) is a pure stream language and forces the programmer to code exactly as you say. So is OpenGL and D3D10 and earlier versions of D3D.

    CUDA introduced the concept of a fixed-size, shared memory. This low level architectural detail "breaks" the stream programming model. It's the MS-DOS 640KB limit all over again :roll:

    D3D11 mandates 32KB of shared memory (and surely CUDA will mandate at least that for GT3xx cards).

    Arguably shared memory is a performance kludge - an alternative to launching two kernels in succession (since all writes to shared memory require bounding with a fence). You could say it's no more bad than programming for a CPU with a known 32KB L1, 256KB L2 and 4MB L3. The GPUs lose a lot of performance moving data off chip, only to read it back again - even though they're capable of hiding such latencies. So shared memory takes on the role of joining two distinct kernels separated by a fence to become arbitrary writes/reads of shared memory in a single kernel, with the performance constraints of fences and limited capacity. Shared memory was originally called parallel data cache - hinting at what I think was its original graphics-related purpose of caching vertex and geometry kernel output - i.e. barycentrics for triangles and GS-amplification - for use by pixel-shading and setup.

    Is 16KB enough? Dunno. I'm doubtful. As I am that 32KB is enough. Future versions of D3D are supposedly planned to increase this amount.


    OpenCL allows the programmer to query the device to obtain the "key dimensions" (I just picked out a few from Table 4.3 of the specification, v1.0.43):
    • CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS - Maximum dimensions that specify the global and local work-item IDs used by the data parallel execution model.
    • CL_DEVICE_MAX_WORK_ITEM_SIZES - Maximum number of work-items that can be specified in each dimension of the work-group
    • CL_DEVICE_MAX_WORK_GROUP_SIZE - Maximum number of work-items in a work-group executing a kernel using the data parallel execution model
    • CL_DEVICE_GLOBAL_MEM_CACHE_TYPE - Type of global memory cache supported. Valid values are: CL_NONE, CL_READ_ONLY_CACHE and CL_READ_WRITE_CACHE
    • CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE
    • CL_DEVICE_GLOBAL_MEM_CACHE_SIZE
    • CL_DEVICE_LOCAL_MEM_TYPE - Type of local memory supported. This can be set to CL_LOCAL implying dedicated local memory storage such as SRAM, or CL_GLOBAL
    • CL_DEVICE_LOCAL_MEM_SIZE - Size of local memory arena in bytes. The minimum value is 16 KB
    It's then up to the programmer to decide if they can be bothered to optimise for the myriad variations.

    Do you think AVX will have similar problems?

    As far as I can tell ATI pre-fetches texture data, as the coordinates for texturing are known before the pixel shader commences - though I don't know how the hardware resolves the ordering of the pre-fetches, nor how much or what kind of pre-fetch interval is used. Additionally Larrabee has various cache control techniques which should enable the programmer to "drive" the caches more effectively than merely relying upon basic cache-line fetching and line-sizes. Though I'm aware that x86 does have some of this stuff, I think Larrabee goes a step further, taking leaves out of Itanium's book. For texturing, specifically, I've no idea if Larrabee does prefetching - there are dedicated texel caches, but that's all that's known. I'm doubtful anything automatic takes place, it'd be totally up to the programmer/driver-writer.

    Jawed
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    For what it's worth, I reckon that shared memory has two uses in graphics:
    1. holds attributes + barycentrics stuff for the pixel shaders to interpolate attributes, as required
    2. holds the output of GS when amplification is turned on
    So the CUDA-specific burden on G80 seems to me to be practically non-existent. You'd have to argue that there's something about CUDA, per se, that made NVidia go after "serial scalar" rather than vector processing of vertices and pixels.

    Jawed
     
  19. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Do you know it is actually implemented this way? Shared memory is not the only way to implement these features, as other vendors have demonstrated.
    So theoretically nVidia *could* have implemented it this way, but I wouldn't be surprised if they didn't. In fact, I'd be surprised if they did.

    Either way, it demonstrates the point that by adding extra 'general purpose' features and instructions, you can expand the usefulness of the processing core, and either use it for new graphics and non-graphics applications. So that would indicate a trend that we're already moving away from a 'hardwired' or 'optimized' Direct3D implementation to something more general purpose, which happens to also work okay for Direct3D.
     
  20. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    17,682
    Likes Received:
    1,200
    Location:
    Maastricht, The Netherlands
    Why would a core on Larrabee run half of the chip's clock? On Cell the SPEs run the same speed as the PPE, all on the main clock speed. Or are you assuming Larrabee running at 1GHz (as some of the early articles do because initial performance numbers were reported with that clockspeed). I'd assume we're not going to see many LRB chips in PCs at less than 1GHz.

    Not sure we'll see it even in any chip at all.

    Talking about fast software renderers, I have a completely new idea for setting up data and rendering it. Since I have a relatively poor background in rendering I'm not sure how new it really is, and I'm wondering if I should just discuss it here or file a patent immediately. :D

    Probably just discussing it here would be good, or in a new thread.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...