Recent content by armchair_architect

  1. A

    Weighted Random Pick of RGBA

    You could probably use a table lookup (using 1D texture, or in DX10 maybe a constant buffer). Say you create a 100x1 RGBA texture, fill the first 60 pixels with (1.0,0,0,0), the next 40 with (0,0,0,1.0). Since your random function gives you a number in [0..1] you can use that as the 's'...
  2. A

    AMD splits in two...

    I'm no expert on merchant fab business models, but I think it boils down to having more customers, which has two benefits: First, multiple customers with various products means demand is probably going to be more consistent. It costs a lot of money just to keep a fab operating and staffed...
  3. A

    Why didnt DX 10.1 catch on?

    I've seen this claim about Nvidia killing memory virtualization a couple of times on B3D forums, but have never seen any source (reliable or otherwise) quoted for it. Got one?
  4. A

    Nvision 2008

    On G80 you have two independent SIMD arrays per cluster. Each one issues an instruction to 16 threads over 2 ALU clocks for vertex work, or an instruction to 32 threads over 4 ALU clocks for pixel and CUDA work. (My theory is they still issue an instruction every two clocks to get dual-issue --...
  5. A

    Nvision 2008

    CUDA documentation gives pretty good evidence that each of the 2 or 3 units is independent of the others in the cluster. They share a texture unit, but that's about all as far as I can tell. So I'd count G80 as 16 cores and GT200 as 30.
  6. A

    Do unused MIP levels take video memory?

    If I understand it right, CUDA and CTM are still subject to WDDM. The APIs -- DX9, DX10, OGL, CUDA, CTM, etc. are peers; they're user-mode clients of the WDDM kernel services. WDDM virtualizes the physical resources between all of these.
  7. A

    Do unused MIP levels take video memory?

    I don't think we could tell whether it does or not, given Vista's driver model (WDDM1.0). From what I know of it, WDDM1 memory management is built around the capabilities of older hardware (R300, NV30), and doesn't give a lot of opportunity for NV and AMD to take advantage of more advanced...
  8. A

    Larrabee at Siggraph

    Coalescing would be implementation. You could relax those rules significantly and (a) existing code would still run without change, and (b) apps that do uncoalesced access would automatically go faster. Whether global memory is cached or not is also in this category. The on-chip shared memory...
  9. A

    Larrabee at Siggraph

    I've looked at it. Ct to me looks like a library version of old vector architectures married to the fancy template meta-programming linear algebra libraries that started popping up a few years ago. Works great for some problems, but doesn't seem as broadly applicable as CUDA. Maybe I just lack...
  10. A

    Larrabee at Siggraph

    They're more about the cache line size. Which happens to match the SIMD width, for reasons that should be obvious (which is why SSE/AVX/Larrabee will all have the same issue). Yes, it does break the abstraction a bit. But like I said, in my experience the 80/20 rule applies. Getting the last 20%...
  11. A

    Larrabee at Siggraph

    I think you're confusing HW threads and SW threads. EDIT: having finished reading the thread, I guess you're not. Not sure what you meant in this post though. They're reserving the term "thread" for HW threads, which do indeed switch every cycle just like on a GPU. Each HW thread has real...
  12. A

    Larrabee at Siggraph

    From reading the paper, it sounds like Larrabee puts more of a burden on the programmer for extracting efficiency than GPUs do (espcially in OGL/D3D but also CUDA). They've got smart people working hard to handle this for D3D and OGL apps, but if you go outside those or other libraries you take...
  13. A

    Direct3D 11

    You're talking about different things. DX11 multi-threading is all about having the app rendering engine, DX runtime, and driver take advantage of multiple CPU cores. It has nothing to do with SLI/Crossfire.
  14. A

    Larrabee at Siggraph

    Only if they're transferred as fp32: 32 pixels * 4 components/pixel * 4 bytes/component = 512 bytes. The ability to convert 8-bit unorm to fp32 when reading from L1 means that for 8-bit textures they can stay 8-bit over the ringbus: only 128 bytes for 32 filtered pixels. I'm really curious...
  15. A

    Larrabee at Siggraph

    Which is kinda the point. Imagine, using all those transistors Moore's Law has been handing us for actual execution units :shock:. To me, Larrabee demonstrates what we've been giving up all these years so those lazy programmers could keep sitting on their single-threaded butts :lol:.
Back
Top