NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Read my post again. Branching is done at the same rate as before (I'm assuming that current hardware does it once every instruction group when I say my method will branch once every four scalar instructions).
    I'm pretty sure that my method doesn't really change things at all here. The only minor issue is that in an 8 cycle period, the total possible locations that need to be accessed from the register file is four times larger with my method. Actual transfer rate will be the same, and the size of the register file is the same, too.

    I'm not going to break the clusters, though. The T units will put into their pipeline 16 pixels of the same batch that the ALUs will. The ALUs go round robin on 8 batches, and each batch will stay active for at least four visits (32 cycles) to allow the T-units and branch units to finish up.

    (FYI, by active I mean that they have data going through the ALU stages. There's plenty more batches in flight, put on hold either for texture fetches or simply waiting their turn.)
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    There are so many cross-cutting issues with splitting up the register banks that I'm having a hard time putting my head around them, given how byzantine access is currently.

    For RV770, the 4-component vector registers are implemented as 4 basically independent single-component register banks that can read one value each per clock into a GPRn collecting register for 3 cycles. All units share from this.

    Splitting the units up makes this highly redundant.
    But do we keep the 4-component register organization and balloon the register file, or do we cut it down so each unit gets a bank and that bank is addressed as a bunch of scalar regsiters? What costs are there?

    What happens with TEX unit writes to the register file?
    Since each bank is now per-pixel, that's additional write ports per separate bank.
     
  3. Groo The Wanderer

    Regular

    Joined:
    Jan 23, 2007
    Messages:
    334
    Likes Received:
    2
    Fake it isn't, I know who showed me the original. That said, they might be playing product naming games, IE not tipping their hat to what they were planning on renaming the part to. They do that a lot.

    -Charlie
     
  4. thambos

    Newcomer

    Joined:
    Sep 29, 2007
    Messages:
    194
    Likes Received:
    0
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    I missed that part.
    So in this case, a batch will do branch processing for 4 cycles from its point of view, before it can try coming back for ALU instruction execution.
    With 8 batches actively cycling and bringing their own branches, the first one won't get to ALU work for 32 global clocks.

    The number of independent accesses doesn't seem possible under the current scheme.
    The physical size of the register file would be bigger because of the number of ports, and in the access method being done now would be highly redundant.
    Some of the port-sharing methods made possible by the wierd register access scheme will not work with 64-pixel wavefronts. There is no sharing of reads to the same register address, since each lane is dedicated to a separate context.


    You have to if you want to process an instruction for 64 pixels at once. The clustering is what enforces the 16 pixels per clock.
     
  6. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Is there a PDF outlining all of this? I remember seeing one for R600 or RV670, but that's it.

    The main point I was trying to get across is that if I ignore any access restrictions, currently we have the vec4 unit of RV770's SIMD needing 12 groups (three operands per channel) of 16 floats every cycle. My method needs only 3 groups of 64 floats every cycle, which is much simpler on the face of it. However, it's possible that through pipelining, RV770 actually fetches data a 12 groups of 64 floats every four cycles, but that's still no easier than what I'm proposing.

    Now, regarding your description, are you saying that RV770 would take three cycles to calculate R1.x * R2.x + R3.x? Or that register file loads are scheduled into the instruction stream so that GPRX has all three values by the time they're needed?
     
  7. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Ar you pointed this out as a problem or just making an observation? Throughput is not going to change.

    I really need more info here, because I don't know any of the details to this scheme. There are more independent accesses when you fetch 12 operands (each being 16 contiguous floats) for a vec4 unit than when you fetch 3 operands (each being 64 contiguous floats) for a scalar unit.

    That's only for the MAD units. The T-units will only do 16 at a time. After 32 cycles, the 8 batches will each get 4 MAD instructions completed and 1 SF instruction completed. Currently the same happens for 2 batches in 8 cycles, but the 4 MAD instructions have to be in parallel.
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf

    Page 68 on is where this is detailed. I'm not sure I have a full handle on it yet.

    My text was incorrect earlier.
    It's 4 separate memories with 3 ports per instruction that load to corresponding lanes in 3 vec4 collector registers, per instruction I think.
    The ALUs pick through this assortment over the course of 3 cycles.

    I think this is what happens, anyway.
    I'm not sure why it's this complex, maybe for sharing.

    This should be available in the X component of in those GPRn registers. It can't read more than one GPR.X value per cycle. I think this is pipelined so these should be ready by the time the EX stage is hit for a given pixel instruction.
     
    #608 3dilettante, Oct 8, 2009
    Last edited by a moderator: Oct 8, 2009
  9. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    I think ultimately, CUDA will be subsumed by OpenCL/DX11, since they have mostly adopted it's model, but that doesn't mean NVidia will lose out. If you look at the Visual Studio tools they're shipping, plus debugability of their hardware, developer's could still choose to use NVidia tools and hardware as their primary platform, even if they ultimately generate output for multiple cards.

    A good set of developer tools that boost productivity is hard to ignore.

    Look at Sun Microsystems during their heyday. They managed to ward off threats from Intel, HP, DEC, et al, by offering superior software tools, including Java and its ecosystem. Java was available on all platforms, but people still ended up buying Sun/Solaris hardware. The dot-com bust and recession rebooted the market, and made everyone focus on outsourcing, cloud services, etc.

    Of course, they ultimately lost out to commodity hardware, but some of their was due to the ineptitude of their management and failure to adapt to shifting market.

    NVidia will have to walk a fine line, pushing Fermi and dev tools, supporting portable standards, while finding ways to address low-cost and mid-range chip markets, specially with their chipset revenues about to go by by. Again, it doesn't seem like they had much choice. They could have gone AMDs route this cycle, but they'd be facing a bigger threat from Intel in 2011/12. They're placing a big venture bet on GPGPU while trying to hold onto graphics. If it works, they'll reap big rewards, like any risky venture. If they fail, it'll be spectacular. I felt the same way when Steve Jobs came back to Apple and introduced the iMac after killing clones. I thought changing Apple into a software company (just sell OS X to clone makers) was a good play, but clearly, Apple made vertical integration work well, so much so, that I don't even own a PC anymore, I'm all Mac now.
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    Just restating to make sure I understood it.

    My understanding is that the reg file is 4 12-ported register banks per cluster.
    Whatever the other clusters are doing in the SIMD may not matter as far as how contiguous the registers are, if I understand the part about relative addressing mentioned in the port restrictions.

    My concern was that allowing the ALUs to work on 4 pixels simultaneously would require additional banking or ports.
    I'm not sure now if this is necessary, though I think it might be simpler if the entire file were 64 banks that dispensed with the complex scheme used right now.
     
  11. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    perhaps that roadmap with codenames is credible, only the interpretation is bollocks, he gets the cards wrong and jump at conclusions.

    here's my thinking on what the cards are, no "evil renaming scheme" conspiracy :

    D10U : GTX series
    D10P2 : GTS 250
    D10P1 : GT130
    D10M2 : GT120
    D10M1 : G100

    D12U : the big Fermi board
    D10P2 : GTS 250
    D11P1 : GT230
    D11M2 : GT220
    D11M1 : G210

    that roadmap would leave us with no mid-range or mainstream Fermi based products until Q3 2010.

    well maybe D9 cards are renamed as D10 (/edit : well justified if there are new clocks, new PCBs). But don't those "D" codenames refer to the actual lines of cards on sale, with market segments?
     
    #611 Blazkowicz, Oct 8, 2009
    Last edited by a moderator: Oct 8, 2009
  12. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Okay, thanks. Unfortuantely, this doesn't have any batch level details, which is the most important part for what I'm talking about. There's mention of three read ports, but then they say that only one read is done per element per cycle, so I'm not sure how it all works. Maybe 64 pixels worth are loaded each cycle, so they actually do 3 cycles of reading and one cycle of writing ("Each instruction is distinguished by the destination vector element to which it writes") in the four cycles it takes to process a batch. So while working on batch B for four cycles, batch A's upcoming operands are loaded in three cycles and the writes from an earlier instruction group that exited the ALU pipeline are written in the fourth.

    Anyway, I'm very sure that it wouldn't be any more difficult to feed data into my proposed design. Remember that with a scalar design that there is a lot of flexibility in what the compiler can do, as it's no longer trying to put independent streams in parallel.
     
  13. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    Peter Glaskowsky (one of the authors who were paid by Nvidia to write a white paper on Fermi) says Nvidia will miss the holiday season. But we already know that holiday season doesnt matter, am I correct?
     
  14. Humus

    Humus Crazy coder
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,217
    Likes Received:
    77
    Location:
    Stockholm, Sweden
    Either I'm missing your magic here, or you're forgetting something. If you run 64x1 instead of 16x4 you have 64 threads instead of 16 to utilize the same number of ALUs. I don't understand how that does not translate to 4 times the register file to preserve the same amount of latency hiding. At best the compiler may be able to retire registers somewhat earlier to reduce the GPR count by one or two, but I don't think that's going to be anywhere close to make up for the loss.
     
  15. Groo The Wanderer

    Regular

    Joined:
    Jan 23, 2007
    Messages:
    334
    Likes Received:
    2
    If you read the story, you would see that there were *2* roadmaps shown to me, one with code names by quarter, the other with product names by season.

    And yes, don't expect Fermi's in anything more than publicity stunt quantities until late spring or early summer.

    -Charlie
     
  16. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,824
    Likes Received:
    253
    Location:
    Taiwan
    Well said. Right now the most mature OpenCL implementation comes from Apple, and it's still not really that mature. NVIDIA's CUDA SDK for MacOS X has more functions (profilers, for example) than OpenCL on MacOS X. The situation of OpenCL implementations on Windows is even worse.

    The situation of DX11 compute shader seems to be much better. At least NVIDIA's driver doesn't seem to have some sort of performance problem with compute shader right now (in contrary to current OpenCL driver in Windows). I don't know about the situation of AMD's driver for compute shader but from what I've heard it's pretty good too. However, DX11 compute shader still lacks in documentations and profilers, debuggers, etc. are still very limited.
     
  17. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I'm not quadrupling the texture rate. Cycles of latency hiding equals #threads divided by texture throughput.

    Think of it this way: You still have the same ALU:TEX ratio, still have clauses, and still have instruction groups, but now you don't need to find 4 independent scalar instructions to fill up the ALU.xyzw slots. The ability to get high utilization for serially dependent instructions is really the only advantage that NVidia's scalar pipeline has over ATI's 5x1D pipeline.
     
  18. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    The difference between Apple and NVidia is that Apple protected the market that they created.

    NVidia's tools may be great, but not only is it probably very easy for ATI to basically copy them feature for feature on the software front when the market becomes larger, but even if they can't then open standards will make it irrelevent because final deployment can be on any hardware.

    If NVidia overhauls their pipeline and can find the same perf/mm2 miracles that ATI did, or if optimizations on their hardware do not carry over to ATI's, then maybe the dev advantage will carry over to the actual deployment of the GPGPU-based product. If not, though, then ATI's superior bang for the buck will let them snatch a large part of the market that NVidia created, but they didn't have to undermine their GPU business.
     
  19. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,824
    Likes Received:
    253
    Location:
    Taiwan
    Not necessarily. For example, a good profiler is only useful for a certain hardware. Although sometimes a shader optimized for a hardware will run well on another hardware too, but unless the architecture of NVIDIA's GPU and AMD's GPU converges at some points, tools designed for NVIDIA's GPU are not going to be very useful for AMD's GPU.

    For example, currently NVIDIA uses a scalar model although its hardware is actually SIMD based. On the other hand, AMD chooses to make a higher density vector model. An optimizer designed for a scaler model is not going to be very useful for a vector model machine.
     
  20. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,604
    Likes Received:
    648
    Location:
    New York
    @Mintmaster

    Won't clauses go away as memory access patterns and latencies become more varied and unpredictable? I don't see how they're sustainable in the compute world.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...