NVIDIA Fermi: Architecture discussion

It would need 4x as many branch units.
Read my post again. Branching is done at the same rate as before (I'm assuming that current hardware does it once every instruction group when I say my method will branch once every four scalar instructions).
Breaking up the groups of 4 would bring up register file concerns.
Either this will quadruple the register file to maintain the same amount of registers, or it split up the wierd quad registers and potentially simplify the register access restrictions.
I'm pretty sure that my method doesn't really change things at all here. The only minor issue is that in an 8 cycle period, the total possible locations that need to be accessed from the register file is four times larger with my method. Actual transfer rate will be the same, and the size of the register file is the same, too.

The trans units would have a problem, though, since their register accesses piggyback on the datapaths of the slim ALUs, and breaking the clusters will leave them orphaned or picking between a lot of lanes.
I'm not going to break the clusters, though. The T units will put into their pipeline 16 pixels of the same batch that the ALUs will. The ALUs go round robin on 8 batches, and each batch will stay active for at least four visits (32 cycles) to allow the T-units and branch units to finish up.

(FYI, by active I mean that they have data going through the ALU stages. There's plenty more batches in flight, put on hold either for texture fetches or simply waiting their turn.)
 
There are so many cross-cutting issues with splitting up the register banks that I'm having a hard time putting my head around them, given how byzantine access is currently.

For RV770, the 4-component vector registers are implemented as 4 basically independent single-component register banks that can read one value each per clock into a GPRn collecting register for 3 cycles. All units share from this.

Splitting the units up makes this highly redundant.
But do we keep the 4-component register organization and balloon the register file, or do we cut it down so each unit gets a bank and that bank is addressed as a bunch of scalar regsiters? What costs are there?

What happens with TEX unit writes to the register file?
Since each bank is now per-pixel, that's additional write ports per separate bank.
 
Sorry, Charlie, but that roadmap, at least the second table, is fake at best.
Nvidia top is GTX285, not 280, and anyway both versions can have 1 or 2 Gigabytes of VRAM. And this is only one of the inconsistencies.

Fake it isn't, I know who showed me the original. That said, they might be playing product naming games, IE not tipping their hat to what they were planning on renaming the part to. They do that a lot.

-Charlie
 
Read my post again. Branching is done at the same rate as before (I'm assuming that current hardware does it once every instruction group when I say my method will branch once every four scalar instructions).
I missed that part.
So in this case, a batch will do branch processing for 4 cycles from its point of view, before it can try coming back for ALU instruction execution.
With 8 batches actively cycling and bringing their own branches, the first one won't get to ALU work for 32 global clocks.

I'm pretty sure that my method doesn't really change things at all here. The only minor issue is that in an 8 cycle period, the total possible locations that need to be accessed from the register file is four times larger with my method. Actual transfer rate will be the same, and the size of the register file is the same, too.
The number of independent accesses doesn't seem possible under the current scheme.
The physical size of the register file would be bigger because of the number of ports, and in the access method being done now would be highly redundant.
Some of the port-sharing methods made possible by the wierd register access scheme will not work with 64-pixel wavefronts. There is no sharing of reads to the same register address, since each lane is dedicated to a separate context.


I'm not going to break the clusters, though. The T units will put into their pipeline 16 pixels of the same batch that the ALUs will. The ALUs go round robin on 8 batches, and each batch will stay active for at least four visits (32 cycles) to allow the T-units and branch units to finish up.
You have to if you want to process an instruction for 64 pixels at once. The clustering is what enforces the 16 pixels per clock.
 
There are so many cross-cutting issues with splitting up the register banks that I'm having a hard time putting my head around them, given how byzantine access is currently.

For RV770, the 4-component vector registers are implemented as 4 basically independent single-component register banks that can read one value each per clock into a GPRn collecting register for 3 cycles. All units share from this.

Splitting the units up makes this highly redundant.
But do we keep the 4-component register organization and balloon the register file, or do we cut it down so each unit gets a bank and that bank is addressed as a bunch of scalar regsiters? What costs are there?

What happens with TEX unit writes to the register file?
Since each bank is now per-pixel, that's additional write ports per separate bank.
Is there a PDF outlining all of this? I remember seeing one for R600 or RV670, but that's it.

The main point I was trying to get across is that if I ignore any access restrictions, currently we have the vec4 unit of RV770's SIMD needing 12 groups (three operands per channel) of 16 floats every cycle. My method needs only 3 groups of 64 floats every cycle, which is much simpler on the face of it. However, it's possible that through pipelining, RV770 actually fetches data a 12 groups of 64 floats every four cycles, but that's still no easier than what I'm proposing.

Now, regarding your description, are you saying that RV770 would take three cycles to calculate R1.x * R2.x + R3.x? Or that register file loads are scheduled into the instruction stream so that GPRX has all three values by the time they're needed?
 
With 8 batches actively cycling and bringing their own branches, the first one won't get to ALU work for 32 global clocks.
Ar you pointed this out as a problem or just making an observation? Throughput is not going to change.

The number of independent accesses doesn't seem possible under the current scheme.
The physical size of the register file would be bigger because of the number of ports, and in the access method being done now would be highly redundant.
Some of the port-sharing methods made possible by the wierd register access scheme will not work with 64-pixel wavefronts. There is no sharing of reads to the same register address, since each lane is dedicated to a separate context.
I really need more info here, because I don't know any of the details to this scheme. There are more independent accesses when you fetch 12 operands (each being 16 contiguous floats) for a vec4 unit than when you fetch 3 operands (each being 64 contiguous floats) for a scalar unit.

You have to if you want to process an instruction for 64 pixels at once. The clustering is what enforces the 16 pixels per clock.
That's only for the MAD units. The T-units will only do 16 at a time. After 32 cycles, the 8 batches will each get 4 MAD instructions completed and 1 SF instruction completed. Currently the same happens for 2 batches in 8 cycles, but the 4 MAD instructions have to be in parallel.
 
Is there a PDF outlining all of this? I remember seeing one for R600 or RV670, but that's it.
http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf

Page 68 on is where this is detailed. I'm not sure I have a full handle on it yet.

My text was incorrect earlier.
It's 4 separate memories with 3 ports per instruction that load to corresponding lanes in 3 vec4 collector registers, per instruction I think.
The ALUs pick through this assortment over the course of 3 cycles.

I think this is what happens, anyway.
I'm not sure why it's this complex, maybe for sharing.

Now, regarding your description, are you saying that RV770 would take three cycles to calculate R1.x * R2.x + R3.x? Or that register file loads are scheduled into the instruction stream so that GPRX has all three values by the time they're needed?
This should be available in the X component of in those GPRn registers. It can't read more than one GPR.X value per cycle. I think this is pipelined so these should be ready by the time the EX stage is hit for a given pixel instruction.
 
Last edited by a moderator:
I think ultimately, CUDA will be subsumed by OpenCL/DX11, since they have mostly adopted it's model, but that doesn't mean NVidia will lose out. If you look at the Visual Studio tools they're shipping, plus debugability of their hardware, developer's could still choose to use NVidia tools and hardware as their primary platform, even if they ultimately generate output for multiple cards.

A good set of developer tools that boost productivity is hard to ignore.

Look at Sun Microsystems during their heyday. They managed to ward off threats from Intel, HP, DEC, et al, by offering superior software tools, including Java and its ecosystem. Java was available on all platforms, but people still ended up buying Sun/Solaris hardware. The dot-com bust and recession rebooted the market, and made everyone focus on outsourcing, cloud services, etc.

Of course, they ultimately lost out to commodity hardware, but some of their was due to the ineptitude of their management and failure to adapt to shifting market.

NVidia will have to walk a fine line, pushing Fermi and dev tools, supporting portable standards, while finding ways to address low-cost and mid-range chip markets, specially with their chipset revenues about to go by by. Again, it doesn't seem like they had much choice. They could have gone AMDs route this cycle, but they'd be facing a bigger threat from Intel in 2011/12. They're placing a big venture bet on GPGPU while trying to hold onto graphics. If it works, they'll reap big rewards, like any risky venture. If they fail, it'll be spectacular. I felt the same way when Steve Jobs came back to Apple and introduced the iMac after killing clones. I thought changing Apple into a software company (just sell OS X to clone makers) was a good play, but clearly, Apple made vertical integration work well, so much so, that I don't even own a PC anymore, I'm all Mac now.
 
Ar you pointed this out as a problem or just making an observation? Throughput is not going to change.
Just restating to make sure I understood it.

I really need more info here, because I don't know any of the details to this scheme. There are more independent accesses when you fetch 12 operands (each being 16 contiguous floats) for a vec4 unit than when you fetch 3 operands (each being 64 contiguous floats) for a scalar unit.
My understanding is that the reg file is 4 12-ported register banks per cluster.
Whatever the other clusters are doing in the SIMD may not matter as far as how contiguous the registers are, if I understand the part about relative addressing mentioned in the port restrictions.

My concern was that allowing the ALUs to work on 4 pixels simultaneously would require additional banking or ports.
I'm not sure now if this is necessary, though I think it might be simpler if the entire file were 64 banks that dispensed with the complex scheme used right now.
 

perhaps that roadmap with codenames is credible, only the interpretation is bollocks, he gets the cards wrong and jump at conclusions.

here's my thinking on what the cards are, no "evil renaming scheme" conspiracy :

D10U : GTX series
D10P2 : GTS 250
D10P1 : GT130
D10M2 : GT120
D10M1 : G100

D12U : the big Fermi board
D10P2 : GTS 250
D11P1 : GT230
D11M2 : GT220
D11M1 : G210

that roadmap would leave us with no mid-range or mainstream Fermi based products until Q3 2010.

well maybe D9 cards are renamed as D10 (/edit : well justified if there are new clocks, new PCBs). But don't those "D" codenames refer to the actual lines of cards on sale, with market segments?
 
Last edited by a moderator:
Okay, thanks. Unfortuantely, this doesn't have any batch level details, which is the most important part for what I'm talking about. There's mention of three read ports, but then they say that only one read is done per element per cycle, so I'm not sure how it all works. Maybe 64 pixels worth are loaded each cycle, so they actually do 3 cycles of reading and one cycle of writing ("Each instruction is distinguished by the destination vector element to which it writes") in the four cycles it takes to process a batch. So while working on batch B for four cycles, batch A's upcoming operands are loaded in three cycles and the writes from an earlier instruction group that exited the ALU pipeline are written in the fourth.

Anyway, I'm very sure that it wouldn't be any more difficult to feed data into my proposed design. Remember that with a scalar design that there is a lot of flexibility in what the compiler can do, as it's no longer trying to put independent streams in parallel.
 
Peter Glaskowsky (one of the authors who were paid by Nvidia to write a white paper on Fermi) says Nvidia will miss the holiday season. But we already know that holiday season doesnt matter, am I correct?
 
I'm pretty sure that my method doesn't really change things at all here. The only minor issue is that in an 8 cycle period, the total possible locations that need to be accessed from the register file is four times larger with my method. Actual transfer rate will be the same, and the size of the register file is the same, too.

Either I'm missing your magic here, or you're forgetting something. If you run 64x1 instead of 16x4 you have 64 threads instead of 16 to utilize the same number of ALUs. I don't understand how that does not translate to 4 times the register file to preserve the same amount of latency hiding. At best the compiler may be able to retire registers somewhat earlier to reduce the GPR count by one or two, but I don't think that's going to be anywhere close to make up for the loss.
 
perhaps that roadmap with codenames is credible, only the interpretation is bollocks, he gets the cards wrong and jump at conclusions.

here's my thinking on what the cards are, no "evil renaming scheme" conspiracy :

D10U : GTX series
D10P2 : GTS 250
D10P1 : GT130
D10M2 : GT120
D10M1 : G100

D12U : the big Fermi board
D10P2 : GTS 250
D11P1 : GT230
D11M2 : GT220
D11M1 : G210

that roadmap would leave us with no mid-range or mainstream Fermi based products until Q3 2010.

well maybe D9 cards are renamed as D10 (/edit : well justified if there are new clocks, new PCBs). But don't those "D" codenames refer to the actual lines of cards on sale, with market segments?

If you read the story, you would see that there were *2* roadmaps shown to me, one with code names by quarter, the other with product names by season.

And yes, don't expect Fermi's in anything more than publicity stunt quantities until late spring or early summer.

-Charlie
 
I think ultimately, CUDA will be subsumed by OpenCL/DX11, since they have mostly adopted it's model, but that doesn't mean NVidia will lose out. If you look at the Visual Studio tools they're shipping, plus debugability of their hardware, developer's could still choose to use NVidia tools and hardware as their primary platform, even if they ultimately generate output for multiple cards.

A good set of developer tools that boost productivity is hard to ignore.

Well said. Right now the most mature OpenCL implementation comes from Apple, and it's still not really that mature. NVIDIA's CUDA SDK for MacOS X has more functions (profilers, for example) than OpenCL on MacOS X. The situation of OpenCL implementations on Windows is even worse.

The situation of DX11 compute shader seems to be much better. At least NVIDIA's driver doesn't seem to have some sort of performance problem with compute shader right now (in contrary to current OpenCL driver in Windows). I don't know about the situation of AMD's driver for compute shader but from what I've heard it's pretty good too. However, DX11 compute shader still lacks in documentations and profilers, debuggers, etc. are still very limited.
 
Either I'm missing your magic here, or you're forgetting something. If you run 64x1 instead of 16x4 you have 64 threads instead of 16 to utilize the same number of ALUs. I don't understand how that does not translate to 4 times the register file to preserve the same amount of latency hiding. At best the compiler may be able to retire registers somewhat earlier to reduce the GPR count by one or two, but I don't think that's going to be anywhere close to make up for the loss.
I'm not quadrupling the texture rate. Cycles of latency hiding equals #threads divided by texture throughput.

Think of it this way: You still have the same ALU:TEX ratio, still have clauses, and still have instruction groups, but now you don't need to find 4 independent scalar instructions to fill up the ALU.xyzw slots. The ability to get high utilization for serially dependent instructions is really the only advantage that NVidia's scalar pipeline has over ATI's 5x1D pipeline.
 
A good set of developer tools that boost productivity is hard to ignore.

...

I felt the same way when Steve Jobs came back to Apple and introduced the iMac after killing clones. I thought changing Apple into a software company (just sell OS X to clone makers) was a good play, but clearly, Apple made vertical integration work well, so much so, that I don't even own a PC anymore, I'm all Mac now.
The difference between Apple and NVidia is that Apple protected the market that they created.

NVidia's tools may be great, but not only is it probably very easy for ATI to basically copy them feature for feature on the software front when the market becomes larger, but even if they can't then open standards will make it irrelevent because final deployment can be on any hardware.

If NVidia overhauls their pipeline and can find the same perf/mm2 miracles that ATI did, or if optimizations on their hardware do not carry over to ATI's, then maybe the dev advantage will carry over to the actual deployment of the GPGPU-based product. If not, though, then ATI's superior bang for the buck will let them snatch a large part of the market that NVidia created, but they didn't have to undermine their GPU business.
 
The difference between Apple and NVidia is that Apple protected the market that they created.

NVidia's tools may be great, but not only is it probably very easy for ATI to basically copy them feature for feature on the software front when the market becomes larger, but even if they can't then open standards will make it irrelevent because final deployment can be on any hardware.

Not necessarily. For example, a good profiler is only useful for a certain hardware. Although sometimes a shader optimized for a hardware will run well on another hardware too, but unless the architecture of NVIDIA's GPU and AMD's GPU converges at some points, tools designed for NVIDIA's GPU are not going to be very useful for AMD's GPU.

For example, currently NVIDIA uses a scalar model although its hardware is actually SIMD based. On the other hand, AMD chooses to make a higher density vector model. An optimizer designed for a scaler model is not going to be very useful for a vector model machine.
 
@Mintmaster

Won't clauses go away as memory access patterns and latencies become more varied and unpredictable? I don't see how they're sustainable in the compute world.
 
Back
Top