Nvidia GT300 core: Speculation

Status
Not open for further replies.
Ummm... talking about Nvidia vs AMD/ATi, I wasn't including Intel in those numbers, see "discrete."
Obviously we get our numbers from different sources.


what are your sources then, I didn't see you put any numbers up yet.



Yes. First on graphics, no, we don’t at this point see our sequential decline as being driven by share loss. Rather it was predominantly driven by OEM notebook build coming down so it could drain inventory from the system. You know, over half of our sequential decline was driven by notebook graphics.

This is from the AMD conference call, so out of 15% more then half of it was due to notebook graphics, the other less then half? Even if desktop discrete was flat which is my understanding a small loss to flat within 1% loss, they didn't gain anything in marketshare.
 
This is from the AMD conference call, so out of 15% more then half of it was due to notebook graphics, the other less then half? Even if desktop discrete was flat which is my understanding a small loss to flat within 1% loss, they didn't gain anything in marketshare.
Less sales doesn't necessarily mean less market share. If the competition lost even more sales, then AMD could gain market share. AMD stated that they didn't feel their decline in sales was due to share loss, did you not understand that part?

-FUDie
 
Less sales doesn't necessarily mean less market share. If the competition lost even more sales, then AMD could gain market share. AMD stated that they didn't feel their decline in sales was due to share loss, did you not understand that part?

-FUDie


Hmm nV was confident last quarter they gained marketshare as well, although specifics of the marketshare numbers aren't out, nV has increased around 5% over marketshare while ATi increased 1-2% according to JPR specific to desktop, I don't see how nV would have lost Desktop Discrete, because compared to IGP's for desktops, they sell more chip when it comes to add in boards.
 
I somewhat disagree.

I believe GT300 will be a major architectural change, even if perhaps not an extremely drastic one.

I expect GT300 to be more of a change from GT200 than NV40-->NV47/G70, or G80-->GT200.

Perhaps somewhat less than NV25 ==>NV30. I'd say going from GT200 to GT300 will be somewhat like going from NV47/G70 -->G80.

GT300 is supposed to be Nvidia's first new architecture since G80 in 2006, and, it'll be Nvidia's answer to Larrabee. The GT300 is, finally, NV60, whereas everything we've had after G80/NV50 has been basicly NV5x.

The problem is that once you start speculating in that direction you'd have to suggest a couple of details that point into the less evolutionary changes for Gt3x0 direction in the first place. Just because it's a new technology generation or because they might call it internally "GT300" (formerly known as "NV60") it doesn't mean much in the end.

Let me make it more simple; at the moment we have (wannabe) speculations that GT3x0 will have 512SPs/128TMUs and RV870 will have 1200SPs/48TMUs. The point were all these theories might fall flat on their faces is that when IHVs go to new technology generations they rarely just increase unit amounts but first and most important the efficiency per unit wherever its necessary.

IMHLO there should be quite some low hanging fruit even in more "primitive" sections like rasterinzing, trisetup and the likes where IHVs could have increased efficiency as well of course in the ALUs themselves (besides the necessary additions for X11 compliance).

Sterile numbers (whether unit counds or codenames) mean jackshit if we don't have a better picture of how each IHV might have improved efficiency at A, B or C spots of an architecture. All I can read out of your post above (with all due respect) is merely a gut feeling.

Anyone with any educated guess that could make sense eventually? (a good chance to bounce back on topic too :) )....
 
Hm, I think the 2900GT's were all 256-bit. The VRM circuit was also simplified, no Rage Theater ASIC and so on - clicky!
To say the truth I've never seen that card before, the GT's that were on sale here were all downclocked XT's. Though I'm not saying that versions with 256bit PCB weren't selling somewhere else.
 
PRO and GT were only short-term products - clearance of stock before RV670 arrived. Majority of them were re-flashed HD2900XT boards with 512bit memory bus despite the former plan was to use cheaper 256bit PCB.

Cost-down products aren't good indicators of yields - manufacturers often use full-fledged GPUs, because demand of these cheap products is higher than the number of defective parts. It's better for the manufacturer to sell some GPUs for lower prices, than to leave the segment to competitor.

You can also notice, that there are even HD4830 boards, whose SIMDs are not even disabled to 640...
But then, why use three different versions to clear stocks? Wouldn't have one model, maybe the full blown "Pro", which was only downclocked, have sufficed and contrary to 2900 GT not have damaged 2900's reputation any further?


I agree that cost-down products per se are not an indicator for good or bad yields, but partly diasbled GPUs can be. After all, you need to provide the consumer with some incentive to buy the higher priced version to make you amortization plans work. You cannot design a product for market range X and then sell the vast majority of it two stories lower. At least not, if you want to stay profitable.


To say the truth I've never seen that card before, the GT's that were on sale here were all downclocked XT's. Though I'm not saying that versions with 256bit PCB weren't selling somewhere else.

What you're talking about where the pro models. GTs were a different breed altogehter. At least here in germany. They had a different PCB, differenz VRMs, slightly different cooling, all of them only a 256 Bit bus as opposed to the "Pro" and "XT" plus they've had one of the four SIMDs disabled.
 
CarstenS: I had HD2900GT in my hands and as Lukfi said, at least the models which were available on local market, were based on XT...

I think the former plan was to launch GT and PRO, which would target price segment between HD2600XT-GDDR4 and HD2900XT. But as we know now, RV670 was ready sooner than expected (lucky A11 revision) and ATi need to clean old stock as soon as possible, so plans were changed and new priority was to outsell stock of HD2900XTs and all R600 GPUs.

You can also find different ATi slides with different GT and PRO specs.
 
CarstenS: I had HD2900GT in my hands and as Lukfi said, at least the models which were available on local market, were based on XT...

Me too. And in our local market, even though not far apart geographically, it definitely was another design:
http://www.pcgameshardware.de/aid,6...adeon-HD2900-GT/Aktuelle-Tests-auf-PCGH/Test/

Especially:
http://www.pcgameshardware.de/screenshots/medium/2007/12/1197030579689.jpg
(you may need to copy the url into your browsers adress bar)
 
IMHLO there should be quite some low hanging fruit even in more "primitive" sections like rasterinzing, trisetup and the likes where IHVs could have increased efficiency as well of course in the ALUs themselves (besides the necessary additions for X11 compliance).

How about a few suggestions then (not sure if these would be considered evolutionary or revolutionary),

1. Hardware now has more than one fixed function geometry unit (ie rasterization, etc), number of geometry units varies by GPU model.

2. Core (core as in a SIMD processing unit) on chip memory is currently read only and cached. This is memory is used for shader constants, and for CUDA constant memory. What if this changed to writable? This might be suggested by a late 2007 NV patent filed referring to this on chip memory/cache holding, "results produced by executing general purpose computing program instructions". That could still mean results produced in a previous kernel execution however, in which case the memory might still be read only.

3. In core instruction scheduling is moved from vertex/fragment specific to general purpose. Meaning there is no longer a context switch from graphics to CUDA mode, and N programs/core of any type can be scheduled at a time. This seems to be needed for DX11 anyway because of all the new shader types.
 
Last edited by a moderator:
See as soon as something interesting comes into the mix the sound of silence strikes :p

Hardware now has more than one fixed function geometry unit (ie rasterization, etc), number of geometry units varies by GPU model.

Dumb layman's question: could a tesselation unit be "abused" to serve as a secondary ff geometry unit?
 
See as soon as something interesting comes into the mix the sound of silence strikes :p

Yeah, were is Jawed?

Dumb layman's question: could a tesselation unit be "abused" to serve as a secondary ff geometry unit?

Seems to me that the output of tessellation ends up going to the raster unit anyway. Perhaps if the fixed function vertex setup was a limiting factor, one could "abuse" tessellation to get around that bottleneck assuming the rasterization could keep up.

And another suggestion for possible GT3xx changes,

4. SIMWMD -> Multiple Warp Multiple Data (but same instruction) with 4:1 ALU:issue clock. A later NV patent covers this in what is described as "Supergroup SIMD".

For comparison, isn't GT200 ALU 8-wide, with issue logic dumping out a 16-wide (half-warp) SIMD instruction each pair of ALU clocks. I thought someone on B3D mentioned that verts actually use the 16-wide issue, but pixel shaders and CUDA always pair instructions for two halfwarps of the same 32-wide warp.

What if for GT3xx, instruction issue sent 4 warp indexes {A,B,C,D} to the ALU units per instruction issue instead of one, and instruction issue was 1/4 the ALU clock. So the same instruction would apply to the lower 0-7 threads of warp A, the lower 8-15 threads of warp B, the upper 16-23 threads of warp C, and the upper 24-31 threads of warp D. Effectively SIMWMD, a form of coarse dynamic warp formation.
 
:LOL: OK...
1. Hardware now has more than one fixed function geometry unit (ie rasterization, etc), number of geometry units varies by GPU model.
I suppose it could, e.g. in the way that attribute interpolation rates (per rasterised fragment) in ATI vary depending on GPU.

2. Core (core as in a SIMD processing unit) on chip memory is currently read only and cached. This is memory is used for shader constants, and for CUDA constant memory. What if this changed to writable? This might be suggested by a late 2007 NV patent filed referring to this on chip memory/cache holding, "results produced by executing general purpose computing program instructions". That could still mean results produced in a previous kernel execution however, in which case the memory might still be read only.
Like an F-buffer? Or an A-buffer?

This thread's interesting (pure luck that the thread's been re-awakened, didn't know about it):

http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=103920

as it seems data can be persistent in "global registers" across kernel calls (these registers are local to a SIMD, they're not GDS).

I haven't come across the NVidia patent you're referring to.

3. In core instruction scheduling is moved from vertex/fragment specific to general purpose. Meaning there is no longer a context switch from graphics to CUDA mode, and N programs/core of any type can be scheduled at a time. This seems to be needed for DX11 anyway because of all the new shader types.
Don't GPUs already run folding@home at the same time as running Aero or a game? Separately, when a game runs windowed under Aero, there are "independent" sets of multiple kernels running? Not sure what's really going on to be honest.

Jawed
 
What if for GT3xx, instruction issue sent 4 warp indexes {A,B,C,D} to the ALU units per instruction issue instead of one, and instruction issue was 1/4 the ALU clock. So the same instruction would apply to the lower 0-7 threads of warp A, the lower 8-15 threads of warp B, the upper 16-23 threads of warp C, and the upper 24-31 threads of warp D. Effectively SIMWMD, a form of coarse dynamic warp formation.
That's a nice idea.

A long time ago (after G80, before R600) I mused on the idea of completely blind SIMDs (no concept of program counter). They are given operands and an instruction and Bob's your uncle - the operand collector and the store units do the heavy lifting. If the SIMD is 8-wide there could be 8 different program counters represented there. Some VS, some PS, some GS kernel. An ADD is an ADD, it really doesn't matter what the operands are and since a vector instruction set is "RISC", the chances are pretty high that most instructions you're running can be amalgamated even if they're from different programs. The chances are even higher if you have an architecture like NVidia's with no built-in DP3/4 and the like, i.e. the instruction set is reduced further by the "scalar" scheduling of instructions.

Of course NVidia's built a fancy operand collector and scoreboarder for instruction-issue. But currently that's program-counter driven. If the first stage, the dependency analyser, remained program-counter aware, but the second stage, the operand-collector/instruction-issuer, was PC-blind, then you could have a lot of fun. Bloody expensive, but you know, fun if you're designing one. Much more radical than any kind of dynamic warp formation. Dunno if it would actually work or whether it'd be efficient.

I still find it a bit surprising how little research/progress there is on the topic of control flow divergence in SIMD architectures. It's almost as if there was a collective sigh upon reaching predication, "OK, we can do that, after that it's just way too difficult".

Jawed
 
Woah, didn't know Google had a dedicated patent rummager :D

[0067] Each processing engine 402 also has access, via a crossbar switch 405, to a shared register file 406 that is shared among all of the processing engines 402 in core 310. Shared register file 406 may be as large as desired, and in some embodiments, any processing engine 402 can read to or write from any location in shared register file 406. In addition to shared register file 406, some embodiments also provide an on-chip shared memory 408, which may be implemented, e.g., as a conventional RAM. On-chip memory 408 is advantageously used to store data that is expected to be used in multiple threads, such as coefficients of attribute equations, which are usable in pixel shader programs, and/or other program data, such as results produced by executing general-purpose computing program instructions. In some embodiments, processing engines 402 may also have access to additional off-chip shared memory (not shown), which might be located, e.g., within graphics memory 124 of FIG. 1.
I'm struggling to disentangle 406 and 408, to be honest :???: The next paragraph says "local register file 406." Paragraph 77 says "Core interface 308 allocates sufficient space for an input buffer (e.g., in shared register file 406 or local register file 404) for each processing engine 402 to execute one vertex thread, then loads the vertex data."

My hypothesis has been that shared memory is where the attribute equations live. This patent seems a bit confused and seems to be referring to G80 (specifically mentioning 24 threads per multiprocessor), so ...

Jawed
 
A long time ago (after G80, before R600) I mused on the idea of completely blind SIMDs (no concept of program counter). They are given operands and an instruction and Bob's your uncle - the operand collector and the store units do the heavy lifting. If the SIMD is 8-wide there could be 8 different program counters represented there. Some VS, some PS, some GS kernel. An ADD is an ADD, it really doesn't matter what the operands are and since a vector instruction set is "RISC", the chances are pretty high that most instructions you're running can be amalgamated even if they're from different programs. The chances are even higher if you have an architecture like NVidia's with no built-in DP3/4 and the like, i.e. the instruction set is reduced further by the "scalar" scheduling of instructions.

That's reducing the SIMDs down to bare ALUs, then making the rest of the hardware track many times more program counters and removing forwarding and pipeline registers.
 
Status
Not open for further replies.
Back
Top