NVIDIA Maxwell Speculation Thread

If Maxwell has a timeline that is not served by the possible timing of 20nm, an initial wave on 28nm could a way to reduce risk of further schedul slips.

Exactly; besides there's not a single reason why NVIDIA wouldn't keep also for Maxwell the Kepler development strategy: mainstream/performance chips first, high end last or amongst the last.

Assuming NV releases GK110 SKUs for the desktop in early 2013 it will be give or take 1 year after the initial first Kepler desktop release of the GTX680. For Maxwell I couldn't imagine how they'd manage to cram the top dog into 28nm without any HUGE risks for such a complicated design. Else keeping a similar trend leaves enough time for 20nm to mature enough for a desktop high end chip after N timeframe and at the same time a direct shrink of all Maxwell chips that had been manufactured under 28nm.

For performance and lower end Maxwell parts 28nm is a viable alternative to release hw ON TIME and the downside is that die area is going to be quite a bit larger than under 20nm later on. Considering in what shape 20nm is supposed to be I wouldn't suggest that despite the larger die area under 28nm manufacturing costs should be noticably higher.

Besides if I'm not reading wrong into that one: http://www.xbitlabs.com/news/other/...s_Technologies_for_Longer_Period_of_Time.html

....who's willing to bet that if above is true, it will be an exclusive design decision for NVIDIA?
 
Exactly; besides there's not a single reason why NVIDIA wouldn't keep also for Maxwell the Kepler development strategy: mainstream/performance chips first, high end last or amongst the last.

Assuming NV releases GK110 SKUs for the desktop in early 2013 it will be give or take 1 year after the initial first Kepler desktop release of the GTX680. For Maxwell I couldn't imagine how they'd manage to cram the top dog into 28nm without any HUGE risks for such a complicated design. Else keeping a similar trend leaves enough time for 20nm to mature enough for a desktop high end chip after N timeframe and at the same time a direct shrink of all Maxwell chips that had been manufactured under 28nm.

For performance and lower end Maxwell parts 28nm is a viable alternative to release hw ON TIME and the downside is that die area is going to be quite a bit larger than under 20nm later on. Considering in what shape 20nm is supposed to be I wouldn't suggest that despite the larger die area under 28nm manufacturing costs should be noticably higher.

Besides if I'm not reading wrong into that one: http://www.xbitlabs.com/news/other/...s_Technologies_for_Longer_Period_of_Time.html

....who's willing to bet that if above is true, it will be an exclusive design decision for NVIDIA?

Now you know why they have not do it, it was purely impossible for Nvidia to bring the GK110 on the first ramp of 28nm in early 2012... completely impossible.., they end with a tesla card who are under the 4Tflops SP ( and 1/3 rate for dp ), at 732 mhz and one 1 SM(x) is disabled ........ something even harder to put in the market of was Fermi. ( K20x have 14SM, K20 have 13 SM.. the architecture have 15SM up.. so one disabled , and Titan have recieve only K20x ( maybe mixed with K20 possibly )

( its a beast, dont misunderstand me, but how they will have been able to adapt this on the consumers cards generation ? .. i dont think they could have do it ).

Nvidia have put a lot force for make the K20(x) possible to use in High computing market... and really this is a real good and shiny experience... but i cant imagine them have do the same in early 2012 as it was maybe previously the plan to use it in "consumer market" ...

I dont know what to think about this xbitlabs article.. first there's many difference between gpu and cpu, and xbitlabs mix both of them, that's a dangerours way as it can make you conclude illusions or desillusions far away of the results. Then, AMD or ATI have been allways the first to use any process and allways do it successfully on gpu side , because their method to aprehend this is clear and is really accurate . They are really realist on this aspect.
 
Last edited by a moderator:
Ailuros said:
For performance and lower end Maxwell parts 28nm is a viable alternative to release hw ON TIME and the downside is that die area is going to be quite a bit larger than under 20nm later on.
Eh, I don't see it... why would they care about being "on time" if their only competition doesn't?
 
Form a consumer point of view: what is the benefit of an additional CPU on the GPU die? We've heard complaints in the past that DirectX adds a lot of inefficiencies compared to consoles, where you program to the metal. Is an on-die CPU something that could help with this? E.g. Take over some work from the main CPU?

Or is it only useful for GPGPU type applications (PhysX?)
 
Form a consumer point of view: what is the benefit of an additional CPU on the GPU die? We've heard complaints in the past that DirectX adds a lot of inefficiencies compared to consoles, where you program to the metal. Is an on-die CPU something that could help with this? E.g. Take over some work from the main CPU?

Or is it only useful for GPGPU type applications (PhysX?)

Not an ARM based one, as basically you will need to code the instructions for allow this cpu to be able to use them .. ( this will be the role to cuda/API and other library) .. if a software can take fully the advantage of it by use at the source this in account this will work.. but if not.. you will need manipulate the data for make this arm use it.. ( so i believe, the impact could be really limited in the situation Nvdiai have been able to put the developpers in a " no choice " situation ) .

If you are using Fortran C++ ... ARM cant work with that, it is needed the instructions you ask to him is coded in a language he can understand.. so you will need re encode them ( CUDA or i dont know how Nvidia will call it for this ) for be understand by it, before it can work with it ( and it can really work well, only if it can treat the data ) .. this is what HSA fundation is doing, create a new language who can be used by ARM and x86 cpu / and ARM based gpu ( Mali ) and x86 GPU ... what will do the maxwell ARM co processor outside use a langage create by Nvidia only. i dont know. Its too much proprietary at my sense to be usefull ( outside Nvidia and even then, they loose a lot of efficiency, as you will need encode this library again for be used by it ( the bigger problem of CUDA ).. this is going absolutely the other way the industry of semi conductor is going. ( ARM included ).. All the time you loose for use this library and API for get the results is again for a big part lost cause you need prepare them for it, and then if you need check them the work is again more long..
 
Last edited by a moderator:
For performance and lower end Maxwell parts 28nm is a viable alternative to release hw ON TIME and the downside is that die area is going to be quite a bit larger than under 20nm later on. Considering in what shape 20nm is supposed to be I wouldn't suggest that despite the larger die area under 28nm manufacturing costs should be noticably higher.
I'm recalling what AMD did with the 32 nm versions of Northern Islands:

The 32nm predecessor of Barts was among the earlier projects to be sent to 40nm. This was due to the fact that before 32nm was even canceled, TSMC’s pricing was going to make 32nm more expensive per transistor than 40nm, a problem for a mid-range part where AMD has specific margins they’d like to hit. Had Barts been made on the 32nm process as projected, it would have been more expensive to make than on the 40nm process, even though the 32nm version would be smaller. Thus 32nm was uneconomical for gaming GPUs, and Barts was moved to the 40nm process.
I'm wondering if NVIDIA is doing something similar with Maxwell for reasons including prices and launch timeframes. In this case I'm not sure if the "big" Maxwell would show up at the front of NVIDIA's 20 nm releases (unlike AMD, if 32 nm wasn't canceled then there would probably have been a 32 nm Cayman predecessor and 40 nm everything else for some time).

( K20x have 14SM, K20 have 13 SM.. the architecture have 15SM up.. so one disabled , and Titan have recieve only K20x ( maybe mixed with K20 possibly )
Well, each compute node (1 CPU + 1 GPU) in Titan has 32 GB + 6 GB memory so every GPU should be K20X (or I suppose some other GK110 variant with 6 GB memory?).
 
With the instructions included in ARM right now, a complete rewrite of it will be needed,... you will not been able to use it as primary " self processor" without loose a lot of efficiency vs use an x86 processors for do the work.
It's very likely that the ARM cores on Nvidia's chip are weaker than an x86.
For code that is architected to run primarily on the GPU, the GPU existing code that relies on a distant host processor should still be able to run because it is already insulated by a runtime and communicates over PCIe.
Code structured to minimize the back and forth between the GPU and CPU may not notice much of a difference, and the CUDA code should run as-is.
If the ARM cores can be tasked to run the board, they could do what Larrabee can do and run enough of the driver and runtime to allow the host CPU to be free to do its own thing, or possibly nearly power down and inflate the Green 500 ranking.

Code meant to be able to run in a hybrid mode, which Larrabee promises at a lower level and OpenCL promises as an abstraction, would need a rewrite of the CPU side for full efficiency. Recompilation or the runtime should get it to the "it runs" stage, though.

Even if the CPUs are weaker, and it is very (very very) unlikely a first attempt at a high-ish performance core on a weaker process is going to top an Intel core, the PCIe bus wouldn't be in the way. That has a lot of potential for cases bottlenecked by the latency and disparate memory pools currently inflicted with the host CPU/slave GPU arrangement.

Going this route in the future seems plausible since Nvidia does know how to make fully functional host systems. The problem is that they're Tegra, and Nvidia has no history of really pushing hardware that sits around the CPU cores and interfaces with the rest of the system. If they do, however, they could remove a chunk of power consumption from the system, and not let Larrabee hold a decent trump card in its ability to run almost independently.
Further in the future, Nvidia needs to counter a possible socketed Larrabee-only system.

Whether Nvidia succeeds in making Tesla capable of running on its own in strong enough fashion is uncertain. They would have the ability to control the whole compute platform if they do, and would have the option of customizing both the ARM and GPU components.
I'm more curious about the non-compute parts of the system they need to convincingly scale up.
 
I think AMD and Nvidia are very close to each other this generation, with no major obvious flaws. New silicon on the same process can only give incremental changes, IMO. I wonder if it's not better waiting for 20nm for the next big push? Maybe just do some clock speed bump and call it a day.

Well I'm hoping that nVidia has some room to grow on 28nm. 780 GTX is hopefully something like a graphics focused 12 SMX chip / 384bit memory bus. 770 GTX could be 11 SMX and 320bit (if that is possible), with clocks perhaps a tiny bit lower than what's on the 600 series. That would be a nice upgrade, but still not as monstrous as the K20. Just a small clock speed bumb and still with a 256 bit bus would imo be a tad too little.
 
Form a consumer point of view: what is the benefit of an additional CPU on the GPU die? We've heard complaints in the past that DirectX adds a lot of inefficiencies compared to consoles, where you program to the metal. Is an on-die CPU something that could help with this? E.g. Take over some work from the main CPU?

Or is it only useful for GPGPU type applications (PhysX?)

If some of the driver's management layer dealing with resource management or kernel invocation can be moved to the GPU, it takes out the round trip on the PCIe bus.
Discrete boards wouldn't be able to keep a coherent view of memory unless PCIe makes some changes.
If the system setup and APIs can be modified to allow programs to run on the GPU side, then the ideal case is that the CPU core can leverage tight integration with the GPU and the units can cooperatively work on the elements of the workload they are best at.

I could see some immediate ancillary benefits like making it possible for the GPU to run encoding software with the option to just run it with CPU settings by punting to the on-board CPU.
Another random thought is that if AMD and Nvidia can make this system work, they could potentially create a very FP-heavy host CPU that can just be bunged onto a discrete card and called a GPU.
That probably matters more to AMD, since they could save some money on developing new products.
 
Well I'm hoping that nVidia has some room to grow on 28nm. 780 GTX is hopefully something like a graphics focused 12 SMX chip / 384bit memory bus. 770 GTX could be 11 SMX and 320bit (if that is possible), with clocks perhaps a tiny bit lower than what's on the 600 series. That would be a nice upgrade, but still not as monstrous as the K20. Just a small clock speed bumb and still with a 256 bit bus would imo be a tad too little.

Agreed with all that you said here regarding Kepler but I do not think it is likely there is a completely new Kepler core coming, though (as nice as it would be to have a beefed up GK104). I think the 700 series flagship will be a 13 smx, 384-bit GK110, followed by a 14-15 smx GK110 chip with higher clocks 6 months later.
 
Agreed with all that you said here regarding Kepler but I do not think it is likely there is a completely new Kepler core coming, though (as nice as it would be to have a beefed up GK104). I think the 700 series flagship will be a 13 smx, 384-bit GK110, followed by a 14-15 smx GK110 chip with higher clocks 6 months later.


My expectations are a March/April 2013 release for 2 GK110 SKUs:

GTX 780 14smx, 384 bit 6gbps gddr5, ~850MHz + boost, ~50% faster than GTX 680, 699$

GTX 770 13smx, 320 bit 6gbps gddr5, ~800MHz + boost, ~30% faster than GTX 680, 499$


A full 15smx sku will probably never be released and that 15th unit is purely for enhanced yields. All this is IMO and probably wrong lol.
 
If some of the driver's management layer dealing with resource management or kernel invocation can be moved to the GPU, it takes out the round trip on the PCIe bus.
Discrete boards wouldn't be able to keep a coherent view of memory unless PCIe makes some changes.
If the system setup and APIs can be modified to allow programs to run on the GPU side, then the ideal case is that the CPU core can leverage tight integration with the GPU and the units can cooperatively work on the elements of the workload they are best at.

Are there hardware structures on current chips that could be (perhaps partly) replaced with code running on the general purpose cores - e.g. thread block assignment and/or command stream processing?

Anyway, based on some of the echelon/einstein videos with Dally, and since power seems to be the dominating design concern across pretty much all the market segments, I was wondering how much being able to use a single design from top to bottom factors into this. It seems to me that for high-end compute or mobile applications a general purpose core is either very useful/desirable or required. But in the discrete graphics card space and with a 2-year timeframe, it's not obviously a killer feature to me.

So is it possible that ripping the ARM cores out of the desktop incarnations of the Maxwell arch is simply more work than it's worth? Or perhaps that having them allows NVs software infra-structure to evolve incrementally/coherently in the direction advertised for echelon? I am assuming here that they are not aiming for high-end Intel x86 performance, and that the architecture can be scaled such that the ARM cores are a relatively small percentage of the die/power budget.
 
Now you know why they have not do it, it was purely impossible for Nvidia to bring the GK110 on the first ramp of 28nm in early 2012... completely impossible.., they end with a tesla card who are under the 4Tflops SP ( and 1/3 rate for dp ), at 732 mhz and one 1 SM(x) is disabled ........ something even harder to put in the market of was Fermi. ( K20x have 14SM, K20 have 13 SM.. the architecture have 15SM up.. so one disabled , and Titan have recieve only K20x ( maybe mixed with K20 possibly )

( its a beast, dont misunderstand me, but how they will have been able to adapt this on the consumers cards generation ? .. i dont think they could have do it ).

It's my understanding that the entire Kepler family was planned that way, because they knew before hand that things won't be easy with such a big and complex die. Up to Fermi they always waited for the top dog to tape out and after that worked on the smaller cores of the same family. When I suggested shortly after the Fermi release that an idea would be change that strategy many here in the forum suggested that it's impossible. And no I don't wake up in the morning and have such funky ideas as a layman; I've heard it somewhere to come as far as to propose it as an idea.

I don't recall the exact order of tape outs but from what I recall GK107 came first, followed by GK104 and GK110 taped out only in early March 2012 afaik.

Nvidia have put a lot force for make the K20(x) possible to use in High computing market... and really this is a real good and shiny experience... but i cant imagine them have do the same in early 2012 as it was maybe previously the plan to use it in "consumer market" ...

They might had planned it initially to release the GK110 for the desktop within 2012, but when they saw that there wasn't an absolute necessity for it, they saved wafer runs for that one and invested them elsewhere and went only for a limited production of GK110 just to serve their whatever HPC design wins they had. They for sure didn't lose any money with that approach and even more so since they are selling a 294mm2/GK104 die with high end SKU prices even as we speak.

I dont know what to think about this xbitlabs article.. first there's many difference between gpu and cpu, and xbitlabs mix both of them, that's a dangerours way as it can make you conclude illusions or desillusions far away of the results. Then, AMD or ATI have been allways the first to use any process and allways do it successfully on gpu side , because their method to aprehend this is clear and is really accurate . They are really realist on this aspect.

I like taking such "risks" especially since I typically sniff around as much as I can. To steal from one source is plagiarism; to steal from many research. At least so they say. That said am I alone with the same gut feeling? You tell me: http://semiaccurate.com/forums/showpost.php?p=172998&postcount=22
 
My expectations are a March/April 2013 release for 2 GK110 SKUs:

GTX 780 14smx, 384 bit 6gbps gddr5, ~850MHz + boost, ~50% faster than GTX 680, 699$

GTX 770 13smx, 320 bit 6gbps gddr5, ~800MHz + boost, ~30% faster than GTX 680, 499$


A full 15smx sku will probably never be released and that 15th unit is purely for enhanced yields. All this is IMO and probably wrong lol.

Keep in mind GF110 had 4 SKU's (gtx580, gtx570, gtx560ti 448, gtx560 OEM), and GK104 has 4 sku's (gtx680, gtx670, gtx660ti, gtx660 OEM) so it goes without reason to assume GK110 will also have 4 sku's. I'm guessing two will have a 384-bit bus and two with a 320-bit bus. I doubt they'll cut down to 256-bits with any GK110, as it would probably be outperformed by GK104 due to higher clocked ROP's.
 
Are there hardware structures on current chips that could be (perhaps partly) replaced with code running on the general purpose cores - e.g. thread block assignment and/or command stream processing?
The command processor might be replaceable, although not much is discussed about what kind of architecture is used there anyway.
The queue management and work allocation sounds like it could be handled as well. On the other hand, it might be convenient for resource virtualization if the cores on the chip don't provide software direct access to that information.
A more ideal outcome would be if the chip could detect when divergence/tiny batch size is wrecking performance on the SIMDs and the warp can migrate to narrower hardware pipelines.

So is it possible that ripping the ARM cores out of the desktop incarnations of the Maxwell arch is simply more work than it's worth? Or perhaps that having them allows NVs software infra-structure to evolve incrementally/coherently in the direction advertised for echelon? I am assuming here that they are not aiming for high-end Intel x86 performance, and that the architecture can be scaled such that the ARM cores are a relatively small percentage of the die/power budget.
That would depend on what Nvidia has actually done with the cores and their design. We don't have much to go on.
 
Keep in mind GF110 had 4 SKU's (gtx580, gtx570, gtx560ti 448, gtx560 OEM), and GK104 has 4 sku's (gtx680, gtx670, gtx660ti, gtx660 OEM) so it goes without reason to assume GK110 will also have 4 sku's. I'm guessing two will have a 384-bit bus and two with a 320-bit bus. I doubt they'll cut down to 256-bits with any GK110, as it would probably be outperformed by GK104 due to higher clocked ROP's.

But how would they position them? Assuming GK114=GTX680+10%, there is only 40% "left" to work with.

So GTX680+25%, +35% and +50%. Okay, 3 SKUs. 4 seems a bit much.
Maybe a 780 Ultra with 900 MHz, 15 SMX and 270-300W TDP for $799? They certainly have 15 SMX dies, they will not throw them away, will they?
 
But how would they position them? Assuming GK114=GTX680+10%, there is only 40% "left" to work with.

So GTX680+25%, +35% and +50%. Okay, 3 SKUs. 4 seems a bit much.
Maybe a 780 Ultra with 900 MHz, 15 SMX and 270-300W TDP for $799? They certainly have 15 SMX dies, they will not throw them away, will they?

1 of the 4 gtx680 sku's are OEM only. Same with GF110. I think it will be similar in this regard - 3 retail based GK110 sku's, 1 OEM based sku (that will likely trade performance with GK114 or be a tiny, tiny bit faster).
 
The command processor might be replaceable, although not much is discussed about what kind of architecture is used there anyway.
The queue management and work allocation sounds like it could be handled as well. On the other hand, it might be convenient for resource virtualization if the cores on the chip don't provide software direct access to that information.
A more ideal outcome would be if the chip could detect when divergence/tiny batch size is wrecking performance on the SIMDs and the warp can migrate to narrower hardware pipelines.


That would depend on what Nvidia has actually done with the cores and their design. We don't have much to go on.

The chip would have to migrate the warp/thread block to a MIMD core to salvage performance from a warp that was heavily divergent.
 
So in the einstein presentation, it sounded like each core was MIMD, and that (some of?) the energy efficiency of SIMD was retained by having a mode where if control flow is coherent, a core runs a single instruction over N data elements for N clocks (vs. on multiple ALUs in parallel). No idea if Maxwell goes this far or not... but it does sound like a large change and so strikes me as pretty unlikely.
 
Charlie's report on "Nvidia’s Maxwell process choice" is now completely available (it's been 30 days).

Main points:
  • Maxwell will be on "the big die strategy"
  • Maxwell on 28 nm could mean that NVIDIA doesn't think they can make a large chip on 20 nm right away with good yields, not due to an engineering decision
  • Apple is going to TSMC for 20 nm, so they'll probably take up all of their initial 20 nm wafers. So in any case, Maxwell will have to be on 28 nm.
 
Back
Top