Nvidia Pascal Speculation Thread

Status
Not open for further replies.
If everyinthg can be done through simulations, then why are development kits created at all?
DevKits are created to develop applications on top of a chip. Simulations are used to verify the core features of a chip. They have nothing to do with each other and the target a completely different audience. If you can abstract the core new features behind an API and package it as a (large) performance improvement of the same thing, then a devkit with a previous chip is totally fine. You unblock software developers who will need years before they have something production ready. You unblock the mechanical developers who are developing today the car that will be on the road 5 years from now.

It's not that a company like Nvidia couldn't afford to make another chip, it's that it would provide no benefit to them. Not the benefits that you imagine there to be. It's that it would allocate core development resources that could be spent on something else more effectively. Icera was needed to break into the mobile phone market. A strategic investment for future growth that was later abandoned. Shit happens. Shield is a similar strategic investment. It remains to be seen whether it will pay off at some point, but it's a product that's supposed to sell in volume.

A 28nm Pascal would be a tactical move that's part of a strategic push into automobile computing, but with a high cost and low value and no volume.
 
The players in the automotive industry are a bit slower that the general consumer side of things. This is because of regulations and and time span of vehicles.

Which is why you would want them to start as soon as posible.

nV has both areas covered, hardware and software. The only thing the automotive companies to worry about is the interface with the hardware and software. nV will not be making custom parts for different manufactures. This is what makes simulations work well in a situation like this.

And this is what makes it hard for others to come into this market, because they don't have the software yet. Look at what has happened with the the compute market.... has Intel or AMD or Qualcomm or any other company made such an impact as nV? By the time these other companies have seen what nV did, it was too late, and nV still has a upper hand because of its software.

I don't think nV are the only ones offering software. And it's still a full year away from being released anyway, when with the first Drive PX only a quarter difference between announcement and actual release. A lot of things can happen in a year, especially when you don't know where your competitors are. Lack of announcements doesn't mean nV doesn't have any competition out there waiting to be unleashed.

Speaking of the compute market, I don't think nV hesitated one bit spending $30M and then some in order to ensure they'd snatch that market. In fact, I remember reading over $1B figures for the develpment of Fermi, which iirc was said to be nearly twice as much as previous generations. And that on top of making the architecture in such a way that it was liable to jeopardizing their conpetitiveness in the gaming market. It was a huge risk, a huge bet that nV didn't bat an eye calling. So I fail to see a $30M investment as an excess, considering the potential returns. It's obvious we have a different view on this, and I don't think we are going to agree, so it's probably better to leave it at that.
 
DevKits are created to develop applications on top of a chip. Simulations are used to verify the core features of a chip. They have nothing to do with each other and the target a completely different audience. If you can abstract the core new features behind an API and package it as a (large) performance improvement of the same thing, then a devkit with a previous chip is totally fine.

And what if you can't? Can you honestly say that is always posible?

I understand that it's not necessary, but I'm still having a hard time believing that it wouldn't help at all. And as I said, it's not only about offering that minor help, it shows commitment. I return to the Phoenix platform, for that purpose. It was clear Nvidia wouldn't take the phone market by storm that late in the game no matter what. The phone market is huge, but nV's posibitites with or without the Phoenix platform weren't very different. But the platform helped bringing T4i to the market and it was an endeavor nV deemd worth taking. Even a 3-6 month difference in the automotive market right now could mark the difference between being just one small player holding a 10% of it or being the dominant market. The "pot" in this case is much much bigger and hence so should the bets you should be calling.

Anyway, and remember we're speaking hypothetically, what do you think of the posibility of actually releasing a 28nm Pascal chip early this year? There was about a 1 month window for 750 Ti, between its existence being known and its launch.
 
It appears like there was a desire to use that platform to help facilitate sales of the silicon to partners.
The test chip scenario involves silicon that is never sold, even though should it exist there doesn't seem to be a reason why it wouldn't be an improved product for sale.

I don't see how this is so different, except it's the other way around and that the potential reward is much higher this time around, because nV's opportunities in the phone market were already almost nil by the time. Never sold platform to help sell silicon, versus never sold silicon to sell platform. Silicon is more expensive, but the potential reward is also much higher. And as you say, maybe the silicon could be sold anyway.
 
And what if you can't? Can you honestly say that is always posible?
You're welcome to discuss general hypotheticals somewhere else, but for this particular case, there is a very high likelihood that you can.

Nobody is expecting Nvidia to release some ground breaking new deep learning invention. Just some architectural improvements that will improve speed on some workloads.
 
You're welcome to discuss general hypotheticals somewhere else, but for this particular case, there is a very high likelihood that you can.

Nobody is expecting Nvidia to release some ground breaking new deep learning invention. Just some architectural improvements that will improve speed on some workloads.

No one is speaking about revolutionary, but 3 ops out of something that's only supposed to be able to execute 2xFP16 or 2xINT8? How do you figure that is performed? And could it be satisfactorily emulated in a way that performance on the final product would be predictable for all or most use cases?

For instance, and it's unrelated, but can you really emulate FMAD instructions in silicon that is only capable of MUL and ADD, in a way that would be completely useful for the programmer? It's my understanding that the simulation would consume significantly more bandwidth, aside from requiring twice as many cycles to execute. And I don't think it would be in a totally predictable way in all or most cases.
 
No one is speaking about revolutionary, but 3 ops out of something that's only supposed to be able to execute 2xFP16 or 2xINT8? How do you figure that is performed? And could it be satisfactorily emulated in a way that performance on the final product would be predictable for all or most use cases?
It. Doesn't. Matter. It's still just a performance optimization.

And if it's emulated, it doesn't even have to be bit compatible, because, for deep learning, it doesn't matter. That's why you can use FP16 in the first place.
 
No one is speaking about revolutionary, but 3 ops out of something that's only supposed to be able to execute 2xFP16 or 2xINT8? How do you figure that is performed? And could it be satisfactorily emulated in a way that performance on the final product would be predictable for all or most use cases?
If you drop back to a design that has been shoehorned into 28nm rather than 14/16nm, we're likely already talking about a factor of 2.x performance regression, and if there are particular low-level differences that crop up between different implementations of an architecture at the same node--much less wholly different ones, there's no guarantee that the whole range of outcomes is going to be applicable.

As far as development ahead of the hardware, this has worked for the consoles for a number of generations despite some rather glaring architectural differences.
If this is merely a development seeding project, there's a raft of non-process related workarounds that can give the necessary compensation for missing hardware.
Can't get a native INT8 instruction and lose by a factor of 4? This isn't a product, so just brute-force it with extra hardware in a tower case. The code doesn't care.

For instance, and it's unrelated, but can you really emulate FMAD instructions in silicon that is only capable of MUL and ADD, in a way that would be completely useful for the programmer?
Turing-complete is what it is, so generally yes. And if the use is for validation, what more does the programmer need?
There are types of validation that need very low-level details, but this is running into an area where a separate physical implementation at 28nm runs the risk of not carrying forward to the target node.
 
Awesome!!! I didn't know JEDEC specs were downloadable for free.

I compared the signal description table of the GDDR5X (Table 57) with the signal description table of GDDR5 (Table 65). Pins are identical, except for the former allowing for a few more address pins (supporting large memories), and GDDR5X dropping the CS_n pin. So it's trivial to make GPU pinouts backward compatible.

The difference between the 190 and 170 ball package is indeed due to increased VDD/VSS pairs.

Also: GDDR5X is QDR instead of DDR.
Thanks for doing the work! Good news for G5X then as it might actually be used. :) It's only board-level design then that needs to be redone.
 
Thanks silent_guy, good to know! :)

So it's QDR... Does anyone know the burst length required to get optimal efficiency for GDDR5 vs 5X then - is it 64B vs 128B? or 256B?!
 
Thanks silent_guy, good to know! :)

So it's QDR... Does anyone know the burst length required to get optimal efficiency for GDDR5 vs 5X then - is it 64B vs 128B? or 256B?!
In QDR, the prefetch size is 512 bits and the burst length is 16. GDDR5X allows for seem less switching between QDR and DDR, but I don't know if that's useful in practice.
An interesting new feature are pseudo channels where you can select 2 columns of 256 bits each from the same bank/row combo instead of a monolithic 512 bits. In DDR that becomes 2x 128bits.
 
Is there something about Pascal that would make it uncompetitive with extant 28nm products? If it's an improved architecture ready for 28nm deployment, why couldn't it be made into a mildly better product to replace some of the SKUs out there?
Just wondering, wasn't the "Maxwell-we-got" really the "Paxwell-we-had-to-have" because 20nm got scrapped? Hence juicy stuff getting stripped out, leaving a SP gamer tweak instead of HPC innovation?

Meanwhile there must have been some full-fat contracts in place for automotive/robotics... and robust discussions. Maybe some branch project? How much might the first self-driving robo-cop be worth?
 
Just wondering, wasn't the "Maxwell-we-got" really the "Paxwell-we-had-to-have" because 20nm got scrapped? Hence juicy stuff getting stripped out, leaving a SP gamer tweak instead of HPC innovation?
I have seen some discussion of Maxwell being a result of doing an additional architectural shift at 28nm with planned 20nm ideas feeding into it.
That situation is still gated by what Nvidia decided was worth manufacturing, which in this case also is what Nvidia thought was worth selling. This seems to follow given the cost of deciding the former.

The scenario I asked that question about has the "good enough to build" question already answered, which would then raise the question why the usually strong link between building and selling wouldn't apply.

That doesn't mean that there have not been projects that do manufacture chips that are not productized. Sometimes, it's a research project with limited presence outside of the project itself. There are examples where it happens for other reasons, but usually because problems or outside factors force the abandonment of making it a product rather than going through that much of the process with no intent to sell it.
 
No one is speaking about revolutionary, but 3 ops out of something that's only supposed to be able to execute 2xFP16 or 2xINT8? How do you figure that is performed? And could it be satisfactorily emulated in a way that performance on the final product would be predictable for all or most use cases?

For instance, and it's unrelated, but can you really emulate FMAD instructions in silicon that is only capable of MUL and ADD, in a way that would be completely useful for the programmer? It's my understanding that the simulation would consume significantly more bandwidth, aside from requiring twice as many cycles to execute. And I don't think it would be in a totally predictable way in all or most cases.
Occam's Razor says you're probably just misinterpreting PowerPoint slides. Less likely that magic is happening.
 
Occam's Razor says you're probably just misinterpreting PowerPoint slides. Less likely that magic is happening.

I didn't misinterpret anything myself.

http://www.anandtech.com/show/9903/nvidia-announces-drive-px-2-pascal-power-for-selfdriving-cars

"Curiously, NVIDIA also used the event to introduce a new unit of measurement – the Deep Learning Tera-Op, or DL TOPS – which at 24 is an unusual 3x higher than PX 2’s FP32 performance. Based on everything disclosed by NVIDIA about Pascal so far, we don’t have any reason to believe FP16 performance is more than 2x Pascal’s FP32 performance. So where the extra performance comes from is a mystery at the moment. NVIDIA quoted this and not FP16 FLOPS, so it may include a special case operation (ala the Fused Multiply-Add), or even including the performance of the Denver CPU cores."
 
Following the Source link from the Anandtech article, Nvidia says this:

Its two next-generation Tegra® processors plus two next-generation discrete GPUs, based on the Pascal™ architecture, deliver up to 24 trillion deep learning operations per second, which are specialized instructions that accelerate the math used in deep learning network inference.

If you have a theory* about what "specialized instruction" means, I'd really like to hear it.

* If you actually know what it means, I'd appreciate it even more.
 
Will gamers even at 4K really see any noticeable improvement in performance or visuals of games using HBM2 over the upcoming GDDR5X?
Can Nvidia implement perhaps forms of AA if developers don't take advantage of this gargantuan amount of memory bandwidth?

We already are in a console era where big publisher games are designed around the limited memory bandwidth of consoles.

Assuming there is a wide memory interface in the GP104 and GDDR5X is used like the shipping manifest suggested, plus color compression, shouldn't that be more than enough?
http://www.tweaktown.com/news/49578...l-gp104-gpu-spotted-feature-gddr5x/index.html

http://www.techpowerup.com/forums/threads/article-just-how-important-is-gpu-memory-bandwidth.209053/
1c30bzN.jpg

EstimatedbandwidthusageALL421.png

EstimatedbandwidthusageALL.png
 
Last edited:
If you have a theory* about what "specialized instruction" means, I'd really like to hear it.
Voxilla already posted a link to an EETimes article that says that it can do 8 bit operations. Isn't that specialized enough?
Google "deep neural net 8 bit" and you find plenty of article claiming that 8 bit is enough for deep learning. What more do you want to hear?
 
Status
Not open for further replies.
Back
Top