Nvidia Volta Speculation Thread

CarstenS · Jun 7, 2017

Regarding „packed math“: Whether or not it will be enabled in the Volta-based consumer Geforces at launch, IMHO Nvidia just cannot risk not having it ready in hardware as an emergency measure if it should see a sudden uptake in some heavily benchmarked titles. Usage model is coming from console and mobile already, so at least cross-platform devs should be able to make use of it soon. Sorry for stating the obivous.

Deleted member 2197 · Jun 7, 2017

I tend to think some people (consumers) might want to be involved in AI/Deep Learning/Inferencing and would be nice if Geforce Volta's gave them that ability. Having tensor cores would be quite appropriate.

Anarchist4000 · Jun 7, 2017

silent_guy said:
There could be various reasons why larger GPUs have better perf/W. The non-core doesn't scale along with the core, the cache hit rates may be better, ...

As a rule silicon is most efficient near threshold voltages due to power being a square of the voltage. There could be other reasons that help, but in general more silicon is better. Same reason mobile chips will run fully enabled parts at lower clocks.

silent_guy said:
I don’t see the big deal, and I have no idea where your 50-100% higher performance is coming from.

Assumption that tensor and SIMDs are an even split. Even a 25% increase in execution units could decrease power at similar performance levels.

silent_guy said:
What double exposure?
You need a double exposure for the interposer, but you'd need that even for a 600mm2 core chip.

Fiji was 600mm2 with 4 stacks of HBM1. Memory was a bit smaller, but stayed within limits. Exceeding those limits, even if for FP64, implies they needed more area. Otherwise just shrink the chip a bit so the memory fits.

CSI PC said:
Maybe I misunderstood your point, but I thought you were saying how having different cores for FP32 and FP64 and Tensore cores with idling/not used was bad even though it is working out fine for them since Kepler.

In this case they are used concurrently. The Tensors are a mix of FP16 muls and FP32 adders that already existed. 30 TFLOPs of existing FP16 should already be enough to drive the tensor units. So why duplicate logic you already have just to leave it idle? Workloads seem unlikely to use both concurrently in this case and the tensor numbers don't seem to imply concurrency.

CSI PC said:
Not sure how you feel Tensor cores fits into this competely replacing traditional SGEMM/SGEMMEx/HGEMM, especially as it must use new instructions and operations and importantly different data structure along with the fact the P100 is not just Deep Learning with FP16 but also for scientific imaging application and that is not even considering all the libraries Nvidia has; there is a place for both and there is some potential for overlap in use of instructions/operations but Tensor cores do not replace the need for Vec2 with the FP32 cores because it cannot be used for all libraries or instructions along with it being another level of coding-data requirement not all will want to change to in one go.

The new instructions are warp level instructions as opposed to per lane. One multiplication opposed to many in parallel with dependencies. My whole point here is that the tensor ops are executing on top of existing hardware with some modifications. So the Vec2s are the FP16 muls and extra adders from the INT32 pipeline the FP32 accumulation of a Tensor core. The instruction sequence allows for a few liberties to be taken increasing performance.

lanek · Jun 7, 2017

pharma said:
I tend to think some people (consumers) might want to be involved in AI/Deep Learning/Inferencing and would be nice if Geforce Volta's gave them that ability. Having tensor cores would be quite appropriate.

I can imagine some will, ( it is not so easy for "consumers" , more for university peoples who could "work at home" ),.. honestly, even if this was the case, i really doubt that Nvidia will do it. In fact i give it a 10% chance to see this happend. Maybe on a Titan at 2000$, and even there ...

I say that, but honestly in Japan, china, many Asia country, theres allready scholar program where at 10 years, boys and girls, are doing robotics and other things related.. so why not ? ( And it is where i look my country and see they have just started to think set course for computers ( maybe C++ ) on primary school ( 14 years )

manux · Jun 7, 2017

I imagine those new thin laptops with 1080 would be good choice for learning about ai/ml/dnn... If stuff gets serious any single gpu solution is anyway too slow to get practical results. Perhaps nvidias docker+cloud approach is decent as not everyone wants to build custom farms.

ieldra · Jun 7, 2017

Anarchist4000 said:
If that's the case, there's a good chance they will get destroyed in the server market. Having half the hardware idling at any given time won't work out very well for efficiency. Although it might explain why a chip nearly double Vegas size is only marginally faster on theoretical numbers.

What will it get destroyed in the server market by ? Vega is no faster than a Quadro GP100. Which has higher memory bandwidth and comparable FP32/FP16 resources (but no DP4A) You are assuming it will have half the hardware idling, we still don't have any concrete information on mixed Tensor + traditional pipeline usage. IIRC GV100 is 815mm^2? Vega 10 was estimated to be around 530. 53% larger die, something like 8x FP64 rates, and 5x higher throughput in DL. Compares quite favorably to me.

Look at it this way if V100 has 30 FP16 TFLOPs, that's one quarter of the tensor capability for half the workload. Run those at double rate and you have 60 TFLOPs for half the operation. Run the adds and suddenly 120 TFLOPs of tensor compute. All within a traditional pipeline with a lot of adders. You could write values to the RF, but power consumption will go up significantly. Coincidentally they specified V100 as 120 TFLOPs of tensor operations.

Anarchist4000 said:
As a rule silicon is most efficient near threshold voltages due to power being a square of the voltage. There could be other reasons that help, but in general more silicon is better. Same reason mobile chips will run fully enabled parts at lower clocks.

Assumption that tensor and SIMDs are an even split. Even a 25% increase in execution units could decrease power at similar performance levels.

Fiji was 600mm2 with 4 stacks of HBM1. Memory was a bit smaller, but stayed within limits. Exceeding those limits, even if for FP64, implies they needed more area. Otherwise just shrink the chip a bit so the memory fits.

In this case they are used concurrently. The Tensors are a mix of FP16 muls and FP32 adders that already existed. 30 TFLOPs of existing FP16 should already be enough to drive the tensor units. So why duplicate logic you already have just to leave it idle? Workloads seem unlikely to use both concurrently in this case and the tensor numbers don't seem to imply concurrency.

The new instructions are warp level instructions as opposed to per lane. One multiplication opposed to many in parallel with dependencies. My whole point here is that the tensor ops are executing on top of existing hardware with some modifications. So the Vec2s are the FP16 muls and extra adders from the INT32 pipeline the FP32 accumulation of a Tensor core. The instruction sequence allows for a few liberties to be taken increasing performance.

It is starting to seem like you are deliberately ignoring nvidia's very clear comments regarding this, they are entirely separate. Tensor cores do not use existing FP32 units, because it was explicitly stated that full throughput FP32 would saturate only half the dispatch capacity per cycle, the remaining half can be used for all other instructions/units.

There are two tensor cores per 8 FP64 units, 16FP32, 16 INT. Each can execute a 3 (4x4 matrix, or 16wide vector) operand FMA instruction in one cycle judging by how this was written. I do not understand your insistence on denying it is, it appears at times as though your end goal is to suggest tensor cores are a banal addition that can easily done with existing logic re: vega, when it is abundantly clear this is not so. You have been in such a rush to make these statements that you apparently forwent reading what limited details there are available and mistakenly believed this was doing so called tensor products.

To reiterate...

The multiplication of two 4x4 matrices will result in a third 4x4 matrix with 16 elements, each element consists of 4 FMA operations, 16 elements = 64 operations. The data in matrix C can be loaded into the accumulator from the beginning, and each FMA of FP16 elements of matrices A and B are fed into that accumulator directly.

Edit:

Am not certain how having four (4 pairs of FP16 mult) operands needing to be accumulated in one cycle will work, as it is a unary operator usually, it may well be pipelined, the question is how deep, but the wording of this suggests this is not the case and i am also very curious as to how this is handled in terms of dispatch as this clearly far exceeds the capacity of the AWS on paper

silent_guy · Jun 8, 2017

Anarchist4000 said:
As a rule silicon is most efficient near threshold voltages due to power being a square of the voltage. There could be other reasons that help, but in general more silicon is better. Same reason mobile chips will run fully enabled parts at lower clocks.

How is that relevant to this discussion? We're talking about a pretty small amount of area.

Assumption that tensor and SIMDs are an even split. Even a 25% increase in execution units could decrease power at similar performance levels.

A split how? In terms of mm2? In terms of usage? In terms of number of units? I'm confused.

Fiji was 600mm2 with 4 stacks of HBM1. Memory was a bit smaller, but stayed within limits. Exceeding those limits, even if for FP64, implies they needed more area. Otherwise just shrink the chip a bit so the memory fits.

Fiji stayed within what limits?

At more than 1000mm2, Fiji's interposer exceeded the limits of today's lithography machines, so they needed double exposure for the interposer. The core die did not. Why GV100 would be any different?

So why duplicate logic you already have just to leave it idle?

It's certainly an option.

But there are good arguments to not do it this way: additional power consumption for the non-tensor FP16 and FP32 cases, simplicity of the design, ease of adding or removing a tensor core to an SM (or replace it with an even faster integer equivalent for the inference versions.)

You seem to dismiss a separate core out of hand as if it's some ridiculous option. There are pros and cons for both.

ieldra · Jun 8, 2017

silent_guy said:
How is that relevant to this discussion? We're talking about a pretty small amount of area.

A split how? In terms of mm2? In terms of usage? In terms of number of units? I'm confused.

Fiji stayed within what limits?

At more than 1000mm2, Fiji's interposer exceeded the limits of today's lithography machines, so they needed double exposure for the interposer. The core die did not. Why GV100 would be any different?

It's certainly an option.

But there are good arguments to not do it this way: additional power consumption for the non-tensor FP16 and FP32 cases, simplicity of the design, ease of adding or removing a tensor core to an SM (or replace it with an even faster integer equivalent for the inference versions.)

You seem to dismiss a separate core out of hand as if it's some ridiculous option. There are pros and cons for both.

The more interesting idea is a DL oriented chip with *more* tensor resources. They could potentially strip vec2 fp16, dp4a, fp64 and double the tensor core numbers and push close to 1/4 petaflop

Anarchist4000 · Jun 8, 2017

ieldra said:
What will it get destroyed in the server market by ? Vega is no faster than a Quadro GP100. Which has higher memory bandwidth and comparable FP32/FP16 resources (but no DP4A) You are assuming it will have half the hardware idling, we still don't have any concrete information on mixed Tensor + traditional pipeline usage. IIRC GV100 is 815mm^2? Vega 10 was estimated to be around 530. 53% larger die, something like 8x FP64 rates, and 5x higher throughput in DL. Compares quite favorably to me.

You have a quote for Vega's tensor flops? Not the FP16 figures, but when using all the hardware in a pipelined fashion for tensor operations on an architecture that hasn't been detailed yet? Right now you seem to be comparing apples and oranges.

ieldra said:
It is starting to seem like you are deliberately ignoring nvidia's very clear comments regarding this, they are entirely separate. Tensor cores do not use existing FP32 units, because it was explicitly stated that full throughput FP32 would saturate only half the dispatch capacity per cycle, the remaining half can be used for all other instructions/units.

I'm not ignoring them, you just don't understand what they mean. The quotes you've provided say nothing about Tensors. In fact they said the INT32 pipeline was the other half of that dispatch capacity. Or probably dual issued FP16 where Vec2s aren't required. That's exactly what I proposed for Vega a while ago where the programmer didn't have to pack anything. It just seems silly to add limited FP32 cores to replace FP32 cores that already exist when they won't in all likelihood be running concurrently.

ieldra said:
when it is abundantly clear this is not so. You have been in such a rush to make these statements that you apparently forwent reading what limited details there are available and mistakenly believed this was doing so called tensor products.

You'll have to explain this abundantly clear part. Because the Nvidia statements run counter to what you've been saying. I'm not sure they say what you think they say. They explicitly state one thing, like FP32 and INT32 concurrently, and you come up with something completely different.

ieldra said:
The multiplication of two 4x4 matrices will result in a third 4x4 matrix with 16 elements, each element consists of 4 FMA operations, 16 elements = 64 operations. The data in matrix C can be loaded into the accumulator from the beginning, and each FMA of FP16 elements of matrices A and B are fed into that accumulator directly.

So you're suggesting one cycle to load values into an accumulator, 4 to process the multiplication thanks to dependencies on the adder, one to write out the value of the accumulator, then repeat that process? As the products of the matrices are being added to sequential matricies, it would seem far simpler to just stay decomposed and add up the components once you finish all the multiplications. Completely avoiding the dependencies. One cycle per multiplication as opposed to the 4-6 cycles you propose or interleaving operations with more complicated data paths.

ieldra said:
Am not certain how having four (4 pairs of FP16 mult) operands needing to be accumulated in one cycle will work, as it is a unary operator usually, it may well be pipelined, the question is how deep, but the wording of this suggests this is not the case and i am also very curious as to how this is handled in terms of dispatch as this clearly far exceeds the capacity of the AWS on paper

There is no such thing as a four input adder. At best it's a series of dependent adders hidden in one really long clock cycle. The only equivalent of a multiple input adder that comes to mind is quantum computing or analog computation involving opamps. I don't foresee either of those in Volta.

silent_guy said:
How is that relevant to this discussion? We're talking about a pretty small amount of area.

Small relative to what though? The entire chip or the area dedicated to logic? It's relevant because it should provide a means to a more efficient chip.

silent_guy said:
At more than 1000mm2, Fiji's interposer exceeded the limits of today's lithography machines, so they needed double exposure for the interposer. The core die did not. Why GV100 would be any different?

I've never seen anything about Fiji using a double exposure on the interposer. My understanding was that it was as large as conventionally possible which defined the chip dimensions. If that wasn't the case Fiji wouldn't have had any trade-offs.

silent_guy said:
But there are good arguments to not do it this way: additional power consumption for the non-tensor FP16 and FP32 cases, simplicity of the design, ease of adding or removing a tensor core to an SM (or replace it with an even faster integer equivalent for the inference versions.)

What consumes extra power though? The non-tensor cores are there regardless, so might as well make use of them. This does seem ridiculous to me as it's being proposed to disable FP32/16 cores so more FP32/16 cores can be added. The whole concept of the tensor core from my view is that a single tensor operation is being executed across all the blocks concurrently. Use the FP16 units for the multiplication, forward the result to the FP32+INT cores for adding/accumulation, then repeat. The only difference is that instead of running 16 threads across 16 hardware lanes an entire matrix operation is being completed in a single cycle using all of them and piplining sequential operations across the blocks with specialized paths. That's why it seems ridiculous to me to replace hardware that already exists with more hardware.

ieldra · Jun 8, 2017

Anarchist4000 said:
You have a quote for Vega's tensor flops? Not the FP16 figures, but when using all the hardware in a pipelined fashion for tensor operations on an architecture that hasn't been detailed yet? Right now you seem to be comparing apples and oranges.

Here we go again with the imaginary support for matrix math ala Volta.

I'm not ignoring them, you just don't understand what they mean. The quotes you've provided say nothing about Tensors. In fact they said the INT32 pipeline was the other half of that dispatch capacity. Or probably dual issued FP16 where Vec2s aren't required. That's exactly what I proposed for Vega a while ago where the programmer didn't have to pack anything. It just seems silly to add limited FP32 cores to replace FP32 cores that already exist when they won't in all likelihood be running concurrently.

You accused me of not understanding when you started waffling about tensor operations and 3 dimensional arrays as well. Read what was posted for you. @CSI PC posted a response from nvidia employee clearly stating the other half of dispatch capacity was available to ALL OTHER INSTRUCTIONS

You'll have to explain this abundantly clear part. Because the Nvidia statements run counter to what you've been saying. I'm not sure they say what you think they say. They explicitly state one thing, like FP32 and INT32 concurrently, and you come up with something completely different.
[/QUOTE]
I don't see how you can conclude that anything I stated contradicts the above, then again after having seen your numerous posts detailing how banal an implementation of tensor products would be on Vega I am hard pressed to find this surprising.

So you're suggesting one cycle to load values into an accumulator, 4 to process the multiplication thanks to dependencies on the adder, one to write out the value of the accumulator, then repeat that process? As the products of the matrices are being added to sequential matricies, it would seem far simpler to just stay decomposed and add up the components once you finish all the multiplications. Completely avoiding the dependencies. One cycle per multiplication as opposed to the 4-6 cycles you propose or interleaving operations with more complicated data paths.

Multplication is clearly done in one cycle, I don't know what you're on about, that probably makes two of us.

There is no such thing as a four input adder.

You don't say...

as it's being proposed to disable FP32/16 cores so more FP32/16 cores can be added. The whole concept of the tensor core from my view is that a single tensor operation is being executed across all the blocks concurrently. Use the FP16 units for the multiplication, forward the result to the FP32+INT cores for adding/accumulation, then repeat. The only difference is that instead of running 16 threads across 16 hardware lanes an entire matrix operation is being completed in a single cycle using all of them and piplining sequential operations across the blocks with specialized paths. That's why it seems ridiculous to me to replace hardware that already exists with more hardware.

Use the FP16 units? What fp16 units? You mean the FP32 units capable of packed fp16 that you refuse to recognize? You have stretched your argument so thin with this long series of misunderstandings, false assumptions and general aversity towards anything nvidia related that you are poking holes in it all by yourself.

They're all idiots, tensor cores are a waste of space. Vega can match Gv100 no problem.

Gotcha.

ieldra · Jun 8, 2017

I don't see how exactly I am supposed to explain the 'abundantly clear' part to you. Would you like me to read this out to you over VoIP?

https://forum.beyond3d.com/posts/1984172/

or would you prefer I function as your forum secretary to keep you up to date on the numerous posts you neglect to read on your path to the conclusion you reached well before looking at any actual information ?

CSI PC · Jun 8, 2017

lanek said:
I can imagine some will, ( it is not so easy for "consumers" , more for university peoples who could "work at home" ),.. honestly, even if this was the case, i really doubt that Nvidia will do it. In fact i give it a 10% chance to see this happend. Maybe on a Titan at 2000$, and even there ...

I say that, but honestly in Japan, china, many Asia country, theres allready scholar program where at 10 years, boys and girls, are doing robotics and other things related.. so why not ? ( And it is where i look my country and see they have just started to think set course for computers ( maybe C++ ) on primary school ( 14 years )

Titan is a strong possibility as for Pascal it has been more directed as a 'cheaper' high end support inferencing product to Gx100 for labs/universities/etc.
If I remember the Titan Pascal was physically shown 1st at Stanford by Jen-Hsun in the context of his being there for DL.
But the headache is that Nvidia do like to differentiate DL operation/instruction/compute capability between Gx100 and GX102.

What will be interesting is the price of the dedicated DL Volta 150W FHHL GV100 and what has been disabled and whether this will have complete DL functionality and scope (such as Int8 inferencing) or if we can expect a GV102 version as well.
The GV100 FHHL is a nice single slot size with good TDP albeit more focused towards DL:

Cheers

CarstenS · Jun 8, 2017

ieldra said:
It is starting to seem like you are deliberately ignoring nvidia's very clear comments regarding this, they are entirely separate. Tensor cores do not use existing FP32 units, because it was explicitly stated that full throughput FP32 would saturate only half the dispatch capacity per cycle, the remaining half can be used for all other instructions/units.

I can only remember that they said explicitly, that GV100-SMs have one dispatcher each, followed by „the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput. “

At least in the blog inside volta I linke above, they do not mention anything else in this regard.

ieldra · Jun 8, 2017

CarstenS said:
I can only remember that they said explicitly, that GV100-SMs have one dispatcher each, followed by „the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput. “

At least in the blog inside volta I linke above, they do not mention anything else in this regard.

Yeah in the blog they mention nothing else, if you see the post by @CSI PC I linked above you will find senior engineer at nvidia responding to questions

CarstenS · Jun 8, 2017

Found it - thx!

But this only talks about scheduling and dispatch. What's probably the catch here is that for ... what can I call them... non-auxiliary calculations maybe, you'd have to have free cycles in register access as well, which might be a problem, since that 32-Warp-in-two-steps-of-16 is mainly due to register file ports economization. IOW, the register file is full-time employed serving the FP32-ALUs their data on FMA operations.

Is my understanding correct?

edit: Thinking about it, the new Warp-at-once scheduling/dispatch in conjunction with the doubled Load/Store capability seems to be the reason for the latency reduction for "core FMA math operations". While at every other cycle (where scheduling/dispatch would run dry, you can do mostly stuff that needs no or very little register accesses or you could switch both off in order to save power.

Anarchist4000 · Jun 8, 2017

ieldra said:
Here we go again with the imaginary support for matrix math ala Volta.

You do realize matrix operations have been supported for some time on GPUs? In fact every GPU since probably the 90s when hardware T&L became available would support them.

ieldra said:
You accused me of not understanding when you started waffling about tensor operations and 3 dimensional arrays as well. Read what was posted for you. @CSI PC posted a response from nvidia employee clearly stating the other half of dispatch capacity was available to ALL OTHER INSTRUCTIONS

What waffling? Show me where there is ANY evidence of these claims you keep making? There is an entire thread spawned with detailed breakdowns of the FMA operations and where all these tensor flops come from. Including the 8 flops per lane possibility I suggested that accomplishes exactly with what AMD has listed as a capability.

ieldra said:
I don't see how exactly I am supposed to explain the 'abundantly clear' part to you. Would you like me to read this out to you over VoIP?

https://forum.beyond3d.com/posts/1984172/

or would you prefer I function as your forum secretary to keep you up to date on the numerous posts you neglect to read on your path to the conclusion you reached well before looking at any actual information ?

So, the new architecture allows to run FP32 instructions at full rate, and use remaining 50% of issue slots to execute all other type of istructions - INT32 for index/cycle calculations, load/store, branches, SFU, FP64 and so on. And unlike Maxwell/Pascal, full GPU utilization doesn't need to pack pairs of coissued instructions into the same thread - each next cycle can execute instructions from differemnt thread, so one thread perfroming series of FP32 instructions and other thread perfroming series of INT32 instructions, will load both blocks by 100%

So "so on" is the mythical tensor plus FP32 scheduling? Because this quote you keep referencing says absolutely nothing to support the claim you keep making. So please do read it back to me and show where this in any way reinforces your claim. Along with this impossible four input adder that is required and the basis for why operations run as they do. What I've been proposing is the very type of operation these software libraries set out to optimize.

BRiT · Jun 8, 2017

Come on now, there is no need to get chippy and start insulting one another. Please keep things on a technical level.

If some one doesnt see things exactly your way, acting like children isn't going to win them over, but having higher quality technical discussion might if you try explaining thungs in a different manner.

silent_guy · Jun 8, 2017

Anarchist4000 said:
Small relative to what though? The entire chip or the area dedicated to logic? It's relevant because it should provide a means to a more efficient chip.

I estimated the total FP16 area in GP100 to be in the range of 3.5% of the total die size.

I've never seen anything about Fiji using a double exposure on the interposer. My understanding was that it was as large as conventionally possible which defined the chip dimensions.

Look at slide 8 of the Fiji presentation at HotChips 2015 (https://www.hotchips.org/wp-content...-GPU-Epub/HC27.25.520-Fury-Macri-AMD-GPU2.pdf) : "Larger than reticle interposer".

Macri said in a Fiji pre-launch interview that "double exposure was possible but might be cost prohibitive". Maybe Fiji was cost prohibitive? Can you hear Totz' head explode?

On a similar note: this could explain why Vega has only 2 HBM2 stacks. It stays very comfortably within single exposure territory. Even 4 stacks might have stayed within the limit.

What consumes extra power though? The non-tensor cores are there regardless, so might as well make use of them.

On one hand, the extra power from making a narrow function, a pure FP16 MAD, more general. This will always cost you one way or the other.

On the other hand, you should be able to save a considerable amount of logic and power for the pure tensor core case as well. If you know you're always going to add 4 FP16 numbers and, ultimately, are always going to add them into an FP32, there should be a plenty of optimizations possible in terms of taking short cuts with normalization etc. For example, for just a 4 way FP16 adder, you only need 1 max function among 4 exponents instead of multiple 2 way ones. There's no way you won't have similar optimizations elsewhere.

This does seem ridiculous to me as it's being proposed to disable FP32/16 cores so more FP32/16 cores can be added.

Speaking of which: in my earlier post, I somehow assumed that the tensor cores needed 2x the number of multipliers of the regular packed FP16. But that should be 4x, doesn't it? Or am I missing something?

That makes the case for a full dedicated tensor core stronger.

Use the FP16 units for the multiplication, forward the result to the FP32+INT cores for adding/accumulation, then repeat.

Use the existing FP16 units *and add 3 extra units*. Versus: just use 4 extra units that are highly optimized for just one function. The latter doesn't seem too ridiculous to me.

CarstenS · Jun 8, 2017

silent_guy said:
Use the existing FP16 units *and add 3 extra units*. Versus: just use 4 extra units that are highly optimized for just one function. The latter doesn't seem too ridiculous to me.

What if you tried to shoehorn the FP64 units (as well) into doing Tensor-stuff?

CSI PC · Jun 8, 2017

CarstenS said:
What if you tried to shoehorn the FP64 units (as well) into doing Tensor-stuff?

Like I said earlier Jonah Alben briefly mentioned it puts incredible pressure on register/BW to do any such operations with the FP64 core that way and why it is just dedicated, nor do we see FP16 operations using both FP32 and FP64 cores because of this - he mentioned this when the P100 was launched was no way to make it work.
Comes down to whether the changes to Volta would be enough to do all of that.

Edit:
To clarify this is who Jonah is:

Jonah Alben is senior vice president of GPU engineering at NVIDIA, a role he’s held since 2008. He leads the development of next-generation GPU architectures. Previously, Alben served four years as vice president of GPU engineering. He joined the company in 1997 as an ASIC design engineer. Prior to NVIDIA, he was an ASIC engineer at Silicon Graphics. Alben has authored 34 patents and holds BSCSE and MSEE degrees from Stanford University.

Cheers

Nvidia Volta Speculation Thread

CarstenS

Moderator

Deleted member 2197

Guest

Anarchist4000

lanek

manux

ieldra

silent_guy

ieldra

Anarchist4000

ieldra

ieldra

CSI PC

CarstenS

Moderator

ieldra

CarstenS

Moderator

Anarchist4000

BRiT

(>• •)>⌐■-■ (⌐■-■)

silent_guy

CarstenS

Moderator

CSI PC

Similar threads