NVIDIA Kepler speculation thread

TKK · Nov 14, 2010

Megadrive1988 said:
G70 / G71 / and PS3's RSX = NV47

Although G71 could be called NV47b. It wasn't the exact same chip afterall (less transistors, full node shrink).

AnarchX · Nov 15, 2010

What do you think would be the ratio between SP and DP on Kepler(GK100?)?

DP landing zone should be ~1.3-1.5 TFLOPs:

NV promised ~5 DP FLOPs per Watt
AMD could offer >=3200SPs @ 4D in early 2012
Larrabee MIC 50 core should be also in this range

To compete with AMDs possible ~5 TFLOPs SP they should do at least 3:1.
Maybe they going superscalar like on GF104 and going even further:

triple dual port sheduler
one DP fullspeed capable Vec16
four normal Vec16
SP to DP 5:1, with parallel looping (GF104-like) on the not-fullsped Vec16 ALUs 5:2

CarstenS · Nov 15, 2010

AnarchX said:
To compete with AMDs possible ~5 TFLOPs SP they should do at least 3:1.
Maybe they going superscalar like on GF104 and going even further:

triple dual port sheduler

one DP fullspeed capable Vec16

four normal Vec16

SP to DP 5:1, with parallel looping (GF104-like) on the not-fullsped Vec16 ALUs 5:2

Depending on the market targeted and the competition materialized up until then (yeah, I mean you Larrabee!), I don't think that af GF104-like specialization of the SPs is the way to go.

Developers seem already facing a large enough challenge to port their algorithms to GPUs in general, so it won't be very productive to force mixed SP/DP algos down their throats at the same time. And if you don't do that, you'd have 4/5 of your shaders basically idling in (most?) HPC workloads.

Plus, 2:1 is an almost ideal ratio, because you'd need double the data for DP anyway, so your extra cost would be only the SPs, not the wiring.

In short: I think they're continuing the Fermi-Route with GK100 (yes, I said that). One HPC/High-End GPU no matter the cost and more specialized designs for Gaming and Mobile markets.

Megadrive1988 · Nov 15, 2010

TKK said:
Although G71 could be called NV47b. It wasn't the exact same chip afterall (less transistors, full node shrink).

Yeah, but just realize that I meant G71 is in the NV47 family

A1xLLcqAgt0qc2RyMz0y · Nov 17, 2010

Nvidia describes 10 teraflops processor
Rick Merritt
11/17/2010 10:47 AM EST
SAN JOSE, Calif. – Nvidia's chief scientist gave attendees at Supercomputing 2010 a sneak peak of a future graphics chip that will power an exascale computer. Nvidia is competing with three other teams to build such a system by 2018 in a program funded by the U.S. Department of Defense.

Nvidia's so-called Echelon system is just a paper design backed up by simulations, so it could change radically before it gets built. Elements of its chip designs ultimately are expected to show up across the company's portfolio of handheld to supercomputer graphics products.

"If you can do a really good job computing at one scale you can do it at another," said Bill Dally, Nvidia's chief scientist who is heading up the Echelon project. "Our focus at Nvidia is on performance per watt [across all products], and we are starting to reuse designs across the spectrum from Tegra to Tesla chips," he said.

In his talk, Dally described a graphics core that can process a floating point operation using just 10 picojoules of power, down from 200 picojoules on Nvidia's current Fermi chips. Eight of the cores would be packaged on a single streaming multiprocessor (SM) and 128 of the SMs would be packed into one chip.

The result would be a thousand-core graphics chip with each core capable of handling four double precision floating-point operations per clock cycle—the equivalent of 10 teraflops on a chip. A chip with just eight of the cores would someday power a handset, Dally said.

The Echelon chip packs just twice as many cores as today's high-end Nvidia GPUs. However, today's cores handle just one double precision floating-point operation per cycle, compared to four for the Echelon chip.

Many of the advances in the chip come from its use of memory. The Echelon chip will use 256 Mbytes of SRAM memory that can be dynamically configured to meet the needs of an application.

For example, the SRAM could be broken up into as many as six levels of cache, each of a variable size. At the lowest level each core would have its own private cache.

The goal is to get data as close to processing elements as possible to reduce the need to move data around the chip, wasting energy. Thus SMs would have a hierarchy of processor registers that could be matched to locations in cache levels. In addition, the chip would have broadcast mechanisms so that the results of one task could be shared with any nodes that needed that data.

----

This was published on a Financial blog and I am looking for the link to the actual article.

A1xLLcqAgt0qc2RyMz0y · Nov 17, 2010

Found the link: http://www.eetimes.com/electronics-news/4210815/Nvidia-describes-10-teraflops-processor

Jawed · Nov 18, 2010

256MB sounds like fun

Maybe his presentation will appear here:

http://www.nvidia.com/object/sc10.html

rpg.314 · Nov 18, 2010

To ease programming, the design is cache coherent across both graphics and traditional processor cores.

Can anybody tell me how on earth is nv going to achieve that with x86 cores?

3dilettante · Nov 18, 2010

rpg.314 said:
Can anybody tell me how on earth is nv going to achieve that with x86 cores?

As long as the cores are designed to use a coherent interconnect with a common protocol, it doesn't matter what those cores are.

AMD has already indicated it would like to extend PCIe so that it supports coherence, in order to have future GPUs function in the same memory space. Fusion APUs are expected to have a shared space even earlier.

Both Intel and AMD have initiatives that opened up their coherent interconnects to third parties.
Certain FPGAs do have cc-QPI for Intel.
AMD had some abortive initiative that did the same for hypertransport. I forget the name, and I wonder if they remember either.

Some funky things in the article's diagram, like "DRAM cube".

Gipsel · Nov 18, 2010

3dilettante said:
AMD has already indicated it would like to extend PCIe so that it supports coherence, in order to have future GPUs function in the same memory space.

So HTX over PCIe then? :?:

rpg.314 · Nov 18, 2010

3dilettante said:
As long as the cores are designed to use a coherent interconnect with a common protocol, it doesn't matter what those cores are.

AMD has already indicated it would like to extend PCIe so that it supports coherence, in order to have future GPUs function in the same memory space. Fusion APUs are expected to have a shared space even earlier.

Both Intel and AMD have initiatives that opened up their coherent interconnects to third parties.
Certain FPGAs do have cc-QPI for Intel.
AMD had some abortive initiative that did the same for hypertransport. I forget the name, and I wonder if they remember either.

Some funky things in the article's diagram, like "DRAM cube".

The question isn't how coherency b/w xpu and ypu cores is achieved. The question is legal and economic. There is a massive on going feud with intel specifically denying nv access to QPI. And I fail to see any incentives for AMD to allow anybody else to access HT over PCIe when their own gpu's are in competition. Even if AMD let any one use HT over PCIe, relying upon AMD to be a competitive CPU vendor 10 years down the line is risky at best.

The feasibility of PCI-SIG coming up with a standardized vendor neutral extension to spec to support coherency seems questionable at best.

Alexko · Nov 18, 2010

3dilettante said:
As long as the cores are designed to use a coherent interconnect with a common protocol, it doesn't matter what those cores are.

AMD has already indicated it would like to extend PCIe so that it supports coherence, in order to have future GPUs function in the same memory space. Fusion APUs are expected to have a shared space even earlier.

Both Intel and AMD have initiatives that opened up their coherent interconnects to third parties.
Certain FPGAs do have cc-QPI for Intel.
AMD had some abortive initiative that did the same for hypertransport. I forget the name, and I wonder if they remember either.

Some funky things in the article's diagram, like "DRAM cube".

Torrenza. I can't think of anything significant that came out of it.

3dilettante · Nov 18, 2010

Gipsel said:
So HTX over PCIe then?

The AMD spokesperson didn't get into specifics.
The timeframe for this seems to be well after PCIe gets integrated on-die, so it may be some kind of ccPCIe since HTX is not particularly relevant while on the same chip.

rpg.314 said:
The question isn't how coherency b/w xpu and ypu cores is achieved. The question is legal and economic.

There was a bit of ambiguity in the question.

The feasibility of PCI-SIG coming up with a standardized vendor neutral extension to spec to support coherency seems questionable at best.

If AMD goes forward with its intent to include its discrete GPUs in the same memory space so that they can frolic with Fusion, it may be interesting to implement a PCIe protocol with proprietary extensions while still calling it a PCI SIG-sanctioned expansion slot, which the CPUs with on-die PCIe will be doing.
In other cases where PCIe-based connections are used between chipset components, they were not expansion slots and they were not named or marketed as PCIe.

rpg.314 · Nov 18, 2010

3dilettante said:
The timeframe for this seems to be well after PCIe gets integrated on-die, so it may be some kind of ccPCIe since HTX is not particularly relevant while on the same chip.

Seems to be 2014.

If AMD goes forward with its intent to include its discrete GPUs in the same memory space so that they can frolic with Fusion, it may be interesting to implement a PCIe protocol with proprietary extensions while still calling it a PCI SIG-sanctioned expansion slot, which the CPUs with on-die PCIe will be doing.

Interestingly, what kind of latency can be expected while trying to maintain coherency over PCIe, vis a vis something like a multi socket system?

3dilettante · Nov 18, 2010

rpg.314 said:
Interestingly, what kind of latency can be expected while trying to maintain coherency over PCIe, vis a vis something like a multi socket system?

One comparison I saw a while back was this: http://www.hypertransport.org/docs/wp/Low_Latency_Final.pdf

Granted, this was done years ago and is literature with an interest in furthering HT.
The advantage of HT is about 40%.

This is slow, but this is not necessarily prohibitive since a 4-socket Opteron system of the day with two HT hops was still acceptable, and would have been even slower.
It may discourage having more than one coherent PCIe device and does not include any latency that the GPU's throughput-oriented memory controller may add in addition to what a CPU would add in that situation.

On the flip side, the time spent in data management and buffering between the GPU and host memory system would be much lower.

Megadrive1988 · Nov 20, 2010

A1xLLcqAgt0qc2RyMz0y said:
Nvidia describes 10 teraflops processor
Rick Merritt
11/17/2010 10:47 AM EST
SAN JOSE, Calif. – Nvidia's chief scientist gave attendees at Supercomputing 2010 a sneak peak of a future graphics chip that will power an exascale computer. Nvidia is competing with three other teams to build such a system by 2018 in a program funded by the U.S. Department of Defense.

Nvidia's so-called Echelon system is just a paper design backed up by simulations, so it could change radically before it gets built. Elements of its chip designs ultimately are expected to show up across the company's portfolio of handheld to supercomputer graphics products.

"If you can do a really good job computing at one scale you can do it at another," said Bill Dally, Nvidia's chief scientist who is heading up the Echelon project. "Our focus at Nvidia is on performance per watt [across all products], and we are starting to reuse designs across the spectrum from Tegra to Tesla chips," he said.

In his talk, Dally described a graphics core that can process a floating point operation using just 10 picojoules of power, down from 200 picojoules on Nvidia's current Fermi chips. Eight of the cores would be packaged on a single streaming multiprocessor (SM) and 128 of the SMs would be packed into one chip.

The result would be a thousand-core graphics chip with each core capable of handling four double precision floating-point operations per clock cycle—the equivalent of 10 teraflops on a chip. A chip with just eight of the cores would someday power a handset, Dally said.

The Echelon chip packs just twice as many cores as today's high-end Nvidia GPUs. However, today's cores handle just one double precision floating-point operation per cycle, compared to four for the Echelon chip.

Many of the advances in the chip come from its use of memory. The Echelon chip will use 256 Mbytes of SRAM memory that can be dynamically configured to meet the needs of an application.

For example, the SRAM could be broken up into as many as six levels of cache, each of a variable size. At the lowest level each core would have its own private cache.

The goal is to get data as close to processing elements as possible to reduce the need to move data around the chip, wasting energy. Thus SMs would have a hierarchy of processor registers that could be matched to locations in cache levels. In addition, the chip would have broadcast mechanisms so that the results of one task could be shared with any nodes that needed that data.

----

This was published on a Financial blog and I am looking for the link to the actual article.

Awesome. Is Echelon an architecture beyond Maxwell?

Maybe derivatives of Echelon will power the graphics in both PSP3 and PlayStation5 ^__^

edit: I understand now that Echelon is not an architecture, but a system. The question I shall ask now is, will Echelon USE an architecture beyond Maxwell, or some derivative of Maxwell.

MfA · Nov 22, 2010

rpg.314 said:
Can anybody tell me how on earth is nv going to achieve that with x86 cores?

It's not cheap, but you can do a lot with the TLB.

For instance for a shared memory region in video memory you can set the CPU to write through, GPU maintains a short list of invalidates per shared page which the TLB handler uses to flush cachelines in the CPU cache (if GPU writes to the page since the last sync exceed the size of the list invalidate the entire page). The TLB handler signals the GPU to block on writes to that memory and the CPU can proceed with CC reading of the shared data. Write rights get passed back to the GPU and read rights revoked on the CPU through an interrupt.

I think it makes more sense to just use message passing than to provide this kind of CC though.

PS. of course if NVIDIA can get a cc-QPI license they can do it much more efficiently.

Remi · Nov 22, 2010

Jawed said:
256MB sounds like fun

Maybe his presentation will appear here:

http://www.nvidia.com/object/sc10.html

There's a pdf in the "Computing Theater" section entitled "GPU Computing: To ExaScale and Beyond"...

Jawed · Nov 22, 2010

So, 50,000 GPUs are required to reach 1 EFLOPS going by that presentation. Also, it'll consume ~15MW, about twice what Jaguar does.

psurge · Nov 23, 2010

Where does it say they will be using x86 cores? I see from the presentation that each chip has 8 "latency processors", which are presumably traditional CPU-like processors, but nothing about coexistence with a separate x86 CPU.

NVIDIA Kepler speculation thread

TKK

AnarchX

CarstenS

Moderator

Megadrive1988

A1xLLcqAgt0qc2RyMz0y

A1xLLcqAgt0qc2RyMz0y

Jawed

rpg.314

3dilettante

Gipsel

rpg.314

Alexko

3dilettante

rpg.314

3dilettante

Megadrive1988

MfA

Remi

Jawed

psurge

Similar threads