Although G71 could be called NV47b. It wasn't the exact same chip afterall (less transistors, full node shrink).G70 / G71 / and PS3's RSX = NV47
Although G71 could be called NV47b. It wasn't the exact same chip afterall (less transistors, full node shrink).G70 / G71 / and PS3's RSX = NV47
To compete with AMDs possible ~5 TFLOPs SP they should do at least 3:1.
Maybe they going superscalar like on GF104 and going even further:
- triple dual port sheduler
- one DP fullspeed capable Vec16
- four normal Vec16
- SP to DP 5:1, with parallel looping (GF104-like) on the not-fullsped Vec16 ALUs 5:2
Although G71 could be called NV47b. It wasn't the exact same chip afterall (less transistors, full node shrink).
To ease programming, the design is cache coherent across both graphics and traditional processor cores.
Can anybody tell me how on earth is nv going to achieve that with x86 cores?
So HTX over PCIe then?AMD has already indicated it would like to extend PCIe so that it supports coherence, in order to have future GPUs function in the same memory space.
As long as the cores are designed to use a coherent interconnect with a common protocol, it doesn't matter what those cores are.
AMD has already indicated it would like to extend PCIe so that it supports coherence, in order to have future GPUs function in the same memory space. Fusion APUs are expected to have a shared space even earlier.
Both Intel and AMD have initiatives that opened up their coherent interconnects to third parties.
Certain FPGAs do have cc-QPI for Intel.
AMD had some abortive initiative that did the same for hypertransport. I forget the name, and I wonder if they remember either.
Some funky things in the article's diagram, like "DRAM cube".
As long as the cores are designed to use a coherent interconnect with a common protocol, it doesn't matter what those cores are.
AMD has already indicated it would like to extend PCIe so that it supports coherence, in order to have future GPUs function in the same memory space. Fusion APUs are expected to have a shared space even earlier.
Both Intel and AMD have initiatives that opened up their coherent interconnects to third parties.
Certain FPGAs do have cc-QPI for Intel.
AMD had some abortive initiative that did the same for hypertransport. I forget the name, and I wonder if they remember either.
Some funky things in the article's diagram, like "DRAM cube".
The AMD spokesperson didn't get into specifics.So HTX over PCIe then?
There was a bit of ambiguity in the question.The question isn't how coherency b/w xpu and ypu cores is achieved. The question is legal and economic.
If AMD goes forward with its intent to include its discrete GPUs in the same memory space so that they can frolic with Fusion, it may be interesting to implement a PCIe protocol with proprietary extensions while still calling it a PCI SIG-sanctioned expansion slot, which the CPUs with on-die PCIe will be doing.The feasibility of PCI-SIG coming up with a standardized vendor neutral extension to spec to support coherency seems questionable at best.
Seems to be 2014.The timeframe for this seems to be well after PCIe gets integrated on-die, so it may be some kind of ccPCIe since HTX is not particularly relevant while on the same chip.
Interestingly, what kind of latency can be expected while trying to maintain coherency over PCIe, vis a vis something like a multi socket system?If AMD goes forward with its intent to include its discrete GPUs in the same memory space so that they can frolic with Fusion, it may be interesting to implement a PCIe protocol with proprietary extensions while still calling it a PCI SIG-sanctioned expansion slot, which the CPUs with on-die PCIe will be doing.
Interestingly, what kind of latency can be expected while trying to maintain coherency over PCIe, vis a vis something like a multi socket system?
Nvidia describes 10 teraflops processor
Rick Merritt
11/17/2010 10:47 AM EST
SAN JOSE, Calif. – Nvidia's chief scientist gave attendees at Supercomputing 2010 a sneak peak of a future graphics chip that will power an exascale computer. Nvidia is competing with three other teams to build such a system by 2018 in a program funded by the U.S. Department of Defense.
Nvidia's so-called Echelon system is just a paper design backed up by simulations, so it could change radically before it gets built. Elements of its chip designs ultimately are expected to show up across the company's portfolio of handheld to supercomputer graphics products.
"If you can do a really good job computing at one scale you can do it at another," said Bill Dally, Nvidia's chief scientist who is heading up the Echelon project. "Our focus at Nvidia is on performance per watt [across all products], and we are starting to reuse designs across the spectrum from Tegra to Tesla chips," he said.
In his talk, Dally described a graphics core that can process a floating point operation using just 10 picojoules of power, down from 200 picojoules on Nvidia's current Fermi chips. Eight of the cores would be packaged on a single streaming multiprocessor (SM) and 128 of the SMs would be packed into one chip.
The result would be a thousand-core graphics chip with each core capable of handling four double precision floating-point operations per clock cycle—the equivalent of 10 teraflops on a chip. A chip with just eight of the cores would someday power a handset, Dally said.
The Echelon chip packs just twice as many cores as today's high-end Nvidia GPUs. However, today's cores handle just one double precision floating-point operation per cycle, compared to four for the Echelon chip.
Many of the advances in the chip come from its use of memory. The Echelon chip will use 256 Mbytes of SRAM memory that can be dynamically configured to meet the needs of an application.
For example, the SRAM could be broken up into as many as six levels of cache, each of a variable size. At the lowest level each core would have its own private cache.
The goal is to get data as close to processing elements as possible to reduce the need to move data around the chip, wasting energy. Thus SMs would have a hierarchy of processor registers that could be matched to locations in cache levels. In addition, the chip would have broadcast mechanisms so that the results of one task could be shared with any nodes that needed that data.
----
This was published on a Financial blog and I am looking for the link to the actual article.
It's not cheap, but you can do a lot with the TLB.Can anybody tell me how on earth is nv going to achieve that with x86 cores?
There's a pdf in the "Computing Theater" section entitled "GPU Computing: To ExaScale and Beyond"...256MB sounds like fun
Maybe his presentation will appear here:
http://www.nvidia.com/object/sc10.html