Nvidia GT300 core: Speculation

Status
Not open for further replies.
Magma's Quartz Deployed by NVIDIA as Primary Design Rule Checker for 40-nm and Below

http://www.marketwatch.com/story/ma...n-rule-checker-for-40-nm-and-below-2009-08-25

"We have been using Magma's Quartz physical verification solution in production since we moved to the 65-nm process node, and it has proven to be both accurate and significantly faster than other solutions," said James Chen, VLSI technology manager at NVIDIA. "Through dozens of tapeouts, we've seen that Quartz provides the sign-off accuracy needed via certified runsets provided by our foundry partners. Though design sizes and rule complexity have increased significantly, we've been able to meet aggressive design schedules by leveraging Quartz's linear scalability on standard, low-memory Linux machines."
 
After their excellence of execution over the past few nodes and half-nodes, I wonder what would have become of them, if they hadn't used this magical toolkit. ;)
 
Could you explain that works out? Why should you have to trade bandwidth for speed. And even then, the bandwidth isn't 4x lesser even though memory has been quadrupled on teslas

I don't believe bandwidth has to scale downward in a linear fashion as vram scales upward. Otherwise we'd see the same trend in consumer graphics where 1GB vram cards have half the bandwidth of 512MB cards, etc.
 
Because the highest-specced GDDR3-memory chips do not come in capacities the size you'd need for 4 GByte Framebuffer?
 
Right now GDDR is availble in x32 and x16 interfaces which don't really work for ECC.

One options depending on bandwidths requirements is x8 DDR3 at boosted performance levels.

Ideally they would want x36 devices which probably won't happen because they would be extremely niche, so the most likely path is x8 DDR3 and taking the ~50% bandwidth hit.

Really? mixing GDDR5 and ECC are that hard?

Theo disagrees.

http://www.brightsideofnews.com/news/2009/8/28/nvidia-plans-gt300-demos-for-late-september.aspx

[which is a given, since GDDR5 comes with ECC]

link from above (jawed's) post.
 
GDDR5 has CRC, but AFAIK that is generated on the fly and won't detect bit flips in the cells ... it's also not error correcting.

PS. I'm wrong, it is error correcting.
 
Last edited by a moderator:
Both the speed increases and feature-additions are what makes any new DRAM type worthwhile, and GDDR5 does have a few tricks up its sleeves. New is an error detection mechanism, which is similar to ECC server memory. It calculates an eight-bit CRC algorithm on each data burst and instantly allows a calculation repeat on a detected error, improving efficiency.

http://forums.techgage.com/showthread.php?t=3629
 
NVIDIA would need to find a manufacturer to actually create chips which store/retrieve the CRC rather than just calculating it on the fly though.
 
http://forums.techgage.com/showthread.php?t=3629
Both the speed increases and feature-additions are what makes any new DRAM type worthwhile, and GDDR5 does have a few tricks up its sleeves. New is an error detection mechanism, which is similar to ECC server memory. It calculates an eight-bit CRC algorithm on each data burst and instantly allows a calculation repeat on a detected error, improving efficiency.
It's only computing CRCs on data bursts, so bit-flips in memory aren't detected. I don't see how this is similar to ECC in server memory. ECC memory actually stores the CRC so that when you later retrieve the value, you can see if there's been an error introduced.
 
Scan

SYSTEM, METHOD, AND COMPUTER PROGRAM PRODUCT FOR PERFORMING A SCAN OPERATION ON A SEQUENCE OF SINGLE-BIT VALUES USING A PARALLEL PROCESSOR ARCHITECTURE
A system, method, and computer program product are provided for performing a scan operation on a sequence of single-bit values using a parallel processing architecture. In operation, a scan operation instruction is received. Additionally, in response to the scan operation instruction, a scan operation is performed on a sequence of single-bit values using a parallel processor architecture with a plurality of processing elements.
Really all you need to know is that things like this are possible:

__device__ void maybe_write(int *queue, int x, bool should_write)
{
unsigned int i = PSCAN(should_write);
if( should_write ) queue = x;
}


Woah, tasty.

Can be read in conjunction with:

SYSTEM, METHOD AND COMPUTER PROGRAM PRODUCT FOR PERFORMING A SCAN OPERATION
A system, method, and computer program product are provided for efficiently performing a scan operation. In use, an array of elements is traversed by utilizing a parallel processor architecture. Such parallel processor architecture includes a plurality of processors each capable of physically executing a predetermined number of threads in parallel. For efficiency purposes, the predetermined number of threads of at least one of the processors may be executed to perform a scan operation involving a number of the elements that is a function (e.g. multiple, etc.) of the predetermined number of threads.
 
Attributes

On-the-fly reordering of 32-bit per component texture images in a multi-cycle data transfer
Each processing engine 1002 also has access, via a crossbar switch 1005, to a global register file 1006 that is shared among all of the processing engines 1002 in core 910. Global register file 1006 may be as large as desired, and in some embodiments, any processing engine 1002 can read to or write from any location in global register file 1006. In addition to global register file 1006, some embodiments also provide an on-chip shared memory 1008, which may be implemented, e.g., as a conventional RAM. On-chip memory 1008 is advantageously used to store data that is expected to be used in multiple threads, such as coefficients of attribute equations, which are usable in pixel shader programs.
Quite explicit really.

This document is quite a nice summary of the overall operation of the graphics pipeline, too.

Jawed
 
VLIW

Techniques for sourcing immediate values from a VLIW

Sourcing immediate values from a very long instruction word includes determining if a VLIW sub-instruction expansion condition exists. If the sub-instruction expansion condition exists, operation of a portion of a first arithmetic logic unit component is minimized. In addition, a part of a second arithmetic logic unit component is expanded by utilizing a block of a very long instruction word, which is normally utilized by the first arithmetic logic unit component, for the second arithmetic logic unit component if the sub-instruction expansion condition exists.
I wonder if this indicates that NVidia's moving back to VLIW. Or does it merely describe the current ALU organisation? Hard to tell.

[0043] At 620, if the VLIW includes an immediate, one or more given sub-ALUs are disabled. In one implementation, any of the plurality of sub-ALUs may be disabled. In another implementation, a predetermined bit of the VLIW indicates which sub-ALU is disabled. In another implementation, a particular sub-ALU may be easier to disable than the other sub-ALUs. Therefore, the sub-ALU, which is the easiest to disable, is the sub-ALU that is disabled each time. In another implementation, one sub-ALU may be less powerful than the other sub-ALUs or one sub-ALU may be more powerful than the other sub-ALUs. For example, a particular sub-ALU may be able to access an immediate in each of the other sub-ALU portions of the VLIW, while the other sub-ALUs cannot. In another example, a particular sub-ALU may be able to access the output of one or more adders or multipliers of one or more sub-ALUs, while the other sub-ALUs cannot. In yet another example, a particular sub-ALU may be able to perform reciprocal operations while the other sub-ALUs cannot. Therefore, the least powerful sub-ALU may be disabled so that all operations are still available. Alternative, the more powerful sub-ALU may consume more power and yet the addition operations that the more powerful sub-ALU can perform may not be needed and is therefore disabled to conserve more power.
[0044] In one implementation, the one or more given sub-ALUs may be disabled by disregarding or dropping the output of the given sub-ALU. In another implementation, the one or more given sub-ALUs may be disabled by not clocking data through the given sub-ALU, and thereby saving power. In yet another implementation, the one or more given sub-ALUs may be disabled by turning off power to the one or more given sub-ALUs.
[0045] At 630, one or more blocks of the VLIW corresponding to the one or more given sub-ALUs that are disabled are made available to one or more other sub-ALUs as one or more immediates. Therefore, the VLIW is coupled to the one or more multiplexers and demultiplexers for controlling multiplexing and demultiplexing. In addition, the VLIW is also coupled to the one or more multiplexers as immediates for sourcing to inputs of the sub-ALUs. At 640, the active sub-ALUs are configured and controlled according to the VLIW. The configuration and control may include sourcing a portion of the VLIW corresponding to the one or more disabled sub-ALUs as one or more immediates to one or more of the active sub-ALUs. The immediates are advantageously available immediately to the active sub-ALUs to perform operations upon, instead of having to placing them first into a register before being able to use the immediates.

Jawed
 
Status
Not open for further replies.
Back
Top