Nvidia shows signs in [2023]

Status
Not open for further replies.
An undergraduate student used an Nvidia GeForce GTX 1070 and AI to decipher a word in one of the Herculaneum scrolls to win a $40,000 prize (via Nvidia). Herculaneum was covered in ash by the eruption of Mount Vesuvius, and the over 1,800 Herculaneum scrolls are one of the site's most famous artifacts. The scrolls have been notoriously hard to decipher, but machine learning might be the key.
...
Luke Farritor, an undergrad at the University of Nebraska-Lincoln and Space-X intern, used his old GTX 1070 to train an AI model to detect "crackle patterns," which indicate where an ink character used to be. Eventually, his GTX 1070-trained AI was able to identify the Greek word πορφυρας (or porphyras), which is either the adjective for purple or the noun for purple dye or purple clothes. Deciphering this single word earned Farritor a $40,000 prize.
 

That's a pretty awesome application for AI. It's kinda crazy that we're living through the AI revolution now which I honestly believe will make all previous revolutions with the possible exception of the agricultural pale in comparison.

For people of the generation that I think many here are, I.e. can remember a time before mobile phones, the Internet and WIFI, we are living through by far the most remarkable transition that has happened in the space of a human life time in all of human history.

How privileged are we!
 
There are likely tons of undiscovered applications for AI but it’s far too early to say it’ll be more transformative than the internet. The internet enabled entirely new markets, professions and hobbies and had a tremendous impact on global society and culture. So far AI is just making existing things easier and/or faster.
 
There are likely tons of undiscovered applications for AI but it’s far too early to say it’ll be more transformative than the internet. The internet enabled entirely new markets, professions and hobbies and had a tremendous impact on global society and culture. So far AI is just making existing things easier and/or faster.
AI will make labor intensive tasks way cheaper, imagine AI creating animations for game characters with solid results that are indistinguishable from real mocap. Speaking of which, I’d love to see cloud AI being used to enhance game world in non latency sensitive applications, such as:

-Very realistic tree/leaf movement in a game like Forza or GT where the cloud AI calculates one intensive physics instance and share it with clients hardware so everyone playing that particular track gets very nice looking physics while those who do not have access to cloud will simply use a downgraded locally calculated version.

-remake higher polygon assets for old games to breath a new life into them, no matter how much you upres a game, upgrade lighting or boost fps the low poly objects will always stick out.
 
https://videocardz.com/press-release/nvidia-introduces-hopper-h200-gpu-with-141gb-of-hbm3e-memory

“To create intelligence with generative AI and HPC applications, vast amounts of data must be efficiently processed at high speed using large, fast GPU memory. With NVIDIA H200, the industry’s leading end-to-end AI supercomputing platform just got faster to solve some of the world’s most important challenges.”
said Ian Buck, vice president of hyperscale and HPC at NVIDIA.

they also "teased" Blackwell , more than double the performance of Hopper H200 GPU, even higher focus on AI , launch in 2024

https://videocardz.com/newz/nvidia-...erformance-in-gpt-3-175b-large-language-model

https://wccftech.com/nvidia-blackwell-b100-gpus-2x-faster-hopper-h200-2024-launch/
 
Last edited:
512-bit never made any sense when they were already gonna get a massive bandwidth lift from GDDR7. I genuinely dont know how kopite didn't think give that one a little more skepticism from the get-go, especially if he didn't actually hear '512-bit' specifically in his information.
 
512-bit never made any sense when they were already gonna get a massive bandwidth lift from GDDR7.
The lift won't be nearly as massive for Nvidia since they are using G6X for the second generation now.
Micron's roadmap shows early G7 launching at 32 Gbps which is a fairly minor jump over late G6X's 24 Gbps.
Which also explains why Blackwell seemingly adds even more L2 in comparison to Lovelace.

The more interesting part on G7 is the addition of 1.5 step capacity which will allow new memory sizes on the same bus widths: 12 GB on 128 bit, 18GB on 192 bit and 24GB on 256 bit.
 
512-bit never made any sense when they were already gonna get a massive bandwidth lift from GDDR7. I genuinely dont know how kopite didn't think give that one a little more skepticism from the get-go, especially if he didn't actually hear '512-bit' specifically in his information.
also the old 128MB L2 rumor is impossible with 384 bit. AD102 has 16MB of L2 per each 64bit memory controller, with 128MB that would be 21.333 MB for each 64bit MC which obviously doesn't work.
ad102 has 48 slices of 2MB for a total of 96MB L2, for blackwell I guess they will just increase it to 2.5MB per slice for a total of 120MB L2.
 
The lift won't be nearly as massive for Nvidia since they are using G6X for the second generation now.
Micron's roadmap shows early G7 launching at 32 Gbps which is a fairly minor jump over late G6X's 24 Gbps.
Which also explains why Blackwell seemingly adds even more L2 in comparison to Lovelace.

The more interesting part on G7 is the addition of 1.5 step capacity which will allow new memory sizes on the same bus widths: 12 GB on 128 bit, 18GB on 192 bit and 24GB on 256 bit.
If we're just talking flagship parts, a 4090 uses 21gbps at stock, not 24. So even at the base 32gbps of GDDR7, that's over a 50% increase in bandwidth. Not that a mere 33% would be 'fairly minor' in either case.
also the old 128MB L2 rumor is impossible with 384 bit. AD102 has 16MB of L2 per each 64bit memory controller, with 128MB that would be 21.333 MB for each 64bit MC which obviously doesn't work.
ad102 has 48 slices of 2MB for a total of 96MB L2, for blackwell I guess they will just increase it to 2.5MB per slice for a total of 120MB L2.
Does the L2 have to be so linked with the memory controllers in terms of numbers? On AD103, it doesn't seem like they're laid out like they would be.
 
Does the L2 have to be so linked with the memory controllers in terms of numbers? On AD103, it doesn't seem like they're laid out like they would be.
I think so. in this die shot of GA102 each 64bit MC is linked to 8 slices of L2 cache. I think it's the same for Ada but each slice was increased from 128KB to 2MB.

EpJ4JWWXUAImYqu
 
This is a quite long talk by Bill Dally about the future of deep learning hardware, including some interesting tidbits on log numbers and sparse matrices.

It's a good talk but it's always interesting to see how research is disassociated from day-to-day hardware at practically every semiconductor company! In the Q&A, Bill Dally says that NVIDIA GPUs process matrix multiplies in 4x4 chunks and go back to the register file after every chunk which he says is "in the noise" and "maybe 10% energy". But that's definitely NOT the case on Hopper where matrix multiplication still isn't systolic but works completely differently (asynchronous instruction that can read directly from shared memory etc...) with a minimum shape of 64x8x16 (so they've effectively solved that issue): https://docs.nvidia.com/cuda/parall...tml#asynchronous-warpgroup-level-matrix-shape

I don't really get his point about glitching with unstructured sparsity though, that's only because they are trying to detect 0s and shuffle values around *in the same pipeline stage as the computation*. But it doesn't feel that complicated to do it a cycle early and just have a signal specifying the multiplexer behaviour as an input to your "computation" pipeline stage, which should reduce glitching to nearly the same level as structured sparsity... Maybe I'm missing something (e.g. is it actually the multiplexer glitching he's worried about at very small data sizes like 4/8-bit? or the cost of the pipeline stage is much higher than I'd expect?) but it feels like a fairly standard hardware engineering problem to me. On the other hand, the fact NVIDIA can do this kind of analysis and realise they need to focus on this kind of thing way before implementation does put them in a league of their own as I'm skeptical most competitors do that kind of analysis that early...

If I had to make a guess for Blackwell, I'd say they will include vector scaling and 4-bit log numbers (the patent diagram he showed with the binning is clever), but that really didn't sound very optimistic for unstructured sparsity or activation sparsity. From a competitive standpoint, I think NVIDIA is going to have much tougher competition for inference than training, so focusing on this kind of inference-only improvement makes a lot of sense for them.
 
Last edited:
Status
Not open for further replies.
Back
Top