Tesla Dojo

OlegSH · Sep 2, 2021

135 gigabytes per cabinet sound like way too low for something like full blown GPT-3 with 175 Billions parameters, which would translate into 326 gigabytes for the model alone with BF16 parameters. Wonder how it would be applicable for large models.

AzBat · Sep 2, 2021

Thanks for the replies guys. I've watched a few videos too & I will go back & find a few(I remember watching Anatasi though)

One thing I will say is that this was all announced at AI Day. It's a recruitment opportunity. I'm not surprised that things are not completely working(Q&A showed that). The whole point of the presentation was get people excited on new stuff & hopefully bring in more people to help solve their issues.

Tommy McClain

pcchen · Sep 2, 2021

OlegSH said:
135 gigabytes per cabinet sound like way too low for something like full blown GPT-3 with 175 Billions parameters, which would translate into 326 gigabytes for the model alone with BF16 parameters. Wonder how it would be applicable for large models.

That's my question too. The size of SRAM in each node is not very big (1.25MB, smaller than many CPU's cache). Of course, it's probably more appropriate to compare it to something like a SM or CU in a GPU, then it looks huge. However, I'm not sure if it has a shared external memory, as it's not mentioned in the AI Day video. If not, it looks to me that they probably intended to use the chip as an accelerator and the host system is responsible of feeding data to it continuously.

cheapchips · Sep 2, 2021

iroboto said:
yea I forgot about that one.

I haven't watched Thunderf00t's video but I assume it's the usual snark. I can't help but find it as weird as Musk worshipers. Who doesn't take Musk's timelines or presentation taglines with a massive pinch of salt? Unless you bought Autopilot in 2019 expecting 2020 revenue, who cares.

The computer vision/ decision making /training pipeline stuff Tesla showed off at AI day was impressive stuff. They've a pretty clear path to a useful humanoid robot as a product (which is very different to one that goes to the shops for you. Like Robotaxis, it doesn't matter that much if it never meets that goal).

Agility Robotics are already selling a humanoid* robot that does pick and carry tasks. That's with a company of 30.

*Caveat is that it uses backwards knees. 'Bird legs' are self stabilising over terrain changes to a degree that hominids aren't.

OlegSH · Sep 3, 2021

pcchen said:
If not, it looks to me that they probably intended to use the chip as an accelerator and the host system is responsible of feeding data to it continuously.

In that case other question arises on whether they will be able to build a host system where host-accelerator bandwidth would not be a bottleneck for the accelerator. Of the shelf parts would be very limited in this regard, so they need custom host HW in the case of host system.

Jawed · Sep 3, 2021

pcchen said:
That's my question too. The size of SRAM in each node is not very big (1.25MB, smaller than many CPU's cache). Of course, it's probably more appropriate to compare it to something like a SM or CU in a GPU, then it looks huge. However, I'm not sure if it has a shared external memory, as it's not mentioned in the AI Day video. If not, it looks to me that they probably intended to use the chip as an accelerator and the host system is responsible of feeding data to it continuously.

Yes, the tiles should be viewed as accelerators - or if you prefer together they make one massive accelerator which has a single-level view of working memory.

The presentation danced around the question of the plane. I interpret this to mean that the ExaPod is really two planes and they should be considered independent. One plane has to go to the host to deal with the other plane. Each plane is effectively a single accelerator.

Tesla is also building a custom host. It seemed as if that involves more custom silicon.

I think we should assume that this first machine will be scrapped pretty soon, if it is even built out to the full 10 cabinets.

For comparison, does anyone know about Tesla's version 1 and version 2 inference chips that go in their cars? All cars are currently using the version 3 chip it seems. Were the previous two designs deployed in cars?

It's notable that no one is laughing at Tesla's current inference chip running in cars. What competing hardware is there? If we want to talk about the prospects for Dojo then it seems a comparison with the inference chip would be instructive.

Software is clearly another problem. SpaceX's rockets are so nice because of their software...

Jawed · Sep 5, 2021

HW3 is the version 3 chip I was referring to before. 2019 is when it became public:

Tesla's New HW3 Self-Driving Computer — It's A Beast (CleanTechnica Deep Dive) | CleanTechnica

It was the first chip for this function that Tesla designed. The prior chip, referred to as "HW2", was by NVidia.

According to this article:

Tesla Autopilot Mystery Solved — HW3 Full Potential Soon To Be Unlocked | CleanTechnica

Tesla initially was "emulating" HW2 on HW3 in order to be functional. I suspect "emulating" is the wrong word. But we're never going to know the details.

pcchen · Sep 5, 2021

Jawed said:
It's notable that no one is laughing at Tesla's current inference chip running in cars. What competing hardware is there? If we want to talk about the prospects for Dojo then it seems a comparison with the inference chip would be instructive.

Inference is one thing, but Tesla Dojo is obviously designed for training, not inference.
Of course, I don't doubt that they know what they are doing. This is not something you go blindly without at least some data to back up your design decisions (although, there are unfortunately some precedents by some very famous CPU companies which looks good on paper but turned out to be pretty bad in practice). However, it's likely that Tesla has their own design goals which may or may not be compatible with other people's requirements.

BRiT · Sep 7, 2021

NonTechnical Offtopic spawned to https://forum.beyond3d.com/threads/elon-musk-spawn.62517/

Jawed · Oct 25, 2021

I accidentally found this:

SBY4B9_tesla-dojo-technology_OPNZ0M.pdf (thron.com)

"A Guide to Tesla’s Configurable Floating Point Formats & Arithmetic"

Jawed · Sep 3, 2022

Hot Chips 34 – Tesla’s Dojo Microarchitecture

To say Tesla is merely interested in machine learning is an understatement. The electric car maker built an in-house supercomputer named Dojo, optimized for training its machine learning models. Un…

chipsandcheese.com

Jawed · Oct 17, 2022

So they're still very much at the development stage. I wonder whether they'll even deploy a version 1.0 system before the version 2.0 hardware is ready to deploy, which sounds like it would be towards the end of 2024.

They talk about using 4 Dojo racks to replace 72 GPU racks and suggest that will happen 2023Q1 if not this year...

MfA · Oct 17, 2022

Jawed said:
I can't help thinking that Tesla realised that the GPGPU roadmap was too slow by about a decade, so a few years ago they decided to build their own AI supercomputers.

"General purpose GPU compute" looks like a dead end now.

There's more to supercomputing than datamining consumer/citizen data with neural networks for advertising/spying (and an expensive waste of time for FSD Never Ever). Though admittedly it's most of supercomputing at this point, some people still need a little more programmability.

DavidGraham · Feb 11, 2024

Elon doesn't seem to be too enthusiastic about Dojo.

Is the much-hyped supercomputer still a going concern? You'd think so, given all that braggadocio — but as highlighted by CleanTechnica, CEO Elon Musk was asked about it at a January investor meeting and his answer was absolutely baffling.

I mean, the AI auto question is — that is a deep one," he said, tripping over his own words again and again. "So, we're obviously hedging our bets here with significant orders of Nvidia GPUs — or GPUs is the wrong word

But I would, you know, think of Dojo as a long shot," he eventually admitted, after proffering that training a car is "much like" training a human.

Elon Musk Melts Down in Baffling Answer on Tesla Supercomputer

Tesla CEO Elon Musk made some puzzling comments about the company's "Dojo" supercomputers, which train cars how to drive themselves.

futurism.com

Davros · Feb 11, 2024

cheapchips said:
Who doesn't take Musk's timelines or presentation taglines with a massive pinch of salt?

The media for one I remember all the news reports that claimed the hyperloop will travel at 750mph ect - sorry but no it wont and it didnt. The smartest thing Musk did with Hyperloop is have nothing to do with it (maybe he read the original 100year old patent and realised it was a stupid idea)
I remember all the claims for Hyper Tunnel what did it turn out to be? A 30mph ride in a taxi in a 500meter tunnel with disco lights
ps: your quote is basically the Fox News Tucker Carlson defence "Only and idiot would take Tucker Seriously So It's ok if he lies" - Yes Fox's Lawyers actually used that.

I'll shut up now as I dont want to turn the thread into a Musk rant.

cheapchips · Feb 11, 2024

Davros said:
your quote is basically the Fox News Tucker Carlson

I dont want to to get into a discussion about Musk/Tesla timelines either. I will say, even on the Internet, I've never been so insulted in all my life.

DavidGraham · Apr 25, 2024

https://twitter.com/x/status/1783518934882275454

Father_Murphy · May 2, 2024

Speak of the devil... The Dojo wafers are now in full production at TSMC.

Tesla's wafer-sized Dojo processor is in production — 25 chips combined into one

Wafer-scale processors gain traction.

www.tomshardware.com

Tesla Dojo

OlegSH

AzBat

Agent of the Bat

pcchen

Moderator

cheapchips

OlegSH

Jawed

Jawed

pcchen

Moderator

BRiT

(>• •)>⌐■-■ (⌐■-■)

Jawed

Jawed

Hot Chips 34 – Tesla’s Dojo Microarchitecture

Jawed

MfA

DavidGraham

Elon Musk Melts Down in Baffling Answer on Tesla Supercomputer

Davros

cheapchips

DavidGraham

Father_Murphy

Tesla's wafer-sized Dojo processor is in production — 25 chips combined into one

Similar threads