Tesla Dojo

Jawed · Aug 24, 2021

I just love how gigacorps are building their own chips from scratch:

The video is only about 20 minutes and has some nice nuance that you won't get from this article:

Tesla's insane new Dojo D1 AI chip, a full transcript of its unveiling | TweakTown

The article is also missing some slides, which is a shame. 5x GPU in off-chip bandwidth for system-wide networking is a non-trivial factor (timestamped video for the slide I'm referring to):

81229_17_teslas-insane-new-ai-chip-has-been-revealed_full.png

100% of the area out here is going to machine learning training and bandwidth -- there is no dark silicon, there is no legacy support. This is a pure machine-learning machine.

it will be the fastest AI training computer -- 4x the performance at the same cost, 1.3x better performance-per-watt -- that is energy-saving, and 5x smaller footprint

So actual self-driving coming real soon now?

Jawed · Aug 24, 2021

Gary's 10 minute explanation is also worth a watch and might be preferable for some people:

orangpelupa · Aug 24, 2021

does their presentation of tesla dojo "transparent"? or there are misleading things?

i meant, in their presentation of doing point clouds, there were things (able to make point clouds around corners in intersection outside of camera view, lidar rack in window reflection) that indicate it was made off LIDAR (or at least used LIDAR as a base or something) but they didnt mention lidar at all.

Jawed · Aug 24, 2021

I haven't watched anything other than those two videos.

Tesla Is Testing Lidar Sensors, Which Elon Musk Has Criticized: Report (businessinsider.com)

Overall, Elon Musk has been effectively lying about their autonomous driving progress for years now, so I ignore the detail.

It seems clear, now, that they were orders of magnitude distant from what's truly required. Easily could be another 10 years for all I know.

I can't help thinking that Tesla realised that the GPGPU roadmap was too slow by about a decade, so a few years ago they decided to build their own AI supercomputers.

"General purpose GPU compute" looks like a dead end now. ExaPod's cabinet is the size of an aisle and the architecture is designed for aisle--level-scaling, not blade-level as seen with GPUs. Maybe the next wave of GPUs will achieve cabinet-level-scaling... NVidia is building cabinets that are 5x bigger for the same performance, so already far behind on bandwidth density.

(Domestic robot thing sounds like a marketing fuck-up of gigantic proportions, literally making Tesla a laughing stock. I just scrolled past any articles or videos on that topic.)

nutball · Aug 24, 2021

Jawed said:
Overall, Elon Musk has been effectively lying about their autonomous driving progress for years now, so I ignore the detail.

It seems clear, now, that they were orders of magnitude distant from what's truly required. Easily could be another 10 years for all I know.

I'm expecting fully autonomous/self-driving cars around the time that we get the first commercially viable nuclear fusion power plants. I'm not expecting to live to see either.

nAo · Aug 24, 2021

Jawed said:
I can't help thinking that Tesla realised that the GPGPU roadmap was too slow by about a decade, so a few years ago they decided to build their own AI supercomputers.

I am not so sure about the roadmap not moving fast enough. Their own supercomputer, which doesn't exist yet, is already slower than AI supercomputers from last year.

Jawed · Aug 24, 2021

nAo said:
I am not so sure about the roadmap not moving fast enough.

I was waiting for someone to suggest this. The hubris of incumbents is always entertaining. Tesla built this for a joke, obviously.

Well they're already working on version 2.

I wonder if they'll enter the cloud AI business with this, if it works.

iroboto · Aug 24, 2021

Jawed said:
I haven't watched anything other than those two videos.

Tesla Is Testing Lidar Sensors, Which Elon Musk Has Criticized: Report (businessinsider.com)

Overall, Elon Musk has been effectively lying about their autonomous driving progress for years now, so I ignore the detail.

It seems clear, now, that they were orders of magnitude distant from what's truly required. Easily could be another 10 years for all I know.

I can't help thinking that Tesla realised that the GPGPU roadmap was too slow by about a decade, so a few years ago they decided to build their own AI supercomputers.

"General purpose GPU compute" looks like a dead end now. ExaPod's cabinet is the size of an aisle and the architecture is designed for aisle--level-scaling, not blade-level as seen with GPUs. Maybe the next wave of GPUs will achieve cabinet-level-scaling... NVidia is building cabinets that are 5x bigger for the same performance, so already far behind on bandwidth density.

(Domestic robot thing sounds like a marketing fuck-up of gigantic proportions, literally making Tesla a laughing stock. I just scrolled past any articles or videos on that topic.)

training != runtime
just to be clear training is exponentially longer than running, the hardware required to perform self driving is fairly good as it is.

But let's say your AI is about 98% effective. To gain each 0.1% more in accuracy requires significantly more training and research, feature extraction, datasets etc. You require significantly more processing power to extract that small portion forward. Each 0.1% is a massive deal when you're dealing with cars in the millions, so 99.99% safe is not really nearly as safe as 99.9999%, when you consider millions of tesla vehicles driving millions of kilometers.

As it stands, adding LIDAR into the AI is ideal, but camera's are likely to represent the majority of the work.

There is a lot that goes into a self driving AI, following distance, acceleration curves, braking curves, dealing with people cutting in, visual inaccuracies, blinding lights, rain, snow, dust, mud, etc. Being able to handle all those situations without making the car feel like it drives erratically requires some complex neural networks with a fairly extensive driving history (maybe 30 seconds backwards). As someone who was there for autopilot 1 to where it is today, they've come a very far way in how well the car handles the road compared to how it did at the start.

Full self driving will be here much sooner than you think, but drivers should be prepared to 'take over' from time to time. Maybe in 10+ years time, you can be relatively not there as a driver.

Tesla continues to make it's own hardware because quite frankly, nvidia is expensive. They'll likely use this for SpaceX as well.

nutball · Aug 24, 2021

iroboto said:
Full self driving will be here much sooner than you think, but drivers should be prepared to 'take over' from time to time.

In which case it's not full self-drive. It's a slightly smarter cruise control.

Full self-drive is like being in a taxi or chauffeur driven. You go from A to B regardless of where A and B are and what's in between them, without always having to be ready to leap in to the cockpit on the off-chance that something hasn't been modelled properly. Anything less than that is marketing bollocks.

iroboto · Aug 24, 2021

nutball said:
In which case it's not full self-drive. It's a slightly smarter cruise control.

Full self-drive is like being in a taxi or chauffeur driven. You go from A to B regardless of where A and B are and what's in between them, without always having to be ready to leap in to the cockpit on the off-chance that something hasn't been modelled properly. Anything less than that is marketing bollocks.

they could release that today. Doesn't mean it's as safe as they want it to be. There's a big difference between the technology being available, and being 99.9999% safe. I assure you that they have full self driving working already. Doesn't mean it can handle everything you throw at it.

I don't think they ever promised it to be a chauffeur driven experience. You are welcome to point to me where Elon says that, because I'm pretty sure he goes on record that we are well over 20+ years away from that level of self driving.

Mind you that was in 2017 when I bought mine, I haven't seen what he's said lately. Frankly I don't trust it that much per se, but it's been useful for long road trips.

edit: update. Ah, I see he is overstating their output:

The ratio of driver interaction would need to be in the magnitude of 1 or 2 million miles per driver interaction to move into higher levels of automation. Tesla indicated that Elon is extrapolating on the rates of improvement when speaking about L5 capabilities. Tesla couldn’t say if the rate of improvement would make it to L5 by end of calendar year.

**

yea, I don't see L5 automation being ready for end of 2021, it's unlikely to have occurred during the covid years. Curious to see where they land in 2024.

nutball · Aug 24, 2021

iroboto said:
Doesn't mean it can handle everything you throw at it.

I don't think they ever promised it to be a chauffeur driven experience. You are welcome to point to me where Elon says that, because I'm pretty sure he goes on record that we are well over 20+ years away from that level of self driving.

I'm not singling out Elon or Tesla here. My sole point is that the term "full self driving" needs to be used very carefully IMO lest it get devalued. I realise I'm pissing in the wind and that marketing spin will likely triumph as it usually does.

If Musk is saying 20+ years for what I call full self-driving, then compensating for the Musk Reality Distortion Field gives ~40 years Standard Human Elapsed Time, which is roughly what I'm expecting. So he and I agree. Excellent. And anything up to then is marketing bollocks.

iroboto · Aug 24, 2021

nutball said:
I'm not singling out Elon or Tesla here. My sole point is that the term "full self driving" needs to be used very carefully IMO lest it get devalued. I realise I'm pissing in the wind and that marketing spin will likely triumph as it usually does.

If Musk is saying 20+ years for what I call full self-driving, then compensating for the Musk Reality Distortion Field gives ~40 years Standard Human Elapsed Time, which is roughly what I'm expecting. So he and I agree. Excellent. And anything up to then is marketing bollocks.

it'll likely land much before. He made that comment in 2017 and revised it again in 2019. I think the amount of road data being captured by all these teslas are something he didn't factor in. 40 years is a long time, there will be several breakthroughs before then. FSD subscriptions comes out soonish for USA, so they'll be able to FSD point A to B. But there will be times in non-ideal conditions that the AI will ask the driver to take over. Of this I'm certain.

“Full Self-Driving capability is now available as a monthly subscription. Upgrade your Model Y ... for $199 (excluding taxes) to experience features like Navigate on Autopilot, Auto Lane Change, Auto Park, Summon and Traffic Light and Stop Sign Control. The currently enabled features require active driver supervision and do not make the vehicle autonomous.” City street driving, steering, is a requirement that the laws in a particular area will need to allow. That pretty much means anyone who buys into FSD today, is a beta participant as you cross boundary areas the GPS will shut off certain features.

oddly I don't get to subscribe for my version, I can only pay. But it's 5300 CAD. So much cheaper than the 10K everyone else has to pay.

iroboto · Aug 24, 2021

Right let me reiterate my position here on the subject.
Tesla is not far away from FSD due to a lack of computational power. This is likely a bottleneck around R&D and data sets that are required there to make FSD happen at Level 5 automation.

The reason why Tesla would move to make their own supercomputer as opposed to using an existing one comes down to cost. They likely extrapolated the run rates of their training costs many years down the road and determined that this would ultimately be cheaper and better for them.

To provide you an example:
https://dl.acm.org/doi/fullHtml/10.1145/3381831

Such large models have high costs for processing each example, which leads to large training costs. BERT-large was trained on 64 TPU chips for four days at an estimated cost of $7,000. Grover was trained on 256 TPU chips for two weeks, at an estimated cost of $25,000. XLNet had a similar architecture to BERT-large, but used a more expensive objective function (in addition to an order of magnitude more data), and was trained on 512 TPU chips for 2.5 days, costing more than $60,000.f It is impossible to reproduce the best BERT-large results or XLNet results using a single GPU,g and models such as openGPT2 are too large to be used in production.h Specialized models can have even more extreme costs, such as AlphaGo, the best version of which required 1,920 CPUs and 280 GPUs to play a single game of Go,44 with an estimated cost to reproduce this experiment of $35,000,000.i,j

You can quickly see how this starts to ramp up in cost. There is a lot of processing power out there, but the electricity doesn't come cheap. If there is a way to get similar performance with less power usage, that would be the reason to invent your own silicon.

You are constantly retraining and evaluating models and only when you are satisfied do you release it into production. If a team is training several models a day, you're burning through tons of cash.

nAo · Aug 24, 2021

Jawed said:
I was waiting for someone to suggest this. The hubris of incumbents is always entertaining. Tesla built this for a joke, obviously.

What hubris? I simply made a factual statement.

They’re clearly very serious about this effort and perhaps they will be very successful on other important metrics. Raw power doesn't seem to be highlight, so far, and strictly speaking there is nothing wrong with that.

Well they're already working on version 2.

Like every other company in this space. It’s a marathon..

I wonder if they'll enter the cloud AI business with this, if it works.

Developing the whole thing just for themselves sounds very expensive if in the long term they don’t plan to somehow externally productize their technology.

Jawed · Aug 25, 2021

I'm wondering if people have watched the videos I linked or read the transcript I linked, because I'm seeing lots of comments that contradict selected aspects.

iroboto said:
just to be clear training is exponentially longer than running

I don't think "exponentially" quite captures the difference :mrgreen:

Tesla continues to make it's own hardware because quite frankly, nvidia is expensive.

Tesla described lots of other problems. Networking, small-batch performance, physical size being some.

Also, Tesla gets to write its own ISA. And when it spends billions on software engineering to target its ISA, it has full control over that software. It's not praying for NVidia to agree on what's important, and to implement that stuff whenever it suits. It's not paying for a drip-feed of incremental improvements...

nAo said:
What hubris? I simply made a factual statement.

nAo said:

I am not so sure about the roadmap not moving fast enough. Their own supercomputer, which doesn't exist yet, is already slower than AI supercomputers from last year.

Click to expand...

The problem is assuming that NVidia's roadmap is useful to Tesla.

There's an asymmetry here: Tesla knows what NVidia tells it about its plans (going back a few years it would have looked like: Ampere, Hopper, Next Next, Next Next Next) and Tesla estimates the fitness of that roadmap for its own use.

Clearly, Tesla decided that NVidia's roadmap was unsuitable.

I see Dojo as being very much like Apple building M1 and the family of chips that follows it. The incumbent was basically clueless and now has a hollow-looking roadmap to attain parity with what Apple has already delivered. Intel's roadmap was utterly useless. In addition to that, Apple realised that vertical integration down to the transistor would result in vastly better products.

Time will tell if NVidia's roadmap was also useless for Tesla, or Tesla dumps Dojo in a few years' time for more NVidia.

pcchen · Aug 25, 2021

I don't know if what NVIDIA is building is good enough for Tesla or not, but it's obvious that Tesla is not the only one doing AI works right now. People who buy from NVIDIA definitely have their own requirements and (at least for those who can't afford to build their own chips) will tell NVIDIA what they want, in order to get what they want.
Furthermore, NVIDIA is no longer in the situation where "gaming GPU" is required to subsidize "AI chip" development, as revenue from AI chip rivals gaming GPU already, so they can afford to design a completely different AI chip if that's the way to go.

AzBat · Aug 30, 2021

Kinda surprised on the lack of discussion of the actual D1 chip, it's scaling to a tile of 25 D1 processors, then 6 tiles to a tray & then 2 trays to cabinet, then 10 cabinets to a ExoPod.

Tommy McClain

AzBat · Sep 2, 2021

Still nothing? Maybe if it was painted blue or used an x-clamp. Bleh.

Tommy McClain

Jawed · Sep 2, 2021

AzBat said:
Still nothing? Maybe if it was painted blue or used an x-clamp. Bleh.

Problem is there's really not much detailed information.

Some videos that are mildly interesting:

But there's lots of speculation in those videos.

I think CleanTechnica is wrong to call the 25 chips in a slice "a wafer". I believe this is actually 25 known-good dies that are mounted onto a single wafer. Tesla's video is very brief on the subject of "wafer". CleanTechnica in my opinion is incorrect to assume that redundancy (a la Cerebras) across a single wafer is good enough for the architecture that Tesla is using.

pharma · Sep 2, 2021

The Tesla Dojo Chip Is Impressive, But There Are Some Major Technical Issues – SemiAnalysis

This is as far as Tesla has gotten. They have an extremely expensive, single tile running in their labs at 2GHz. They do not have a full system. The full system is scheduled for some time in 2022. Knowing Tesla’s timing on Model 3, Model Y, Cyber Truck, Semi, Roadster, and Full Self Driving, we should automatically assume we can pad this timing here.

The two most difficult technological feats have not even been accomplished. This is the tile to tile interconnect and software. Each tile has more external bandwidth than the highest end networking switches. To achieve this, Tesla developed custom interconnects. And by Tesla, I mean their partner who has deep expertise in photonics. They are custom silicon photonics transceivers with custom external lasers. This sort of implementation is incredibly expensive.
...
The other lion in the room is software. Tesla does not even claim to have a way to automatically place and route operations of mini-tensors across the architecture. They do claim their compiler takes care of fine-grained and data model parallelism. This handwave here is not enough for us to believe the claim. There are simply far too many firms with AI hardware that can’t even scratch this with many engineers working on the software for chips that have existed for a couple years. Even if they claimed they did, a magical compiler is something worth being skeptical against. When asked questions about the stack in the Q&A they were woefully unprepared. They even said they had not solved the software problem.
...
The biggest question that has been asked by semiconductor professionals has been “How on earth is this economically viable?” Tesla is detailing a very specific set of hardware that isn’t exactly that high volume. Only a total of 3,000 645mm^2 7nm dies have been committed to being deployed. This comes alongside very exotic packaging and custom photonics developed just to be deployed in the ExaPod supercomputer. There is nowhere near enough volume to amortize the huge costs for researching and developing a chip like this. This applies, even though Tesla isn’t the one doing the R&D on the tile-to-tile interconnect or the 112G SerDes.
...
Cost, photonics, memory constraints, lack of software, and the fact that this chip is 2022 or beyond is something we have to keep in mind. This chip is not Tesla designing something that is better than everyone else all by themselves. We are not at the liberty to reveal the name of their partner(s), but the astute readers will know exactly who we are talking about when we reference the external SerDes and photonics IP. The Tesla chip and system design is incredibly impressive, but it should not be hyped to the moon.

Tesla Dojo

Jawed

Jawed

orangpelupa

Elite Bug Hunter

Jawed

nutball

nAo

Nutella Nutellae

Jawed

iroboto

Daft Funk

nutball

iroboto

Daft Funk

nutball

iroboto

Daft Funk

iroboto

Daft Funk

nAo

Nutella Nutellae

Jawed

pcchen

Moderator

AzBat

Agent of the Bat

AzBat

Agent of the Bat

Jawed

pharma

Similar threads