Peter Chen joins the What’s Your Problem Podcast for a conversation with Jacob Goldstein, former host of NPR’s Planet Money. Their discussion traces Peter’s journey from OpenAI to the founding of Covariant, shedding light on the evolution and transformative impact of Robotics Foundation Models in the physical world. Here, we distill key insights from their conversation, offering a glimpse into the dynamic intersection of artificial intelligence and robotics.

Listen: Full Episode
Early conviction in the power of foundation models

Early conviction in the power of foundation models

Peter Chen: That is really the magic of foundation models. That is the thing that would not be obvious to people, outside of OpenAI for a very long time. And because we came from OpenAI, a lot of the founding team at Covariant came from OpenAI. We saw that insight earlier and that insight allowed us to start Covariant, to build foundation models for robotics way before other people even believed in the approach.

Jacob Goldstein: When did you personally have this realization, and you're not the only one to have it, but when do you see the power of foundation models?

Peter Chen: There are two things on it, so the first thing is that, early on at OpenAI, sometime in 2016, we believed in the idea of scaling, like really scaling up the model and scaling up the data sets. And you see models getting increasingly smarter as you scale them up.

The other one is, that we had conviction in the foundation model for robotics, probably earlier than the foundation model for language. This is the one key thing if you think about building a large language model that tries to compress the whole Internet of knowledge, you still need to compress many things that are not quite related to each other. Maybe you are browsing on Wikipedia and you have to recite the composition of materials of soil on the moon. And you also need to learn how to play chess. Well, there's nothing in common with these two things. There are two parts of the knowledge, but you are asking one AI model to learn all of these.

Grounding AI in the physical world

Grounding AI in the physical world

Peter Chen: It makes a lot of sense to us that there's only one physical world, even when you have many different robots that need to do different things in different factories, and different warehouses, they are still interacting in the same physical world. And so building a foundation model for robotics has this amazing property of grounding that no matter what kind of tasks you're asking this foundation model to learn, it's just learning the same sets of physics.

Jacob Goldstein: The grounding is the literal ground, and the model has to understand just how the physical world works and that if you drop a thing, it will fall.

Peter Chen: Exactly. And if it's deformable, and you push it, it would move a little bit. If something is rigid, you would slide it. If something is rollable, it will roll away. These are the type of things that no matter where you are on earth and what type of robot-like body you're using are the same. And if you can build one single foundation model that can learn from all of these different data, it would be incredibly powerful.

Jacob Goldstein: So just to state it clearly, you're at OpenAI, you're seeing the power of foundation models, and you decide to leave and start the company that is now Covariant. What are you setting out to do when you start the company?

Peter Chen: So when we started Covariant, we had this really strong conviction that there should be a future that has a lot of autonomous robots doing all the things that are repetitive, injury-prone, and dangerous. And that can revolutionize the physical world, make it a lot more abundant, and to enable that future of autonomous robots, you need really smart AI. And because of the insight, that we believe that AI had to be a foundation model, we believe that you should have a single model that learns from all these different robots together and become smarter together.

Jacob Goldstein: So the basic idea is that the dream is to build one AI foundation model for robots, in the same way, that you can ask ChatGPT anything in language and it can answer you in language about any different thing. You have a model where you make it be the “brain” of any robot and that robot can see the world and move and pick things up and behave in the world.

Collecting data in the real world

Collecting data in the real world

Peter Chen: Exactly. And there's one key problem. The key problem is that unlike, the foundation model for language, where you can scrape the whole internet of text as your pre-training data, there's nothing equivalent, in the case of robotics. There are some images online, and there are some YouTube videos online, but by and large, they don't give you the same type of data that is in the form of robots interacting with the world. And the big problem is there are just not that many robots that are just doing interesting things in the world.

And a big chunk of what we set out to build as a company is recognizing that we need to build a foundation model for robotics. And to be a foundation model for robotics, you need to have large data sets.

And to create large data sets, you have to have robots that are creating value for customers in production at scale, because if you're only collecting data in your lab, you only collect so much data. And so the last six years, Covariant is largely focused on really building autonomous robot systems that work really well for customers, and they're doing interesting things at a level of autonomy and reliability that has not been hit before.

Human-like intuition

Building robots with human-like intuition of the physical world

Jacob Goldstein: How did you solve it? How does it work?

Peter Chen: At the end of the day, the way that it operates is very similar to how a human vision system works.

We have two eyes and then by our two eyes looking at something we can figure out what's the depth of a certain item. And because our two eyes can triangulate a single point in the 3D world. And it's the same kind of mechanism. So you can just use multiple regular cameras, just like the one that you have on your iPhone. And by having multiple of those, you give the neural network the ability to triangulate what's happening.

Jacob Goldstein: Just the way our two eyes allow us to see depth, essentially. And other things, like the arm, is going to be picking things up of all different weights. So presumably there could be a shirt in a plastic bag, and some things are rigid and Some things are deformable.

Peter Chen: Yeah. So what we have found is that if you just have a visual understanding of the world that is as robust as a human, you go a really long way. So when I pick up a cup, I'm not doing a lot of calculations on how is my fingers forced, get exactly translated to the cup, and make sure it holds.

Jacob Goldstein: It's part of the miracle of being a person though. It's like a really hard problem.

Peter Chen: It's a very hard problem, but then your brain subconsciously solves it for you. And you can imagine when you do this, even if your fingers are numb, you can still do this perfectly just because you have acquired this intuitive understanding of interaction with the physical world so well that you can do it.

Jacob Goldstein: So basically vision gets you most of the way there.

Peter Chen: I would say vision and then the ability to intuit physics from your visual input that you'll get.

Jacob Goldstein: That second one is wild though. Intuit. And I mean, intuit, in the context of AI means to make inferences.

Peter Chen: Yeah. By intuit, I mean, it's not doing some detailed physical calculation.

Jacob Goldstein: It's not doing math. It’s not doing math.

Peter Chen: Yeah. It's doing kind of high-level pattern matching based on how these things look, this is likely going to be a successful way to approach the item and interact with it.

What's next?

What’s next?

Peter Chen: So what is next immediately is very exciting. We are now getting to a place, by we, I mean, we, as in the AI community have gotten to a place where we have enough computation power and algorithmic and modeling understanding that can allow us to extract a lot out of data, right?

Jacob Goldstein: So from any given amount of data, you can get more, you can get more out of it, right? That's exciting for you because data is such a constraint on what you're trying to do.

Peter Chen: Exactly. And then, we are building up these large robotics data sets, by tapping into a lot of these advances. It gives us the ability to get even more out of the data sets that we're building and allows us to build smarter robots and a better Robotics Foundation Model that performs better at the current tasks that they're supposed to do and also powers more robots.

Related The power of collaboration: 5 years of KNAPP & Covariant