Robot Learning Collective: VLA

<aside> 💡

TL;DR: We are developing a compact VLA integrated into Lerobot and pre-trained on SO100/1 embodiment.

</aside>

Why

<aside> 💡

No open-source plug-and-play VLA for ‘consumers’

</aside>

Pi0: Jax or bugs in torch; expensive to train; not pretrained on so101; outdated (not tokenised)
SmolVLA: questionable evaluation (no lang conditioning); outdated (not tokenised); serious design flow (frozen VLM)
ACT/Diffusion: no lang
Open-VLA+: poor performance; not pretrained on so101; not integrated in Lerobot

Knowledge Insulation inspired tokenised VLA (mostly VLM + separate de-noising expert)

Joint-training: tokenised (only for training) and de-noising (for fast inference)
1. Test in sim (faster)
2. Test on a real robot (50-100 eps per task)
Infusing the de-noising timestamp at every level of the transformer
Robot state as text
System 2 and system 1 in one model (training and inference)
1. Synthetic demonstrations relabeling
2. Live audio guidance
Webdata (Image cap, VQA, localisation)
Metadata about the robot in lang