VLA for manipulation

SmallVLA experiments

Want to run some experiments with smallVLA. During the hackathon, I noticed that the model easily learns L→A (e.g. commands like "move left" or "open gripper") from just 2–3 demonstrations. But it struggles with V+L→A (e.g. "pick the blue cube" when there are two different cubes), at least on our small dataset and with 12k training steps. I want to test this properly: compare a binary cube task (where the V part does binary classification of the cube’s position, and the A part just picks one of two trajectories) to a coloured binary cube task (where the L part specifies the desired cube colour). Also want to see how well the model remembers pre-training: I’ll compare 1) correct colour names, 2) inverted colour labels (both in train and test), and 3) a neutral attribute ("cube 1" vs "cube 2"). I just need a stable setup for these experiments, and it’s hard to do this properly at home. Training requires a large GPU (A100), but I can pay for that.

SmallVLA→System 1 model

Following the Hi Robot paper, the goal is to retrain or fine-tune SmallVLA into a "System 1" model—capable of following fine-grained and reactive commands, as opposed to the mid-length, non-reactive commands typically used during training. The idea is that this low-level model can be controlled by a more general-purpose system, such as GPT, a human operator (via real-time voice), or a specialized vision-language model (VLM). In this iteration, control is limited to voice only.

Steps:

System 2 (with chatGPT)

Toy version of end-to-end product

Use System-1 model derived from SmalVLA (previous step) + ChatGPT with a camera to build something cool that requires following user input and multiple steps to complete. Options:

  1. Tidy bot. Throwing away trash, arranging objects on the table. Similar to this demo from Hi robot.
  2. Building some structure from toy blocks like these:

video_2025-06-20_15-36-35.mp4