Want to run some experiments with smallVLA. During the hackathon, I noticed that the model easily learns L→A (e.g. commands like "move left" or "open gripper") from just 2–3 demonstrations. But it struggles with V+L→A (e.g. "pick the blue cube" when there are two different cubes), at least on our small dataset and with 12k training steps. I want to test this properly: compare a binary cube task (where the V part does binary classification of the cube’s position, and the A part just picks one of two trajectories) to a coloured binary cube task (where the L part specifies the desired cube colour). Also want to see how well the model remembers pre-training: I’ll compare 1) correct colour names, 2) inverted colour labels (both in train and test), and 3) a neutral attribute ("cube 1" vs "cube 2"). I just need a stable setup for these experiments, and it’s hard to do this properly at home. Training requires a large GPU (A100), but I can pay for that.
Following the Hi Robot paper, the goal is to retrain or fine-tune SmallVLA into a "System 1" model—capable of following fine-grained and reactive commands, as opposed to the mid-length, non-reactive commands typically used during training. The idea is that this low-level model can be controlled by a more general-purpose system, such as GPT, a human operator (via real-time voice), or a specialized vision-language model (VLM). In this iteration, control is limited to voice only.
Steps:
Use System-1 model derived from SmalVLA (previous step) + ChatGPT with a camera to build something cool that requires following user input and multiple steps to complete. Options: