VLA for manipulation

Plan: August 4, 2025

  1. Debug smolVLA

    Status: can reproduce offline SmolVLA and ACT and confirm that this is a model (smolvla) problem. ACT (even with 50 steps chunk size) seems to not have such problem. Early evidence that SmolVLA fits train well.

  2. Collect pick-and-place dataset

  3. Make reproducible eval

    1. Inference from the default position
    2. Control starting conditions with a set of “tasks” and image overlay
  4. Test ACT against SmolVLA on the pick-and-place task

  5. Implement a simple behavioural model from TRI (CLIP + Action head) and test it on the pick-and-place task

  6. Tool to edit a dataset (cut episode, delete episode)

  7. Collect diverse dataset: positions/objects/target with lang

  8. Train and eval models on these tasks

  9. Implement voice control

System 1 is ready


IL Experiments

Speed up training (low priority)