Intelligent embodied agents not only need to accomplish preset tasks, but also learn to align with individual human needs and preferences. Extracting reward signals from human language preferences allows an embodied agent to adapt through reinforcement learning. However, human language preferences are unconstrained, diverse, and dynamic, making constructing learnable reward from them a major challenge. We present ROSETTA, a framework that uses foundation models to ground and disambiguate unconstrained natural language preference, construct multi-stage reward functions, and implement them with code generation. Unlike prior works requiring extensive offline training to get general reward models or fine-grained correction on a single task, ROSETTA allows agents to adapt online to preference that evolves and is diverse in language and content. We test ROSETTA on both short-horizon and long-horizon manipulation tasks and conduct extensive human evaluation, finding that ROSETTA outperforms SOTA baselines and achieves 87% average success rate and 86% human satisfaction across 116 preferences.
Human-centered embodied intelligence requires that humans be able to guide embodied agents to align with their preferences in their most naturally expressed forms. Reinforcement learning (RL)-based embodied agents have demonstrated the ability to adapt to high-level tasks, and specifying reward is more efficient than collecting extensive training data. However, humans have unique voices and changing goals. Adaptation requires handling unconstrained language and unseen goals that edit, build on, or even contradict prior goals at every step. Generating effective rewards under such conditions is an open problem.
We present ROSETTA: Reward Objectives from Spontaneous Expression Translated to Agents, a code-based reward generation pipeline that enables embodied agent adaptation to human natural language preference in a single step. It generates rewards from single statements that have no constraints on language, limited constraints on content, and accommodate chains of preferences in ongoing interactions even as they build on, edit, and contradict each other. We contribute 1) the ROSETTA method, 2) state-of-the-art results that outperform baselines, particularly on preferences that change task requirements, and 3) a thorough evaluation framework that evades the pitfalls of narrow metrics.
ROSETTA uses a three-step pipeline to turn human language preference into dense reward code.
gpt-4o
): the original preference is often vague, colloquial, missing context, and lacking grounding in the scene, so this module uses a VLM to ground and disambiguate.gpt-4o
): ROSETTA takes the grounded preference and generates a staged reward plan. This takes advantage of LLMs' known planning capabilities to describe a staged reward, which promotes flexibility and density.o1-mini
): this staged reward plan is given as a specification to the coding module, which writes the resulting dense reward code.
This pipeline results in a set of variants. Some of these are trained, then shown to the original annotator, who chooses their favorite. ROSETTA is now ready for their next preference, which may build on the previous ones.
ROSETTA preferences are collected in sequential interactions for up to four steps, and have many diverse attributes. Here we visualize the attributes of a subset of preferences, and see an example of a chain that includes corrective, task-changing, colloquial, verbose, and context-dependent preferences.
The main set contains 116 preferences, with an additional 30 difficult preferences used for comparison to baselines and ablations. Within one chain of preference, the annotator is held constant, thereby testing ROSETTA's ability to handle true subjectivity and narrow preference.
Environments: Task-agnostic. 2 short-horizon (continuous control), 3 long-horizon (primitives), ManiSkill 3
Training: PPO (continuous control), SAC (MAPLE primitives)
Baselines: Eureka, Text2Reward
Ablations: no-grounding, no-staging, no-followup
Eureka and Text2Reward suffer on optimizability in cases where the task requirements change, which is 80% of preferences. This effect is very clear for optimizability, where Eureka and Text2Reward are explicitly unable to change task requirements. However, it is also true for alignment, even though a good-looking rollout will score well regardless of numeric task success rate. The baselines perform as well or better on corrective preferences (25% of all preferences), where task requirements don't change.
Ablations suffer much more on alignment than on optimizability. This is because the non-coding modules that are ablated out are typically most responsible for extracting semantic meaning; the coding module is for realizing that meaning but is less capable of interpretation. They perform as well on optimizability, as they still have ROSETTA's coding power. In fact, the no-grounding ablation in particular performs very well. This is often because ungrounded preferences are unclear enough that the later modules ignore them entirely and output the old reward function, which may be easier. It's therefore quite optimizable, while being unrelated to the actual current preference.