ROSETTA: Constructing Code-Based Reward from Unconstrained Language Preference

Abstract

Intelligent embodied agents not only need to accomplish preset tasks, but also learn to align with individual human needs and preferences. Extracting reward signals from human language preferences allows an embodied agent to adapt through reinforcement learning. However, human language preferences are unconstrained, diverse, and dynamic, making constructing learnable reward from them a major challenge. We present ROSETTA, a framework that uses foundation models to ground and disambiguate unconstrained natural language preference, construct multi-stage reward functions, and implement them with code generation. Unlike prior works requiring extensive offline training to get general reward models or fine-grained correction on a single task, ROSETTA allows agents to adapt online to preference that evolves and is diverse in language and content. We test ROSETTA on both short-horizon and long-horizon manipulation tasks and conduct extensive human evaluation, finding that ROSETTA outperforms SOTA baselines and achieves 87% average success rate and 86% human satisfaction across 116 preferences.

Introduction

Human-centered embodied intelligence requires that humans be able to guide embodied agents to align with their preferences in their most naturally expressed forms. Reinforcement learning (RL)-based embodied agents have demonstrated the ability to adapt to high-level tasks, and specifying reward is more efficient than collecting extensive training data. However, humans have unique voices and changing goals. Adaptation requires handling unconstrained language and unseen goals that edit, build on, or even contradict prior goals at every step. Generating effective rewards under such conditions is an open problem.

We present ROSETTA: Reward Objectives from Spontaneous Expression Translated to Agents, a code-based reward generation pipeline that enables embodied agent adaptation to human natural language preference in a single step. It generates rewards from single statements that have no constraints on language, limited constraints on content, and accommodate chains of preferences in ongoing interactions even as they build on, edit, and contradict each other. We contribute 1) the ROSETTA method, 2) state-of-the-art results that outperform baselines, particularly on preferences that change task requirements, and 3) a thorough evaluation framework that evades the pitfalls of narrow metrics.

Methods

ROSETTA uses a three-step pipeline to turn human language preference into dense reward code.

Preference Grounding (gpt-4o): the original preference is often vague, colloquial, missing context, and lacking grounding in the scene, so this module uses a VLM to ground and disambiguate.
Staging (gpt-4o): ROSETTA takes the grounded preference and generates a staged reward plan. This takes advantage of LLMs' known planning capabilities to describe a staged reward, which promotes flexibility and density.
Coding (o1-mini): this staged reward plan is given as a specification to the coding module, which writes the resulting dense reward code.
- 3a. Domain knowledge assertion and verification: given the limitations of current LLMs, we give general knowledge about physical reasoning and reward, then reiterate it as verification questions to ensure it is integrated.

This pipeline results in a set of variants. Some of these are trained, then shown to the original annotator, who chooses their favorite. ROSETTA is now ready for their next preference, which may build on the previous ones.

Preference Data

ROSETTA preferences are collected in sequential interactions for up to four steps, and have many diverse attributes. Here we visualize the attributes of a subset of preferences, and see an example of a chain that includes corrective, task-changing, colloquial, verbose, and context-dependent preferences.

The main set contains 116 preferences, with an additional 30 difficult preferences used for comparison to baselines and ablations. Within one chain of preference, the annotator is held constant, thereby testing ROSETTA's ability to handle true subjectivity and narrow preference.

Experimental Setup

Environments: Task-agnostic. 2 short-horizon (continuous control), 3 long-horizon (primitives), ManiSkill 3
Training: PPO (continuous control), SAC (MAPLE primitives)
Baselines: Eureka, Text2Reward
Ablations: no-grounding, no-staging, no-followup

Long Horizon Action Primitives: Sim and Real

Long horizon action primitives setting in simulation.

Long horizon action primitives setting on real robot. State-based observation, determined with FoundationPose. We don't observe any sim2real gap - assuming a correct state, all action primitive-based policies were trained in simulation and transferred without degradation to the real robot.

Evaluation Suite

Alignment

Satisfaction measured through human survey. We consider the satisfaction of the original preference annotator to be the gold standard measure of ROSETTA's success. To that end, we measure it through binary and ternary descriptive questions such as "did the robot incorporate your preference that wasn't met in the previous video", and binary and Likert questions that quantify satisfaction.

Optimizability

Task success rate on generated reward function. To ensure that ROSETTA reward functions are well-behaved and result in desirable behaviors intentionally, not just by happenstance, we also measure standard task success rate. However, we note that in our setting, task success rate is not informative about whether the goal matches the preference, only whether the algorithm is able to optimize for the goal. So, we consider it a measure of optimizability rather than success.

Semantic Match

Expert-evaluated semantic alignment between preference and reward code text. This is to have a metric that evaluates ROSETTA's ability to match the semantics of the preference, making it complementary to Optimizability metrics. We measure from text directly to bypass 1) noise from measuring human satisfaction, 2) noise from communicating a policy via rollout videos, and 3) undesirable behaviors that come from small RL shaping details or training quirks that are important, but not related to high-level semantics.

Results

ROSETTA vs. Baselines

Eureka and Text2Reward suffer on optimizability in cases where the task requirements change, which is 80% of preferences. This effect is very clear for optimizability, where Eureka and Text2Reward are explicitly unable to change task requirements. However, it is also true for alignment, even though a good-looking rollout will score well regardless of numeric task success rate. The baselines perform as well or better on corrective preferences (25% of all preferences), where task requirements don't change.

ROSETTA vs. Ablations

Ablations suffer much more on alignment than on optimizability. This is because the non-coding modules that are ablated out are typically most responsible for extracting semantic meaning; the coding module is for realizing that meaning but is less capable of interpretation. They perform as well on optimizability, as they still have ROSETTA's coding power. In fact, the no-grounding ablation in particular performs very well. This is often because ungrounded preferences are unclear enough that the later modules ignore them entirely and output the old reward function, which may be easier. It's therefore quite optimizable, while being unrelated to the actual current preference.