PersONAL: Towards a Comprehensive Benchmark for Personalized Embodied Agents

As embodied agents become more deeply integrated into our daily lives—from household robots to virtual teammates in collaborative software—the question of how to measure their personal touch becomes critical. PersONAL aims to fill that gap by offering a rigorous, multi-dimensional benchmark that captures not just task performance, but how well an agent tailors its behavior to individual users, contexts, and evolving preferences. The goal is to provide a common yardstick for researchers and developers to compare approaches, identify gaps, and accelerate progress toward truly personalized, safe, and capable embodied agents.

Why a comprehensive benchmark matters

Personalization in embodied systems is not a single knob to tweak; it’s a composite of identity modeling, preference inference, memory management, and adaptable decision-making, all while maintaining transparency and user trust. A robust benchmark must simulate real-world variability: diverse user profiles, long-term interactions, multimodal communication, and the social dynamics of shared spaces. PersONAL recognizes these complexities and introduces a structured framework to evaluate both micro-tasks—like selecting an appropriate action in a given moment—and macro-tasks—such as maintaining a coherent long-term relationship with a user.

What PersONAL covers

Personalization targets: user preferences, goals, routines, and personality signals that influence how the agent should respond, assist, or collaborate.
Embodiment contexts: diverse environments (home, office, public spaces) and embodiments (physical robot, VR/AR avatar, or social robot) that shape interaction dynamics.
Interaction modalities: natural language, gesture, gaze, and non-verbal cues that inform intent and need for adaptation.
Safety and privacy considerations: privacy-preserving inference, data minimization, and transparent user control over personalization data.
Evaluation regimes: a mix of simulated scenarios and real-user studies, with both objective metrics and user-reported outcomes.

“A benchmark is not a verdict on a single system; it’s a shared playground where diverse ideas can be tested, compared, and improved in a reproducible way.”

Core components of the benchmark structure

PersONAL is organized around three interlocking layers that guide researchers from conception to evaluation:

Scenario suites that stress personalization across daily tasks, social interactions, and long-horizon planning in varied environments.
Evaluation pipeline combining automated testing with human-in-the-loop feedback to capture both objective performance and subjective experience.
Baseline and protocol library offering standard implementations and data formats to ensure fair, apples-to-apples comparisons.

Evaluation metrics you’ll find in PersONAL

Personalization accuracy — alignment between inferred user preferences and agent actions over time.
Task success with personalization — how personalization contributes to objective task outcomes.
Adaptation speed — how quickly the agent updates its model after a change in user behavior.
User satisfaction — subjective ratings on usefulness, comfort, and trust.
Robustness — resilience to noisy modalities, sudden preference shifts, or context changes.
Privacy and safety — adherence to privacy policies, data minimization, and risk scoring for sensitive inferences.
Compute efficiency — resource use and latency under real-time constraints.

Data, privacy, and ethical guardrails

PersONAL emphasizes principled data handling: decoupled user profiles, on-device personalization where possible, and clear consent flows. The benchmark encourages representations that protect sensitive signals while still enabling meaningful personalization. Researchers are invited to publish datasets and protocols that are designed with ethical considerations at the forefront, ensuring that advances in personalization do not come at the expense of user rights or safety.

Impact: who benefits and why it matters

For researchers, PersONAL provides a transparent, reproducible path to demonstrate advancements in personalization that go beyond short-term task completion. For industry practitioners, it offers a practical set of benchmarks to guide product decisions, from user experience design to privacy-by-default features. For users, a standardized evaluation helps ensure future embodied agents feel more attuned to individual needs without sacrificing safety or trust.

Getting involved

Review the scenario catalogs and start prototyping personalization strategies within the benchmark’s guidelines.
Contribute baselines and evaluation scripts to foster fair comparisons across teams.
Share insights from user studies to inform best practices in interaction design and privacy controls.

PersONAL is designed as a collaborative, evolving standard—one that invites researchers and practitioners to iterate toward embodied agents that truly understand and adapt to the people they serve, in ways that feel natural, respectful, and reliable.