SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning

Hexian Ni1, 2, Tao Lu1†, Haoyuan Hu1, 2, Yinghao Cai1, Shuo Wang1
1State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, 2School of Artificial Intelligence, University of Chinese Academy of Sciences

Corresponding author

Abstract

Preference-based Reinforcement Learning (PbRL) methods provide a solution to avoid reward engineering by learning reward models based on human preferences. However, poor feedback- and sample- efficiency still remain the problems that hinder the application of PbRL. In this paper, we present a novel efficient query selection and preference-guided exploration method, called SENIOR, which could select the meaningful and easy-to-comparison behavior segment pairs to improve human feedback-efficiency and accelerate policy learning with the designed preference-guided intrinsic rewards. Our key idea is twofold: (1) We designed a Motion-Distinction-based Selection scheme (MDS). It selects segment pairs with apparent motion and different directions through kernel density estimation of states, which is more task-related and easy for human preference labeling; (2) We proposed a novel preference-guided exploration method (PGE). It encourages the exploration towards the states with high preference and low visits and continuously guides the agent achieving the valuable samples. The synergy between the two mechanisms could significantly accelerate the progress of reward and policy learning. Our experiments show that SENIOR outperforms other five existing methods in both human feedback-efficiency and policy convergence speed on six complex robot manipulation tasks from simulation and four real-worlds.

Method

Figure 1: Illustration of SENIOR. PGE assigns high task rewards for fewer visits and human-preferred states to encourage efficient exploration through hybrid experience updating policy, which will provide query selection for more valuable task-relevant segments. MDS select easily comparable and meaningful segment pairs with apparent motion distinction for high-quality labels to facilitate reward learning, providing the agent with accurate rewards guidance for PGE exploration. During training, MDS and PGE interact and complement each other, improving both feedback- and exploration-efficiency of PbRL.

Experiment

Simulated Experiments

We compare our method (SENIOR) with five baseline methods on six complex robot manipulation tasks in Meta-World, including Door Lock, Window Close, Handle Press, Window Open, Door Open and Door Unlock. PbRL methods include PEBBLE, MRN, RUNE, M-RUNE(RUNE with MRN), QPA, P-SENIOR(PEBBLE with SENIOR) and M-SENIOR(MRN with SENIOR). The experiment was repeated 100 times for each task.

Door Lock (feedback=250)

Window Close (feedback=250)

Handle Press (feedback=250)

Window Open (feedback=250)

Door Open (feedback=1000)

Door Unlock (feedback=1000)

Figure 2: Learning curves on six robotic manipulation tasks as measured by success rate. The solid line and shaded regions represent the mean and standard deviation, respectively, across five runs.

Figure 3:Comparison of success rates for six tasks at 500K and 1000K steps.

Real-World Experiments

We also compare our method (SENIOR) with five baseline methods on four complex robot manipulation tasks in the real world, including Door Open, Door Close, Box Open and Box Close. PbRL methods include PEBBLE, MRN, RUNE, M-RUNE(RUNE with MRN), QPA and M-SENIOR(MRN with SENIOR). The experiment was repeated 20 times for each task.

Door Open (feedback=1000)

Door Close (feedback=50)

Box Open (feedback=250)

Box Close (feedback=250)

Figure 4: Success rate of simulation and real experiments on four tasks.