Datasets and Benchmarks for Offline Safe RL

Introduction

Introducing Our New Offline Safe RL Benchmark Suite

We are thrilled to introduce our comprehensive benchmarking suite, specifically designed for offline safe reinforcement learning (RL). This suite's goal is to aid the development and evaluation of safe learning algorithms during both the training and deployment stages. 

The suite comprises three high-quality packages: expertly curated safe policies, datasets structured similarly to D4RL that come with environment wrappers, and top-notch offline safe RL baseline implementations. The suite is driven by an efficient data collection pipeline and incorporates advanced safe RL algorithms to generate a diverse range of datasets across 38 popular safe RL tasks. These range from robot control to autonomous driving scenarios, making it applicable to a wide spectrum of challenges.

What Makes Our Benchmark Suite Unique?

We have introduced an array of data post-processing filters that can modify each dataset's diversity. This means that we can simulate various data collection conditions, resulting in more than hundreds of distinct datasets that each offer varying levels of difficulty. 

Furthermore, our benchmark suite includes implementations of popular offline safe RL algorithms. This not only helps speed up research in this area but also offers a comparison of their performance on collected datasets. Our thorough evaluations provide insights into the algorithms' strengths, weaknesses, and areas of potential improvement.

Benchmark Details

As shown in Figure 1, our benchmarking platform comprises three packages: FSRL (Fast Safe RL) [link], DSRL (Datasets for Safe RL) [link], and OSRL (Offline Safe RL) [link]. FSRL features a highly efficient data collection pipeline. DSRL hosts the datasets with a user-friendly API. OSRL offers implementations of a range of existing offline safe learning algorithms. This blend of resources makes our platform the perfect tool for those working in safe RL.

Figure 1. Overview of the benchmarking platform and the three packages: FSRL, DSRL, and OSRL.

To create a broad spectrum of difficulty levels for unbiased algorithm evaluation, we've gathered datasets from three renowned safe RL environments: SafetyGymnasium, BulletSafetyGym, and MetaDrive as shown in Figure 2. These environments allow us to evaluate safe RL algorithms in a variety of scenarios, from dynamic driving situations to shorter horizons with multiple agents.


Each environment is trained using expert policies subjected to varying cost thresholds, providing a pool of raw data that accurately represent different task scenarios. We then apply a density filter to maintain data diversity, discarding redundant trajectories.

Figure 2. Visualization of the simulation environments and representative tasks.

To make the datasets readily accessible to researchers, we've packaged them all in the DSRL package. This package mirrors the user-friendly API structure of D4RL, with the addition of a specialized costs entry in the datasets to indicate constraint violations.


But that's not all! We've also introduced post-processing filters that can adjust each dataset's complexity and difficulty level. From changing data density to introducing outliers, these filters allow for a comprehensive and varied evaluation of offline safe RL algorithms. The visualization of the filters is shown in Figure 3. Here is an intuitive description of these filters:


Figure 3. Illustration of the post-process filters.

Figure 4. Illustration of the partial data discarding filters.

The offline safe RL landscape, with its unique objectives and applications, requires an evaluation system that does justice to its capabilities. We propose a multi-tiered, four-level evaluation system that we believe better assesses safe offline RL algorithms:



While our benchmark primarily addresses the first two levels, we've also provided an array of filters for assessing an algorithm's generalization capabilities and outlier sensitivity.

Evaluation Metrics

We adopt the normalized reward return and the normalized cost return as the comparison metrics. Denote r(M) as the empirical reward return for task M. The normalized reward is computed by:

Note that we use a constant maximum and minimum values for a safe RL task rather than a dataset. This is because the post-process filters may modify the dataset to create different difficulty levels. The normalized cost is defined differently from the reward to better distinguish the results. It is computed by the ratio between the evaluated cost return C(π) and the target threshold:

Note that the cost return and the threshold are always non-negative.

Our benchmark suite includes a wide range of existing offline safe RL methods, conveniently organized under the OSRL package as shown in Table 1. These methods cover the majority of offline safe learning categories currently available, providing an expansive platform for research and comparison.

Table 1. Implemented offline safe learning algorithms and their base methods.

Experiments and Analysis

We've adopted a unique approach for our experiment settings, using a Constraint Variation Evaluation to assess algorithm versatility. Each algorithm is evaluated on each dataset using three distinct target thresholds and with three random seeds. This method allows us to better understand each algorithm's adaptability to varying safety conditions. The performance of the algorithms we tested uncovers key challenges of offline safe learning. Here, we explore these insights and what they mean for the future of safe RL. The results are summarized in Table 2.


First, let's look at BC-All and BC-Safe. These algorithms mainly imitate policies instead of estimating Q values. While BC-All scores high rewards, it struggles with safety constraints. On the other hand, BC-Safe, which only uses safe trajectories, generally meets safety requirements. However, it is more cautious and yields lower rewards. This difference highlights a vital trade-off in offline safe RL: balancing safety and reward, which largely depends on the training dataset.


Now, consider CDT. Thanks to its advanced structure and efficient use of data, CDT shows a well-balanced performance. Even though it has trouble with complex tasks in unpredictable environments, it usually secures higher rewards while keeping safety, outperforming BC-Safe in most tasks.


In contrast, all Q-learning-based algorithms, including BCQ-Lag, BEAR-Lag, CPQ, and COptiDICE, show inconsistent performance. They tend to oscillate between being too cautious and too risky. For example, CPQ achieves high rewards at the expense of safety in Button tasks, while it shows almost no cost with low rewards in MetaDrive tasks. 


These inconsistencies pinpoint a major challenge for Q-learning-based approaches in offline safe RL: accurately predicting the safety performance. In regular offline RL, small biases in Q estimation don't significantly affect the overall performance. However, when safety thresholds are introduced, the situation changes. Underestimating cost Q values may result in risky policies with minimal safety penalties, while overestimations can lead to overly cautious behaviors.


The way forward? Future research could concentrate on refining techniques for accurate safety performance estimation in offline settings. This is particularly important for the development and use of Q-learning-based approaches in offline safe RL. Through these focused efforts, we look forward to creating a world where offline safe RL algorithms confidently strike the perfect balance between risk and reward.

Table 2. Evaluation results of the normalized reward and cost. The cost threshold is 1. Each value is averaged over 3 distinct cost thresholds, 20 evaluation episodes, and 3 random seeds. Bold: Safe agents whose normalized cost is smaller than 1. Gray: Unsafe agents. Blue: Safe agent with the highest reward.

Our experiments also dug into the effects of different data manipulation filters, such as density and noise level manipulation filters.  The results are shown in Figure 5 and 6.


The density filter, for example, revealed that most algorithms saw a decrease in cost values as more data was used. This finding highlights the significant role dataset size plays in influencing algorithmic performance. 


The noise data filter experiments illustrated how increasing the percentage of outlier trajectories affected performance. This was especially true for some algorithms, which showed significant performance degradation, emphasizing the importance of outlier sensitivity in the evaluation of offline safe RL algorithms.


Figure 7 shows the performance under different data-discarding strategies. We can see that The tempting datasets usually lead to high costs and high rewards, i.e., tempting policies, in the BulletSafetyGym tasks. It also provides an insight that we can adjust the learning difficulty by manipulating the shape of the datasets. Investigating how to selectively use data in the dataset for learning to enhance safety and performance could be an interesting future direction.


In conclusion, our benchmark suite, experimental settings, and post-process filters pave the way for a more comprehensive understanding of the strengths and limitations of offline safe RL algorithms. This understanding is crucial for future research and the advancement of safe RL applications in real-world scenarios.

Figure 5. Average performance with different percentage of dataset trajectories.

Figure 6. Average performance with different percentage of outlier trajectories.

Figure 7. Average performance with different data discarding strategy.

Moving Forward

We're excited to see how this benchmarking suite will aid in the development of more robust and reliable offline safe RL solutions. By exposing the inherent challenges of offline safe learning problems, we hope to contribute to the evolution of this field, enabling the deployment of safe RL in a myriad of real-world applications.

Stay tuned for more exciting updates and happy benchmarking!