Datasets and Benchmarks for Offline Safe RL
Introducing Our New Offline Safe RL Benchmark Suite
We are thrilled to introduce our comprehensive benchmarking suite, specifically designed for offline safe reinforcement learning (RL). This suite's goal is to aid the development and evaluation of safe learning algorithms during both the training and deployment stages.
The suite comprises three high-quality packages: expertly curated safe policies, datasets structured similarly to D4RL that come with environment wrappers, and top-notch offline safe RL baseline implementations. The suite is driven by an efficient data collection pipeline and incorporates advanced safe RL algorithms to generate a diverse range of datasets across 38 popular safe RL tasks. These range from robot control to autonomous driving scenarios, making it applicable to a wide spectrum of challenges.
What Makes Our Benchmark Suite Unique?
We have introduced an array of data post-processing filters that can modify each dataset's diversity. This means that we can simulate various data collection conditions, resulting in more than hundreds of distinct datasets that each offer varying levels of difficulty.
Furthermore, our benchmark suite includes implementations of popular offline safe RL algorithms. This not only helps speed up research in this area but also offers a comparison of their performance on collected datasets. Our thorough evaluations provide insights into the algorithms' strengths, weaknesses, and areas of potential improvement.
Figure 1. Overview of the benchmarking platform and the three packages: FSRL, DSRL, and OSRL.
Figure 2. Visualization of the simulation environments and representative tasks.
Figure 3. Illustration of the post-process filters.
Figure 4. Illustration of the partial data discarding filters.
We adopt the normalized reward return and the normalized cost return as the comparison metrics. Denote r(M) as the empirical reward return for task M. The normalized reward is computed by:
Note that we use a constant maximum and minimum values for a safe RL task rather than a dataset. This is because the post-process filters may modify the dataset to create different difficulty levels. The normalized cost is defined differently from the reward to better distinguish the results. It is computed by the ratio between the evaluated cost return C(π) and the target threshold:
Note that the cost return and the threshold are always non-negative.
Table 1. Implemented offline safe learning algorithms and their base methods.
Experiments and Analysis
Table 2. Evaluation results of the normalized reward and cost. The cost threshold is 1. Each value is averaged over 3 distinct cost thresholds, 20 evaluation episodes, and 3 random seeds. Bold: Safe agents whose normalized cost is smaller than 1. Gray: Unsafe agents. Blue: Safe agent with the highest reward.
Figure 5. Average performance with different percentage of dataset trajectories.
Figure 6. Average performance with different percentage of outlier trajectories.
Figure 7. Average performance with different data discarding strategy.
We're excited to see how this benchmarking suite will aid in the development of more robust and reliable offline safe RL solutions. By exposing the inherent challenges of offline safe learning problems, we hope to contribute to the evolution of this field, enabling the deployment of safe RL in a myriad of real-world applications.
Stay tuned for more exciting updates and happy benchmarking!