Deep reinforcement learning has proven to be a powerful approach to solving control tasks, but its characteristic high‑frequency oscillations make it difficult to apply in real‑world environments. While prior methods have addressed action oscillations via architectural or loss-based methods, the latter typically depend on heuristic or synthetic definitions of state similarity to promote action consistency, which often fail to accurately reflect the underlying system dynamics. In this paper, we propose a novel loss-based method by introducing a transition-induced similar state. The transition-induced similar state is defined as the distribution of next states transitioned from the previous state. Since it utilizes only environmental feedback and actually collected data, it better captures system dynamics. Building upon this foundation, we introduce Action Smoothing by Aligning Actions with Predictions from Preceding States (ASAP), an action smoothing method that effectively mitigates action oscillations. ASAP enforces action smoothness by aligning the actions with those taken in transition-induced similar states and by penalizing second-order differences to suppress high-frequency oscillations. Experiments in Gymnasium and Isaac-Lab environments demonstrate that ASAP yields smoother control and improved policy performance over existing methods.
A key challenge in action smoothing for reinforcement learning lies in defining similar states in a way that faithfully reflects the underlying system dynamics. Prior loss-based approaches typically rely on heuristic or synthetic neighborhoods, such as Gaussian perturbations around the current state or interpolations along the next-state direction. However, these artificially generated states often deviate from the true state distribution induced by the environment, leading to unstable or overly conservative regularization.
ASAP addresses this limitation by defining similar states directly from the environment’s transition distribution. Given a previous state st−1, we define the similar state distribution as the set of next states sampled from the transition kernel P(· | st−1). Intuitively, states that originate from the same preceding state are expected to be similar from a dynamics perspective, as their differences arise solely from bounded stochastic disturbances.
Under mild assumptions on the transition function—namely, local Lipschitz continuity with respect to noise and bounded stochastic perturbations—this transition-induced definition forms a spatially bounded neighborhood. As a result, it provides a principled foundation for enforcing local Lipschitz continuity of the policy without introducing synthetic samples or heuristic distance thresholds.
Within this framework, ASAP enforces action consistency by aligning the policy output at the current state with the expected action over the similar state distribution. This alignment suppresses excessive sensitivity to small state variations and directly mitigates oscillatory behavior caused by transition noise.
Building on the transition-induced similar state formulation, ASAP introduces a practical mechanism to enforce action smoothness during training. Instead of explicitly sampling multiple next states, ASAP augments the policy network with an additional prediction head that estimates the expected action at the next time step given the preceding state.
Specifically, the ASAP actor consists of a shared feature extractor followed by two heads: an action head that outputs the executed action at from the current state st, and a prediction head that predicts the expected next action E[at+1 | st]. The action head is trained to align its output with the prediction head, thereby encouraging consistency across transition-induced similar states.
To stabilize training, ASAP separates the learning signals for the two heads using stop-gradient operations. The prediction head learns to approximate the policy’s actual outputs, while the action head treats the predicted action as a fixed target. This asymmetric design avoids moving-target instability and enables robust optimization.
In addition to spatial alignment, ASAP incorporates a temporal regularization term that penalizes second-order differences in the action sequence. This temporal loss suppresses high-frequency oscillations while preserving the agent’s ability to perform rapid but purposeful action changes when necessary.
The final ASAP objective combines the standard reinforcement learning actor loss with spatial and temporal smoothness terms. As a result, ASAP achieves smooth and stable control policies without modifying the network architecture at inference time, making it readily applicable to real-world robotic systems.
Formally, ASAP optimizes the policy using a composite objective that combines the standard reinforcement learning actor loss with spatial and temporal smoothness penalties:
$$ J^{\mathrm{ASAP}}_{\pi_\phi} = J_{\pi_\phi} + \lambda_S L_S + \lambda_P L_P + \lambda_T L_T $$
Here, Jπφ denotes the standard actor loss used in common RL algorithms such as PPO and SAC. The additional terms encourage smoothness in both the spatial and temporal domains.
The spatial smoothing loss aligns the action produced at the current state with the expected action predicted from the preceding state:
$$ L_S = \left\| \pi_\phi(s_t) - \texttt{stopgrad}\!\left(\pi_P(s_{t-1})\right) \right\|_2^2 $$
To train the prediction head, ASAP introduces a complementary prediction loss, which encourages the predicted action to match the policy’s actual output while treating the target as fixed:
$$ L_P = \left\| \pi_P(s_{t-1}) - \texttt{stopgrad}\!\left(\pi_\phi(s_t)\right) \right\|_2^2 $$
This asymmetric design decouples the learning dynamics of the action head and the prediction head, preventing moving-target instability and allowing each component to be optimized with an appropriate learning strength.
In addition to spatial alignment, ASAP adopts a temporal smoothness penalty based on second-order action differences, originally proposed in Grad-CAPS:
$$ L_T = \left\| \frac{a_{t+1} - 2a_t + a_{t-1}} {\tanh(a_{t+1} - a_{t-1}) + \epsilon} \right\|_2^2 $$
This temporal loss suppresses high-frequency oscillations while preserving flexibility for purposeful action changes. The hyperparameters λS, λP, and λT control the relative strength of each regularization term.
We evaluate policies using two complementary metrics: Cumulative Return (re) and Smoothness Score (sm).
Cumulative Return measures overall task performance and is computed as the total accumulated reward over an episode.
Smoothness Score quantifies action oscillations based on the frequency spectrum of actions. Following prior work, we compute smoothness using the FFT as:
$$ \mathrm{Sm} = \frac{2}{n f_s} \sum_{i=1}^{n} M_i f_i $$
Here, $f_i$ and $M_i$ denote the frequency and amplitude of the $i$-th spectral component, and $f_s$ denotes the sampling frequency. This metric computes a frequency-weighted average of action magnitudes. Lower values indicate smoother and more stable control, while higher values correspond to increased high-frequency action components.
Experiments on the Gymnasium benchmark were conducted to validate the theoretical assumptions and design principles of ASAP. Gymnasium provides a set of continuous control tasks with relatively simple yet diverse dynamics, making it well suited for analyzing whether the transition distribution-based definition of similar states effectively suppresses action oscillations.
We evaluate ASAP under both PPO and SAC settings, comparing it against prior methods using two metrics: cumulative return and action smoothness. The results show that ASAP consistently reduces high-frequency action oscillations while largely preserving policy performance across most environments, demonstrating that transition-induced similar states and prediction-based alignment accurately capture the underlying system dynamics.
Experiments on Isaac-Lab were conducted to assess whether ASAP remains effective beyond standard benchmarks, under realistic robot dynamics and practical noise conditions. Isaac-Lab provides high-fidelity physics simulation along with domain randomization and observation noise, enabling evaluation of policy stability and control quality in more realistic settings.
We apply ASAP to PPO-based policies and compare performance using two metrics: cumulative return and action smoothness. The results show that ASAP consistently reduces high-frequency action oscillations while maintaining or improving policy performance across multiple robotic tasks, demonstrating that transition-induced similar states and prediction-based alignment remain effective in robotic control scenarios.
We propose ASAP, a novel action smoothing method that combines transition induced similar states with prediction from preceding states. ASAP effectively suppresses high-frequency action oscillations while preserving policy performance, as demonstrated across Gymnasium and Isaac-Lab benchmarks. The method introduces no additional inference-time overhead, making it readily applicable to real-world robotic control systems.
ASAP defines similar states under the assumption of bounded transition noise. In environments with excessively high noise, the induced neighborhood may become overly large, potentially weakening the spatial alignment effect. This limitation can be mitigated by appropriately tuning the spatial regularization strength.