Enhancing Control Policy Smoothness by Aligning Actions with Predictions from Preceding States

Kyoleen Kwak and Hyoseok Hwang*
{2007kkl, hyoseok}@khu.ac.uk

Kyung Hee University AIRLAB
AAAI 2026

Paper Supplementary Code

Method

Figure 1. Similar State of ASAP. ASAP uses the environment's transition distribution to define similar states and aligns the action with the expected policy output under this distribution.

Figure 2. The implementation architecture of ASAP. A prediction head is added after the shared MLP.The prediction head predicts the expected action at time step s_t+1 given the state at time step s_t.

Figure 3. The update procedure of ASAP. The ASAP network takes s_t as input and outputs both the executed action a_t and the predicted expected action at the next state, E[a_t+1]. The action head is trained to match the output of the prediction head, while the prediction head learns the expected next action over the state distribution stored in the buffer.

Similar State Based on Transition Distribution

A key challenge in action smoothing for reinforcement learning lies in defining similar states in a way that faithfully reflects the underlying system dynamics. Prior loss-based approaches typically rely on heuristic or synthetic neighborhoods, such as Gaussian perturbations around the current state or interpolations along the next-state direction. However, these artificially generated states often deviate from the true state distribution induced by the environment, leading to unstable or overly conservative regularization.

ASAP addresses this limitation by defining similar states directly from the environment’s transition distribution. Given a previous state s_t−1, we define the similar state distribution as the set of next states sampled from the transition kernel P(· | s_t−1). Intuitively, states that originate from the same preceding state are expected to be similar from a dynamics perspective, as their differences arise solely from bounded stochastic disturbances.

Under mild assumptions on the transition function—namely, local Lipschitz continuity with respect to noise and bounded stochastic perturbations—this transition-induced definition forms a spatially bounded neighborhood. As a result, it provides a principled foundation for enforcing local Lipschitz continuity of the policy without introducing synthetic samples or heuristic distance thresholds.

Within this framework, ASAP enforces action consistency by aligning the policy output at the current state with the expected action over the similar state distribution. This alignment suppresses excessive sensitivity to small state variations and directly mitigates oscillatory behavior caused by transition noise.

Action Smoothing via Predictions from Preceding States

Building on the transition-induced similar state formulation, ASAP introduces a practical mechanism to enforce action smoothness during training. Instead of explicitly sampling multiple next states, ASAP augments the policy network with an additional prediction head that estimates the expected action at the next time step given the preceding state.

Specifically, the ASAP actor consists of a shared feature extractor followed by two heads: an action head that outputs the executed action a_t from the current state s_t, and a prediction head that predicts the expected next action E[a_t+1 | s_t]. The action head is trained to align its output with the prediction head, thereby encouraging consistency across transition-induced similar states.

To stabilize training, ASAP separates the learning signals for the two heads using stop-gradient operations. The prediction head learns to approximate the policy’s actual outputs, while the action head treats the predicted action as a fixed target. This asymmetric design avoids moving-target instability and enables robust optimization.

In addition to spatial alignment, ASAP incorporates a temporal regularization term that penalizes second-order differences in the action sequence. This temporal loss suppresses high-frequency oscillations while preserving the agent’s ability to perform rapid but purposeful action changes when necessary.

The final ASAP objective combines the standard reinforcement learning actor loss with spatial and temporal smoothness terms. As a result, ASAP achieves smooth and stable control policies without modifying the network architecture at inference time, making it readily applicable to real-world robotic systems.

ASAP Objective Function

Formally, ASAP optimizes the policy using a composite objective that combines the standard reinforcement learning actor loss with spatial and temporal smoothness penalties:

$$ J^{\mathrm{ASAP}}_{\pi_\phi} = J_{\pi_\phi} + \lambda_S L_S + \lambda_P L_P + \lambda_T L_T $$

Here, J_{π_φ} denotes the standard actor loss used in common RL algorithms such as PPO and SAC. The additional terms encourage smoothness in both the spatial and temporal domains.

The spatial smoothing loss aligns the action produced at the current state with the expected action predicted from the preceding state:

$$ L_S = \left\| \pi_\phi(s_t) - \texttt{stopgrad}\!\left(\pi_P(s_{t-1})\right) \right\|_2^2 $$

To train the prediction head, ASAP introduces a complementary prediction loss, which encourages the predicted action to match the policy’s actual output while treating the target as fixed:

$$ L_P = \left\| \pi_P(s_{t-1}) - \texttt{stopgrad}\!\left(\pi_\phi(s_t)\right) \right\|_2^2 $$

This asymmetric design decouples the learning dynamics of the action head and the prediction head, preventing moving-target instability and allowing each component to be optimized with an appropriate learning strength.

In addition to spatial alignment, ASAP adopts a temporal smoothness penalty based on second-order action differences, originally proposed in Grad-CAPS:

$$ L_T = \left\| \frac{a_{t+1} - 2a_t + a_{t-1}} {\tanh(a_{t+1} - a_{t-1}) + \epsilon} \right\|_2^2 $$

This temporal loss suppresses high-frequency oscillations while preserving flexibility for purposeful action changes. The hyperparameters λ_S, λ_P, and λ_T control the relative strength of each regularization term.

Evaluation

Evaluation Metrics

We evaluate policies using two complementary metrics: Cumulative Return (re) and Smoothness Score (sm).

Cumulative Return measures overall task performance and is computed as the total accumulated reward over an episode.

Smoothness Score quantifies action oscillations based on the frequency spectrum of actions. Following prior work, we compute smoothness using the FFT as:

$$ \mathrm{Sm} = \frac{2}{n f_s} \sum_{i=1}^{n} M_i f_i $$

Here, $f_i$ and $M_i$ denote the frequency and amplitude of the $i$-th spectral component, and $f_s$ denotes the sampling frequency. This metric computes a frequency-weighted average of action magnitudes. Lower values indicate smoother and more stable control, while higher values correspond to increased high-frequency action components.

Evaluation on Gymnasium

Experiments on the Gymnasium benchmark were conducted to validate the theoretical assumptions and design principles of ASAP. Gymnasium provides a set of continuous control tasks with relatively simple yet diverse dynamics, making it well suited for analyzing whether the transition distribution-based definition of similar states effectively suppresses action oscillations.

We evaluate ASAP under both PPO and SAC settings, comparing it against prior methods using two metrics: cumulative return and action smoothness. The results show that ASAP consistently reduces high-frequency action oscillations while largely preserving policy performance across most environments, demonstrating that transition-induced similar states and prediction-based alignment accurately capture the underlying system dynamics.

fail — **Table 1.** Cumulative return (re) and smoothness score (sm) on Gymnasium benchmark under PPO setting. Higher re and lower sm are better. Bold indicates best, and underlined the second-best, per environment. Standard deviations shown in parentheses.

Evaluation on Isaac-Lab

Experiments on Isaac-Lab were conducted to assess whether ASAP remains effective beyond standard benchmarks, under realistic robot dynamics and practical noise conditions. Isaac-Lab provides high-fidelity physics simulation along with domain randomization and observation noise, enabling evaluation of policy stability and control quality in more realistic settings.

We apply ASAP to PPO-based policies and compare performance using two metrics: cumulative return and action smoothness. The results show that ASAP consistently reduces high-frequency action oscillations while maintaining or improving policy performance across multiple robotic tasks, demonstrating that transition-induced similar states and prediction-based alignment remain effective in robotic control scenarios.

Conclusion

We propose ASAP, a novel action smoothing method that combines transition induced similar states with prediction from preceding states. ASAP effectively suppresses high-frequency action oscillations while preserving policy performance, as demonstrated across Gymnasium and Isaac-Lab benchmarks. The method introduces no additional inference-time overhead, making it readily applicable to real-world robotic control systems.

Limitation

ASAP defines similar states under the assumption of bounded transition noise. In environments with excessively high noise, the induced neighborhood may become overly large, potentially weakening the spatial alignment effect. This limitation can be mitigated by appropriately tuning the spatial regularization strength.

Enhancing Control Policy Smoothness by Aligning Actions with Predictions from Preceding States

Abstract

Method

Figure 1. Similar State of ASAP. ASAP uses the environment's transition distribution to define similar states and aligns the action with the expected policy output under this distribution.

Figure 2. The implementation architecture of ASAP. A prediction head is added after the shared MLP.The prediction head predicts the expected action at time step st+1 given the state at time step st.

Similar State Based on Transition Distribution

Action Smoothing via Predictions from Preceding States

ASAP Objective Function

Evaluation

Evaluation Metrics

Evaluation on Gymnasium

Evaluation on Isaac-Lab

Conclusion

Conclusion

Limitation

Figure 2. The implementation architecture of ASAP. A prediction head is added after the shared MLP.The prediction head predicts the expected action at time step s_t+1 given the state at time step s_t.