Robust Fall Recovery for Armless Bipedal-Wheeled Robots via Force‑Guided Learning

1Beijing Institute of Technology   

Abstract

Fall recovery is critical for autonomous legged locomotion. Existing methods have demonstrated that some legged robots, such as humanoids and quadrupeds, are capable of fall recovery from diverse postures by utilizing arms or coordinating multi-legs to generate support forces. Without arms or other legs to provide supportive assistance, a bipedal-wheeled robot must rely solely on the actuation of its legs, making recovery particularly difficult. To address this, we introduce FTSR (Force-guided Teacher-student framework with Stage-wise Rewards). The force-guided method constructs an external auxiliary force during simulation training that correlates directly with the robot's real-time height, explicitly formulating this force as an optimizable constraint. Through constrained reinforcement learning, the policy is guided toward reducing force dependency gradually and increasing the body height, developing internal recovery strategies despite having no arms for support. Height-progressive stage-Wise rewards progressively structure posture stabilization during recovery and transition to sustained locomotion, integrated with teacher-student architecture distilling privileged knowledge of force effects and recovery dynamics. After simulation training, the policy is deployed on a physical armless bipedal-wheeled robot and extensively evaluated. Experiments confirm robust and reliable fall recovery under diverse challenging conditions, demonstrating strong environmental adaptability and motion robustness, while maintaining full post-recovery motion capability. The framework also generalizes effectively to a high-DOF humanoid, confirming its practical generalizability.

Method

The overall framework is illustrated in Fig. 1. Armless bipedal-wheeled robots lack upper-body support during fall recovery, making them prone to convergence to dead points under pure reward optimization. This motivates formulating external auxiliary forces $\mathcal{F}$ and torques $\mathcal{T}$ as optimizable constraints rather than decaying curricula, enabling the policy to gradually reduce dependency on external assistance while discovering physically feasible recovery trajectories. Such constraint-guided exploration necessitates stable posture refinement across phases, which we achieve through height-progressive stage-wise rewards that auto-transition based on batch height statistics rather than fixed durations.

Realizing these mechanisms requires balancing privileged information with proprioceptive-only execution, leading to a teacher-student architecture where teacher encoder $E^t$ processes $\mathbf{x}^t_t$ comprising contact forces, base height, and full base state, while student encoder $E^s$ learns equivalent representations from history $\mathbf{o}^s_{t:t-H}$. This history concatenates observations $\mathbf{o}_t$ over past $H$ steps, where each $\mathbf{o}_t \in \mathbb{R}^{34}$ includes angular velocity, gravity projection, user commands, joint positions, velocities, and previous actions. An "or" module routes latent $\mathbf{z}^t_t$ or $\mathbf{z}^s_t$ to shared Actor-Critic networks by agent group, producing $\mathbf{z}_t$ concatenated with current $\mathbf{o}_t$ and $\mathbf{s}_t \in \mathbb{R}^{188}$ containing height map and robot base height. The networks output actions $\mathbf{a}_t$ and values $\hat{V}_t$, with parameters in Table I and component designs in Sec. II-C and Sec. II-D.

Method Pipeline
Figure 1: Force-guided fall-recovery training framework. The framework supports either bipedal-wheeled or humanoid robots, selected from Robot Type. Training data flows from the selected robot through Robot Data to generate teacher encoder observation $\mathbf{x}^t_t$ and student history observation $\mathbf{o}^s_{t:t-H}$, which feed $E^t$ and $E^s$ respectively to produce latents $\mathbf{z}^t_t$ and $\mathbf{z}^s_t$. These latents are then selected by an or module, with $\mathbf{z}^t_t$ for teacher-group agents and $\mathbf{z}^s_t$ for student-group agents, to form $\mathbf{z}_t$ concatenated with $\mathbf{o}_t$ and $\mathbf{s}_t$ for shared Actor-Critic networks ($\pi$ and $\phi$) outputting $\mathbf{a}_t$ and $\hat{V}_t$. During training, yellow arrows indicate training signals: $\mathbf{z}^t_t$ and $\mathbf{z}^s_t$ supervise $E^s$ via MSE loss, while $\mathcal{F}$, $\mathcal{T}$, $R_t$, and $\hat{V}_t$ update $E^t$, $\pi$, and $\phi$ to evolve policy toward $\pi^*$ by progressively reducing $\mathcal{F}$ and $\mathcal{T}$ via Force-Guided Learning. As robot base height increases, $R_t$ transitions from $r^u$ to $r^w$ when agents exceeding height thresholds reach two-thirds of the agents, synchronizing target height and rewards with actual height progression.

Simulation Results

Our framework enables armless bipedal-wheeled robots to recover from falls while maintaining sustained locomotion, and further generalizes to a humanoid platform.

Height-progressive stage-wise rewards is crucial for guiding the recovery process. Without the initial reward layers and adaptive target height updates, the force guided policy may rise with assistance but collapses upon force removal due to poor posture—evident in the pronounced force‑curve rebound during training (Fig. 3, Table I), reflecting slow and unstable convergence. In contrast, the stage‑wise rewards policy attains stable locomotion at 0.8m/s and maintains a high recovery success rate while moving (Fig. 4).

Method Pipeline
Figure 3: External force and height reward curves.
Method Pipeline
Figure 4: Success rate and foot travel distance across training iterations.

Proper force-guided learning leads to more robust and efficient recovery strategies. The conventional force-curriculum approach fails to achieve standing, while our redesigned force function enables rising but yields poor metrics (Table I). In contrast, our method formulates force as optimizable constraints, driving rapid convergence to low-intervention and robust regions. Assistance is removed after 3k iterations, with stable, rapid force decay (Fig. 3). The policy achieves force-free standing by 3k iterations, stabilizes after 4k with minimal foot adjustment, and maintains high success under commanded speeds (Fig. 4).

Method Pipeline
Figure 5: The proposed force-guided method is extended to a 23-DOF Unitree humanoid. Our method learns a policy to stand up from arbitrary poses (a). The recovery motion is illustrated via 3D trajectories of key points (b), and a comparative evaluation (c) confirms the policy’s smoothness and superiority over other methods. Unitree humanoid..

Bipedal-Wheeled Robot in Simulation

Humanoid Robot in Simulation

Real-World Experiments

To validate the effectiveness of our approach, we deployed the trained policy on our self-developed $ extit{JiaRan}$ robot and conducted real-world tests across various terrains and initial poses. All experimental results demonstrate the robustness and adaptability of our method.

ground getup one

ground getup two

ground getup three

ground getup four

step-terrain getup five

step-terrain getup six

step-terrain getup seven

slope-terrain getup eight

slope-terrain getup nine

slope-terrain getup ten

outdoor getup eleven

outdoor getup twelve

outdoor getup thirteen

outdoor getup fourteen

Citation

@article{yourname2024diffusion, title={Diffusion Policy for Robust Locomotion Control}, author={Your Name and Co-author}, journal={arXiv preprint arXiv:2405.xxxxx}, year={2024} }