Overview Video (Best with 🔊)

Abstract

Achieving human-like dexterity is a longstanding challenge in robotics, in part due to the complexity of planning and control for contact-rich systems. In reinforcement learning (RL), one popular approach has been to use massively-parallelized, domain-randomized simulations to learn a policy offline over a vast array of contact conditions, allowing robust sim-to-real transfer. Inspired by recent advances in real-time parallel simulation, this work considers instead the viability of online planning methods for contact-rich manipulation by studying the well-known in-hand cube reorientation task. We propose a simple architecture that employs a sampling-based predictive controller and vision-based pose estimator to search for contact-rich control actions online. We conduct thorough experiments to assess the real-world performance of our method, architectural design choices, and key factors for robustness, demonstrating that our simple sampling-based approach achieves performance comparable to prior RL-based works.

Assorted Contact-Rich Maneuvers

We show a variety of contact-rich maneuvers that DROP can achieve. Some are quite dynamic and fast, others are measured and slow. Overall, we observe that the motions discovered via a simple sampling-based planner can be quite sophisticated, involving contact with all parts of the hand. To reach a goal, the cube must often be rotated via a complicated composition of moves, which demonstrates the expressiveness of sampling-based predictive controllers.

Curated Full Runs

The following runs all use the CEM sampler, corresponding to the last row of Table I(B). As in the prior work "DeXtreme: Transfer of Agile In-hand Manipulation from Simulation to Reality," we consider a goal rotation reached when the cube is within 0.4 radians of it. New goal rotations are uniformly randomly sampled over SO(3) such that they are at least 90 degrees away from the previous goal to ensure sufficient difficulty.

The DROP Architecture

Demo Task

Our architecture has two parts.


A sampling-based predictive controller. The controller rolls out forward simulations of the system in parallel to compute some open-loop control spline, then while executing those motions, repeatedly replans in closed loop. We show that using the simple cross-entropy method (CEM), we can achieve remarkably expressive contact-rich motions on the cube rotation task.


A vision-based cube pose estimator. The estimator is further composed of three parts. First, a fine-tuned Resnet predicts keypoints corresponding to the corners of the cube based on RGBD images. Next, given known camera poses and a pinhole camera model, we use a factor-graph based fixed-lag smoother to estimate the cube pose corresponding to a set of images. Finally, because the smoother may result in estimates with non-negligible hand-cube penetration, we use a corrector to find a feasible cube pose estimate. The corrector is simply another simulation that maintains its own internal cube and hand states. However, the corrector imparts a virtual wrench onto its cube that attracts it towards the raw estimate from the smoother. This produces feasible cube states that resemble the smoother output.


Findings

Here, we summarize a few interesting findings. For the most fine-grained quantitative details, please see the paper.


DROP performs comparably to RL. Prior works like Dactyl and DeXtreme learn robust cube rotation policies using offline RL and mass domain randomization. It is surprising that our simple online planning approach achieves similar performance and is robust enough to perform well in the real world.


The CEM sampler outperforms predictive sampling and iLQR. Prior work has shown that the simple predictive sampling (PS) strategy is surprisingly effective on contact-rich planning tasks. Even though CEM is only slightly more complex, we find that it substantially outperforms PS, as well as the gradient-based iLQR planner, which struggles due to stiff contact dynamics (consistent with many prior observations). We systematically evaluate the robustness of these planners in our paper, and find that CEM is far more robust to both model and estimator error, which may explain its performance.


Performance is sensitive to number of threads. Unsurprisingly, adding more threads improves performance. However, in this work, we already use a server-grade CPU in order to plan with 120 threads. This motivates future work that may perform parallel contact-rich simulation in real time using GPUs, which could massively boost performance on all contact-rich tasks.

Citation

If you found our work useful, please use the following citation.

@article{li2024_drop,
    title={DROP: Dexterous Reorientation via Online Planning},
    author={Albert H. Li, Preston Culbertson, Vince Kurtz, and Aaron D. Ames},
    year={2024},
    journal={arXiv preprint arXiv:2409.14562},
    note={Available at: \url{https://arxiv.org/abs/2409.14562}},
}