For robots to operate in general environments like households, they must be able to perform non-prehensile manipulation actions such as toppling and rolling to manipulate ungraspable objects. However, prior works on non-prehensile manipulation cannot yet generalize across environments with diverse geometries. The main challenge lies in adapting to varying environmental constraints: within a cabinet, the robot must avoid walls and ceilings; to lift objects to the top of a step, the robot must account for the step's pose and extent. While deep reinforcement learning (RL) has demonstrated impressive success in non-prehensile manipulation, accounting for such variability presents a challenge for the generalist policy, as it must learn diverse strategies for each new combination of constraints. To address this, we propose a modular and reconfigurable architecture that adaptively reconfigures network modules based on task requirements. To capture the geometric variability in environments, we extend the contact-based object representation (CORN) to environment geometries, and propose a procedural algorithm for generating diverse environments to train our agent. Taken together, the resulting policy can zero-shot transfer to novel real-world environments and objects despite training entirely within a simulator. We additionally release a simulation-based benchmark featuring nine digital twins of real-world scenes with 353 objects to facilitate non-prehensile manipulation research in realistic domains.
Our framework consists of four main components: a modular policy network architecture trained via RL, contact-based representation pre-training, a procedural domain generation scheme for environment geometries, and a simulation-to-real transfer method for real-world deployment.
HAMNet is a hierarchical architecture where the upper-level modulation network dynamically constructs the parameters of the lower-level base network. Based on the input context (robot, object, scene, and the goal), the modulation network (green) maps inputs to the base network's parameters θ by composing the modules and passing them through feature-wise gating. Conditioned on θ, the base network (blue) maps the state inputs and object geometry to actions and values.
UniCORN is a geometry representation model that learns both local and global embeddings of geometries. During policy training, we use UniCORN as a lightweight and generalizable representation of object and environment geometries. During pretraining, UniCORN is used in a Siamese fashion to embed two point clouds A and B. Afterward, the contact decoder predicts contact between each local patch of A and the global embedding of B. The bottom block illustrates the patchification and tokenization process.
We leverage procedural generation to construct diverse environment geometries during our simulation-based policy training. Our pipeline composes different environmental factors, such as walls, ceilings, and plates at different elevations for each axis, to construct geometrically diverse environments. This results in scenes that include real-world-like structures, such as cabinets, baskets, sinks, valleys, countertops, and steps.
In the simulation, we find that the joint use of our representation (UniCORN) and architecture (HAMNet) results in the policy that achieves the highest performance while also training fastest. Overall, the training only takes around 2 days on a single NVIDIA RTX4090 GPU to achieve 75% success rate.
Domain | Object | Success rate | Domain | Object | Success rate |
---|---|---|---|---|---|
Cabinet | Bulldozer | 4/5 | Top of cabinet | Bulldozer | 3/5 |
Heart-Box | 3/5 | Crab | 4/5 | ||
Sink | Bulldozer | 5/5 | Basket | Bulldozer | 3/5 |
Angled Cup | 4/5 | Heart-Box | 5/5 | ||
Drawer | Bulldozer | 4/5 | Grill | Bulldozer | 5/5 |
Pencil case | 3/5 | Dino | 4/5 | ||
Circular bin | Bulldozer | 4/5 | Flat | Bulldozer | 5/5 |
Pineapple | 3/5 | Nutella | 3/5 | ||
Suitcase | Bulldozer | 4/5 | Total | 78.9% | |
Candy Jar | 5/5 |
Our real-world test validates that our framework zero-shot transfers to unseen real-world dynamics across novel objects and environment geometries. Across 9 different real-world environments with different factors such as sinks, ceilings, and walls, our robot manages to achieve an overall 78.9% despite solely training in synthetic simulated environments.
After training our policy, we analyzed the latent space of the modulation embedding zT via UMAP.
We found that there are emergent patterns in how HAMNet determines network configurations.
These patterns form distinct clusters (found by HDBScan),
corresponding to semantically meaningful manipulation skills,
such as lifting, reorientation, reaching, dropping, etc.
What's more, even within a particular cluster, we find that
HAMNet implements finger-grained behavioral nuances, where the spatial layout of the cluster
reflects the robot's intended lifting direction or the reorientation axis.
For a more holistic view on the latent space of zT, also see the overview image.
We find that distinct co-activation patterns of modules map to distinct manipulation skills. Here, we visualize 9 representative skills and the corresponding module activation map, where rows denote layer and columns denote module activation strengths.
We visualize the progression of zT over the course of an episode by colorizing zT by its nearest cluster label. We find that HAMNet selects task-relevant skills in a temporally consistent manner, and switches to context-appropriate skills as the task progresses. For instance, in the second video, after the dropping skill succeeds, the policy switches to reorientation and then translation to align the object to the goal.
@inproceedings{cho2025hamnet, title={{Hierarchical and Modular Network on Non-prehensile Manipulation in General Environments}}, author={Cho, Yoonyoung and Han, Junhyek and Han, Jisu and Kim, Beomjoon}, booktitle={Robotics: Science and Systems (RSS)}, year={2025} }