Hierarchical and Modular Network on Non-prehensile Manipulation in General Environments

KAIST, *Equal contribution

HAMNET generalizes across objects and environments with diverse geometries via a modular and reconfigurable architecture.

Abstract

For robots to operate in general environments like households, they must be able to perform non-prehensile manipulation actions such as toppling and rolling to manipulate ungraspable objects. However, prior works on non-prehensile manipulation cannot yet generalize across environments with diverse geometries. The main challenge lies in adapting to varying environmental constraints: within a cabinet, the robot must avoid walls and ceilings; to lift objects to the top of a step, the robot must account for the step's pose and extent. While deep reinforcement learning (RL) has demonstrated impressive success in non-prehensile manipulation, accounting for such variability presents a challenge for the generalist policy, as it must learn diverse strategies for each new combination of constraints. To address this, we propose a modular and reconfigurable architecture that adaptively reconfigures network modules based on task requirements. To capture the geometric variability in environments, we extend the contact-based object representation (CORN) to environment geometries, and propose a procedural algorithm for generating diverse environments to train our agent. Taken together, the resulting policy can zero-shot transfer to novel real-world environments and objects despite training entirely within a simulator. We additionally release a simulation-based benchmark featuring nine digital twins of real-world scenes with 353 objects to facilitate non-prehensile manipulation research in realistic domains.

Method

Overall Method

Our framework consists of four main components: a modular policy network architecture trained via RL, contact-based representation pre-training, a procedural domain generation scheme for environment geometries, and a simulation-to-real transfer method for real-world deployment.

Overall Architecture

Overall architecture

Hierarchical And Modular Network (HAMNET)

hamnet architecture

HAMNet is a hierarchical architecture where the upper-level modulation network dynamically constructs the parameters of the lower-level base network. Based on the input context (robot, object, scene, and the goal), the modulation network (green) maps inputs to the base network's parameters θ by composing the modules and passing them through feature-wise gating. Conditioned on θ, the base network (blue) maps the state inputs and object geometry to actions and values.

Universal Contact-based Object Representation for Nonprehensile Manipulation (UniCORN)

UniCORN architecture

UniCORN is a geometry representation model that learns both local and global embeddings of geometries. During policy training, we use UniCORN as a lightweight and generalizable representation of object and environment geometries. During pretraining, UniCORN is used in a Siamese fashion to embed two point clouds A and B. Afterward, the contact decoder predicts contact between each local patch of A and the global embedding of B. The bottom block illustrates the patchification and tokenization process.

Procedural Domain Generation

We leverage procedural generation to construct diverse environment geometries during our simulation-based policy training. Our pipeline composes different environmental factors, such as walls, ceilings, and plates at different elevations for each axis, to construct geometrically diverse environments. This results in scenes that include real-world-like structures, such as cabinets, baskets, sinks, valleys, countertops, and steps.

Results

Compilation of real-world videos

Simulation Results

simulation results

In the simulation, we find that the joint use of our representation (UniCORN) and architecture (HAMNet) results in the policy that achieves the highest performance while also training fastest. Overall, the training only takes around 2 days on a single NVIDIA RTX4090 GPU to achieve 75% success rate.

Real-World Results

Domain Object Success rate Domain Object Success rate
Cabinet Bulldozer 4/5 Top of cabinet Bulldozer 3/5
Heart-Box 3/5 Crab 4/5
Sink Bulldozer 5/5 Basket Bulldozer 3/5
Angled Cup 4/5 Heart-Box 5/5
Drawer Bulldozer 4/5 Grill Bulldozer 5/5
Pencil case 3/5 Dino 4/5
Circular bin Bulldozer 4/5 Flat Bulldozer 5/5
Pineapple 3/5 Nutella 3/5
Suitcase Bulldozer 4/5 Total 78.9%
Candy Jar 5/5

Our real-world test validates that our framework zero-shot transfers to unseen real-world dynamics across novel objects and environment geometries. Across 9 different real-world environments with different factors such as sinks, ceilings, and walls, our robot manages to achieve an overall 78.9% despite solely training in synthetic simulated environments.

Emergent Skill Discovery

UMAP analysis reveals emergence of semantically meaningful skills

After training our policy, we analyzed the latent space of the modulation embedding zT via UMAP. We found that there are emergent patterns in how HAMNet determines network configurations. These patterns form distinct clusters (found by HDBScan), corresponding to semantically meaningful manipulation skills, such as lifting, reorientation, reaching, dropping, etc. What's more, even within a particular cluster, we find that HAMNet implements finger-grained behavioral nuances, where the spatial layout of the cluster reflects the robot's intended lifting direction or the reorientation axis.

For a more holistic view on the latent space of zT, also see the overview image.

Co-activation patterns of modules map to distinct manipulation skills

Skill-routing map

We find that distinct co-activation patterns of modules map to distinct manipulation skills. Here, we visualize 9 representative skills and the corresponding module activation map, where rows denote layer and columns denote module activation strengths.

HAMNet learns to select and sequence skills

We visualize the progression of zT over the course of an episode by colorizing zT by its nearest cluster label. We find that HAMNet selects task-relevant skills in a temporally consistent manner, and switches to context-appropriate skills as the task progresses. For instance, in the second video, after the dropping skill succeeds, the policy switches to reorientation and then translation to align the object to the goal.

BibTeX

@inproceedings{cho2025hamnet,
  title={{Hierarchical and Modular Network on Non-prehensile Manipulation in General Environments}},
  author={Cho, Yoonyoung and Han, Junhyek and Han, Jisu and Kim, Beomjoon},
  booktitle={Robotics: Science and Systems (RSS)},
  year={2025}
}