Contact-Rich Vegetable Peeling via Diffusion Policy Learning

Team: Yuchen Yang

Supervisor: Prof. Xinjun Sheng, Prof. Yang Yu

Contact-rich manipulation represents a class of complex robotic interaction tasks where precise control is achieved through physical contact with the environment. Among these tasks, vegetable peeling stands out as a representative and practical example, combining high-level manipulation challenges with real-world applicability. It serves as an ideal testbed for exploring robotic strategies that rely heavily on force interaction and compliance.

While vision-based approaches have been widely adopted, their limitations in contact-rich tasks have led to growing interest in multi-modal sensing, particularly the integration of force and tactile feedback. By incorporating force information, robots can significantly enhance stability, adaptability, and success rates when dealing with tasks involving uncertain dynamics and fine surface interactions.

This project focuses on the vegetable peeling task and proposes a complete research framework spanning multi-modal data acquisition, policy learning, control execution, and performance evaluation. Specifically, we:

  • Designed a VR-based multi-modal data collection system and built a task dataset combining visual and force signals;

  • Developed CNN-based and Mamba-based diffusion policy learning models that fuse visual and force information;

  • Designed compliant and adaptive control strategies;

  • Conducted comprehensive experiments to evaluate and compare model performance.

A demonstration video of the final result is below.

1. Design of a VR Teleoperation Data Collection System and Dataset Construction

1.1 Experimental Platform

As the core carrier of data acquisition, the experimental platform has perfect multimodal perception functions and can fully support the realization of complex operation tasks. The key module composition of the experimental platform is shown in Figure 1.1.

Experimental Platform

Figure 1.1: Experimental Platform

1.2 VR Teleoperation System

The overall architecture of the VR teleoperation system is shown in Fig 1.2, and the system consists of four main parts: the VR, the camera, the PC and the robot. The VR glasses receive images from the head camera and render them on the scene canvas in real time to provide an immersive visual experience, and the VR handle is responsible for releasing the key and pose information. The keys include the start button, the stop button and the grasp button. The real-time pose change information of the VR handle is converted into the expected pose of the robot end effector through mapping to realize the teleoperation control. The final saved data include RGB images and depth images from the three cameras (collected at 10 Hz), low-frequency data from the dual-arm robot (collected at 6 Hz), and high-frequency data (collected at 60 Hz). The low-frequency data contains the robot’s end pose, joint angle, and gripper state, and the high-frequency data contains the robot’s end force and moment, end pose, and joint angle.

VR System

Figure 1.2: VR System

1.3 Dataset Construction

Based on the VR teleoperated system, data acquisition was performed for the vegetable peeling task of zucchini, and a dataset containing 100 pieces of high-quality peeling data was created. The architecture of the raw dataset is shown in Figure 1.3. The image data includes color images and depth images (collected at 30Hz). The low-frequency data included the pose of robot end effector, the angle of each joint, and the the gripper state (collected at 6 Hz). High-frequency data includes force sensor data, robot end effector pose, and the angle of each joint (collected at 60Hz).

Raw Dataset

Figure 1.3: Raw Dateset Structure

For subsequent model training, point cloud reconstruction was performed based on the RGB images and depth images collected from cameras. The overall process is shown in Figure 1.4.
PCD

Figure 1.4: Point Cloud Reconstruction Process

The data is organized into a HDF5 file in robomimic format and the overall architecture is shown in Figure 1.5.

HDF5

Figure 1.5: HDF5 Dataset Structure

2. Development and Training of Diffusion Policy Models

This study employed the CNN-based Diffusion Policy and the Mamba-based Diffusion Policy (subsequently referred to as Diffusion-C and Diffusion-M), and the overall model architecture is shown in Figure 2.1. Two models were trained separately, with a batch size of 32 and an epoch number of 1000. Finally, the model generated after 800 epochs was selected for testing.

Model

Figure 2.1: Model Architecture (Adapted from Diffusion Policy and Mamba Policy)

3. Task Execution and Performance Evaluation on Vegetable Peeling

3.1 Compliant Adjustment Strategy

We design a compliant adjustment strategy that organically combines the pose parameters and force/torque parameters output from the model to achieve better control. The core idea is to adjust the desired pose of the robot in real time to reduce the error between the actual force/torque and the expected force/torque. The overall flow is shown in Figure 3.1.

Control

Figure 3.1: Compliant Adaption Flow

Comparison experiments of force following with or without the strategy were conducted for the Diffusion-C and Diffusion-M models, respectively, focusing on examining the actual force/torque following the expected force/torque after using the compliant adjustment strategy (Figure 3.2, Figure 3.3). In the figures, the curves labeled “1” represent the results without the compliant adjustment strategy, while the curves labeled “2” represent the results with the compliant control strategy.

Force Comparison C

Figure 3.2: Comparison of Force/Torque Curves for Diffusion-C with and Without Compliant Adjustment Strategy

Force Comparison M

Figure 3.3: Comparison of Force/Torque Curves for Diffusion-M with and Without Compliant Adjustment Strategy

3.2 Model Performance Evaluation and Comparison

The following four quantitative indicators were designed:

  • Success Rate: the ratio of the number of successful peeling tasks performed to the total number of tasks performed.
  • Peeling Length: the mean and standard deviation of the statistical peeling length in successfully completed peeling tasks.
  • Total Movement Efficiency: the mean and standard deviation of 100 time points divided by the total time points of the positioning and peeling phases were counted. The expression is:

$$ \begin{aligned} \text{motion_efficiency} &= \frac{100}{\text{time_step_total}}\\ \text{time_step_total} &= \frac{t_{\text{locate}}+t_{\text{peel}}}{t_{\text{send}}+t_{\text{inference}}+t_{\text{move}}} \end{aligned} $$

  • Peeling Efficiency: the peeling length is divided by the time points of the peeling phase, i.e. the peeling length per unit time step, and its mean and standard deviation are counted. The expression is:

$$ \begin{aligned} \text{peeling_efficiency} &= \frac{\text{length}}{\text{time_step_peel}}\\ \text{time_step_peel} &= \frac{t_{\text{peel}}}{t_{\text{send}}+t_{\text{inference}}+t_{\text{move}}} \end{aligned} $$

Stage

Figure 3.4: Peeling Task Stage Division

In order to fully evaluate the performance of the Diffusion-C and Diffusion-M models with and without the compliant adjustment strategy, we performed 32 peeling inference tasks for each of the four model combinations. Figure 3.5 illustrates a visual record of some of the test results.

Record

Figure 3.5: Partial Peeling Test Results Record

Table 3.1 demonstrates the statistics of the four quantitative indicators. Figures 3.6, 3.7 and 3.8 show the distribution of peeling length, total motion efficiency and peeling efficiency respectively.

Table 3.1: Comparison of Quantitative Indicators

Indicators
Length

Figure 3.6: Distribution of Peeling Length

Motion Efficiency

Figure 3.7: Distribution of Total Motion Efficiency

Peeling Efficiency

Figure 3.8: Distribution of Peeling Efficiency

From the comparison, we draw the following conclusions:

  • Diffusion-C demonstrates more stable performance, while Diffusion-M yields more visually impressive results but suffers from reduced consistency.
  • Compliant adjustment significantly improves system stability, though it compromises the “brute-force” effects to some extent.
  • Diffusion-C is more time-efficient overall.
  • Diffusion-C produces more conservative position outputs, where incorporating force prediction helps accelerate the peeling process; in contrast, Diffusion-M generates more aggressive motions, and force prediction helps slow down the process and enhance stability.