Flat'n'Fold: A Diverse Multi-Modal Dataset for Garment Perception and Manipulation

1University of Glasgow

Flat'n'Fold: A Diverse Multi-Modal Dataset for Garment Perception and Manipulation

Abstract

Overview of the experimental setup

We present Flat'n'Fold, a novel large-scale dataset for garment manipulation that addresses critical gaps in existing datasets. Comprising 1,212 human and 887 robot demonstrations of flattening and folding 44 unique garments across 8 categories, Flat'n'Fold surpasses prior datasets in size, scope, and diversity.

Our dataset uniquely captures the entire manipulation process from crumpled to folded states, providing synchronized multi-view RGB-D images, point clouds, and action data, including hand or gripper positions and rotations.

We quantify the dataset's diversity and complexity compared to existing benchmarks and show that our dataset features natural and diverse manipulations of real-world demonstrations of human and robot demonstrations in terms of visual and action information.

To showcase Flat'n'Fold's utility, we establish new benchmarks for grasping point prediction and subtask decomposition. Our evaluation of state-of-the-art models on these tasks reveals significant room for improvement. This underscores Flat'n'Fold's potential to drive advances in robotic perception and manipulation of deformable objects.

Hardware

Overview of the experimental setup

Hardware Setup. (1) Front camera; (2) Top camera; (3) Side camera; (4) Steam Index VR Headset, which serves as the origin of the world; (5) HTC Vive tracker; (6) Receiver of the tracker; (7) Pedal; (8) Grasping Point, the yellow line indicates the distance from the center of the tracker to the grasping point; (9) Baxter's gripper with (A) the gripper in its closed state and (B) opened state; (10) Baxter's zero-G mode and control buttons. The black numbers mean that this hardware was used for human and robot demonstrations; red numbers, it was used in the robot demonstration dataset only, and green numbers, it was used only for human demonstration.

Dataset Overview

Overview of the experimental setup
Overview of the experimental setup

Compare to other datasets

Overview of the experimental setup

* indicates that the subset related to deformable objects (or garment) is considered. 'Agent' represents whether it is human demon- strations or robot demonstrations; 'Hum. Action' means whether human action data is recorded and saved during human demonstrations; 'Ann.'' represents extra annotations. Fla'n'Fold has a clear advantage in terms of data volume, diversity, and modalities recorded.

Experiments

We first compare the diversity and complexity of Flat'n'Fold to existing datasets. Then, we define two benchmarks for evaluating grasp prediction and sub-task decomposition.

Quantifying the diversity of the dataset

Action Diversity

Action Diversity

Video Diversity

Vision Diversity

Action information: Complexity is calculated by averaging the variance of positions and rotations over time within each action sequence. Action sequence diversity is measured by uniformly sampling 300 time ticks, calculating variance at each point, and averaging across the sequence.

Visual information: Features are extracted from each video using a pre-trained I3D model (Carreira & Zisserman, 2017), and the global standard deviation is calculated to measure diversity.

Flat'n'Fold provides diverse visual and action data, showcasing its applicability in garment perception and manipulation tasks.

Grasping Point Prediction Benchmark

Human demonstration

Action Diversity

The ground truth (red hand), Point-BERT (pink hand), and PointNet++ (yellow hand); Human demonstrations: lower classification accuracy, higher position errors, and lower rotation errors.

Robot demonstration

Vision Diversity

The ground truth (red gripper), Point-BERT (pink gripper), and PointNet++ (yellow gripper); Robot demonstrations: higher classification accuracy, lower position errors, and higher rotation errors.

Quantitive Results For Grasping Point Prediction

Action Diversity

We created a sub-dataset with 6,329 human and 5,574 robot annotated point clouds for grasp prediction. Metrics include classification accuracy (left vs. right hand), L1 error for positions, and geodesic error for rotations. Baselines used: PointNet++ and Point-BERT, with two fully-connected layers for predicting hand position (L1 loss), rotation quaternion (geodesic loss), and hand classification (cross-entropy loss). Results show that more training data improves metrics, but even with the full dataset, methods still struggle to predict grasps accurately.

Automated Subtask Decomposition Benchmark​

Human demonstration example

Action Diversity

Robot demonstration example

Vision Diversity

Quantitive Results For UVD

Vision Diversity

We define 'pick' and 'place' actions as sub-task boundaries: 'pick' actions are set as subtasks during the flattening phase, and both 'pick' and 'place' during the folding phase. We use precision, recall, and F1 score for evaluation. Using the unsupervised task decomposition UVD (Zhang et al., 2024) as a baseline, results show high precision but lower recall, indicating missed subtask boundaries. UVD performs better with human demonstrations than with robots and shows decreased effectiveness in varied flattening approaches compared to the folding phase.

Dataset download

Some samples of the dataset can be download at: Dataset Download. It will take some time for all the data to be uploaded because of the large size of the dataset. We will publish the full dataset as soon as possible.

Acknowledgements

We want to thank Zhuo He, Tanatta Chaichakan, and the Computer Vision and Autonomous Systems (CVAS) research group for insightful discussions and for participating in the data collection for this work.

BibTeX


        @misc{zhuang2024flatnfolddiversemultimodaldataset,
            title={Flat'n'Fold: A Diverse Multi-Modal Dataset for Garment Perception and Manipulation}, 
            author={Lipeng Zhuang and Shiyu Fan and Yingdong Ru and Florent Audonnet and Paul Henderson and Gerardo Aragon-Camarasa},
            year={2024},
            eprint={2409.18297},
            archivePrefix={arXiv},
            primaryClass={cs.RO},
            url={https://arxiv.org/abs/2409.18297}
    }