Audit of 100 LeRobot Datasets Finds 81% Flawed or Unloadable
A developer audited 100 publicly available LeRobotDataset repositories on the Hugging Face Hub and found that 81% either contained data errors or could not be linted at all. Of the datasets that did load successfully, nearly 19% suffered from a known migration bug where episode-to-frame index boundaries were corrupted during a v2.1-to-v3.0 conversion, causing frames to be silently assigned to the wrong episode during training. A separate floating-point timestamp drift issue, which can cause video decoding to fail mid-training run, was found in about 3% of successfully linted datasets. To address the lack of automated quality checks, the developer released an open-source tool called trajlens that runs 16 validation checks across categories including structural integrity, timestamp consistency, and video decodability. The tool is available via pip and is designed to complete a lint pass on a 100-episode dataset in under 30 seconds, with CI-friendly output formats.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in