Internals ========= Data Root --------- Smallpond stores all data in a single directory called data root. This directory has the following structure: .. code-block:: bash data_root └── 2024-12-11-12-00-28.2cc39990-296f-48a3-8063-78cf6dca460b # job_time.job_id ├── config # configuration and state │ ├── exec_plan.pickle │ ├── logical_plan.pickle │ └── runtime_ctx.pickle ├── log # logs │ ├── graph.png │ └── scheduler.log ├── queue # message queue between scheduler and workers ├── output # output data ├── staging # intermediate data │ ├── DataSourceTask.000001 │ ├── EvenlyDistributedPartitionProducerTask.000002 │ ├── completed_tasks # output dataset of completed tasks │ └── started_tasks # used for checkpoint └── temp # temporary data ├── DataSourceTask.000001 └── EvenlyDistributedPartitionProducerTask.000002 Failure Recovery ---------------- Smallpond can recover from failure and resume execution from the last checkpoint. Checkpoint is task-level. A few tasks, such as `ArrowBatchTask`, support checkpointing at the batch level.