Internals#
Data Root#
Smallpond stores all data in a single directory called data root.
This directory has the following structure:
data_root
└── 2024-12-11-12-00-28.2cc39990-296f-48a3-8063-78cf6dca460b # job_time.job_id
├── config # configuration and state
│ ├── exec_plan.pickle
│ ├── logical_plan.pickle
│ └── runtime_ctx.pickle
├── log # logs
│ ├── graph.png
│ └── scheduler.log
├── queue # message queue between scheduler and workers
├── output # output data
├── staging # intermediate data
│ ├── DataSourceTask.000001
│ ├── EvenlyDistributedPartitionProducerTask.000002
│ ├── completed_tasks # output dataset of completed tasks
│ └── started_tasks # used for checkpoint
└── temp # temporary data
├── DataSourceTask.000001
└── EvenlyDistributedPartitionProducerTask.000002
Failure Recovery#
Smallpond can recover from failure and resume execution from the last checkpoint. Checkpoint is task-level. A few tasks, such as ArrowBatchTask, support checkpointing at the batch level.