Internals#

Data Root#

Smallpond stores all data in a single directory called data root.

This directory has the following structure:

data_root
└── 2024-12-11-12-00-28.2cc39990-296f-48a3-8063-78cf6dca460b # job_time.job_id
    ├── config  # configuration and state
       ├── exec_plan.pickle
       ├── logical_plan.pickle
       └── runtime_ctx.pickle
    ├── log     # logs
       ├── graph.png
       └── scheduler.log
    ├── queue   # message queue between scheduler and workers
    ├── output  # output data
    ├── staging # intermediate data
       ├── DataSourceTask.000001
       ├── EvenlyDistributedPartitionProducerTask.000002
       ├── completed_tasks  # output dataset of completed tasks
       └── started_tasks    # used for checkpoint
    └── temp    # temporary data
        ├── DataSourceTask.000001
        └── EvenlyDistributedPartitionProducerTask.000002

Failure Recovery#

Smallpond can recover from failure and resume execution from the last checkpoint. Checkpoint is task-level. A few tasks, such as ArrowBatchTask, support checkpointing at the batch level.