smallpond.execution.task.Task#

class smallpond.execution.task.Task(ctx: RuntimeContext, input_deps: List[Task], partition_infos: List[PartitionInfo], output_name: str | None = None, output_path: str | None = None, cpu_limit: int = 1, gpu_limit: float = 0, memory_limit: int | None = None)#

The base class for all tasks.

Task is the basic unit of work in smallpond. Each task represents a specific operation that takes a series of input datasets and produces an output dataset. Tasks can depend on other tasks, forming a directed acyclic graph (DAG). Tasks can specify resource requirements such as CPU, GPU, and memory limits. Tasks should be idempotent. They can be retried if they fail.

Lifetime of a task object:

instantiated at planning time on the scheduler node
pickled and sent to a worker node
initialize() is called to prepare for execution
run() is called to execute the task
finalize() or cleanup() is called to release resources
pickled and sent back to the scheduler node

__init__(ctx: RuntimeContext, input_deps: List[Task], partition_infos: List[PartitionInfo], output_name: str | None = None, output_path: str | None = None, cpu_limit: int = 1, gpu_limit: float = 0, memory_limit: int | None = None) → None#

Methods

`__init__`(ctx, input_deps, partition_infos[, ...])
`add_elapsed_time`([metric_name])	Start or stop the timer.
`adjust_row_group_size`(nbytes, num_rows[, ...])
`clean_complex_attrs`()
`clean_output`([force])
`cleanup`()	Called after run() even if there is an exception.
`compute_avg_row_size`(nbytes, num_rows)
`dump`()
`exec`([cq])
`finalize`()	Called after run() to finalize the processing.
`get_partition_info`(dimension)
`initialize`()	Called before run() to prepare for running the task.
`inject_fault`()
`merge_metrics`(metrics)
`oom`([nonzero_exitcode_as_oom])
`parquet_kv_metadata_bytes`([extra_partitions])
`parquet_kv_metadata_str`([extra_partitions])
`random_float`()
`random_uint32`()
`run`()
`run_on_ray`()	Run the task on Ray.
`set_memory_limit`(soft_limit, hard_limit)

Attributes

`ctx`
`id`
`node_id`
`sched_epoch`
`output_name`
`output_root`
`dataset`
`input_deps`
`output_deps`
`perf_metrics`
`perf_profile`
`runtime_state`
`input_datasets`
`allow_speculative_exec`	Whether the task is allowed to be executed by speculative execution.
`any_input_empty`
`cpu_limit`
`default_output_name`
`elapsed_time`
`exception`
`exec_cq`
`exec_id`
`exec_on_scheduler`
`fail_count`
`final_output_abspath`
`finish_time`
`gpu_limit`
`key`
`local_gpu`	Return the first GPU granted to this task.
`local_gpu_ranks`	Return all GPU ranks granted to this task.
`local_rank`	Return the first GPU rank granted to this task.
`location`
`memory_limit`
`numa_node`
`numpy_random_gen`
`output`
`output_dirname`
`output_filename`
`partition_dims`
`partition_infos`
`partition_infos_as_dict`
`python_random_gen`
`random_seed_bytes`
`ray_dataset_path`	The path of a pickle file that contains the output dataset of the task.
`ray_marker_path`	The path of an empty file that is used to determine if the task has been started.
`retry_count`
`runtime_id`
`runtime_output_abspath`	Output data will be produced in this directory at runtime.
`self_contained_output`	Whether the output of this node is not dependent on any input nodes.
`skip_when_any_input_empty`
`staging_root`	If the task has a special output directory, its runtime output directory will be under it.
`start_time`
`status`
`temp_abspath`
`temp_output`	Whether the output of this node is stored in a temporary directory.
`uniform_failure_prob`