Tasks#

# create a runtime context
runtime_ctx = RuntimeContext(JobId.new(), data_root)
runtime_ctx.initialize(socket.gethostname(), cleanup_root=True)

# create a logical plan
plan = create_logical_plan()

# create an execution plan
planner = Planner(runtime_ctx)
exec_plan = planner.create_exec_plan(plan)

You can then execute the tasks in a scheduler, see Execution.

RuntimeContext#

RuntimeContext(job_id, job_time, data_root, *)

The configuration and state for a running job.

JobId([hex, bytes, bytes_le, fields, int, ...])

A unique identifier for a job.

TaskId

A unique identifier for a task.

TaskRuntimeId(id, epoch, retry)

A unique identifier for a task at runtime.

PartitionInfo([index, npartitions, dimension])

Information about a partition of a dataset.

PerfStats(cnt, sum, min, max, avg, p50, p75, ...)

Performance statistics for a task.

ExecutionPlan#

ExecutionPlan(ctx, root_task, logical_plan)

A directed acyclic graph (DAG) of tasks.

Tasks#

Task(ctx, input_deps, partition_infos[, ...])

The base class for all tasks.

ArrowBatchTask(ctx, input_deps, partition_infos)

ArrowComputeTask(ctx, input_deps, ...[, ...])

ArrowStreamTask(ctx, input_deps, partition_infos)

DataSinkTask(ctx, input_deps, ...[, type, ...])

DataSourceTask(ctx, dataset, partition_infos)

EvenlyDistributedPartitionProducerTask(ctx, ...)

HashPartitionArrowTask(ctx, input_deps, ...)

HashPartitionDuckDbTask(ctx, input_deps, ...)

HashPartitionTask(ctx, input_deps, ...[, ...])

LoadPartitionedDataSetProducerTask(ctx, ...)

MergeDataSetsTask(ctx, input_deps, ...[, ...])

PandasBatchTask(ctx, input_deps, partition_infos)

PandasComputeTask(ctx, input_deps, ...[, ...])

PartitionConsumerTask(ctx, input_deps, ...)

PartitionProducerTask(ctx, input_deps, ...)

ProjectionTask(ctx, input_deps, ...)

PythonScriptTask(ctx, input_deps, ...[, ...])

RangePartitionTask(ctx, input_deps, ...)

RepeatPartitionProducerTask(ctx, input_deps, ...)

RootTask(ctx, input_deps, partition_infos)

SplitDataSetTask(ctx, input_deps, ...)

SqlEngineTask(ctx, input_deps, ...[, udfs, ...])

UserDefinedPartitionProducerTask(ctx, ...[, ...])