smallpond.execution.task.PandasBatchTask#

class smallpond.execution.task.PandasBatchTask(ctx: RuntimeContext, input_deps: List[Task], partition_infos: List[PartitionInfo], process_func: Callable[[RuntimeContext, List[RecordBatchReader]], Iterable[Table]] | None = None, background_io_thread=True, streaming_batch_size: int = 122880, secs_checkpoint_interval: int | None = None, parquet_row_group_size: int = 122880, parquet_dictionary_encoding=False, parquet_compression='ZSTD', parquet_compression_level=3, use_duckdb_reader=False, output_name: str | None = None, output_path: str | None = None, cpu_limit: int | None = None, gpu_limit: float | None = None, memory_limit: int | None = None)#
__init__(ctx: RuntimeContext, input_deps: List[Task], partition_infos: List[PartitionInfo], process_func: Callable[[RuntimeContext, List[RecordBatchReader]], Iterable[Table]] | None = None, background_io_thread=True, streaming_batch_size: int = 122880, secs_checkpoint_interval: int | None = None, parquet_row_group_size: int = 122880, parquet_dictionary_encoding=False, parquet_compression='ZSTD', parquet_compression_level=3, use_duckdb_reader=False, output_name: str | None = None, output_path: str | None = None, cpu_limit: int | None = None, gpu_limit: float | None = None, memory_limit: int | None = None) None#

Methods

__init__(ctx, input_deps, partition_infos[, ...])

add_elapsed_time([metric_name])

Start or stop the timer.

adjust_row_group_size(nbytes, num_rows[, ...])

clean_complex_attrs()

clean_output([force])

cleanup()

Called after run() even if there is an exception.

compute_avg_row_size(nbytes, num_rows)

create_input_views(conn, input_datasets[, ...])

dump()

dump_output(output_iter)

exec([cq])

exec_query(conn, query_statement[, ...])

finalize()

Called after run() to finalize the processing.

get_partition_info(dimension)

initialize()

Called before run() to prepare for running the task.

inject_fault()

merge_metrics(metrics)

oom([nonzero_exitcode_as_oom])

parquet_kv_metadata_bytes([extra_partitions])

parquet_kv_metadata_str([extra_partitions])

prepare_connection(conn)

process(runtime_ctx, input_dfs)

This method can be overridden in subclass of ArrowStreamTask.

random_float()

random_uint32()

restore_input_state(runtime_state, input_readers)

run()

run_on_ray()

Run the task on Ray.

set_memory_limit(soft_limit, hard_limit)

Attributes

process_func

background_io_thread

streaming_batch_size

streaming_batch_count

parquet_row_group_size

parquet_row_group_bytes

parquet_dictionary_encoding

parquet_compression

parquet_compression_level

secs_checkpoint_interval

allow_speculative_exec

Whether the task is allowed to be executed by speculative execution.

any_input_empty

compression_level_str

compression_options

compression_type_str

cpu_limit

cpu_overcommit_ratio

ctx

dataset

default_output_name

elapsed_time

enable_temp_directory

exception

exec_cq

exec_id

exec_on_scheduler

fail_count

final_output_abspath

finish_time

gpu_limit

id

input_datasets

input_deps

input_udfs

input_view_index

key

local_gpu

Return the first GPU granted to this task.

local_gpu_ranks

Return all GPU ranks granted to this task.

local_rank

Return the first GPU rank granted to this task.

location

max_batch_size

memory_limit

memory_overcommit_ratio

node_id

numa_node

numpy_random_gen

output

output_deps

output_dirname

output_filename

output_name

output_root

partition_dims

partition_infos

partition_infos_as_dict

perf_metrics

perf_profile

python_random_gen

query_udfs

rand_seed_float

rand_seed_uint32

random_seed_bytes

ray_dataset_path

The path of a pickle file that contains the output dataset of the task.

ray_marker_path

The path of an empty file that is used to determine if the task has been started.

retry_count

runtime_id

runtime_output_abspath

Output data will be produced in this directory at runtime.

runtime_state

sched_epoch

self_contained_output

Whether the output of this node is not dependent on any input nodes.

skip_when_any_input_empty

staging_root

If the task has a special output directory, its runtime output directory will be under it.

start_time

status

temp_abspath

temp_output

Whether the output of this node is stored in a temporary directory.

udfs

uniform_failure_prob