smallpond.execution.task.ArrowComputeTask#
- class smallpond.execution.task.ArrowComputeTask(ctx: RuntimeContext, input_deps: List[Task], partition_infos: List[PartitionInfo], process_func: Callable[[RuntimeContext, List[Table]], Table] | None = None, parquet_row_group_size: int = 122880, parquet_dictionary_encoding=False, parquet_compression='ZSTD', parquet_compression_level=3, use_duckdb_reader=False, output_name: str | None = None, output_path: str | None = None, cpu_limit: int | None = None, gpu_limit: float | None = None, memory_limit: int | None = None)#
- __init__(ctx: RuntimeContext, input_deps: List[Task], partition_infos: List[PartitionInfo], process_func: Callable[[RuntimeContext, List[Table]], Table] | None = None, parquet_row_group_size: int = 122880, parquet_dictionary_encoding=False, parquet_compression='ZSTD', parquet_compression_level=3, use_duckdb_reader=False, output_name: str | None = None, output_path: str | None = None, cpu_limit: int | None = None, gpu_limit: float | None = None, memory_limit: int | None = None) None #
Methods
__init__
(ctx, input_deps, partition_infos[, ...])add_elapsed_time
([metric_name])Start or stop the timer.
adjust_row_group_size
(nbytes, num_rows[, ...])clean_complex_attrs
()clean_output
([force])cleanup
()Called after run() even if there is an exception.
compute_avg_row_size
(nbytes, num_rows)create_input_views
(conn, input_datasets[, ...])dump
()dump_output
(output_table)exec
([cq])exec_query
(conn, query_statement[, ...])finalize
()Called after run() to finalize the processing.
get_partition_info
(dimension)initialize
()Called before run() to prepare for running the task.
inject_fault
()merge_metrics
(metrics)oom
([nonzero_exitcode_as_oom])parquet_kv_metadata_bytes
([extra_partitions])parquet_kv_metadata_str
([extra_partitions])prepare_connection
(conn)process
(runtime_ctx, input_tables)This method can be overridden in subclass of ArrowComputeTask.
random_float
()random_uint32
()run
()run_on_ray
()Run the task on Ray.
set_memory_limit
(soft_limit, hard_limit)Attributes
process_func
parquet_row_group_size
parquet_row_group_bytes
parquet_dictionary_encoding
use_duckdb_reader
allow_speculative_exec
Whether the task is allowed to be executed by speculative execution.
any_input_empty
compression_level_str
compression_options
compression_type_str
cpu_limit
cpu_overcommit_ratio
ctx
dataset
default_output_name
elapsed_time
enable_temp_directory
exception
exec_cq
exec_id
exec_on_scheduler
fail_count
final_output_abspath
finish_time
gpu_limit
id
input_datasets
input_deps
input_udfs
input_view_index
key
local_gpu
Return the first GPU granted to this task.
local_gpu_ranks
Return all GPU ranks granted to this task.
local_rank
Return the first GPU rank granted to this task.
location
memory_limit
memory_overcommit_ratio
node_id
numa_node
numpy_random_gen
output
output_deps
output_dirname
output_filename
output_name
output_root
parquet_compression
parquet_compression_level
partition_dims
partition_infos
partition_infos_as_dict
perf_metrics
perf_profile
python_random_gen
query_udfs
rand_seed_float
rand_seed_uint32
random_seed_bytes
ray_dataset_path
The path of a pickle file that contains the output dataset of the task.
ray_marker_path
The path of an empty file that is used to determine if the task has been started.
retry_count
runtime_id
runtime_output_abspath
Output data will be produced in this directory at runtime.
runtime_state
sched_epoch
self_contained_output
Whether the output of this node is not dependent on any input nodes.
skip_when_any_input_empty
staging_root
If the task has a special output directory, its runtime output directory will be under it.
start_time
status
temp_abspath
temp_output
Whether the output of this node is stored in a temporary directory.
udfs
uniform_failure_prob