smallpond.execution.task.PandasBatchTask#

class smallpond.execution.task.PandasBatchTask(ctx: RuntimeContext, input_deps: List[Task], partition_infos: List[PartitionInfo], process_func: Callable[[RuntimeContext, List[RecordBatchReader]], Iterable[Table]] | None = None, background_io_thread=True, streaming_batch_size: int = 122880, secs_checkpoint_interval: int | None = None, parquet_row_group_size: int = 122880, parquet_dictionary_encoding=False, parquet_compression='ZSTD', parquet_compression_level=3, use_duckdb_reader=False, output_name: str | None = None, output_path: str | None = None, cpu_limit: int | None = None, gpu_limit: float | None = None, memory_limit: int | None = None)#

__init__(ctx: RuntimeContext, input_deps: List[Task], partition_infos: List[PartitionInfo], process_func: Callable[[RuntimeContext, List[RecordBatchReader]], Iterable[Table]] | None = None, background_io_thread=True, streaming_batch_size: int = 122880, secs_checkpoint_interval: int | None = None, parquet_row_group_size: int = 122880, parquet_dictionary_encoding=False, parquet_compression='ZSTD', parquet_compression_level=3, use_duckdb_reader=False, output_name: str | None = None, output_path: str | None = None, cpu_limit: int | None = None, gpu_limit: float | None = None, memory_limit: int | None = None) → None#

Methods

`__init__`(ctx, input_deps, partition_infos[, ...])
`add_elapsed_time`([metric_name])	Start or stop the timer.
`adjust_row_group_size`(nbytes, num_rows[, ...])
`clean_complex_attrs`()
`clean_output`([force])
`cleanup`()	Called after run() even if there is an exception.
`compute_avg_row_size`(nbytes, num_rows)
`create_input_views`(conn, input_datasets[, ...])
`dump`()
`dump_output`(output_iter)
`exec`([cq])
`exec_query`(conn, query_statement[, ...])
`finalize`()	Called after run() to finalize the processing.
`get_partition_info`(dimension)
`initialize`()	Called before run() to prepare for running the task.
`inject_fault`()
`merge_metrics`(metrics)
`oom`([nonzero_exitcode_as_oom])
`parquet_kv_metadata_bytes`([extra_partitions])
`parquet_kv_metadata_str`([extra_partitions])
`prepare_connection`(conn)
`process`(runtime_ctx, input_dfs)	This method can be overridden in subclass of ArrowStreamTask.
`random_float`()
`random_uint32`()
`restore_input_state`(runtime_state, input_readers)
`run`()
`run_on_ray`()	Run the task on Ray.
`set_memory_limit`(soft_limit, hard_limit)

Attributes

`process_func`
`background_io_thread`
`streaming_batch_size`
`streaming_batch_count`
`parquet_row_group_size`
`parquet_row_group_bytes`
`parquet_dictionary_encoding`
`parquet_compression`
`parquet_compression_level`
`secs_checkpoint_interval`
`allow_speculative_exec`	Whether the task is allowed to be executed by speculative execution.
`any_input_empty`
`compression_level_str`
`compression_options`
`compression_type_str`
`cpu_limit`
`cpu_overcommit_ratio`
`ctx`
`dataset`
`default_output_name`
`elapsed_time`
`enable_temp_directory`
`exception`
`exec_cq`
`exec_id`
`exec_on_scheduler`
`fail_count`
`final_output_abspath`
`finish_time`
`gpu_limit`
`id`
`input_datasets`
`input_deps`
`input_udfs`
`input_view_index`
`key`
`local_gpu`	Return the first GPU granted to this task.
`local_gpu_ranks`	Return all GPU ranks granted to this task.
`local_rank`	Return the first GPU rank granted to this task.
`location`
`max_batch_size`
`memory_limit`
`memory_overcommit_ratio`
`node_id`
`numa_node`
`numpy_random_gen`
`output`
`output_deps`
`output_dirname`
`output_filename`
`output_name`
`output_root`
`partition_dims`
`partition_infos`
`partition_infos_as_dict`
`perf_metrics`
`perf_profile`
`python_random_gen`
`query_udfs`
`rand_seed_float`
`rand_seed_uint32`
`random_seed_bytes`
`ray_dataset_path`	The path of a pickle file that contains the output dataset of the task.
`ray_marker_path`	The path of an empty file that is used to determine if the task has been started.
`retry_count`
`runtime_id`
`runtime_output_abspath`	Output data will be produced in this directory at runtime.
`runtime_state`
`sched_epoch`
`self_contained_output`	Whether the output of this node is not dependent on any input nodes.
`skip_when_any_input_empty`
`staging_root`	If the task has a special output directory, its runtime output directory will be under it.
`start_time`
`status`
`temp_abspath`
`temp_output`	Whether the output of this node is stored in a temporary directory.
`udfs`
`uniform_failure_prob`