smallpond.logical.node.ShuffleNode#

class smallpond.logical.node.ShuffleNode(ctx: Context, input_deps: Tuple[Node, ...], npartitions: int, data_partition_column: str | None = None, *, dimension: str | None = None, nested: bool = False, engine_type: Literal['duckdb', 'arrow'] | None = None, use_parquet_writer: bool = False, hive_partitioning: bool = False, parquet_row_group_size: int | None = None, parquet_dictionary_encoding=False, parquet_compression='ZSTD', parquet_compression_level=3, output_name: str | None = None, output_path: str | None = None, cpu_limit: int | None = None, memory_limit: int | None = None)#

__init__(ctx: Context, input_deps: Tuple[Node, ...], npartitions: int, data_partition_column: str | None = None, *, dimension: str | None = None, nested: bool = False, engine_type: Literal['duckdb', 'arrow'] | None = None, use_parquet_writer: bool = False, hive_partitioning: bool = False, parquet_row_group_size: int | None = None, parquet_dictionary_encoding=False, parquet_compression='ZSTD', parquet_compression_level=3, output_name: str | None = None, output_path: str | None = None, cpu_limit: int | None = None, memory_limit: int | None = None) → None#

Construct a HashPartitionNode. See Node.__init__() to find comments on other parameters.

Parameters#

npartitions: The number of hash partitions. The number of generated parquet files would be proportional to npartitions.
hash_columns: The hash values are computed from hash_columns.
data_partition_column, optional: The name of column used to store partition keys.
engine_type, optional: The underlying query engine for hash partition.
random_shuffle, optional: Ignore hash_columns and shuffle each row to a random partition if true.
shuffle_only, optional: Ignore hash_columns and shuffle each row to the partition specified in data_partition_column if true.
drop_partition_column, optional: Exclude data_partition_column in output if true.
use_parquet_writer, optional: Convert partition data to arrow tables and append with parquet writer if true. This creates less number of intermediate files but makes partitioning slower.
hive_partitioning, optional: Use Hive partitioned write of duckdb if true.
parquet_row_group_size, optional: The number of rows stored in each row group of parquet file. Large row group size provides more opportunities to compress the data. Small row groups size could make filtering rows faster and achieve high concurrency. See https://duckdb.org/docs/data/parquet/tips.html#selecting-a-row_group_size.
parquet_dictionary_encoding, optional: Specify if we should use dictionary encoding in general or only for some columns. See use_dictionary in https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html.

Methods

`__init__`(ctx, input_deps, npartitions[, ...])	Construct a HashPartitionNode.
`add_perf_metrics`(name, value)
`create_consumer_task`(args, *kwargs)
`create_merge_task`(args, *kwargs)
`create_producer_task`(args, *kwargs)
`create_split_task`(args, *kwargs)
`create_task`(runtime_ctx, input_deps, ...)
`get_perf_stats`(name)
`slim_copy`()
`task_factory`(task_builder)

Attributes

`default_cpu_limit`
`default_data_partition_column`
`default_engine_type`
`default_memory_limit`
`default_row_group_size`
`enable_resource_boost`
`max_card_of_producers_x_consumers`
`max_num_producer_tasks`
`num_partitions`