smallpond.logical.node.ShuffleNode#

class smallpond.logical.node.ShuffleNode(ctx: Context, input_deps: Tuple[Node, ...], npartitions: int, data_partition_column: str | None = None, *, dimension: str | None = None, nested: bool = False, engine_type: Literal['duckdb', 'arrow'] | None = None, use_parquet_writer: bool = False, hive_partitioning: bool = False, parquet_row_group_size: int | None = None, parquet_dictionary_encoding=False, parquet_compression='ZSTD', parquet_compression_level=3, output_name: str | None = None, output_path: str | None = None, cpu_limit: int | None = None, memory_limit: int | None = None)#
__init__(ctx: Context, input_deps: Tuple[Node, ...], npartitions: int, data_partition_column: str | None = None, *, dimension: str | None = None, nested: bool = False, engine_type: Literal['duckdb', 'arrow'] | None = None, use_parquet_writer: bool = False, hive_partitioning: bool = False, parquet_row_group_size: int | None = None, parquet_dictionary_encoding=False, parquet_compression='ZSTD', parquet_compression_level=3, output_name: str | None = None, output_path: str | None = None, cpu_limit: int | None = None, memory_limit: int | None = None) None#

Construct a HashPartitionNode. See Node.__init__() to find comments on other parameters.

Parameters#

npartitions

The number of hash partitions. The number of generated parquet files would be proportional to npartitions.

hash_columns

The hash values are computed from hash_columns.

data_partition_column, optional

The name of column used to store partition keys.

engine_type, optional

The underlying query engine for hash partition.

random_shuffle, optional

Ignore hash_columns and shuffle each row to a random partition if true.

shuffle_only, optional

Ignore hash_columns and shuffle each row to the partition specified in data_partition_column if true.

drop_partition_column, optional

Exclude data_partition_column in output if true.

use_parquet_writer, optional

Convert partition data to arrow tables and append with parquet writer if true. This creates less number of intermediate files but makes partitioning slower.

hive_partitioning, optional

Use Hive partitioned write of duckdb if true.

parquet_row_group_size, optional

The number of rows stored in each row group of parquet file. Large row group size provides more opportunities to compress the data. Small row groups size could make filtering rows faster and achieve high concurrency. See https://duckdb.org/docs/data/parquet/tips.html#selecting-a-row_group_size.

parquet_dictionary_encoding, optional

Specify if we should use dictionary encoding in general or only for some columns. See use_dictionary in https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html.

Methods

__init__(ctx, input_deps, npartitions[, ...])

Construct a HashPartitionNode.

add_perf_metrics(name, value)

create_consumer_task(*args, **kwargs)

create_merge_task(*args, **kwargs)

create_producer_task(*args, **kwargs)

create_split_task(*args, **kwargs)

create_task(runtime_ctx, input_deps, ...)

get_perf_stats(name)

slim_copy()

task_factory(task_builder)

Attributes

default_cpu_limit

default_data_partition_column

default_engine_type

default_memory_limit

default_row_group_size

enable_resource_boost

max_card_of_producers_x_consumers

max_num_producer_tasks

num_partitions