smallpond.logical.node.ShuffleNode#
- class smallpond.logical.node.ShuffleNode(ctx: Context, input_deps: Tuple[Node, ...], npartitions: int, data_partition_column: str | None = None, *, dimension: str | None = None, nested: bool = False, engine_type: Literal['duckdb', 'arrow'] | None = None, use_parquet_writer: bool = False, hive_partitioning: bool = False, parquet_row_group_size: int | None = None, parquet_dictionary_encoding=False, parquet_compression='ZSTD', parquet_compression_level=3, output_name: str | None = None, output_path: str | None = None, cpu_limit: int | None = None, memory_limit: int | None = None)#
- __init__(ctx: Context, input_deps: Tuple[Node, ...], npartitions: int, data_partition_column: str | None = None, *, dimension: str | None = None, nested: bool = False, engine_type: Literal['duckdb', 'arrow'] | None = None, use_parquet_writer: bool = False, hive_partitioning: bool = False, parquet_row_group_size: int | None = None, parquet_dictionary_encoding=False, parquet_compression='ZSTD', parquet_compression_level=3, output_name: str | None = None, output_path: str | None = None, cpu_limit: int | None = None, memory_limit: int | None = None) None #
Construct a HashPartitionNode. See
Node.__init__()
to find comments on other parameters.Parameters#
- npartitions
The number of hash partitions. The number of generated parquet files would be proportional to npartitions.
- hash_columns
The hash values are computed from hash_columns.
- data_partition_column, optional
The name of column used to store partition keys.
- engine_type, optional
The underlying query engine for hash partition.
- random_shuffle, optional
Ignore hash_columns and shuffle each row to a random partition if true.
- shuffle_only, optional
Ignore hash_columns and shuffle each row to the partition specified in data_partition_column if true.
- drop_partition_column, optional
Exclude data_partition_column in output if true.
- use_parquet_writer, optional
Convert partition data to arrow tables and append with parquet writer if true. This creates less number of intermediate files but makes partitioning slower.
- hive_partitioning, optional
Use Hive partitioned write of duckdb if true.
- parquet_row_group_size, optional
The number of rows stored in each row group of parquet file. Large row group size provides more opportunities to compress the data. Small row groups size could make filtering rows faster and achieve high concurrency. See https://duckdb.org/docs/data/parquet/tips.html#selecting-a-row_group_size.
- parquet_dictionary_encoding, optional
Specify if we should use dictionary encoding in general or only for some columns. See use_dictionary in https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html.
Methods
__init__
(ctx, input_deps, npartitions[, ...])Construct a HashPartitionNode.
add_perf_metrics
(name, value)create_consumer_task
(*args, **kwargs)create_merge_task
(*args, **kwargs)create_producer_task
(*args, **kwargs)create_split_task
(*args, **kwargs)create_task
(runtime_ctx, input_deps, ...)get_perf_stats
(name)slim_copy
()task_factory
(task_builder)Attributes
default_cpu_limit
default_data_partition_column
default_engine_type
default_memory_limit
default_row_group_size
enable_resource_boost
max_card_of_producers_x_consumers
max_num_producer_tasks
num_partitions