smallpond.logical.node.DataSetPartitionNode#

smallpond.logical.node.DataSetPartitionNode(ctx: Context, input_deps: Tuple[Node, ...], npartitions: int, *, partition_by_rows=False, random_shuffle=False, data_partition_column=None)#

Partition the outputs of input_deps into n partitions.

Parameters#

npartitions

The number of partitions. The input files or rows would be evenly distributed to npartitions partitions.

partition_by_rows, optional

Evenly distribute rows instead of input files into npartitions partitions, by default distribute by files.

random_shuffle, optional

Random shuffle the list of paths or parquet row groups (if partition_by_rows=True) in input datasets.

data_partition_column, optional

Partition by files based on the partition keys stored in data_partition_column if specified. Default column name used by HashPartitionNode is DATA_PARTITION_COLUMN_NAME.

Examples#

See unit test test_load_partitioned_datasets in test/test_partition.py.