smallpond.logical.node.DataSetPartitionNode#
- smallpond.logical.node.DataSetPartitionNode(ctx: Context, input_deps: Tuple[Node, ...], npartitions: int, *, partition_by_rows=False, random_shuffle=False, data_partition_column=None)#
Partition the outputs of input_deps into n partitions.
Parameters#
- npartitions
The number of partitions. The input files or rows would be evenly distributed to npartitions partitions.
- partition_by_rows, optional
Evenly distribute rows instead of input files into npartitions partitions, by default distribute by files.
- random_shuffle, optional
Random shuffle the list of paths or parquet row groups (if partition_by_rows=True) in input datasets.
- data_partition_column, optional
Partition by files based on the partition keys stored in data_partition_column if specified. Default column name used by HashPartitionNode is DATA_PARTITION_COLUMN_NAME.
Examples#
See unit test test_load_partitioned_datasets in test/test_partition.py.