smallpond.dataframe.DataFrame.repartition#

DataFrame.repartition(npartitions: int, hash_by: str | List[str] | None = None, by: str | None = None, by_rows: bool = False, **kwargs) → DataFrame#

Repartition the data into the given number of partitions.

Parameters#

npartitions: The dataset would be split and distributed to npartitions partitions. If not specified, the number of partitions would be the default partition size of the context.
hash_by, optional: If specified, the dataset would be repartitioned by the hash of the given columns.
by, optional: If specified, the dataset would be repartitioned by the given column.
by_rows, optional: If specified, the dataset would be repartitioned by rows instead of by files.

Examples#

df = df.repartition(10)                 # evenly distributed
df = df.repartition(10, by_rows=True)   # evenly distributed by rows
df = df.repartition(10, hash_by='host') # hash partitioned
df = df.repartition(10, by='bucket')    # partitioned by column