smallpond.logical.node.PartitionNode#

class smallpond.logical.node.PartitionNode(ctx: Context, input_deps: Tuple[Node, ...], npartitions: int, dimension: str | None = None, nested: bool = False, output_name: str | None = None, output_path: str | None = None, cpu_limit: int = 1, memory_limit: int | None = None)#

The base class for all partition nodes.

__init__(ctx: Context, input_deps: Tuple[Node, ...], npartitions: int, dimension: str | None = None, nested: bool = False, output_name: str | None = None, output_path: str | None = None, cpu_limit: int = 1, memory_limit: int | None = None) None#

Partition the outputs of input_deps into n partitions.

Parameters#

npartitions

The dataset would be split and distributed to npartitions partitions.

dimension

The unique partition dimension. Required if this is a nested partition.

nested, optional

npartitions subpartitions are created in each existing partition of input_deps if true.

Examples#

See unit tests in test/test_partition.py. For nested partition see test_nested_partition. Why nested partition? See 5.1 Partial Partitioning of [Advanced partitioning techniques for massively distributed computation](https://dl.acm.org/doi/10.1145/2213836.2213839).

Methods

__init__(ctx, input_deps, npartitions[, ...])

Partition the outputs of input_deps into n partitions.

add_perf_metrics(name, value)

create_consumer_task(*args, **kwargs)

create_merge_task(*args, **kwargs)

create_producer_task(*args, **kwargs)

create_split_task(*args, **kwargs)

create_task(runtime_ctx, input_deps, ...)

get_perf_stats(name)

slim_copy()

task_factory(task_builder)

Attributes

enable_resource_boost

max_card_of_producers_x_consumers

max_num_producer_tasks

num_partitions