smallpond.logical.node.ProjectionNode#

class smallpond.logical.node.ProjectionNode(ctx: Context, input_dep: Node, columns: List[str] | None = None, generated_columns: List[Literal['filename', 'file_row_number']] | None = None, union_by_name=None)#

Select columns from output of an input node.

__init__(ctx: Context, input_dep: Node, columns: List[str] | None = None, generated_columns: List[Literal['filename', 'file_row_number']] | None = None, union_by_name=None) None#

Construct a ProjectNode to select only the columns from output of input_dep.

Parameters#

input_dep

The input node whose output would be selected.

columns, optional

The columns to be selected or created. Select all columns if set to None.

generated_columns

Auto generated columns, supported values: filename, file_row_number.

union_by_name, optional

Unify the columns of different files by name (see https://duckdb.org/docs/data/multiple_files/combining_schemas#union-by-name).

Examples#

First create an ArrowComputeNode to extract hosts from urls.

class ParseUrl(ArrowComputeNode):
    def process(self, runtime_ctx: RuntimeContext, input_tables: List[arrow.Table]) -> arrow.Table:
        assert input_tables[0].column_names == ["url"] # check url is the only column in table
        urls, = input_tables[0].columns
        hosts = [url.as_py().split("/", maxsplit=2)[0] for url in urls]
        return arrow.Table.from_arrays([hosts, urls], names=["host", "url"])

Suppose there are several columns in output of data_partitions, ProjectionNode(…, [“url”]) selects the url column. Then only this column would be loaded into arrow table when feeding data to ParseUrl.

urls_with_host = ParseUrl(ctx, (ProjectionNode(ctx, data_partitions, ["url"]),))

Methods

__init__(ctx, input_dep[, columns, ...])

Construct a ProjectNode to select only the columns from output of input_dep.

add_perf_metrics(name, value)

create_task(*args, **kwargs)

get_perf_stats(name)

slim_copy()

task_factory(task_builder)

Attributes

enable_resource_boost

num_partitions