smallpond.logical.node.ProjectionNode#

class smallpond.logical.node.ProjectionNode(ctx: Context, input_dep: Node, columns: List[str] | None = None, generated_columns: List[Literal['filename', 'file_row_number']] | None = None, union_by_name=None)#

Select columns from output of an input node.

__init__(ctx: Context, input_dep: Node, columns: List[str] | None = None, generated_columns: List[Literal['filename', 'file_row_number']] | None = None, union_by_name=None) → None#

Construct a ProjectNode to select only the columns from output of input_dep.

Parameters#

input_dep: The input node whose output would be selected.
columns, optional: The columns to be selected or created. Select all columns if set to None.
generated_columns: Auto generated columns, supported values: filename, file_row_number.
union_by_name, optional: Unify the columns of different files by name (see https://duckdb.org/docs/data/multiple_files/combining_schemas#union-by-name).

Examples#

First create an ArrowComputeNode to extract hosts from urls.

class ParseUrl(ArrowComputeNode):
    def process(self, runtime_ctx: RuntimeContext, input_tables: List[arrow.Table]) -> arrow.Table:
        assert input_tables[0].column_names == ["url"] # check url is the only column in table
        urls, = input_tables[0].columns
        hosts = [url.as_py().split("/", maxsplit=2)[0] for url in urls]
        return arrow.Table.from_arrays([hosts, urls], names=["host", "url"])

Suppose there are several columns in output of data_partitions, ProjectionNode(…, [“url”]) selects the url column. Then only this column would be loaded into arrow table when feeding data to ParseUrl.

urls_with_host = ParseUrl(ctx, (ProjectionNode(ctx, data_partitions, ["url"]),))

Methods

`__init__`(ctx, input_dep[, columns, ...])	Construct a ProjectNode to select only the columns from output of input_dep.
`add_perf_metrics`(name, value)
`create_task`(args, *kwargs)
`get_perf_stats`(name)
`slim_copy`()
`task_factory`(task_builder)

Attributes

`enable_resource_boost`
`num_partitions`