smallpond.logical.node.ProjectionNode#
- class smallpond.logical.node.ProjectionNode(ctx: Context, input_dep: Node, columns: List[str] | None = None, generated_columns: List[Literal['filename', 'file_row_number']] | None = None, union_by_name=None)#
Select columns from output of an input node.
- __init__(ctx: Context, input_dep: Node, columns: List[str] | None = None, generated_columns: List[Literal['filename', 'file_row_number']] | None = None, union_by_name=None) None #
Construct a ProjectNode to select only the columns from output of input_dep.
Parameters#
- input_dep
The input node whose output would be selected.
- columns, optional
The columns to be selected or created. Select all columns if set to None.
- generated_columns
Auto generated columns, supported values: filename, file_row_number.
- union_by_name, optional
Unify the columns of different files by name (see https://duckdb.org/docs/data/multiple_files/combining_schemas#union-by-name).
Examples#
First create an ArrowComputeNode to extract hosts from urls.
class ParseUrl(ArrowComputeNode): def process(self, runtime_ctx: RuntimeContext, input_tables: List[arrow.Table]) -> arrow.Table: assert input_tables[0].column_names == ["url"] # check url is the only column in table urls, = input_tables[0].columns hosts = [url.as_py().split("/", maxsplit=2)[0] for url in urls] return arrow.Table.from_arrays([hosts, urls], names=["host", "url"])
Suppose there are several columns in output of data_partitions, ProjectionNode(…, [“url”]) selects the url column. Then only this column would be loaded into arrow table when feeding data to ParseUrl.
urls_with_host = ParseUrl(ctx, (ProjectionNode(ctx, data_partitions, ["url"]),))
Methods
__init__
(ctx, input_dep[, columns, ...])Construct a ProjectNode to select only the columns from output of input_dep.
add_perf_metrics
(name, value)create_task
(*args, **kwargs)get_perf_stats
(name)slim_copy
()task_factory
(task_builder)Attributes
enable_resource_boost
num_partitions