Nodes#
Nodes represent the fundamental building blocks of a data processing pipeline. Each node encapsulates a specific operation or transformation that can be applied to a dataset. Nodes can be chained together to form a logical plan, which is a directed acyclic graph (DAG) of nodes that represent the overall data processing workflow.
A typical workflow to create a logical plan is as follows:
# Create a global context
ctx = Context()
# Create a dataset
dataset = ParquetDataSet("path/to/dataset/*.parquet")
# Create a data source node
node = DataSourceNode(ctx, dataset)
# Partition the data
node = DataSetPartitionNode(ctx, (node,), npartitions=2)
# Create a SQL engine node to transform the data
node = SqlEngineNode(ctx, (node,), "SELECT * FROM {0}")
# Create a logical plan from the root node
plan = LogicalPlan(ctx, node)
You can then create tasks from the logical plan, see Tasks.
Notable properties of Node:
Nodes are partitioned. Each Node generates a series of tasks, with each task processing one partition of data.
The input and output of a Node are a series of partitioned Datasets. A Node may write data to shared storage and return a new Dataset, or it may simply recombine the input Datasets.
Context#
|
Global context for each logical plan. |
A unique identifier for each node. |
LogicalPlan#
|
The logical plan that defines a directed acyclic computation graph. |
|
Visit the nodes of a logcial plan in depth-first order. |
Nodes#
|
The base class for all nodes. |
|
Partition the outputs of input_deps into n partitions. |
|
Run user-defined code to process the input datasets as a series of arrow tables. |
|
Run Python code to process the input datasets, which have been loaded as Apache Arrow tables. |
|
Run Python code to process the input datasets, which have been loaded as RecordBatchReader. |
|
Consolidate partitions into larger ones. |
|
Collect the output files of input_deps to output_path. |
|
All inputs of a logical plan are represented as DataSourceNode. |
|
Evenly distribute the output files or rows of input_deps into n partitions. |
|
Partition the outputs of input_deps into n partitions based on the hash values of hash_columns. |
|
Limit the number of rows of the output of an input node. |
|
Load existing partitioned dataset (only parquet files are supported). |
|
Run Python code to process the input datasets as a series of pandas DataFrames. |
|
Run Python code to process the input datasets as a single pandas DataFrame. |
|
The base class for all partition nodes. |
|
Select columns from output of an input node. |
|
Run Python code to process the input datasets with PythonScriptNode.process(...). |
|
Partition the outputs of input_deps into partitions defined by split_points. |
|
Create a new partition dimension by repeating the input_deps. |
|
A virtual node that assembles multiple nodes and outputs nothing. |
|
|
|
Run SQL query against the outputs of input_deps. |
|
Union two or more nodes into one flow of data. |
|
Distribute the output files or rows of input_deps into n partitions based on user code. |
|