smallpond.logical.dataset.ParquetDataSet#

class smallpond.logical.dataset.ParquetDataSet(paths: List[str], root_dir: str | None = '', recursive=False, columns: List[str] | None = None, generated_columns: List[str] | None = None, union_by_name=False)#

A set of parquet files.

__init__(paths: List[str], root_dir: str | None = '', recursive=False, columns: List[str] | None = None, generated_columns: List[str] | None = None, union_by_name=False) None#

Construct a dataset from a list of paths.

Parameters#

paths

A path or a list of paths or path patterns. e.g. [‘data/100.parquet’, ‘/datasetA/*.parquet’].

root_dir, optional

Relative paths in paths would be resolved under root_dir if specified.

recursive, optional

Resolve path patterns recursively if true.

columns, optional

Only load the specified columns if not None.

union_by_name, optional

Unify the columns of different files by name (see https://duckdb.org/docs/data/multiple_files/combining_schemas#union-by-name).

Methods

__init__(paths[, root_dir, recursive, ...])

Construct a dataset from a list of paths.

create_from(table, output_dir[, filename])

load_partitioned_datasets(npartition, ...[, ...])

Split the dataset into a list of partitioned datasets.

log([num_rows])

Log the dataset to the logger.

merge(datasets)

Merge multiple datasets into a single dataset.

partition_by_files(npartition[, random_shuffle])

Evenly split into npartition datasets by files.

partition_by_rows(npartition[, random_shuffle])

Evenly split the dataset into npartition partitions by rows.

partition_by_size(max_partition_size)

Split the dataset into multiple partitions so that each partition has at most max_partition_size bytes.

remove_empty_files()

Remove empty parquet files from the dataset.

reset([paths, root_dir, recursive])

NOTE: all row ranges will be cleared.

sql_query_fragment([filesystem, conn])

Return a sql fragment that represents the dataset.

to_arrow_table([max_workers, filesystem, conn])

Load the dataset to an arrow table.

to_batch_reader([batch_size, filesystem, conn])

Return an arrow record batch reader to read the dataset.

to_pandas()

Convert the dataset to a pandas dataframe.

Attributes

generated_columns

Generated columns of DuckDB read_parquet function.

absolute_paths

An ordered list of absolute paths of the given file patterns.

columns

The columns to load from the dataset files.

empty

Whether the dataset is empty.

estimated_data_size

Return the estimated data size in bytes.

num_files

The number of files in the dataset.

num_rows

The number of rows in the dataset.

paths

The paths to the dataset files.

recursive

Whether to resolve path patterns recursively.

resolved_paths

An ordered list of absolute paths of files.

resolved_row_ranges

Return row ranges for each parquet file.

root_dir

The root directory of paths.

udfs

union_by_name

Whether to unify the columns of different files by name.