smallpond.logical.dataset.ParquetDataSet#
- class smallpond.logical.dataset.ParquetDataSet(paths: List[str], root_dir: str | None = '', recursive=False, columns: List[str] | None = None, generated_columns: List[str] | None = None, union_by_name=False)#
A set of parquet files.
- __init__(paths: List[str], root_dir: str | None = '', recursive=False, columns: List[str] | None = None, generated_columns: List[str] | None = None, union_by_name=False) None #
Construct a dataset from a list of paths.
Parameters#
- paths
A path or a list of paths or path patterns. e.g. [‘data/100.parquet’, ‘/datasetA/*.parquet’].
- root_dir, optional
Relative paths in paths would be resolved under root_dir if specified.
- recursive, optional
Resolve path patterns recursively if true.
- columns, optional
Only load the specified columns if not None.
- union_by_name, optional
Unify the columns of different files by name (see https://duckdb.org/docs/data/multiple_files/combining_schemas#union-by-name).
Methods
__init__
(paths[, root_dir, recursive, ...])Construct a dataset from a list of paths.
create_from
(table, output_dir[, filename])load_partitioned_datasets
(npartition, ...[, ...])Split the dataset into a list of partitioned datasets.
log
([num_rows])Log the dataset to the logger.
merge
(datasets)Merge multiple datasets into a single dataset.
partition_by_files
(npartition[, random_shuffle])Evenly split into npartition datasets by files.
partition_by_rows
(npartition[, random_shuffle])Evenly split the dataset into npartition partitions by rows.
partition_by_size
(max_partition_size)Split the dataset into multiple partitions so that each partition has at most max_partition_size bytes.
remove_empty_files
()Remove empty parquet files from the dataset.
reset
([paths, root_dir, recursive])NOTE: all row ranges will be cleared.
sql_query_fragment
([filesystem, conn])Return a sql fragment that represents the dataset.
to_arrow_table
([max_workers, filesystem, conn])Load the dataset to an arrow table.
to_batch_reader
([batch_size, filesystem, conn])Return an arrow record batch reader to read the dataset.
to_pandas
()Convert the dataset to a pandas dataframe.
Attributes
generated_columns
Generated columns of DuckDB read_parquet function.
absolute_paths
An ordered list of absolute paths of the given file patterns.
columns
The columns to load from the dataset files.
empty
Whether the dataset is empty.
estimated_data_size
Return the estimated data size in bytes.
num_files
The number of files in the dataset.
num_rows
The number of rows in the dataset.
paths
The paths to the dataset files.
recursive
Whether to resolve path patterns recursively.
resolved_paths
An ordered list of absolute paths of files.
resolved_row_ranges
Return row ranges for each parquet file.
root_dir
The root directory of paths.
udfs
union_by_name
Whether to unify the columns of different files by name.