smallpond.logical.dataset.ParquetDataSet#

class smallpond.logical.dataset.ParquetDataSet(paths: List[str], root_dir: str | None = '', recursive=False, columns: List[str] | None = None, generated_columns: List[str] | None = None, union_by_name=False)#

A set of parquet files.

__init__(paths: List[str], root_dir: str | None = '', recursive=False, columns: List[str] | None = None, generated_columns: List[str] | None = None, union_by_name=False) → None#

Construct a dataset from a list of paths.

Parameters#

paths: A path or a list of paths or path patterns. e.g. [‘data/100.parquet’, ‘/datasetA/*.parquet’].
root_dir, optional: Relative paths in paths would be resolved under root_dir if specified.
recursive, optional: Resolve path patterns recursively if true.
columns, optional: Only load the specified columns if not None.
union_by_name, optional: Unify the columns of different files by name (see https://duckdb.org/docs/data/multiple_files/combining_schemas#union-by-name).

Methods

`__init__`(paths[, root_dir, recursive, ...])	Construct a dataset from a list of paths.
`create_from`(table, output_dir[, filename])
`load_partitioned_datasets`(npartition, ...[, ...])	Split the dataset into a list of partitioned datasets.
`log`([num_rows])	Log the dataset to the logger.
`merge`(datasets)	Merge multiple datasets into a single dataset.
`partition_by_files`(npartition[, random_shuffle])	Evenly split into npartition datasets by files.
`partition_by_rows`(npartition[, random_shuffle])	Evenly split the dataset into npartition partitions by rows.
`partition_by_size`(max_partition_size)	Split the dataset into multiple partitions so that each partition has at most max_partition_size bytes.
`remove_empty_files`()	Remove empty parquet files from the dataset.
`reset`([paths, root_dir, recursive])	NOTE: all row ranges will be cleared.
`sql_query_fragment`([filesystem, conn])	Return a sql fragment that represents the dataset.
`to_arrow_table`([max_workers, filesystem, conn])	Load the dataset to an arrow table.
`to_batch_reader`([batch_size, filesystem, conn])	Return an arrow record batch reader to read the dataset.
`to_pandas`()	Convert the dataset to a pandas dataframe.

Attributes

`generated_columns`	Generated columns of DuckDB read_parquet function.
`absolute_paths`	An ordered list of absolute paths of the given file patterns.
`columns`	The columns to load from the dataset files.
`empty`	Whether the dataset is empty.
`estimated_data_size`	Return the estimated data size in bytes.
`num_files`	The number of files in the dataset.
`num_rows`	The number of rows in the dataset.
`paths`	The paths to the dataset files.
`recursive`	Whether to resolve path patterns recursively.
`resolved_paths`	An ordered list of absolute paths of files.
`resolved_row_ranges`	Return row ranges for each parquet file.
`root_dir`	The root directory of paths.
`udfs`
`union_by_name`	Whether to unify the columns of different files by name.