DataFrame#
DataFrame is the main class in smallpond. It represents a lazily computed, partitioned data set.
A typical workflow looks like this:
import smallpond
sp = smallpond.init()
df = sp.read_parquet("path/to/dataset/*.parquet")
df = df.repartition(10)
df = df.map("x + 1")
df.write_parquet("path/to/output")
Initialization#
|
Initialize smallpond environment. |
Loading Data#
|
Create a DataFrame from a list of local Python objects. |
|
Create a DataFrame from a pyarrow Table. |
Create a DataFrame from a pandas DataFrame. |
|
|
Create a DataFrame from CSV files. |
|
Create a DataFrame from JSON files. |
|
Create a DataFrame from Parquet files. |
Partitioning Data#
|
Repartition the data into the given number of partitions. |
Transformations#
Apply transformations and return a new DataFrame.
|
Execute a SQL query on each partition of the input DataFrames. |
|
Apply a function to each row. |
|
Apply the given function to batches of data. |
|
Apply a function to each row and flatten the result. |
|
Filter out rows that don't satisfy the given predicate. |
|
Limit the number of rows to the given number. |
|
Sort rows by the given columns in each partition. |
|
Randomly shuffle all rows globally. |
Consuming Data#
These operations will trigger execution of the lazy transformations performed on this DataFrame.
Count the number of rows. |
|
|
Return up to limit rows. |
Return all rows. |
|
Convert to an arrow Table. |
|
Convert to a pandas DataFrame. |
|
|
Write data to a series of parquet files under the given path. |
Write data to a series of parquet files under the given path. |
Execution#
DataFrames are lazily computed. You can use these methods to manually trigger computation.
Compute the data. |
|
Check if the data is ready on disk. |
|
Always recompute the data regardless of whether it's already computed. |
|
|
Wait for all DataFrames to be computed. |