Getting Started#
Installation#
Python 3.8+ is required.
pip install smallpond
Initialization#
The first step is to initialize the smallpond session:
import smallpond
sp = smallpond.init()
Loading Data#
Create a DataFrame from a set of files:
df = sp.read_parquet("path/to/dataset/*.parquet")
To learn more about loading data, please refer to Loading Data.
Partitioning Data#
Smallpond requires users to manually specify data partitions for now.
df = df.repartition(3) # repartition by files
df = df.repartition(3, by_row=True) # repartition by rows
df = df.repartition(3, hash_by="host") # repartition by hash of column
To learn more about partitioning data, please refer to Partitioning Data.
Transforming Data#
Apply python functions or SQL expressions to transform data.
df = df.map('a + b as c')
df = df.map(lambda row: {'c': row['a'] + row['b']})
To learn more about transforming data, please refer to Transformations.
Saving Data#
Save the transformed data to a set of files:
df.write_parquet("path/to/output")
To learn more about saving data, please refer to Consuming Data.
Monitoring#
Smallpond uses Ray Core as the task scheduler. You can use Ray Dashboard to monitor the task execution.
When smallpond starts, it will print the Ray Dashboard URL:
... Started a local Ray instance. View the dashboard at http://127.0.0.1:8008