Getting Started
===============

Installation
------------

Python 3.8+ is required.

.. code-block:: bash

   pip install smallpond

Initialization
--------------

The first step is to initialize the smallpond session:

.. code-block:: python

   import smallpond

   sp = smallpond.init()

Loading Data
------------

Create a DataFrame from a set of files:

.. code-block:: python

   df = sp.read_parquet("path/to/dataset/*.parquet")

To learn more about loading data, please refer to :ref:`loading_data`.

Partitioning Data
-----------------

Smallpond requires users to manually specify data partitions for now.

.. code-block:: python

   df = df.repartition(3)                 # repartition by files
   df = df.repartition(3, by_row=True)    # repartition by rows
   df = df.repartition(3, hash_by="host") # repartition by hash of column

To learn more about partitioning data, please refer to :ref:`partitioning_data`.

Transforming Data
-----------------

Apply python functions or SQL expressions to transform data.

.. code-block:: python

   df = df.map('a + b as c')
   df = df.map(lambda row: {'c': row['a'] + row['b']})

To learn more about transforming data, please refer to :ref:`transformations`.

Saving Data
-----------

Save the transformed data to a set of files:

.. code-block:: python

   df.write_parquet("path/to/output")

To learn more about saving data, please refer to :ref:`consuming_data`.

Monitoring
----------

Smallpond uses `Ray Core`_ as the task scheduler. You can use `Ray Dashboard`_ to monitor the task execution.

.. _Ray Core: https://docs.ray.io/en/latest/ray-core/walkthrough.html
.. _Ray Dashboard: https://docs.ray.io/en/latest/ray-observability/getting-started.html

When smallpond starts, it will print the Ray Dashboard URL:

.. code-block:: bash

   ... Started a local Ray instance. View the dashboard at http://127.0.0.1:8008