smallpond.init#

smallpond.init(job_id: str | None = None, job_time: float | None = None, job_name: str | None = None, data_root: str | None = None, num_executors: int | None = None, ray_address: str | None = None, bind_numa_node: bool | None = None, platform: str | None = None, _remove_output_root: bool = True, **kwargs) Session#

Initialize smallpond environment.

This is the entry point for smallpond:

import smallpond
sp = smallpond.init()

By default, it will use a local ray cluster as worker node. To use more worker nodes, please specify the argument:

sp = smallpond.init(num_executors=10)

It will create an task to run the workers.

Parameters#

All parameters are optional. If not specified, read from environment variables. If not set, use default values.

job_id (SP_JOBID)

Unique job id. Default to a random uuid.

job_time (SP_JOB_TIME)

Job create time (seconds since epoch). Default to current time.

job_name (SP_JOB_NAME)

Job display name. Default to the filename of the current script.

data_root (SP_DATA_ROOT)

The root folder for all files generated at runtime.

num_executors (SP_NUM_EXECUTORS)

The number of executors. Default to 0, which means all tasks will be run on the current node.

ray_address (SP_RAY_ADDRESS)

If specified, use the given address to connect to an existing ray cluster. Otherwise, create a new ray cluster.

bind_numa_node (SP_BIND_NUMA_NODE)

If true, bind executor processes to numa nodes.

memory_allocator (SP_MEMORY_ALLOCATOR)

The memory allocator used by worker processes. Choices: “system”, “jemalloc”, “mimalloc”. Default to “mimalloc”.

platform (SP_PLATFORM)

The platform to use. Choices: “mpi”. By default, it will automatically detect the environment and choose the most suitable platform.

_remove_output_root

If true, remove the “{data_root}/output” directory after the job is finished. Default to True. This is only used for compatibility. User should use write_parquet for saving outputs.

Spawning a new job#

If the environment variable SP_SPAWN is set to “1”, it will spawn a new job to run the current script.