❓ FAQ: Frequently Asked Questions

❓ FAQ: Frequently Asked Questions #

Missing something or is something unclear? Please open an issue!

How do I run `pipeline.map_async` inside a Python script?#

pipeline.map_async returns an AsyncMap. In notebooks you typically await runner.task, but for plain .py scripts you can stay synchronous by calling runner.result(). It runs the asynchronous map to completion for you and returns the ResultDict.

# ruff: noqa: INP001
"""Run `pipeline.map_async` from a regular Python script."""

from __future__ import annotations

from concurrent.futures import ThreadPoolExecutor

from pipefunc import Pipeline, pipefunc


@pipefunc(output_name="y", mapspec="x[i] -> y[i]")
def shift(x: int) -> int:
    """Add one to each element in `x` to create `y`."""
    return x + 1


pipeline = Pipeline([shift])


def main() -> None:
    """Submit the async map and block until it completes, then show sample output."""
    runner = pipeline.map_async(
        inputs={"x": range(100)},
        executor=ThreadPoolExecutor(max_workers=4),
        start=False,
        display_widgets=False,
    )

    result = runner.result()
    print(result["y"].output[:5])


if __name__ == "__main__":
    main()

The snippet above lives in docs/source/concepts/map_async_in_script.py so you can copy it as-is for your own scripts.
Keep start=False to suppress the automatic start() call that expects a running event loop.
Pass display_widgets=False to avoid the notebook-only widget warning in terminal scripts.
Swap the executor for SlurmExecutor(...) (or another backend) if you need cluster integration or custom execution behavior.

If you are in an async context (e.g. a FastAPI endpoint), continue to await runner.task instead of calling runner.result().

How is this different from Dask, AiiDA, Luigi, Prefect, Kedro, Apache Airflow, Snakemake, etc.?#

pipefunc fills a unique niche in the Python workflow ecosystem.

Key Differentiators #

What makes pipefunc unique:

Simplicity: Pure Python implementation with minimal dependencies, allowing standard debuggers and profiling tools to work without modification
Flexibility: Easy to modify pipelines and add parameter sweeps with minimal boilerplate
HPC Integration: First-class support for traditional HPC clusters
Resource Management: Fine-grained control over computational resources per function
Development Speed: Rapid prototyping without infrastructure setup

pipefunc is particularly well-suited for scientists and researchers who need to:

Quickly prototype and iterate on computational workflows
Run parameter sweeps across multiple dimensions
Manage varying computational requirements between pipeline steps
Work with traditional HPC systems
Maintain readable and maintainable Python code

Let’s break down the comparison by categories:

Low-Level Parallel Computing Tools (e.g., Dask)#

Dask and pipefunc serve different purposes and can be complementary:

Dask provides low-level control over parallelization, letting you decide exactly what and how to parallelize
pipefunc automatically handles parallelization based on pipeline structure and mapspec definitions
Dask can serve as a computational backend for pipefunc
pipefunc provides higher-level abstractions for parameter sweeps without requiring explicit parallel programming

In summary, Dask is a powerful parallel computing library, while pipefunc helps you build and manage scientific workflows with less boilerplate and takes care of parallelization and data saving for you.

Scientific Workflow Tools (e.g., AiiDA, Pydra)#

Compared to scientific workflow managers, pipefunc provides:

Lighter weight setup with no external dependencies (unlike AiiDA, which requires a daemon, PostgreSQL, and RabbitMQ).
More intuitive Python-native interface with automatic graph construction from function signatures.
Simpler debugging as code runs in the same Python process by default.
Built-in parameter sweeps with automatic parallelization.
Dynamic resource allocation based on input parameters.

Job Schedulers/Runners (e.g., Airflow, Luigi)#

These tools are designed for scheduling and running tasks, often in a distributed environment. They are well-suited for production ETL pipelines and managing dependencies between jobs. Unlike pipefunc, they often rely on serialized data or external storage for data exchange between tasks and require custom implementations for parameter sweeps.

pipefunc vs. Job Schedulers:

Focus: pipefunc focuses on creating reusable, composable Python functions within a pipeline. Job schedulers focus on scheduling and executing independent tasks.
Complexity: pipefunc is simpler to set up and use for Python-centric workflows. Job schedulers have more features but a steeper learning curve.
Flexibility: pipefunc allows for dynamic, data-driven workflows within Python. Job schedulers are more rigid but offer robust scheduling and monitoring.

Data Pipelines (e.g., Kedro, Prefect)#

These tools provide frameworks for building data pipelines with a focus on data engineering best practices, such as modularity, versioning, and testing.

pipefunc vs. Data Pipelines:

Structure: pipefunc is less opinionated about project structure than Kedro, which enforces a specific layout. Prefect is more flexible but still geared towards defining data flows.
Scope: pipefunc is more focused on the computational aspects of pipelines, while Kedro and Prefect offer more features for data management, versioning, and deployment.
Flexibility: pipefunc offers more flexibility in how pipelines are defined and executed, while Kedro and Prefect provide more structure and standardization.

Workflow Definition Languages (e.g., Snakemake)#

Snakemake uses a domain-specific language (DSL) to define workflows as a set of rules with dependencies. It excels at orchestrating diverse tools and scripts, often in separate environments, through a dedicated workflow definition file (Snakefile). Unlike pipefunc, Snakemake primarily works with serialized data and may require custom implementations for parameter sweeps within the Python code.

pipefunc vs. Snakemake:

Workflow Definition: pipefunc uses Python code with decorators. Snakemake uses a Snakefile with a specialized syntax.
Focus: pipefunc is designed for Python-centric workflows and automatic parallelization within Python. Snakemake is language-agnostic and handles the execution of diverse tools and steps, potentially in different environments.
Flexibility: pipefunc offers more flexibility in defining complex logic within Python functions. Snakemake provides a more rigid, rule-based approach.
Learning Curve: pipefunc is generally easier to learn for Python users. Snakemake requires understanding its DSL.

pipefunc within Snakemake:

pipefunc can be integrated into a Snakemake workflow. You could have a Snakemake rule that executes a Python script containing a pipefunc pipeline, combining the strengths of both tools.

In essence:

pipefunc provides a simpler, more Pythonic approach for workflows primarily based on Python functions. It excels at streamlining development, reducing boilerplate, and automatically handling parallelization within the familiar Python ecosystem. While other tools may be better suited for production ETL pipelines, managing complex dependencies, or workflows involving diverse non-Python tools, pipefunc is ideal for flexible scientific computing workflows where rapid development and easy parameter exploration are priorities.

How is this different from Hamilton?#

Because frequent comparisons are made between the two projects, it gets its own section.

pipefunc and Hamilton both generate DAGs from plain Python functions, but they target different pain points.

Where pipefunc leans in

Scientific + HPC workflows: Minimal runtime overhead (<10 µs) with mapspec driven scheduling, executor-agnostic parallelism (any concurrent.futures.Executor, from ProcessPoolExecutor to Dask, ipyparallel, mpi4py, etc.), and first-class support for job schedulers such as SLURM and PBS (see Execution and Parallelism).
N-dimensional parameter sweeps: Built-in sweep tooling (pipeline.map) stores intermediate artifacts, supports eager or queued execution, and works with structured outputs like xarray.
Fine-grained resource policies: Per-function constraints for CPU, memory, GPUs, wall-time, and custom selectors (Resource Management).
Type-aware validation: Type hints are checked when pipelines are constructed (with optional runtime checks for Array[...] outputs), and pipelines can emit Pydantic models for CLIs, agents, or user interfaces (CLI and MCP).

Where Hamilton focuses

Column-first dataflows: Decorators such as @extract_columns, @parameterize_extract_columns, and @with_columns specialize in expanding and transforming DataFrame columns while preserving lineage.
Data quality & observability: Hamilton’s first-party decorators (@check_output, etc.) and plugins integrate with pandera/pydantic, emit OpenLineage metadata, and surface lineage/telemetry through the Hamilton UI.
Adapter-driven execution: Hamilton provides its own GraphAdapter API for Ray, Dask, Spark, async/thread pools, etc. Switching backends means selecting or implementing the matching adapter rather than swapping a standard executor.
Structured module layout: Drivers crawl Python modules to assemble the DAG; teams wanting strong conventions appreciate that, while others may find the enforced module boundaries restrictive compared to ad-hoc wiring.

How to choose

Reach for pipefunc when you need lightweight orchestration for numerically intensive or simulation-heavy workloads, multi-dimensional sweeps, or custom resource scheduling without leaving pure Python.
Reach for Hamilton when your pipelines revolve around DataFrame transformations, you need column-level lineage and validation baked in, or you want UI-driven observability out of the box.

You can mix them too: Hamilton-produced functions can be wrapped with @pipefunc, and pipefunc stages can call Hamilton drivers for teams already invested in Hamilton’s ecosystem.

❓ FAQ: Frequently Asked Questions

Contents

❓ FAQ: Frequently Asked Questions #

How do I run `pipeline.map_async` inside a Python script?#

How is this different from Dask, AiiDA, Luigi, Prefect, Kedro, Apache Airflow, Snakemake, etc.?#

Key Differentiators #

Low-Level Parallel Computing Tools (e.g., Dask)#

Scientific Workflow Tools (e.g., AiiDA, Pydra)#

Job Schedulers/Runners (e.g., Airflow, Luigi)#

Data Pipelines (e.g., Kedro, Prefect)#

Workflow Definition Languages (e.g., Snakemake)#

How is this different from Hamilton?#

How to use `adaptive` with `pipefunc`?#

How to handle defaults?#

How to bind parameters to a fixed value?#

How to rename inputs and outputs?#

How to handle multiple outputs?#

How does type checking work in `pipefunc`?#

What is the difference between `pipeline.run` and `pipeline.map`?#

How to use parameter scopes (namespaces)?#

How to inspect the `Resources` inside a `PipeFunc`?#

How to set the `Resources` dynamically, based on the input arguments?#

How to use `adaptive` with `pipefunc`?#

What is the `ErrorSnapshot` feature in `pipefunc`?#

What is the overhead / efficiency / performance of `pipefunc`?#

How to mock functions in a pipeline for testing?#

Mixing executors and storage backends for I/O-bound and CPU-bound work #

Get a function handle for a specific pipeline output (`pipeline.func`)#

`dataclasses` and `pydantic.BaseModel` as `PipeFunc`#

What is `VariantPipeline` and how to use it?#

How to use post-execution hooks?#

How to collect results as a step in my `Pipeline`?#

`PipeFunc`s with Multiple Outputs of Different Shapes #

Simplifying Pipelines #

Parameter Sweeps #