❓ FAQ: Frequently Asked Questions#
Missing something or is something unclear? Please open an issue!
How do I run pipeline.map_async inside a Python script?#
pipeline.map_async returns an AsyncMap. In notebooks you typically await runner.task, but for plain .py scripts you can stay synchronous by calling runner.result(). It runs the asynchronous map to completion for you and returns the ResultDict.
# ruff: noqa: INP001
"""Run `pipeline.map_async` from a regular Python script."""
from __future__ import annotations
from concurrent.futures import ThreadPoolExecutor
from pipefunc import Pipeline, pipefunc
@pipefunc(output_name="y", mapspec="x[i] -> y[i]")
def shift(x: int) -> int:
"""Add one to each element in `x` to create `y`."""
return x + 1
pipeline = Pipeline([shift])
def main() -> None:
"""Submit the async map and block until it completes, then show sample output."""
runner = pipeline.map_async(
inputs={"x": range(100)},
executor=ThreadPoolExecutor(max_workers=4),
start=False,
display_widgets=False,
)
result = runner.result()
print(result["y"].output[:5])
if __name__ == "__main__":
main()
The snippet above lives in
docs/source/concepts/map_async_in_script.pyso you can copy it as-is for your own scripts.Keep
start=Falseto suppress the automaticstart()call that expects a running event loop.Pass
display_widgets=Falseto avoid the notebook-only widget warning in terminal scripts.Swap the executor for
SlurmExecutor(...)(or another backend) if you need cluster integration or custom execution behavior.
If you are in an async context (e.g. a FastAPI endpoint), continue to await runner.task instead of calling runner.result().
How is this different from Dask, AiiDA, Luigi, Prefect, Kedro, Apache Airflow, Snakemake, etc.?#
pipefunc fills a unique niche in the Python workflow ecosystem.
Key Differentiators#
What makes pipefunc unique:
Simplicity: Pure Python implementation with minimal dependencies, allowing standard debuggers and profiling tools to work without modification
Flexibility: Easy to modify pipelines and add parameter sweeps with minimal boilerplate
HPC Integration: First-class support for traditional HPC clusters
Resource Management: Fine-grained control over computational resources per function
Development Speed: Rapid prototyping without infrastructure setup
pipefunc is particularly well-suited for scientists and researchers who need to:
Quickly prototype and iterate on computational workflows
Run parameter sweeps across multiple dimensions
Manage varying computational requirements between pipeline steps
Work with traditional HPC systems
Maintain readable and maintainable Python code
Let’s break down the comparison by categories:
Low-Level Parallel Computing Tools (e.g., Dask)#
Dask and pipefunc serve different purposes and can be complementary:
Dask provides low-level control over parallelization, letting you decide exactly what and how to parallelize
pipefuncautomatically handles parallelization based on pipeline structure andmapspecdefinitionsDask can serve as a computational backend for
pipefuncpipefuncprovides higher-level abstractions for parameter sweeps without requiring explicit parallel programming
In summary, Dask is a powerful parallel computing library, while pipefunc helps you build and manage scientific workflows with less boilerplate and takes care of parallelization and data saving for you.
Scientific Workflow Tools (e.g., AiiDA, Pydra)#
Compared to scientific workflow managers, pipefunc provides:
Lighter weight setup with no external dependencies (unlike AiiDA, which requires a daemon, PostgreSQL, and RabbitMQ).
More intuitive Python-native interface with automatic graph construction from function signatures.
Simpler debugging as code runs in the same Python process by default.
Built-in parameter sweeps with automatic parallelization.
Dynamic resource allocation based on input parameters.
Job Schedulers/Runners (e.g., Airflow, Luigi)#
These tools are designed for scheduling and running tasks, often in a distributed environment. They are well-suited for production ETL pipelines and managing dependencies between jobs. Unlike pipefunc, they often rely on serialized data or external storage for data exchange between tasks and require custom implementations for parameter sweeps.
pipefunc vs. Job Schedulers:
Focus:
pipefuncfocuses on creating reusable, composable Python functions within a pipeline. Job schedulers focus on scheduling and executing independent tasks.Complexity:
pipefuncis simpler to set up and use for Python-centric workflows. Job schedulers have more features but a steeper learning curve.Flexibility:
pipefuncallows for dynamic, data-driven workflows within Python. Job schedulers are more rigid but offer robust scheduling and monitoring.
Data Pipelines (e.g., Kedro, Prefect)#
These tools provide frameworks for building data pipelines with a focus on data engineering best practices, such as modularity, versioning, and testing.
pipefunc vs. Data Pipelines:
Structure:
pipefuncis less opinionated about project structure than Kedro, which enforces a specific layout. Prefect is more flexible but still geared towards defining data flows.Scope:
pipefuncis more focused on the computational aspects of pipelines, while Kedro and Prefect offer more features for data management, versioning, and deployment.Flexibility:
pipefuncoffers more flexibility in how pipelines are defined and executed, while Kedro and Prefect provide more structure and standardization.
Workflow Definition Languages (e.g., Snakemake)#
Snakemake uses a domain-specific language (DSL) to define workflows as a set of rules with dependencies. It excels at orchestrating diverse tools and scripts, often in separate environments, through a dedicated workflow definition file (Snakefile).
Unlike pipefunc, Snakemake primarily works with serialized data and may require custom implementations for parameter sweeps within the Python code.
pipefunc vs. Snakemake:
Workflow Definition:
pipefuncuses Python code with decorators. Snakemake uses aSnakefilewith a specialized syntax.Focus:
pipefuncis designed for Python-centric workflows and automatic parallelization within Python. Snakemake is language-agnostic and handles the execution of diverse tools and steps, potentially in different environments.Flexibility:
pipefuncoffers more flexibility in defining complex logic within Python functions. Snakemake provides a more rigid, rule-based approach.Learning Curve:
pipefuncis generally easier to learn for Python users. Snakemake requires understanding its DSL.
pipefunc within Snakemake:
pipefunc can be integrated into a Snakemake workflow. You could have a Snakemake rule that executes a Python script containing a pipefunc pipeline, combining the strengths of both tools.
In essence:
pipefunc provides a simpler, more Pythonic approach for workflows primarily based on Python functions. It excels at streamlining development, reducing boilerplate, and automatically handling parallelization within the familiar Python ecosystem. While other tools may be better suited for production ETL pipelines, managing complex dependencies, or workflows involving diverse non-Python tools, pipefunc is ideal for flexible scientific computing workflows where rapid development and easy parameter exploration are priorities.
How is this different from Hamilton?#
Because frequent comparisons are made between the two projects, it gets its own section.
pipefunc and Hamilton both generate DAGs from plain Python functions, but they target different pain points.
Where pipefunc leans in
Scientific + HPC workflows: Minimal runtime overhead (<10 µs) with
mapspecdriven scheduling, executor-agnostic parallelism (anyconcurrent.futures.Executor, fromProcessPoolExecutorto Dask, ipyparallel, mpi4py, etc.), and first-class support for job schedulers such as SLURM and PBS (see Execution and Parallelism).N-dimensional parameter sweeps: Built-in sweep tooling (
pipeline.map) stores intermediate artifacts, supports eager or queued execution, and works with structured outputs likexarray.Fine-grained resource policies: Per-function constraints for CPU, memory, GPUs, wall-time, and custom selectors (Resource Management).
Type-aware validation: Type hints are checked when pipelines are constructed (with optional runtime checks for
Array[...]outputs), and pipelines can emit Pydantic models for CLIs, agents, or user interfaces (CLI and MCP).
Where Hamilton focuses
Column-first dataflows: Decorators such as
@extract_columns,@parameterize_extract_columns, and@with_columnsspecialize in expanding and transforming DataFrame columns while preserving lineage.Data quality & observability: Hamilton’s first-party decorators (
@check_output, etc.) and plugins integrate with pandera/pydantic, emit OpenLineage metadata, and surface lineage/telemetry through the Hamilton UI.Adapter-driven execution: Hamilton provides its own
GraphAdapterAPI for Ray, Dask, Spark, async/thread pools, etc. Switching backends means selecting or implementing the matching adapter rather than swapping a standard executor.Structured module layout: Drivers crawl Python modules to assemble the DAG; teams wanting strong conventions appreciate that, while others may find the enforced module boundaries restrictive compared to ad-hoc wiring.
How to choose
Reach for
pipefuncwhen you need lightweight orchestration for numerically intensive or simulation-heavy workloads, multi-dimensional sweeps, or custom resource scheduling without leaving pure Python.Reach for Hamilton when your pipelines revolve around DataFrame transformations, you need column-level lineage and validation baked in, or you want UI-driven observability out of the box.
You can mix them too: Hamilton-produced functions can be wrapped with @pipefunc, and pipefunc stages can call Hamilton drivers for teams already invested in Hamilton’s ecosystem.
How to use adaptive with pipefunc?#
This section has been moved to SLURM integration.
How to handle defaults?#
This section has been moved to Function Inputs and Outputs.
How to bind parameters to a fixed value?#
This section has been moved to Function Inputs and Outputs.
How to rename inputs and outputs?#
This section has been moved to Function Inputs and Outputs.
How to handle multiple outputs?#
This section has been moved to Function Inputs and Outputs.
How does type checking work in pipefunc?#
This section has been moved to Type Checking.
What is the difference between pipeline.run and pipeline.map?#
This section has been moved to Parallelism and Execution.
How to use parameter scopes (namespaces)?#
This section has been moved to Parameter Scopes.
How to inspect the Resources inside a PipeFunc?#
This section has been moved to Resource Management.
How to set the Resources dynamically, based on the input arguments?#
This section has been moved to Resource Management.
How to use adaptive with pipefunc?#
This section has been moved to Adaptive integration.
What is the ErrorSnapshot feature in pipefunc?#
This section has been moved to Error Handling.
What is the overhead / efficiency / performance of pipefunc?#
This section has been moved to Overhead and Efficiency.
How to mock functions in a pipeline for testing?#
This section has been moved to Testing.
Mixing executors and storage backends for I/O-bound and CPU-bound work#
This section has been moved to Parallelism and Execution.
Get a function handle for a specific pipeline output (pipeline.func)#
This section has been moved to Function Inputs and Outputs.
dataclasses and pydantic.BaseModel as PipeFunc#
This section has been moved to Function Inputs and Outputs.
What is VariantPipeline and how to use it?#
This section has been moved to Variants.
How to use post-execution hooks?#
This section has been moved to Parallelism and Execution.
How to collect results as a step in my Pipeline?#
This section has been moved to Function Inputs and Outputs.
PipeFuncs with Multiple Outputs of Different Shapes#
This section has been moved to Function Inputs and Outputs.
Simplifying Pipelines#
This section has been moved to Simplifying Pipelines.
Parameter Sweeps#
This section has been moved to Parameter Sweeps.