Skip to main content

Core Concepts

PyPlyne brings clean functional pipes directly to Python. It compiles to Python AST, runs in a normal Python environment, and is built around four ideas:

  • pipelines describe data flow from left to right
  • shapes choose the verb family for each step
  • table expressions are Polars expressions
  • Python objects stay available at the boundaries

Pipelines Read Left To Right

|> sends the value on the left into the call on the right:

result = numbers
|> filter(_ > 1)
|> map(_ * 10)

Conceptually:

[item * 10 for item in numbers if item > 1]

PyPlyne writes that flow in the order you think about it.

Each step receives the current value and returns the next value. Most verbs preserve the same shape, while terminal verbs such as reduce(...) can return a scalar.

Shapes Are The Pipeline Contract

PyPlyne distinguishes two pipeline shapes:

ShapeMental modelMain verbs
seqPython iterables, especially JSON-like records/listsmap, filter, reduce, set_fields, drop_fields, keep_fields
dfa table of rows and columnswhere, mutate, select, group_by, summarize, arrange

Shape annotations go on the right-hand side:

orders = seq [{"item": "coffee", "qty": 3}]
sales = df read_csv("sales.csv")

A shape is not a Python type annotation. It tells PyPlyne how to compile the pipeline that follows, and it normalizes or validates values at runtime. df [...] becomes a Polars DataFrame; seq ... checks that the value is an iterable rather than a single mapping or scalar. If a variable is already known to be seq or df, later pipelines can start from that variable without repeating the annotation:

large_sales = sales |> where(amount > 100)
tip

If a pipeline starts from an arbitrary Python expression or a custom function, annotate the source with seq or df so PyPlyne knows which verb family applies.

Sequence Pipelines Transform Items

Use seq when each step should work item by item. The value can be a list, tuple, generator, API response, or any other Python iterable.

values = seq [1, 2, 3, 4]

doubled = values |> map(_ * 2)
evens = values |> filter(_ % 2 == 0)
total = values |> reduce(_1 + _2)

Inside sequence callbacks, _ is the current item. Numbered placeholders such as _1 and _2 are available for callbacks that receive more than one value.

Sequences of dictionaries are useful for JSON-like records:

orders = seq [
{"item": "coffee", "qty": 3},
{"item": "pens", "qty": 2},
]

restock = orders
|> filter(qty > 1)
|> keep_fields(item)
|> set_fields(buy = item == "pens")

Inside record filter(...) and set_fields(...) expressions, bare names refer to fields on the current record:

rows |> filter(amount > 100)
rows |> set_fields(net = amount - discount)

Think of sequence verbs like compact Python loops: map transforms each item, filter keeps some items, and reduce collapses the sequence to one result.

Table Pipelines Transform Columns

Inside table verbs, bare names are column references:

summary = sales
|> where(amount > 100)
|> group_by(region)
|> summarize(total = sum(amount), rows = count())

amount and region are not Python variables there. They are compiled into Polars expressions and executed by the Polars backend.

Table verbs describe the part of the table they change:

  • where filters rows
  • mutate adds or replaces columns
  • select chooses columns
  • group_by changes the grouping context
  • summarize reduces each group to summary columns
  • arrange sorts rows

This is the same idea Polars uses for expression contexts: an expression does not do work by itself. It runs when placed in a context such as filtering, selecting, mutating, or aggregating.

Execution Runs By Default

PyPlyne materializes table pipelines by default at assignment and expression boundaries. When you run a script, the pipeline runs and you get a concrete result.

Use defer only when you want to keep a lazy Polars plan:

plan = defer sales
|> where(amount > 100)
|> select(region, amount)

Avoid adding df around a lazy plan you want to preserve, because df normalization can collect lazy frames.

collect() remains available for lazy Polars plans, but it is not the normal boundary between PyPlyne and results.

Shape Boundaries Are Explicit

Use explicit helpers to cross between table and sequence workflows:

rows = sales |> to_rows()
table = rows |> to_table()

Conversions can happen inside one pipeline:

reviewed = sales
|> where(amount > 100)
|> to_rows()
|> set_fields(reviewed=True)
|> to_table()
|> arrange(region)

Use table verbs while you want Polars column behavior. Convert to rows when you want Python item behavior. Convert back when you want table behavior again.

Python Is The Host Language

Imports compile to native Python import nodes:

from pathlib import Path
import polars as pl

Python values can enter a PyPlyne pipeline:

from pathlib import Path
from scoring import score_order

root = Path(".")
ranked = seq orders |> map(score_order(_))

The important boundary is expression meaning:

  • sequence callbacks use normal Python semantics
  • table verb expressions use column semantics
  • arbitrary Python calls are best made before a table pipeline, after it, or on rows after to_rows()

You can also run PyPlyne from Python:

from pyplyne import PyPlyneSession

orders = [
{"item": "coffee", "qty": 3},
{"item": "pens", "qty": 2},
]
session = PyPlyneSession({"orders": orders})
session.run("""
restock = seq orders
|> filter(qty > 1)
""")

The session environment can hold Python objects, imported functions, Polars tables, and intermediate PyPlyne results.

Interactive Sessions Keep Context

The REPL and session server keep variables, imports, data, and known shapes alive between snippets:

pyplyne> numbers = seq [1, 2, 3]
pyplyne> numbers |> map(_ * 10)
[10, 20, 30]
pyplyne> _ |> filter(_ > 10)
[20, 30]

In an interactive session, _ has two roles. At the start of a pipeline it is the previous expression result. Inside sequence callbacks it is the current item. If the previous result is scalar, PyPlyne clears the stored shape so later shape-specific verbs do not accidentally use stale context.