Core Concepts
PyPlyne brings clean functional pipes directly to Python. It compiles to Python AST, runs in a normal Python environment, and is built around four ideas:
- pipelines describe data flow from left to right
- shapes choose the verb family for each step
- table expressions are Polars expressions
- Python objects stay available at the boundaries
Pipelines Read Left To Right
|> sends the value on the left into the call on the right:
result = numbers
|> filter(_ > 1)
|> map(_ * 10)
Conceptually:
[item * 10 for item in numbers if item > 1]
PyPlyne writes that flow in the order you think about it.
Each step receives the current value and returns the next value. Most verbs
preserve the same shape, while terminal verbs such as reduce(...) can return
a scalar.
Shapes Are The Pipeline Contract
PyPlyne distinguishes two pipeline shapes:
| Shape | Mental model | Main verbs |
|---|---|---|
seq | Python iterables, especially JSON-like records/lists | map, filter, reduce, set_fields, drop_fields, keep_fields |
df | a table of rows and columns | where, mutate, select, group_by, summarize, arrange |
Shape annotations go on the right-hand side:
orders = seq [{"item": "coffee", "qty": 3}]
sales = df read_csv("sales.csv")
A shape is not a Python type annotation. It tells PyPlyne how to compile the
pipeline that follows, and it normalizes or validates values at runtime.
df [...] becomes a Polars DataFrame; seq ... checks that the value is an
iterable rather than a single mapping or scalar. If a variable is already known
to be seq or df, later pipelines can start from that variable without
repeating the annotation:
large_sales = sales |> where(amount > 100)
If a pipeline starts from an arbitrary Python expression or a custom function,
annotate the source with seq or df so PyPlyne knows which verb family applies.
Sequence Pipelines Transform Items
Use seq when each step should work item by item. The value can be a list,
tuple, generator, API response, or any other Python iterable.
values = seq [1, 2, 3, 4]
doubled = values |> map(_ * 2)
evens = values |> filter(_ % 2 == 0)
total = values |> reduce(_1 + _2)
Inside sequence callbacks, _ is the current item. Numbered placeholders such
as _1 and _2 are available for callbacks that receive more than one value.
Sequences of dictionaries are useful for JSON-like records:
orders = seq [
{"item": "coffee", "qty": 3},
{"item": "pens", "qty": 2},
]
restock = orders
|> filter(qty > 1)
|> keep_fields(item)
|> set_fields(buy = item == "pens")
Inside record filter(...) and set_fields(...) expressions, bare names refer
to fields on the current record:
rows |> filter(amount > 100)
rows |> set_fields(net = amount - discount)
Think of sequence verbs like compact Python loops: map transforms each item,
filter keeps some items, and reduce collapses the sequence to one result.
Table Pipelines Transform Columns
Inside table verbs, bare names are column references:
summary = sales
|> where(amount > 100)
|> group_by(region)
|> summarize(total = sum(amount), rows = count())
amount and region are not Python variables there. They are compiled into
Polars expressions and executed by the Polars backend.
Table verbs describe the part of the table they change:
wherefilters rowsmutateadds or replaces columnsselectchooses columnsgroup_bychanges the grouping contextsummarizereduces each group to summary columnsarrangesorts rows
This is the same idea Polars uses for expression contexts: an expression does not do work by itself. It runs when placed in a context such as filtering, selecting, mutating, or aggregating.
Execution Runs By Default
PyPlyne materializes table pipelines by default at assignment and expression boundaries. When you run a script, the pipeline runs and you get a concrete result.
Use defer only when you want to keep a lazy Polars plan:
plan = defer sales
|> where(amount > 100)
|> select(region, amount)
Avoid adding df around a lazy plan you want to preserve, because df
normalization can collect lazy frames.
collect() remains available for lazy Polars plans, but it is not the normal
boundary between PyPlyne and results.
Shape Boundaries Are Explicit
Use explicit helpers to cross between table and sequence workflows:
rows = sales |> to_rows()
table = rows |> to_table()
Conversions can happen inside one pipeline:
reviewed = sales
|> where(amount > 100)
|> to_rows()
|> set_fields(reviewed=True)
|> to_table()
|> arrange(region)
Use table verbs while you want Polars column behavior. Convert to rows when you want Python item behavior. Convert back when you want table behavior again.
Python Is The Host Language
Imports compile to native Python import nodes:
from pathlib import Path
import polars as pl
Python values can enter a PyPlyne pipeline:
from pathlib import Path
from scoring import score_order
root = Path(".")
ranked = seq orders |> map(score_order(_))
The important boundary is expression meaning:
- sequence callbacks use normal Python semantics
- table verb expressions use column semantics
- arbitrary Python calls are best made before a table pipeline, after it, or on
rows after
to_rows()
You can also run PyPlyne from Python:
from pyplyne import PyPlyneSession
orders = [
{"item": "coffee", "qty": 3},
{"item": "pens", "qty": 2},
]
session = PyPlyneSession({"orders": orders})
session.run("""
restock = seq orders
|> filter(qty > 1)
""")
The session environment can hold Python objects, imported functions, Polars tables, and intermediate PyPlyne results.
Interactive Sessions Keep Context
The REPL and session server keep variables, imports, data, and known shapes alive between snippets:
pyplyne> numbers = seq [1, 2, 3]
pyplyne> numbers |> map(_ * 10)
[10, 20, 30]
pyplyne> _ |> filter(_ > 10)
[20, 30]
In an interactive session, _ has two roles. At the start of a pipeline it is
the previous expression result. Inside sequence callbacks it is the current
item. If the previous result is scalar, PyPlyne clears the stored shape so later
shape-specific verbs do not accidentally use stale context.