Skip to main content

Language Guide

This guide builds up the PyPlyne workflows you will use most often. For a compact list of syntax and verbs, use the Language Reference.

PyPlyne has two main pipeline shapes:

ShapeUse it forMain verbs
seqPython iterables, especially JSON-like records/listsmap, filter, reduce, set_fields, drop_fields, keep_fields
dfPolars-backed table datawhere, mutate, select, group_by, summarize, arrange

The shape determines which verbs are available and how expressions are interpreted.

Start With a Pipeline

The pipeline operator |> sends the value on the left into the call on the right. Read it as "then":

numbers = seq [1, 2, 3, 4, 5, 6]

result = numbers
|> filter(_ % 2 == 0)
|> map(_ * 10)

The assignment stores the final result of the pipeline. In this example, numbers is a sequence, so filter and map use sequence semantics.

Pipelines can also call methods on the current value:

name = "pyplyne" |> .upper()

Shape annotations go on the right-hand side:

values = seq [1, 2, 3]
sales = df read_csv("sales.csv")

Once a variable is known to be seq or df, later pipelines can start from that variable without repeating the annotation:

large_sales = sales |> where(amount > 100)

If a pipeline starts from an arbitrary Python expression or a custom function, annotate the source so PyPlyne knows which verb family applies.

Shape annotations also validate or normalize values at runtime. df [...] turns row dictionaries into a Polars DataFrame. seq ... checks that the value is iterable rather than a single mapping or scalar.

Work With Sequences

Use seq for Python iterables. Sequence verbs evaluate Python expressions over the items in the iterable:

values = seq [1, 2, 3, 4]

doubled = values |> map(_ * 2)
evens = values |> filter(_ % 2 == 0)
total = values |> reduce(_1 + _2)
total_from_zero = values |> reduce(_1 + _2, 0)

The result shape depends on the verb:

  • map(expression) returns one transformed item for each input item.
  • filter(expression) keeps items where the expression is true.
  • reduce(expression) combines the sequence into one scalar value.

Use _ for the common one-argument case:

values |> map(_ * 2)
values |> filter(_ > 2)

Use numbered placeholders when the callback receives more than one argument:

values |> reduce(_1 + _2)

Use arrow lambdas when names make the expression easier to read:

values |> map(x => x * 2)
values |> reduce((total, x) => total + x)

The older fn x: x * 2 spelling is still accepted for compatibility.

For more examples with records, Python objects, functions, and missing fields, see Sequence Patterns.

Work With Records

Sequences of dictionaries are common when working with JSON, APIs, or rows converted from a table. Use sequence verbs to filter the records, then field helpers to add, remove, or project fields:

orders = seq [
{"item": "notebook", "qty": 1, "price": 6},
{"item": "coffee", "qty": 3, "price": 4},
{"item": "pens", "qty": 2, "price": 3},
]

restock = orders
|> filter(qty > 1)
|> set_fields(total = qty * price, status = "buy")
|> drop_fields(price)
|> keep_fields(item, qty, total, status)

Record helpers:

  • set_fields(name = expression, ...) adds or replaces fields.
  • drop_fields(field, ...) removes fields when present.
  • keep_fields(field, ...) projects records to selected fields.

Inside record filter(...) and set_fields(...) expressions, bare names refer to dictionary fields or object attributes on the current record:

orders |> filter(qty > 1)
orders |> set_fields(buy = item == "pens")

If a bare field is missing, comparisons are falsy rather than crashing:

orders |> filter(qty > 1)

Rows without qty are skipped. Arithmetic still requires present values, so set_fields(total = qty * price) raises an error when qty or price is missing.

If you pass a single bare name to filter, PyPlyne treats it as an existing predicate function:

orders |> filter(is_priority)

Inside map(...), use placeholders or explicit lambdas:

orders |> map(_["item"])
orders |> map(order => order["item"])

Work With Tables

Use df for table-shaped data. PyPlyne normalizes table-shaped values into Polars DataFrames and compiles table expressions to Polars expressions:

sales = df [
{"region": "north", "amount": 120, "discount": 10},
{"region": "south", "amount": 80, "discount": 5},
{"region": "north", "amount": 220, "discount": 20},
]

large_sales = sales
|> where(amount > 100)
|> mutate(net = amount - discount)
|> select(region, amount, net)

Inside table verbs, bare names are columns:

sales |> where(amount > 100 and region == "north")

Think of each table verb as a context for column expressions:

VerbUse it to
where(condition)Filter rows by a boolean expression.
mutate(name = expression, ...)Add or replace columns.
select(column, ...)Choose columns.
group_by(column, ...)Group rows before a summary.
summarize(name = aggregation, ...)Aggregate grouped or whole tables.
arrange(column, descending=True)Sort rows.

Table expressions can use arithmetic, comparisons, and boolean logic:

sales
|> where(amount > 100 and discount < 20)
|> mutate(net = amount - discount)
|> arrange(net, descending=True)

Summarize Tables

Use group_by(...) followed by summarize(...) when each group should become one result row:

summary = sales
|> where(amount > 100)
|> mutate(net = amount - discount)
|> group_by(region)
|> summarize(
total = sum(net),
average = mean(net),
smallest = min(net),
largest = max(net),
rows = count(),
)
|> arrange(region)

The aggregation helpers available inside summarize(...) are sum, mean, min, max, and count. A grouped table should be followed immediately by summarize(...) before it is assigned, printed, written, or passed to other verbs.

Without group_by(...), summarize(...) aggregates the whole table:

overall = sales
|> summarize(total = sum(amount), rows = count())

Read and Write Files

Read helpers create table-shaped values. Table writes use Polars writers, and sequence JSON/CSV writes use Python-backed writers. Write helpers return the current value so you can keep piping:

sales = df read_csv("sales.csv")

summary = sales
|> where(amount > 100)
|> group_by(region)
|> summarize(total = sum(amount))

summary
|> write_json("summary.json")
|> write_parquet("summary.parquet")
|> write_csv("summary.csv")

Supported formats are CSV, JSON, Parquet, and Excel. Excel support requires:

uv sync --extra excel

Move Between Tables and Sequences

Use to_rows() to turn a table into a sequence of dictionaries, and to_table() to turn records back into a Polars table:

rows = sales |> to_rows()
table = rows |> to_table()

This is the explicit boundary between table verbs and sequence verbs. A common pattern is to do column-oriented filtering first, cross to records for dictionary-style edits, then return to a table:

reviewed = sales
|> where(amount > 100)
|> to_rows()
|> set_fields(reviewed=True)
|> to_table()
|> arrange(region)

After to_rows(), use sequence verbs. After to_table(), use table verbs.

Run Now or Defer

PyPlyne executes pipelines by default. Assigning a table pipeline gives you a concrete Polars result, even if Polars lazy plans are used internally:

large_sales = sales
|> where(amount > 100)
|> select(region, amount)

Use defer only when you want to keep a lazy query plan:

plan = defer read_csv("sales.csv")
|> where(amount > 100)
|> select(region, amount)

Avoid wrapping a lazy plan in df when you want to keep it lazy, because df normalization can collect lazy frames.

collect() remains available for lazy Polars plans:

result = plan |> collect()

Use Python Objects

Python imports work normally:

from pathlib import Path
import polars as pl

root = Path(".")
sales = df pl.DataFrame([
{"region": "north", "amount": 120},
{"region": "south", "amount": 80},
])

Custom Python functions can be used in sequence pipelines:

from scoring import score_order

ranked = seq orders
|> map(score_order(_))

Custom pipeline functions and method pipes do not change PyPlyne's shape tracking. If a custom function changes a table into rows, or rows into a table, start the next pipeline with an explicit seq or df boundary.

They can also be called before a shape annotation:

orders = seq load_orders()

Know the Two Uses of _

Inside sequence verbs, _ is the current item:

values |> map(_ * 10)

In an interactive session, _ is also the last expression result:

values |> map(_ * 10)
_ |> filter(_ > 10)

In the second line, the left _ is the previous result and the right _ is the current item inside filter:

_ |> filter(_ > 10)
^ ^
last current
result item

The session only stores shape information for _ when the last result is sequence-shaped or table-shaped.

Expression Basics

PyPlyne currently supports:

  • assignments: name = expression
  • shape annotations: name = seq expression, name = df expression
  • lists, tuples, dictionaries, strings, numbers, booleans, and None
  • function calls: name(arg, key=value)
  • attributes and indexing: item.name, item["name"]
  • arithmetic, comparisons, and boolean logic
  • pipelines and method pipes
  • arrow lambdas: x => x * 2
  • placeholder lambdas: _, _1, _2

For exact syntax, see the Language Reference.