Lamella surface distance filtering

In this example we'll use data from EMPIAR-11899. When we originally did this test we used versions of the ribosome and void networks that had been trained with this dataset excluded from the training collection. The current versions were trained on the full collection.

This data was originally generated by Tuijtel et al. who used it to investigate the effect of cryoFIB-milling-induced surface damage on subtomogram averaging quality in detail. We are re-using their data to show how pretrained networks (for void and ribosomes) can be used to achieve the same results.

We'll start with segmentation, with denoised reconstructions already in warp_tiltseries/reconstruction/denoised. There are 207 tomograms in the dataset.

Segmenting ribosomes and void

First, we segment ribosomes and void. For the void segmentation we use the 2D model (which is the one for which validation metrics are shown in the paper) and the maximum degree of test-time augmentation. For ribosomes we'll keep everything at default settings.

easymode segment ribosome --data warp_tiltseries/reconstruction/denoised --output segmented/
easymode segment void --data warp_tiltseries/reconstruction/denoised --2d --tta 8 --output segmented/

Subtomogram averaging of ribosome

We refer to the ribosome STA tutorial for more details on this step. In short, we pick ribosomes using:

easymode pick ribosome --data segmented --size 2000000 --spacing 250

With a minimum blob size of 2e6 Å³ and a minimum 250 Å spacing between particles. After exporting particles at 5 Å/px with WarpTools ts_export_particles and averaging in Relion5 to around 10 Å resolution, the particles should have settled into position.

Measuring the distance to the nearest lamella

Next, we wanted to measure the distance from each ribosome coordinate to the nearest lamella surface. We use the void segmentations (which annotate the regions of the tomogram that are not biological material - i.e., the void outside of the lamella) to identify these surfaces, and Pom to do the measurement. First, set up Pom:

pom initialize
pom add_source --tomograms warp_tiltseries/reconstruction/denoised
pom add_source --segmentations segmented

Then for the measurement (see this page for more details about using Pom), we define a sampler that measures distance (in Å) to the nearest void surface (void segmentation output thresholded at 0.5, or half the maximum network output value). We only take the two largest objects in the void mesh in to account, to avoid small spurious void output blobs within the lamella introducing errors in the measurement.

$ pom contextualize \
    --starfile relion/ribosome/Refine3D/job001/run_data.star \
    --samplers void:0.5:-2 \
    --substitutions .tomostar:_10.00Apx.mrc \
    --out_star ribosomes_pom.star

Contextualizing 27399 particles with 1 samplers:
  1. pomDistVoidT50: distance to void
     (threshold=0.5, keeping only 2 largest blob(s)).

found 207 volumes for feature "void".

sampling context for 207 tomograms with 32 workers.
  100%|████████████████████| 207/207 [01:39<00:00, 2.09it/s]

Splitting the particle set

We now have ribosome_pom.star with all the original particle coordinates and angles, plus a new column called 'pomDistVoidT50' which lists the distance from each particle to what we defined as the nearest lamella surface.

Lamella surface distance histogram

The median distance was 435 Å, so we split the set into two halves with:

starutil ribosomes_pom.star "pomDistVoidT50<435" --output ribosomes_surface.star
starutil ribosomes_pom.star "pomDistVoidT50>435" --output ribosomes_central.star

starutil — a small star file filtering script

starutil is a simple Python script for filtering and manipulating RELION/Warp star files. Save the script below as starutil.py and use it as python starutil.py (or alias it as starutil).

Requires: pip install starfile pandas numpy

#!/usr/bin/env python
import argparse
import operator
import re
import starfile
import sys
import numpy as np
import pandas as pd
from pathlib import Path

WIDTH = 80
HEIGHT = 20

def read_star(star_path):
    """Return (full_dict, particle_key)."""
    data = starfile.read(star_path, always_dict=True)
    if 'particles' in data:
        return data, 'particles'
    keys = list(data.keys())
    if len(keys) == 1:
        return data, keys[0]
    print(f"Available blocks: {keys}")
    sys.exit(1)

def parse_filter(f):
    m = re.match(r'^(\w+)(>=|<=|!=|>|<|=)(.+)$', f)
    if not m:
        sys.exit(f"Bad filter: {f}")
    col, op_str, val = m.groups()
    ops = {'=': operator.eq, '!=': operator.ne, '>': operator.gt,
            '<': operator.lt, '>=': operator.ge, '<=': operator.le}
    try:
        val = float(val)
        if val == int(val):
            val = int(val)
    except ValueError:
        pass
    return col, ops[op_str], val

def parse_substitute(s):
    parts = s.split(':', 2)
    if len(parts) != 3:
        sys.exit(f"Bad substitution (need col:search:replace): {s}")
    return parts

def parse_scale(s):
    parts = s.split(':', 1)
    if len(parts) != 2:
        sys.exit(f"Bad scale (need col:factor): {s}")
    col, factor = parts
    try:
        factor = float(factor)
    except ValueError:
        sys.exit(f"Bad scale factor: {factor}")
    return col, factor

def print_histogram(series, name):
    print(f"\n{'─' * WIDTH}")
    print(f"  {name}  (n={len(series)}, non-null={series.notna().sum()})")
    print(f"{'─' * WIDTH}")

    s = series.dropna()
    if len(s) == 0:
        print("  (all NaN)")
        return

    if pd.api.types.is_numeric_dtype(s):
        _numeric_histogram(s, name)
    else:
        _categorical_histogram(s, name)

def _numeric_histogram(s, name):
    lo, hi = s.min(), s.max()
    mean, std, med = s.mean(), s.std(), s.median()
    print(f"  min={lo:.4g}  max={hi:.4g}  mean={mean:.4g}  std={std:.4g}  median={med:.4g}")
    print()

    if lo == hi:
        print(f"  all values = {lo}")
        return

    is_int = pd.api.types.is_integer_dtype(s) or (s == s.round()).all()

    if is_int:
        lo_i, hi_i = int(lo), int(hi)
        bins = np.arange(lo_i - 0.5, hi_i + 1.5, 1.0)
        counts, _ = np.histogram(s, bins=bins)
        labels = [str(v) for v in range(lo_i, hi_i + 1)]
    else:
        nbins = min(HEIGHT, len(s.unique()))
        counts, edges = np.histogram(s, bins=nbins)
        labels = [f"{edges[i]:.3g}" for i in range(len(counts))]

    max_count = counts.max()
    if max_count == 0:
        return

    label_w = max(len(l) for l in labels)
    bar_w = WIDTH - label_w - 10

    for label, c in zip(labels, counts):
        bar_len = int(c / max_count * bar_w)
        print(f"  {label.rjust(label_w)} │{'█' * bar_len} {c}")

def _categorical_histogram(s, name):
    vc = s.value_counts()
    n_show = min(HEIGHT, len(vc))
    vc = vc.head(n_show)
    max_count = vc.iloc[0]

    label_w = min(30, max(len(str(v)) for v in vc.index))
    bar_w = WIDTH - label_w - 10

    for val, c in vc.items():
        lbl = str(val)[:label_w].ljust(label_w)
        bar_len = int(c / max_count * bar_w)
        bar = '█' * bar_len
        print(f"  {lbl} │{bar} {c}")

    if len(s.value_counts()) > n_show:
        print(f"  ... and {len(s.value_counts()) - n_show} more unique values")

def main():
    parser = argparse.ArgumentParser(description='Star file utility')
    parser.add_argument('star', nargs='+', help='Input star file(s) (multiple for merge)')
    parser.add_argument('filters', nargs='*')
    parser.add_argument('-o', '--output', help='Output star file')
    parser.add_argument('--substitute', action='append', metavar='col:search:replace',
                        help='String substitution in column (repeatable)')
    parser.add_argument('--scale', action='append', metavar='col:factor',
                        help='Multiply numeric column by constant (repeatable)')
    parser.add_argument('--drop', nargs='+', metavar='COL',
                        help='Remove one or more columns from the dataframe')
    parser.add_argument('--histogram', metavar='COL', help='Print value counts for column')
    parser.add_argument('--stats', metavar='COL', nargs='+',
                        help='Print histogram stats for column(s)')
    args = parser.parse_args()

    star_files = []
    filters = []
    for a in args.star + args.filters:
        if Path(a).suffix == '.star':
            star_files.append(a)
        else:
            filters.append(a)

    merging = len(star_files) > 1

    if merging:
        dfs = []
        for sf in star_files:
            all_blocks, pkey = read_star(sf)
            dfs.append(all_blocks[pkey])
            print(f"  {sf}: {len(all_blocks[pkey])} rows")
        df = pd.concat(dfs, ignore_index=True)
        all_blocks = None  # no extra blocks preserved for merge
        print(f"Merged: {len(df)} rows")
    else:
        all_blocks, pkey = read_star(star_files[0])
        df = all_blocks[pkey]
        print(f"Rows: {len(df)}")
        print(f"Columns: {', '.join(df.columns)}")
        if len(all_blocks) > 1:
            print(f"Blocks: {', '.join(all_blocks.keys())} (editing '{pkey}')")

    no_ops = not any([filters, args.substitute, args.scale, args.drop, args.histogram, args.stats, merging])

    if no_ops:
        return

    for f in filters:
        col, op, val = parse_filter(f)
        if col not in df.columns:
            sys.exit(f"Column '{col}' not found")
        df = df[op(df[col], val)].copy()

    for s in (args.substitute or []):
        col, search, replace = parse_substitute(s)
        if col not in df.columns:
            sys.exit(f"Column '{col}' not found")
        df[col] = df[col].astype(str).str.replace(search, replace, regex=False)

    for s in (args.scale or []):
        col, factor = parse_scale(s)
        if col not in df.columns:
            sys.exit(f"Column '{col}' not found")
        df[col] = df[col] * factor

    if args.drop:
        for col in args.drop:
            if col in df.columns:
                df = df.drop(columns=[col])
                print(f"Dropped column: {col}")
            else:
                print(f"Warning: Column '{col}' not found, skipping drop.")

    if args.histogram:
        if args.histogram not in df.columns:
            sys.exit(f"Column '{args.histogram}' not found")
        for val, count in df[args.histogram].value_counts().items():
            print(f"  {val}: {count}")
        return

    if args.stats:
        for col in args.stats:
            if col not in df.columns:
                sys.exit(f"Column '{col}' not found")
            print_histogram(df[col], col)
        return

    print(f"After operations: {len(df)} rows")

    if merging:
        default_out = str(Path(star_files[0]).with_stem(Path(star_files[0]).stem + '_merged'))
    else:
        default_out = str(Path(star_files[0]).with_stem(Path(star_files[0]).stem + '_filtered'))
    output = args.output or default_out

    if merging or all_blocks is None or len(all_blocks) == 1:
        out_data = {'particles': df}
    else:
        all_blocks[pkey] = df
        out_data = all_blocks

    starfile.write(out_data, output, overwrite=True)
    print(f"Written to {output}")

if __name__ == '__main__':
    main()

And as in the earlier ribosome refinement in M, made a population and source and added two species, one using each of the half particle sets. After refinement of the particle poses, CTF, and image warp, this resulted in maps at 4.0 Å for the central 50% of particles and 5.7 Å for the 50% of particles closest to the lamella surface.

Lamella distance refinement results