LHCb Starter Kit 2023: Statistics I

Hans Dembinski, TU Dortmund

About me

  • Working with the LHCb experiment since 2016
  • Before astroparticle physicist working with Pierre Auger Observatory and IceCube
  • Statistics convener in LHCb (2019-2021, 2024-now)
  • Interests
    • Statistics
    • High-performance computing
    • API design: “Make Interfaces Easy to Use Correctly and Hard to Use Incorrectly” Scott Meyers

Scikit-HEP project

  • Provides high-performance data analysis tools in Python
  • Partially funded by NSF grants
  • Several LHCb developers: Eduardo Rodrigues, Chris Burr, myself, …
  • Core values
    • Specialized tools that integrate well with existing Python ecosystem
    • Easy to use for beginners
    • Flexible for power users
    • Good documentation
    • Easy installation

The power of combinatorics

Build everything from basic components that fit together

Image credit: Toni Zaat on Unsplash

Important packages

  • Python scientific stack

    • Numpy: fast computation with arrays
    • Scipy: integration, statistical distributions, special functions, …
    • Matplotlib: scientific plotting
    • Numba: JIT Compiler for Python (very fast)
    • Pandas: Convenient processing of tabular data
  • Scikit-HEP Project

    • uproot: fast reading and writing of ROOT Trees
    • particle: get particle properties from PDG IDs
    • boost-histogram: multi-dimensional generalised histograms
    • iminuit: fitting and error computation package
    • resample: Easy bootstrapping to compute uncertainty and bias
    • pyhf: for histogram fitting/limit setting and preserving likelihoods
  • Other packages

    • jacobi: Easy numerical error propagation
    • numba-stats: Fast statistical distributions for fitting
    • tabulate: Convert data tables to various formats (LaTeX, HTML, Markdown, …)
    • scikit-learn: Basic machine learning tools
    • xgboost: Gradient-boosting machine library that won many Kaggle competitions

Install everything with pip install LIBRARY

Working with Jupyter notebooks

  • I recommend to use vscode
    • Very good Jupyter notebook support
    • Remote explorer: edit and run code on remote machines

Prepare computing environment

  • In terminal, create and enter virtual environment (you need Python-3.8 or later)

    python3 -m venv .venv
    source .venv/bin/activate
  • Install packages

    python -m pip install --upgrade numba matplotlib uproot boost-histogram iminuit scipy particle numba_stats ipywidgets
  • Download this notebook from indico and example.root, put latter into same folder

  • Run vscode on the current folder, then open the notebook inside the editor OR

  • Run

    python -m pip install ipykernel
    python -m ipykernel install --name "starterkit"
    jupyter notebook lecture_1.ipynb

    Then select starterkit as kernel

  • uproot provides fast and convenient access to ROOT trees

import uproot

print(f"uproot {uproot.__version__}")
uproot 5.2.2
f = uproot.open("example.root")
event = f["event"]
 
event.show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
trk_len              | int32_t                  | AsDtype('>i4')
mc_trk_len           | int32_t                  | AsDtype('>i4')
trk_imc              | int32_t[]                | AsJagged(AsDtype('>i4'))
trk_px               | float[]                  | AsJagged(AsDtype('>f4'))
trk_py               | float[]                  | AsJagged(AsDtype('>f4'))
trk_pz               | float[]                  | AsJagged(AsDtype('>f4'))
mc_trk_px            | float[]                  | AsJagged(AsDtype('>f4'))
mc_trk_py            | float[]                  | AsJagged(AsDtype('>f4'))
mc_trk_pz            | float[]                  | AsJagged(AsDtype('>f4'))
mc_trk_pid           | int32_t[]                | AsJagged(AsDtype('>i4'))
  • Tree contains fake simulated LHCb events

  • Reconstructed tracks in simulation, branches with trk_*

  • Truth in simulation, branches with mc_trk_*

  • Some branches are 2D (e.g. trk_px)

    • First index iterates over events
    • Second index iterates over tracks per event
  • Mathematical operators and slicing works on these arrays similar to Numpy

  • Many numpy functions work on awkward arrays

import numpy as np
np.__version__
'1.24.4'
data = event.arrays()

np.sum(data["trk_len"])
5032

Exercise

  • Use np.sum and an array mask to count how many events have zero tracks
# do exercise here
  • For some operations and awkward arrays, we need the awkward library
import awkward as ak
ak.__version__
'2.6.1'
  • For plotting, we use matplotlib
import matplotlib
import matplotlib.pyplot as plt
matplotlib.__version__
'3.8.2'

Exercise * Plot the branch trk_px with matplotlib.pyplot.hist * Can you figure out from the error message why it does not work? * Fix the issue with the function ak.flatten

# do exercise here

Reading data from a large tree

  • Tree often do not fit into memory (tree size > 100 GB and more possible)
  • Iterate over chunks
  • Chunk-wise processing works very well with histograms

  • Python frontend to Boost.Histogram library in C++ from the Boost project
  • Very fast and flexible
  • Multi-dimensional histograms and other binned statistics (profiles!)
  • Supports weighted data
  • And much much more, see docs
import boost_histogram as bh
bh.__version__
'1.4.0'
# make an axis
xaxis = bh.axis.Regular(20, -2, 2)

# easy to make several histograms with same binning by reusing axis
h_px = bh.Histogram(xaxis)
h_mc_px = bh.Histogram(xaxis)
# incremental filling, only read branch "trk_px" from tree
for data in event.iterate(["trk_px"]):
    h_px.fill(ak.flatten(data["trk_px"]))
h_px
Histogram(Regular(20, -2, 2), storage=Double()) # Sum: 5032.0
print(h_px)
                 ┌───────────────────────────────────────────────────────────┐
[-inf,   -2) 0   │                                                           │
[  -2, -1.8) 1   │▏                                                          │
[-1.8, -1.6) 3   │▎                                                          │
[-1.6, -1.4) 17  │█▎                                                         │
[-1.4, -1.2) 25  │█▊                                                         │
[-1.2,   -1) 78  │█████▌                                                     │
[  -1, -0.8) 181 │████████████▉                                              │
[-0.8, -0.6) 308 │█████████████████████▊                                     │
[-0.6, -0.4) 453 │████████████████████████████████▏                          │
[-0.4, -0.2) 647 │█████████████████████████████████████████████▉             │
[-0.2,    0) 819 │██████████████████████████████████████████████████████████ │
[   0,  0.2) 782 │███████████████████████████████████████████████████████▍   │
[ 0.2,  0.4) 629 │████████████████████████████████████████████▌              │
[ 0.4,  0.6) 480 │██████████████████████████████████                         │
[ 0.6,  0.8) 323 │██████████████████████▉                                    │
[ 0.8,    1) 159 │███████████▎                                               │
[   1,  1.2) 77  │█████▌                                                     │
[ 1.2,  1.4) 37  │██▋                                                        │
[ 1.4,  1.6) 7   │▌                                                          │
[ 1.6,  1.8) 3   │▎                                                          │
[ 1.8,    2) 3   │▎                                                          │
[   2,  inf) 0   │                                                           │
                 └───────────────────────────────────────────────────────────┘
# plot histogram with plt.stairs
plt.stairs(h_px.values(), h_px.axes[0].edges);

  • Histograms can be cut down to size after filling
    • Shrink axis range with slices and using loc
    • Make binning coarser with rebin
    • See docs of boost-histogram or extended lecture
  • Beyond 1D histograms
    • Supports advanced axis types (e.g. category axis)
    • Supports higher dimensional histograms
    • Supports generalized histograms with binned statistics (aka “profile” in ROOT)
    • See docs of boost-histogram or extended lecture

Fits

  • Typical analysis work flow (often automated with Snakemake)

    1. Pre-select data
    2. Make histograms or profiles from pre-selected data
    3. Fit histograms or profiles to extract physical parameters
  • Many specialized fitting tools for individual purposes, e.g.: scipy.optimize.curve_fit

  • Generic libraries

  • Fitting (“estimation” in statistics)

    • Adjust parametric statistical model to dataset and find model parameters which match dataset best
    • AI community calls this “learning”
    • Estimate: result of fit
  • Dataset: samples \(\{ \vec x_i \}\)

  • Model: Probability density or probability mass function \(f(\vec x; \vec p)\) which depends on unknown parameters \(\vec p\)

  • Conjecture: maximum-likelihood estimate (MLE) \(\hat{\vec p} = \text{argmax}_{\vec p} L(\vec p)\) is optimal, with \(L(\vec p) = \prod_i f(\vec x_i; \vec p)\)

    • Equivalent: minimize negative log-likelihood (NLL), \(\hat{\vec p} = \text{argmin}_{\vec p}(-\ln L(\vec p))\)
  • Limiting case of maximum-likelihood fit: least-squares fit aka chi-square fit

  • Python frontend to Minuit2 C++ library maintained by ROOT team at CERN
  • Maximum-likelihood and least-squares fits with error estimation
import iminuit
from iminuit import Minuit
iminuit.__version__
'2.25.2'
  • iminuit vs. other packages
    • pyhf
      • Better choice when you need to compute limits and preserve the likelihood
    • zfit and RooFit
      • Help you build statistical models: automatic normalization & convolutions
        • This has to be done “by hand” in iminuit
      • Good for standard tasks, but restrict your freedom
    • iminuit
      • Has good documentation
      • Well-designed API
      • Is very flexible and open
      • Is very fast with numba
  • Typical fit in particle physics: mass distribution of decay candidates with signal peak over smooth background
  • Task: fit signal fraction to calculate signal yield (fraction * number of candidates)
# fast implementations of statistical distributions, API similar to scipy.stats
import numba_stats
from numba_stats import norm, truncexpon
numba_stats.__version__
'1.7.0'
def make_data(size, seed=0):
    z=0.5
    mu=0.5
    sigma=0.1
    slope=0.5
    s = norm.rvs(mu, sigma, size=int(z * size), random_state=seed)
    b = truncexpon.rvs(0, 1, 0, slope, size=int((1-z) * size), random_state=seed)
    return np.append(s, b)

x1 = make_data(10_000)
plt.hist(x1, bins=100);

  • Statistical model assumptions
    • Full sample is additive mixture of signal sample and background sample
    • Signal is normal-distributed; pdf is \(\mathcal{N}(\mu, \sigma)\) with parameters \(\mu\) and \(\sigma\)
    • Background is truncated exponential distribution
# model pdf
def model1(x, z, mu, sigma, slope):
    s = norm.pdf(x, mu, sigma)
    b = truncexpon.pdf(x, 0.0, 1.0, 0.0, slope)
    return (1 - z) * b + z * s

# we use NLL implementation provided by iminuit for convenience
from iminuit.cost import UnbinnedNLL

nll1 = UnbinnedNLL(x1, model1)
m = Minuit(nll1, z=0.5, mu=0.5, sigma=0.1, slope=1)
# need to set parameter limits or fit might fail
m.limits["z", "mu"] = (0, 1)
m.limits["sigma", "slope"] = (0, None)
m.migrad()
Migrad
FCN = -4741 Nfcn = 99
EDM = 2.03e-05 (Goal: 0.0002)
Valid Minimum Below EDM threshold (goal x 10)
No parameters at limit Below call limit
Hesse ok Covariance accurate
Name Value Hesse Error Minos Error- Minos Error+ Limit- Limit+ Fixed
0 z 0.489 0.009 0 1
1 mu 0.4983 0.0020 0 1
2 sigma 0.0977 0.0019 0
3 slope 0.505 0.016 0
z mu sigma slope
z 8.48e-05 -2e-6 (-0.102) 9e-6 (0.535) -0.06e-3 (-0.395)
mu -2e-6 (-0.102) 3.94e-06 -0e-6 (-0.124) -5e-6 (-0.153)
sigma 9e-6 (0.535) -0e-6 (-0.124) 3.7e-06 -9e-6 (-0.280)
slope -0.06e-3 (-0.395) -5e-6 (-0.153) -9e-6 (-0.280) 0.000268
2024-02-13T09:27:46.977103 image/svg+xml Matplotlib v3.8.2, https://matplotlib.org/

Exercise

  • Use interactive mode via m.interactive() to check how fit reacts to parameter changes
  • Useful for debugging and to find starting values by hand
# do exercise here
  • you can compute 2D confidence regions with Minuit.mncontour

Exercise * Draw the likelihood profile contour of parameters z and sigma with Minuit.draw_mncontour * Compute matrix of likelihood profile contours with Minuit.draw_mnmatrix

# do exercise here

Faster fits?

  • Option 1: accelerate NLL computation with numba
  • Option 2: use binned data
x2 = make_data(1_000_000)
  • Option 1
    • NLL computation is trivially parallelizable: log(pdf) computed independently for each data point
    • Numba allows us to exploit auto-vectorization (SIMD instructions) and parallel computing on multiple cores
    • See full lecture for an example to numba-JIT compile the NLL function
    • See benchmark of iminuit + numba vs. RooFit
  • Option 2
    • Often simpler: fit histograms
    • No bias from fitting histograms if done correctly, only loss in precision
    • Need model cdf instead of model pdf: \(F(x; \vec p) = \int_{-\infty}^x f(x'; \vec p) \text{d}x'\)
from iminuit.cost import BinnedNLL

# model is now a cdf!
def model3(x, z, mu, sigma, slope):
    s = norm.cdf(x, mu, sigma)
    b = truncexpon.cdf(x, 0.0, 1.0, 0.0, slope)
    return z * s + (1 - z) * b

w, xe = np.histogram(x2, bins=100, range=(0, 1))
nll3 = BinnedNLL(w, xe, model3)
m = Minuit(nll3, z=0.5, mu=0.5, sigma=1, slope=1)
m.migrad()
Migrad
FCN = 94.55 (χ²/ndof = 1.0) Nfcn = 192
EDM = 1.76e-07 (Goal: 0.0002)
Valid Minimum Below EDM threshold (goal x 10)
No parameters at limit Below call limit
Hesse ok Covariance accurate
Name Value Hesse Error Minos Error- Minos Error+ Limit- Limit+ Fixed
0 z 501.0e-3 0.9e-3
1 mu 500.3e-3 0.2e-3
2 sigma 100.06e-3 0.20e-3
3 slope 0.4982 0.0017
z mu sigma slope
z 8.91e-07 -0.02e-6 (-0.103) 0.10e-6 (0.557) -0.7e-6 (-0.422)
mu -0.02e-6 (-0.103) 4.03e-08 -0 (-0.125) -0.05e-6 (-0.155)
sigma 0.10e-6 (0.557) -0 (-0.125) 3.98e-08 -0.10e-6 (-0.307)
slope -0.7e-6 (-0.422) -0.05e-6 (-0.155) -0.10e-6 (-0.307) 2.72e-06
2024-02-13T09:27:47.235298 image/svg+xml Matplotlib v3.8.2, https://matplotlib.org/

Other fits

  • To fit yields (counts) instead of fractions, use ExtendedUnbinnedNLL or ExtendedBinnedNLL
  • To fit (x, y) pairs, use LeastSquares
  • To fit simulated templates, use Template
  • See documentation how models should be constructed then

Template fits

def make_data2(rng, nmc, truth, bins):
    xe = np.linspace(0, 1, bins + 1)
    b = np.diff(truncexpon.cdf(xe, 0, 1, 0, 1))
    s = np.diff(norm.cdf(xe, 0.5, 0.05))
    n = rng.poisson(b * truth[0]) + rng.poisson(s * truth[1])
    t = np.array([rng.poisson(b * nmc), rng.poisson(s * nmc)])
    return xe, n, t

rng = np.random.default_rng(1)
truth = 750, 250
xe, n, t = make_data2(rng, 200, truth, 15)

_, ax = plt.subplots(1, 2, figsize=(10, 4))
ax[0].stairs(n, xe, color="k", label="data")
ax[0].legend();
ax[1].stairs(t[0], xe, label="background")
ax[1].stairs(t[1], xe, label="signal")
ax[1].legend(title="templates");

from iminuit.cost import Template

c = Template(n, xe, t)
m = Minuit(c, 1, 1)
m.migrad()
Migrad
FCN = 8.352 (χ²/ndof = 0.6) Nfcn = 132
EDM = 2.75e-07 (Goal: 0.0002)
Valid Minimum Below EDM threshold (goal x 10)
No parameters at limit Below call limit
Hesse ok Covariance accurate
Name Value Hesse Error Minos Error- Minos Error+ Limit- Limit+ Fixed
0 x0 770 70 0
1 x1 220 40 0
x0 x1
x0 4.3e+03 -0.8e3 (-0.354)
x1 -0.8e3 (-0.354) 1.32e+03
2024-02-13T09:27:47.726719 image/svg+xml Matplotlib v3.8.2, https://matplotlib.org/
  • With Template you can also fit a mix of template and parametric model
  • Often template only needed for background(s)

Exercise

  • With make_data2 generate templates with 1 000 000 simulated points
  • Fit this data with the Template cost function
  • Compare parameter uncertainties
# do exercise here