#!pip install scipy matplotlib numpy scikit-learn xgboost
Hans Dembinski
Overview
Caution: Field is riddled with buzzwords
- Machine learning, artificial intelligence, deep learning, belief networks, attention-based networks, long short-term memory, features, hyperparameters …
Anthropomorphized and technical names hide what are usually basic mathematical algorithms
Machine learning: umbrella term for multivariate statistical methods for classification and regression
- Multivariate: algorithms use more than one variable as input, e.g. (x, y, z)
- Statistical: algorithms can handle noisy input and are founded on statistical theory
- Classification: sort inputs based on input variables into categories
- Regression: find a mapping of input vectors to output vectors (fitting)
- Rise of the machines: Why do we need them?
- Humans good at working with 1D or 2D data, because we can visualize it
- Humans bad at working with higer-dimensional data
- Algorithms (“machines”) scale well to high-dimensional data
Notable software
- Multiple algorithms and utilities
- Gradient-boosted decision trees (aka gradient boosting machines)
- Neural networks
- Tensorflow
- Keras (high-level interface to Tensorflow)
- PyTorch
- …
Learning resources
Internet provides excellent resources for autodidactic learning
- Example code
- Tutorials
- Documentation
All methods can do classification and regression and work with any input
What to use?
- Structured data (i.e. tables of numbers): Ensembles of decision trees
- Good results with default settings
- Can process continuous and categorical input (m/w, red/green/blue, …)
- No data preprocessing required
- Compute feature importance
- Unstructured data (i.e. images, waveforms, text…): Neural networks
- Deep networks learn to extract features automatically
- Requires data preprocessing
- Networks difficult to optimize for non-experts
- Often need very large data sets for training
- Structured data (i.e. tables of numbers): Ensembles of decision trees
Some definitions * sample: one or more numbers which describe outcome of stochastic process * dataset: consists of samples * features: subset of numbers from each sample which carry interesting information, also numbers computed from each sample * signal: samples which originate from stochastic process of interest, typically particle decay * background: samples which do not originate from process of interest
Discuss how the typical decay \(K^0_S \to \pi^+ \pi^-\) is reconstructed and why in general there is always some background.
Clarify how the previous terms would be used in the context of this example.
Classification
- Setting: dataset is mix of two or more components, typically signal + one or more backgrounds
- Want only signal component, invert a cut to select signal candidates
- Features are variables which allow to discriminate between signal and background, for example, reconstructed mass of decay candidate
import scipy
import matplotlib
import numpy
import sklearn
import xgboost
print(f"scipy {scipy.__version__}")
print(f"matplotlib {matplotlib.__version__}")
print(f"numpy {numpy.__version__}")
print(f"sklearn {sklearn.__version__}")
print(f"xgboost {xgboost.__version__}")
scipy 1.10.0
matplotlib 3.7.1
numpy 1.21.6
sklearn 1.2.2
xgboost 1.7.4
from scipy.stats import norm
from matplotlib import pyplot as plt
import numpy as np
= norm(0, 1).rvs(1000, random_state=1)
background = norm(1, 0.1).rvs(500, random_state=1)
signal
= np.append(signal, background)
x = np.append(np.ones(len(signal)), np.zeros(len(background))) y
def plot(x, y):
= x[y == 1]
signal = x[y == 0]
background = np.histogram(x, bins=100)
t, xe = np.histogram(signal, bins=xe)[0]
s = np.histogram(background, bins=xe)[0]
b =True, label="background")
plt.stairs(b, xe, fill=b, fill=True, label="signal")
plt.stairs(t, xe, baseline"x")
plt.xlabel(;
plt.legend()return xe
; plot(x, y)
Exercise: * Propose simple selection (“cut”) that keeps most of the signal and removes most of the background * Is it possible to eliminate the background completely?
Efficiency and purity of a selection
- Efficiency: fraction of signal left after applying selection
- Number of true signal samples after cut / Number of all true signal samples
- Purity: fraction of background still contained in selected sample
- Number of true signal samples after cut / Number of selected samples
- Want to maximize efficiency and purity
Single-parameter cut
= 0.7
xmin
= np.sum(x[y==1] > xmin)
selected_signal = np.sum(y==1)
total_signal = np.sum(x > xmin)
selected
= selected_signal / total_signal
efficiency = selected_signal / selected
purity
plot(x, y)-3.5, xmin, facecolor="k", alpha=0.5)
plt.axvspan(=f"efficiency={efficiency:.2f}\npurity={purity:.2f}"); plt.legend(title
= np.linspace(-3, 3, 100)
xmin = []
efficiency = []
purity for xm in xmin:
= np.sum(signal > xm)
selected_signal = len(signal)
total_signal = np.sum(x > xm)
selected = selected_signal / total_signal
e = selected_signal / selected
p
efficiency.append(e)
purity.append(p)="efficiency")
plt.plot(xmin, efficiency, label="purity")
plt.plot(xmin, purity, label"xmin")
plt.xlabel(; plt.legend()
=xmin);
plt.scatter(efficiency, purity, c0, 1)
plt.xlim(0, 1)
plt.ylim("efficiency")
plt.xlabel("purity")
plt.ylabel("xmin") plt.colorbar().set_label(
= np.argmax(purity)
arg print("efficiency at max purity", efficiency[arg])
efficiency at max purity 0.904
How to choose cut?
- Want to maximize efficiency and purity, but what trade-off between the two?
Often subjective choice is acceptable, because exact answer is difficult
Objective answer given by figure-of-merit (FOM)
- Popular is Punzi FOM
- Most FOM (including Punzi FOM) only minimize statistical uncertainty
- Correct FOM has to minimize total uncertainty = statistical ⊕ systematic
- Systematic uncertainty often difficult to quantify: fallback to Punzi or subjective choice
BDT-based cut
We can do better with two cuts, on xmin and xmax
Space of cut parameters then already two-dimensional
With 2D data
- Four rectangular cuts possible (4 cut parameters)
- Infinitely many non-rectangular cuts possible
Use machine instead, needs to be trained on labeled data
Use BDT classifier from XGBoost
Gradient boosting machines regularly win Kaggle competitions
Well documented, fast, parallizable, tunable
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import train_test_split
# generate training data, ideally we have more than actual data
= norm(0, 1).rvs(100000, random_state=2)
background = norm(1, 0.1).rvs(50000, random_state=2)
signal = np.append(signal, background)
x2
# features in general matrix [N, K] for N samples and K features
= x2.reshape((len(x2), 1))
X # labels: signal=1, background=0
= np.append(np.ones_like(signal), np.zeros_like(background)) y
# ALWAYS do this: split into train and test datasets
= train_test_split(X, y, random_state=3)
X_train, X_test, y_train, y_test
# actual magic happens here
= XGBClassifier(random_state=4)
clf ; clf.fit(X_train, y_train)
= plt.subplots(2, sharex=True)
fig, ax
0])
plt.sca(ax[= plot(X_test[:, 0], y_test)
xe None) # unset label set by `plot`
plt.xlabel(
1])
plt.sca(ax[= clf.score(X_test, y_test)
score = np.linspace(xe[0], xe[-1], 200)
xm = clf.predict_proba(xm).T
ps, pb
"x")
plt.xlabel("C0-", label="background probability")
plt.plot(xm, ps, "C1-", label="signal probability")
plt.plot(xm, pb, =f"score={score:.2f}", frameon=False); plt.legend(title
Exercise: * Vary some hyperparameters in XGBClassifier
to see how score, runtime, and predicted probability changes * n_estimators
try e.g. 2, 5, 10 * max_depth
try e.g. 1, 5, 10 * learning_rate
try e.g. 0.001, 0.05, 0.5, 1
- BDT predicts signal probability
- We use this to implement our cut: select regions with high signal probability
= clf.predict_proba(X_test[y_test==1])[:, 1]
ps = clf.predict_proba(X_test[y_test==0])[:, 1]
pb = plt.hist(ps, bins=20, alpha=0.5, label="true signal")[1]
xe =xe, alpha=0.5, label="true background")
plt.hist(pb, bins
plt.legend()"predicted signal probability")
plt.xlabel(0, 5e3); plt.ylim(
= plot(X_test[:, 0], y_test)
xe
# chosen cut: signal probability > 0.6
# visualize accepted region
= np.linspace(xe[0], xe[-1], 200)
xm = clf.predict_proba(xm)[:, 1] > 0.6
mask * 5000, "k-"); plt.plot(xm, mask
= np.linspace(0, 1, 100)
pmin = []
efficiency = []
purity for pm in pmin:
= clf.predict_proba(X_test)[:, 1] > pm
mask = np.sum(mask)
selected = np.sum(y_test[mask])
selected_signal = np.sum(y_test)
total_signal / total_signal)
efficiency.append(selected_signal with np.errstate(divide="ignore", invalid="ignore"):
/ selected)
purity.append(selected_signal ="efficiency")
plt.plot(pmin, efficiency, label="purity")
plt.plot(pmin, purity, label"pmin")
plt.xlabel(; plt.legend()
=pmin);
plt.scatter(efficiency, purity, c0, 1)
plt.xlim(0, 1)
plt.ylim("efficiency")
plt.xlabel("purity")
plt.ylabel("pmin") plt.colorbar().set_label(
- Physicists like efficiency and purity, but ROC curve is standard
- Receiver-operator characteristic
- ROC plots true positive rate (fraction of signal that passes cut) over false positive rate (fraction of background that passes cut)
- Scikit-Learn provides a quick way to generate this plot
from sklearn.metrics import RocCurveDisplay
; RocCurveDisplay.from_estimator(clf, X_test, y_test)
- area under curve (AUC) is a common metric to compare the performance of two different classifiers
- can also use Scikit-Learn to compute ROC points and AUC and plot these ourselves
from sklearn.metrics import roc_curve, auc
# plot XGBoost curve again
= roc_curve(y_test, clf.predict_proba(X_test)[:, 1])
fpr, tpr, thresholds =f"XGBoost AUC={auc(fpr, tpr):.2f}")
plt.plot(fpr, tpr, label
# plot ROC curve for our simple x > 0.7 cut for comparison
= roc_curve(y_test, X_test[:, 0] > 0.7)
fpr2, tpr2, _ =f"x > 0.7 AUC={auc(fpr2, tpr2):.2f}")
plt.plot(fpr2, tpr2, label
"False positive rate")
plt.xlabel("True positive rate")
plt.ylabel(; plt.legend()
- Exercise: run scikit-learn example
- Try adding XGBClassifier into the comparison
- Play around with the hyperparameters of the classifiers
Regression
Regression (aka fitting): “learn” function y = f(x) from points(x, y), where both x and y can be vectors
Machines can do this, but we rarely need this in HEP
We usually prefer to use theoretically motivated models or hand-crafted low-order polynoms
Let’s try regression on one-dimensional function
def model(x, a, b, c, d):
return a + b * np.sin(c * x) * np.exp(- d * x)
= 0.2, 1, 2, 0.5
truth
= np.linspace(0, 10, 1000)
x *truth))
plt.plot(x, model(x, "x")
plt.xlabel("y"); plt.ylabel(
- generate random points on this curve and add some noise to make it more realistic
= np.random.default_rng(1)
rng
= rng.uniform(0, 10, 1000)
x = model(x, 0.2, 1, 2, 0.5)
y += rng.normal(0, 0.05, len(y))
y
= np.reshape(x, (-1, 1)) X
- (X, y) are inputs for regression similar to before, but y is a real number now, not a label (integer)
Neural network
- We use MLPRegressor from Scikit-Learn, a basic neural network
- Neural networks always need preprocessing
- Scikit-Learn offers pipelines to automate this, use them
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# network with 5 hidden layers with 100 nodes each
= MLPRegressor(
regressor =(100,)*5,
hidden_layer_sizes=1)
random_state
# we use basic StandardScaler for preprocessing
= make_pipeline(StandardScaler(), regressor)
pipeline ; pipeline.fit(X, y)
# compute fitted curve
= np.reshape(np.linspace(0, 10, 1000), (-1, 1))
X2 = pipeline.predict(X2)
y2
".")
plt.plot(x, y, 0], y2, color="r", label="Neural net")
plt.plot(X2[:, 0], model(X2[:, 0], *truth), color="k", label="truth")
plt.plot(X2[:,
plt.legend()"x")
plt.xlabel("y"); plt.ylabel(
- Exercise: play around with hyperparameters of MLPRegressor
- What happens if you remove the StandardScaler?
- What performs better, deep or wide networks?
Gaussian Process
- For low-dimensional data, Gaussian Process Regression can work well
- Not easy to set up: need to select suitable kernels and configure them
- Advantage: Predicts uncertainty and automatically provides smooth solution
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
= RBF() + WhiteKernel()
kernel
= GaussianProcessRegressor(kernel=kernel)
gpr
gpr.fit(X, y)= gpr.predict(X2, return_std=True) y3, y3err
".")
plt.plot(x, y, 0], model(X2[:, 0], *truth), color="k", label="truth")
plt.plot(X2[:, 0], y2, color="r", label="Neural net")
plt.plot(X2[:, = plt.plot(X2[:, 0], y3, label="Gaussian Process")[0]
l 0], y3 - y3err, y3 + y3err, alpha=0.8, color=l.get_color())
plt.fill_between(X2[:, ; plt.legend()