NROY Sampling Methods: LHS vs Ray-Resample¶
This notebook compares two methods for generating NROY (Not Ruled Out Yet) samples between history matching waves:
lhs: Pure Latin Hypercube rejection sampling — brute force, guaranteed unbiased, but very slow when the NROY region is a tiny fraction of the prior.ray_resample: 4-stage pipeline (LHS → ray sampling → importance sampling → maximin thinning) inspired by the hmer R package. Much faster at low acceptance rates, but could introduce bias.
We'll investigate:
- How much faster is ray_resample?
- Do both methods produce similar NROY distributions?
- Where might ray_resample introduce bias?
import historymatching as hm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
np.random.seed(42)
%matplotlib inline
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1780534052.011897 1042382 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. I0000 00:00:1780534052.087784 1042382 cpu_feature_guard.cc:227] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1780534053.543294 1042382 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
/software/conda/envs/hm/lib/python3.13/site-packages/gpflow/versions.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources
Setup: A synthetic problem with a small NROY region¶
We create a 5D problem with 4 emulated targets and tight uncertainties, giving an NROY acceptance rate of ~1-5%. This is representative of later waves in history matching where pure LHS rejection gets slow.
The true "good" region is a narrow band where x0 ≈ 0.5 and x1 ≈ 0.3, with weaker constraints on x2-x4. This gives us a known ground truth to compare against.
# 5D problem — easier to visualize, but 4 targets make the NROY small
n_dims = 5
param_bounds = {f"x{i}": (0, 1) for i in range(n_dims)}
ps = hm.ParameterSpace(param_bounds)
# 4 targets with tight uncertainties → small NROY region
obs = hm.ObservationData({
"y1": (0.50, 0.02), # strongly constrains x0
"y2": (0.30, 0.02), # strongly constrains x1
"y3": (0.40, 0.05), # moderately constrains x0+x2
"y4": (0.20, 0.04), # constrains x1+x3
})
# Generate training data: known functions of the parameters
n_train = 500
X_train = pd.DataFrame(
np.random.uniform(0, 1, (n_train, n_dims)),
columns=[f"x{i}" for i in range(n_dims)]
)
noise = 0.01
y_train = pd.DataFrame({
"y1": X_train["x0"] + noise * np.random.randn(n_train),
"y2": X_train["x1"] + noise * np.random.randn(n_train),
"y3": 0.5*(X_train["x0"] + X_train["x2"]) + noise * np.random.randn(n_train),
"y4": 0.3*X_train["x1"] + 0.2*X_train["x3"] + noise * np.random.randn(n_train),
})
print(f"Problem: {n_dims}D, {len(obs.get_all_targets())} targets, {n_train} training points")
for k, (m, s) in obs.get_all_targets().items():
print(f" {k}: target={m}, std={s}")
Problem: 5D, 4 targets, 500 training points y1: target=0.5, std=0.02 y2: target=0.3, std=0.02 y3: target=0.4, std=0.05 y4: target=0.2, std=0.04
# Train one emulator per target
bank = hm.EmulatorBank()
for target in obs.get_all_targets():
em = hm.BayesLinear(X_train, y_train[[target]])
em.train()
bank.add_emulator(1, target, em)
em.test()
r2 = em.emulator_metrics.get("R2", float("nan"))
print(f" {target}: R²={r2:.4f}")
# Check LHS acceptance rate
from historymatching.nroy_sampling import _filter_nroy
lhs = hm.SamplingStrategyFactory.create("lhs")
param_cols = ps.get_parameter_names()
test_candidates = lhs.generate_samples(ps, 50000, seed=0)
test_nroy = _filter_nroy(test_candidates, bank, obs, 3.5, param_cols)
acceptance = len(test_nroy) / len(test_candidates)
print(f"LHS acceptance rate: {acceptance:.2%} ({len(test_nroy)}/50000)")
print("Expected ~1-5% — if much higher, tighten target stds")
y1: R²=0.9989
y2: R²=0.9987
WARNING:historymatching.emulators.bayes_linear:Bayes Linear theta optimization did not converge: ABNORMAL: . The fitted model may still be usable.
y3: R²=0.9969
y4: R²=0.9925
LHS acceptance rate: 1.63% (816/50000) Expected ~1-5% — if much higher, tighten target stds
Comparison: Speed¶
Generate 500 NROY samples with each method and time them.
n_target = 500
# Ground truth: large LHS sample for reference distribution
print("Generating ground truth (large LHS)...")
t0 = time.time()
nroy_truth = hm.generate_nroy_design(
n_points=2000,
parameter_space=ps,
emulator_bank=bank,
observations=obs,
threshold=3.5,
method="lhs",
seed=0,
).samples
t_truth = time.time() - t0
print(f"Ground truth: {len(nroy_truth)} points in {t_truth:.2f}s")
# LHS rejection
print("LHS rejection...")
t0 = time.time()
nroy_lhs = hm.generate_nroy_design(
n_points=n_target,
parameter_space=ps,
emulator_bank=bank,
observations=obs,
threshold=3.5,
method="lhs",
seed=42,
).samples
t_lhs = time.time() - t0
print(f"LHS rejection: {len(nroy_lhs)} points in {t_lhs:.2f}s")
# Ray-resample pipeline
print("Ray-resample...")
t0 = time.time()
nroy_ray = hm.generate_nroy_design(
n_points=n_target,
parameter_space=ps,
emulator_bank=bank,
observations=obs,
threshold=3.5,
method="ray",
seed=42,
).samples
t_ray = time.time() - t0
print(f"Ray-resample: {len(nroy_ray)} points in {t_ray:.2f}s")
print(f"Speedup: {t_lhs/t_ray:.1f}x")
Generating ground truth (large LHS)...
Ground truth: 2000 points in 9.33s LHS rejection...
LHS rejection: 500 points in 2.82s Ray-resample...
Ray-resample: 500 points in 3.56s Speedup: 0.8x
Comparison: Marginal distributions¶
If both methods are unbiased, marginal histograms should look similar. Ray-resample could over-represent boundary regions (from ray sampling) or under-represent tails (from PCA-oriented proposals).
fig, axes = plt.subplots(1, n_dims, figsize=(4*n_dims, 4), sharey=True)
n_bins = 25
for i, param in enumerate(param_cols):
ax = axes[i]
ax.hist(nroy_truth[param], bins=n_bins, alpha=0.3, density=True,
label="truth (2k LHS)", color="gray")
ax.hist(nroy_lhs[param], bins=n_bins, alpha=0.5, density=True,
label="LHS 500", color="tab:blue")
ax.hist(nroy_ray[param], bins=n_bins, alpha=0.5, density=True,
label="ray_resample 500", color="tab:orange")
ax.set_title(param)
if i == 0:
ax.legend(fontsize=8)
fig.suptitle("Marginal distributions: truth (gray) vs LHS (blue) vs ray_resample (orange)", fontsize=14)
fig.tight_layout()
plt.show()
Comparison: Pairwise scatter¶
Check the joint distribution for the most constrained parameters.
# Pairwise scatter for most constrained parameters
truth_ranges = nroy_truth.std()
top_params = truth_ranges.sort_values().head(4).index.tolist()
print(f"Most constrained parameters: {top_params}")
n_top = len(top_params)
fig, axes = plt.subplots(n_top, n_top, figsize=(12, 12))
for i, p1 in enumerate(top_params):
for j, p2 in enumerate(top_params):
ax = axes[i, j]
if i == j:
ax.hist(nroy_truth[p1], bins=20, alpha=0.3, density=True, color="gray")
ax.hist(nroy_lhs[p1], bins=20, alpha=0.5, density=True, color="tab:blue")
ax.hist(nroy_ray[p1], bins=20, alpha=0.5, density=True, color="tab:orange")
else:
ax.scatter(nroy_truth[p2], nroy_truth[p1], alpha=0.15, s=4,
color="gray")
ax.scatter(nroy_ray[p2], nroy_ray[p1], alpha=0.5, s=8,
color="tab:orange", marker="s")
ax.scatter(nroy_lhs[p2], nroy_lhs[p1], alpha=0.5, s=8,
color="tab:blue", marker="o")
if j == 0:
ax.set_ylabel(p1)
if i == n_top - 1:
ax.set_xlabel(p2)
fig.suptitle("Pairwise: truth (gray) vs LHS (blue o) vs ray (orange sq)", fontsize=14)
fig.tight_layout()
plt.show()
Most constrained parameters: ['x0', 'x1', 'x2', 'x3']
Comparison: Summary statistics¶
Quantitative comparison of means, standard deviations, and Kolmogorov-Smirnov test per parameter.
from scipy.stats import ks_2samp
print("KS tests: each method vs ground truth (2000 LHS points)")
print()
print("{:<8} {:<10} {:<10} {:<10} {:<10}".format("Param", "KS(LHS)", "p(LHS)", "KS(ray)", "p(ray)"))
print("-" * 48)
for param in param_cols:
ks_lhs, p_lhs = ks_2samp(nroy_truth[param], nroy_lhs[param])
ks_ray, p_ray = ks_2samp(nroy_truth[param], nroy_ray[param])
flag = " <-- bias?" if p_ray < 0.05 and p_lhs > 0.05 else ""
print("{:<8} {:<10.4f} {:<10.4f} {:<10.4f} {:<10.4f}{}".format(
param, ks_lhs, p_lhs, ks_ray, p_ray, flag))
print()
print("If p(ray) < 0.05 but p(LHS) > 0.05, ray_resample is biased for that parameter.")
print("If both have low p-values, it may just be sampling noise (n=500 vs n=2000).")
KS tests: each method vs ground truth (2000 LHS points) Param KS(LHS) p(LHS) KS(ray) p(ray) ------------------------------------------------ x0 0.0340 0.7364 0.1560 0.0000 <-- bias? x1 0.0290 0.8838 0.2495 0.0000 <-- bias? x2 0.0360 0.6696 0.2340 0.0000 <-- bias? x3 0.0370 0.6359 0.0850 0.0059 <-- bias? x4 0.0350 0.7032 0.4435 0.0000 <-- bias? If p(ray) < 0.05 but p(LHS) > 0.05, ray_resample is biased for that parameter. If both have low p-values, it may just be sampling noise (n=500 vs n=2000).
Investigating potential bias¶
Ray sampling generates points along lines connecting NROY pairs, which could over-represent the "skeleton" of the NROY region. Importance sampling centers proposals on existing points, which could under-explore disconnected regions.
Let's check: what fraction of ray_resample points came from each stage?
# Run ray_resample with verbose logging to see stage contributions
import logging
logging.basicConfig(level=logging.INFO, format='%(message)s')
logging.getLogger('historymatching').setLevel(logging.INFO)
nroy_verbose = hm.generate_nroy_design(
n_points=n_target,
parameter_space=ps,
emulator_bank=bank,
observations=obs,
threshold=3.5,
method='ray',
seed=123, # different seed for variety
).samples
print(f'\nFinal: {len(nroy_verbose)} NROY points')
INFO:historymatching.nroy_sampling: LHS seed: generating 500 candidates...
INFO:historymatching.nroy_sampling: LHS seed: 6/500 (1.2%)
INFO:historymatching.nroy_sampling: RAY MODE: 6 LHS seed points → ray + importance sampling
INFO:historymatching.nroy_sampling: Ray sampling: 200 rays × 500 pts (scaled 83x for shortfall, seeded with 6 points)...
INFO:historymatching.nroy_sampling: Ray sampling: +7227, pool=7233/500 [1.2s]
INFO:historymatching.nroy_sampling: NROY SUMMARY: 500/500 [OK] — sources: LHS=6, ray=7227 [1.8s]
Final: 500 NROY points
Practical recommendations¶
- Early waves (acceptance >10%): Both methods are fast. LHS is simpler.
- Later waves (acceptance <1%):
ray_resampleis much faster. Check marginals against a small LHS sample to verify no major bias. - Maximin thinning helps counteract ray-sampling bias by selecting a space-filling subset from the pool.
- Tuning: Increase
n_linesandpoints_per_lineif the NROY region has complex topology. Increasemaximin_repsfor better thinning. - Diagnostics: Always compare z-scores and parameter marginals across waves to detect systematic drift.
# Example: tuning ray_resample options
nroy_tuned = hm.generate_nroy_design(
n_points=n_target,
parameter_space=ps,
emulator_bank=bank,
observations=obs,
threshold=3.5,
method='ray',
seed=42,
n_lines=40, # more rays for better boundary coverage
points_per_line=100, # denser sampling per ray
maximin_reps=2000, # better space-filling selection
).samples
print(f'Tuned: {len(nroy_tuned)} NROY points')
INFO:historymatching.nroy_sampling: LHS seed: generating 500 candidates...
INFO:historymatching.nroy_sampling: LHS seed: 10/500 (2.0%)
INFO:historymatching.nroy_sampling: RAY MODE: 10 LHS seed points → ray + importance sampling
INFO:historymatching.nroy_sampling: Ray sampling: 200 rays × 500 pts (scaled 50x for shortfall, seeded with 10 points)...
INFO:historymatching.nroy_sampling: Ray sampling: +20881, pool=20891/500 [3.0s]
INFO:historymatching.nroy_sampling: NROY SUMMARY: 500/500 [OK] — sources: LHS=10, ray=20881 [4.0s]
Tuned: 500 NROY points
Summary¶
| Method | Speed | Bias risk | Best for |
|---|---|---|---|
lhs |
Slow at low acceptance | None (ground truth) | Early waves, validation |
ray_resample |
Fast even at <1% acceptance | Possible boundary enrichment | Later waves, production |
The ray_resample method is the default. When LHS acceptance is above 10%
(configurable via lhs_fallback_rate), it automatically falls back to pure LHS
to avoid unnecessary boundary bias.
Known limitations of ray_resample¶
Boundary enrichment: Ray sampling generates points along lines connecting distant NROY pairs, which naturally over-represents the boundary of the NROY region. Maximin thinning partially corrects this, but the pool is still boundary-heavy. This shows up as inflated standard deviations in marginals (see KS test results above).
Assumes contiguous NROY: Ray sampling draws lines between NROY points, assuming the region between them is connected. If the NROY space has disconnected components (e.g. two parameter regimes that both fit the data), rays between points in different islands cross implausible space and contribute few new points. Importance sampling has the same issue — it centers proposals on known points and cannot jump to undiscovered islands. Pure LHS rejection does not have this problem.
Auto-fallback: When LHS acceptance is above the
lhs_fallback_ratethreshold (default 10%),ray_resampleautomatically falls back to pure LHS. This avoids introducing bias when rejection sampling is already fast. The ray/importance stages only kick in when the NROY region is genuinely small and hard to find.
Recommendations¶
- Early waves: The auto-fallback handles this — ray_resample will use pure LHS.
- Later waves: Ray_resample is much faster. Monitor marginals and z-scores for drift.
- Multimodal problems: Use
method="lhs"if you suspect disconnected NROY regions. - Validation: Run both methods on a representative wave and compare with KS tests.