T6 - Using analyzers

Analyzers are objects that do not change the behavior of a simulation, but just report on its internal state, almost always something to do with sim.people. This tutorial takes you through some of the built-in analyzers and gives a brief example of how to build your own.

Click here to open an interactive version of this notebook.

Results by age

By far the most common reason to use an analyzer is to report results by age. The results in sim.results are aggregated over all ages, whereas data on cervical cancers are generally reported by age. Age-specific outputs can be customized using an analyzer to match the age bins of the data. The following example shows how to set this up:

[1]:
import numpy as np
import sciris as sc
import hpvsim as hpv

# Create some parameters, setting beta (per-contact transmission probability) higher
# to create more cancers for illutration
pars = dict(beta=0.5, n_agents=50e3, start=1970, n_years=50, dt=1., location='tanzania')

# Also set initial HPV prevalence to be high, again to generate more cancers
pars['init_hpv_prev'] = {
    'age_brackets'  : np.array([  12,   17,   24,   34,  44,   64,    80, 150]),
    'm'             : np.array([ 0.0, 0.75, 0.9, 0.45, 0.1, 0.05, 0.005, 0]),
    'f'             : np.array([ 0.0, 0.75, 0.9, 0.45, 0.1, 0.05, 0.005, 0]),
}

# Create the age analyzers.
az1 = hpv.age_results(
    result_keys=sc.objdict(
        hpv_prevalence=sc.objdict( # The keys of this dictionary are any results you want by age, and can be any key of sim.results
            timepoints=['2019'], # List the years that you want to generate results for
            edges=np.array([0., 15., 20., 25., 30., 40., 45., 50., 55., 65., 100.]),
        ),
        hpv_incidence=sc.objdict(
            timepoints=['2019'],
            edges=np.array([0., 15., 20., 25., 30., 40., 45., 50., 55., 65., 100.]),
        ),
        total_cancer_incidence=sc.objdict(
            timepoints=['2019'],
            edges=np.array([0.,20.,25.,30.,40.,45.,50.,55.,65.,100.]),
        ),
        cancer_mortality=sc.objdict(
            timepoints=['2019'],
            edges=np.array([0., 20., 25., 30., 40., 45., 50., 55., 65., 100.]),
        )
    )
)

sim = hpv.Sim(pars, genotypes=[16, 18], analyzers=[az1])
sim.run()
a = sim.get_analyzer()
a.plot();
HPVsim 0.4.0 (2022-11-16) — © 2022 by IDM
Loading location-specific demographic data for "tanzania"
Initializing sim with 50000 agents
Loading location-specific data for "tanzania"
/home/docs/checkouts/readthedocs.org/user_builds/institute-for-disease-modeling-hpvsim/envs/latest/lib/python3.9/site-packages/hpvsim/data/loaders.py:209: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  dd = full_df.groupby("Time").sum()["PopTotal"]
  Running 1970.0 ( 0/51) (0.96 s)  ———————————————————— 2%
  Running 1980.0 (10/51) (1.55 s)  ••••———————————————— 22%
  Running 1990.0 (20/51) (2.28 s)  ••••••••———————————— 41%
  Running 2000.0 (30/51) (3.19 s)  ••••••••••••———————— 61%
  Running 2010.0 (40/51) (4.39 s)  ••••••••••••••••———— 80%
  Running 2020.0 (50/51) (6.22 s)  •••••••••••••••••••• 100%

Simulation summary:
       66,936 total infections
       14,726 total cin1s
       17,671 total cin2s
        9,371 total cin3s
       41,768 total cins
        1,071 total cancers
            0 total cancer detections
        1,071 total cancer deaths
            0 total detected cancer deaths
       14,458 total reinfections
       17,939 total reactivations
            0 total hiv infections
         0.06 total hpv incidence (/100)
           48 total cin1 incidence (/100,000)
           58 total cin2 incidence (/100,000)
           31 total cin3 incidence (/100,000)
          136 total cin incidence (/100,000)
            3 total cancer incidence (/100,000)
         0.31 total hpv prevalence (/100)

../_images/tutorials_tut_analyzers_3_3.svg

It’s also possible to plot these results alongside data.

[2]:
az2 = hpv.age_results(
    result_keys=sc.objdict(
        total_cancers=sc.objdict(
            datafile='example_cancer_cases.csv',
        ),
    )
)
sim = hpv.Sim(pars, genotypes=[16, 18], analyzers=[az2])
sim.run()
a = sim.get_analyzer()
a.plot();
Loading location-specific demographic data for "tanzania"
Initializing sim with 50000 agents
Loading location-specific data for "tanzania"
  Running 1970.0 ( 0/51) (0.09 s)  ———————————————————— 2%
/home/docs/checkouts/readthedocs.org/user_builds/institute-for-disease-modeling-hpvsim/envs/latest/lib/python3.9/site-packages/hpvsim/data/loaders.py:209: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  dd = full_df.groupby("Time").sum()["PopTotal"]
  Running 1980.0 (10/51) (0.65 s)  ••••———————————————— 22%
  Running 1990.0 (20/51) (1.37 s)  ••••••••———————————— 41%
  Running 2000.0 (30/51) (2.27 s)  ••••••••••••———————— 61%
  Running 2010.0 (40/51) (3.49 s)  ••••••••••••••••———— 80%
  Running 2020.0 (50/51) (5.29 s)  •••••••••••••••••••• 100%

Simulation summary:
       66,936 total infections
       14,726 total cin1s
       17,671 total cin2s
        9,371 total cin3s
       41,768 total cins
        1,071 total cancers
            0 total cancer detections
        1,071 total cancer deaths
            0 total detected cancer deaths
       14,458 total reinfections
       17,939 total reactivations
            0 total hiv infections
         0.06 total hpv incidence (/100)
           48 total cin1 incidence (/100,000)
           58 total cin2 incidence (/100,000)
           31 total cin3 incidence (/100,000)
          136 total cin incidence (/100,000)
            3 total cancer incidence (/100,000)
         0.31 total hpv prevalence (/100)

../_images/tutorials_tut_analyzers_5_3.svg

These results are not particularly well matched to the data, but we will deal with this in the calibration tutorial later.

Snapshots

Snapshots both take “pictures” of the sim.people object at specified points in time. This is because while most of the information from sim.people is retrievable at the end of the sim from the stored events, it’s much easier to see what’s going on at the time. The following example leverages a snapshot in order to create a figure demonstrating age mixing patterns among sexual contacts:

[3]:
snap = hpv.snapshot(timepoints=['2020'])
sim = hpv.Sim(pars, analyzers=snap)
sim.run()

a = sim.get_analyzer()
people = a.snapshots[0]

# Plot age mixing
import pylab as pl
import matplotlib as mpl
fig, ax = pl.subplots(nrows=1, ncols=1, figsize=(5, 4))

fc = people.contacts['m']['age_f'] # Get the age of female contacts in marital partnership
mc = people.contacts['m']['age_m'] # Get the age of male contacts in marital partnership
h = ax.hist2d(fc, mc, bins=np.linspace(0, 75, 16), density=True, norm=mpl.colors.LogNorm())
ax.set_xlabel('Age of female partner')
ax.set_ylabel('Age of male partner')
fig.colorbar(h[3], ax=ax)
ax.set_title('Marital age mixing')
pl.show();
Loading location-specific demographic data for "tanzania"
No genotypes provided, will assume only simulating HPV16 by default
Initializing sim with 50000 agents
Loading location-specific data for "tanzania"
  Running 1970.0 ( 0/51) (0.07 s)  ———————————————————— 2%
/home/docs/checkouts/readthedocs.org/user_builds/institute-for-disease-modeling-hpvsim/envs/latest/lib/python3.9/site-packages/hpvsim/data/loaders.py:209: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  dd = full_df.groupby("Time").sum()["PopTotal"]
  Running 1980.0 (10/51) (0.88 s)  ••••———————————————— 22%
  Running 1990.0 (20/51) (1.94 s)  ••••••••———————————— 41%
  Running 2000.0 (30/51) (3.20 s)  ••••••••••••———————— 61%
  Running 2010.0 (40/51) (4.69 s)  ••••••••••••••••———— 80%
  Running 2020.0 (50/51) (6.45 s)  •••••••••••••••••••• 100%

Simulation summary:
       44,714 total infections
       13,387 total cin1s
       12,049 total cin2s
        5,890 total cin3s
       31,326 total cins
        1,339 total cancers
            0 total cancer detections
          803 total cancer deaths
            0 total detected cancer deaths
       11,245 total reinfections
       10,710 total reactivations
            0 total hiv infections
         0.07 total hpv incidence (/100)
           44 total cin1 incidence (/100,000)
           39 total cin2 incidence (/100,000)
           19 total cin3 incidence (/100,000)
          102 total cin incidence (/100,000)
            4 total cancer incidence (/100,000)
         0.22 total hpv prevalence (/100)

../_images/tutorials_tut_analyzers_8_3.svg

Age pyramids

Age pyramids, like snapshots, take a picture of the people at a given point in time, and then bin them into age groups by sex. These can also be plotted alongside data:

[4]:
# Create some parameters
pars = dict(n_agents=50e3, start=2000, n_years=30, dt=0.5)

# Make the age pyramid analyzer
age_pyr = hpv.age_pyramid(
    timepoints=['2010', '2020'],
    datafile='south_africa_age_pyramid.csv',
    edges=np.linspace(0, 100, 21))

# Make the sim, run, get the analyzer, and plot
sim = hpv.Sim(pars, location='south africa', analyzers=age_pyr)
sim.run()
a = sim.get_analyzer()
fig = a.plot(percentages=True);
Loading location-specific demographic data for "south africa"
No genotypes provided, will assume only simulating HPV16 by default
Initializing sim with 50000 agents
Loading location-specific data for "south africa"
Dates provided in the age pyramid datafile ({'2020.0', '2010.0', '2000.0', '1990.0'}) are not the same as the age pyramid dates that were requested (['2010.0' '2020.0']).
Plots will only show requested dates, not all dates in the datafile.
  Running 2000.0 ( 0/62) (0.08 s)  ———————————————————— 2%
/home/docs/checkouts/readthedocs.org/user_builds/institute-for-disease-modeling-hpvsim/envs/latest/lib/python3.9/site-packages/hpvsim/data/loaders.py:209: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  dd = full_df.groupby("Time").sum()["PopTotal"]
  Running 2005.0 (10/62) (0.70 s)  •••————————————————— 18%
  Running 2010.0 (20/62) (1.39 s)  ••••••—————————————— 34%
  Running 2015.0 (30/62) (2.13 s)  ••••••••••—————————— 50%
  Running 2020.0 (40/62) (2.91 s)  •••••••••••••——————— 66%
  Running 2025.0 (50/62) (3.73 s)  ••••••••••••••••———— 82%
  Running 2030.0 (60/62) (4.59 s)  •••••••••••••••••••— 98%
Simulation summary:
    1,283,552 total infections
      323,451 total cin1s
      295,487 total cin2s
      198,545 total cin3s
      817,484 total cins
       21,439 total cancers
            0 total cancer detections
       10,254 total cancer deaths
            0 total detected cancer deaths
      366,330 total reinfections
      195,749 total reactivations
            0 total hiv infections
         2.40 total hpv incidence (/100)
          992 total cin1 incidence (/100,000)
          906 total cin2 incidence (/100,000)
          609 total cin3 incidence (/100,000)
        2,507 total cin incidence (/100,000)
           66 total cancer incidence (/100,000)
         5.19 total hpv prevalence (/100)

../_images/tutorials_tut_analyzers_10_3.svg