Features#

The phylomodels.features module contains the definition of the features class, as well as internal methods and functions used by this class. The features class is an interface for the definition, computation, and general analyses of features. We may also refer to features as summary statistics.

The ultimate goal of the features module is to provide a standard interface that enables easy and streamlined definition, addition, deletion, and overall management of individual features. When using this module, the addition of a new feature in a given Python script should take at most 1 additional line of code. Coding and integration of new features could be also made in a few minutes.

For example, the basic usage of this module within a Python program should look as follows:

x = readTimeSeries()  # One or more time series to be analyzed
f = features(x)       # Instantiation of features object
f.compute()           # Compute features
y, s = f.get()        # Extract features and statistics as a Pandas DataFrame

Computation of features can be enabled/disabled either by groups or individually:

f.disableFeature(group="all")      # Disable computation of all available
                                   # features
f.enableFeature(name="series_sum") # Enable computation of the feature called
                                   # series_sum
f.compute()                        # Only series_sum is computed

File structure#

features.py: Definition of features class
inventory.py: Methods and auxiliary functions for management of features.
testFeatures.py: Test script.
./graphs: Methods and auxiliary functions for features that are obtained from graphs.
./series`: Methods and auxiliary functions for features that are obtained from time series.
./statistics: Methods and auxiliary functions for computing statistics on computed features.

Inputs#

The instantiation of a features object can receive any of the following inputs, which are attributes of the features class:

x: Set of time series to analyze. This is a pandas dataframe with m rows by n columns. m is the number of sequences (i.e., each row is a different time series) and m is the length of each sequence.
xref: Reference time series. This is a a pandas dataframe with 1 row and m columns, where m is defined as in x above.
g: Set of graphs (e.g., trees) to analyze. This is a list of NetworkX data structures (see https://networkx.github.io/).
gref: Reference graph. This is a NetworkX data structure.

Core methods#

There main methods for the computation and analysis of features are:

compute: Compute all the features and statistics currently enabled (see “Management of features” below for details on enabling/disabling features).
get: Return features and statistics rendered by the compute method.

Management of features#

The features class can extract and/or generate any number of features. Features are only computed when the compute method is called. Important notes regarding the management of features:

Available features

Potential features to be extracted or generated should be in an inventory of features. This inventory is an attribute called availableFeatures. availableFeatures is a list of feature cards.

Feature card

A feature card contains a brief description of a feature, as well as information regarding the actual function that computes the feature. It is defined based on the following namedtuple:

featureCard \\
  = collections.namedtuple(
         "featureCard",
          ["id", "name", "description", "subroutine"]
    )

where “id” is an integer, “name” is a string (see naming conventions below), “subroutine” is the name of the function that extracts and/or computes the feature, and “parameters” is a list of arguments for said function.

Enabled features

A feature is to be computed only if it is enabled. In general, the compute method loops over availableFeatures and checks if they are enabled. If so, then the corresponding subroutines are called. The status of a feature (enabled/disabled) is maintained in the Boolean attribute enabledFeatures.

Methods

enableFeature: Enable the computation/extraction of a feature or group of features.
disableFeature: Disable the computation/extraction of a feature or group of features.
listFeatures: Display list of available features.
listActiveFeatures: Display list of active (i.e., enabled) features.

Designing and integrating new features#

Each feature must be computed by a single function. The goal is for each of these functions to be very short. This makes it easy to code, easy to review, and easy to maintain.
Each feature function receives two arguments, namely: input x transformed into a numpy array of size m x n (with m and n defined as indicated above), followed by input xref transformed into a numpy array. Not all the arguments have to be used by a feature function, but the order in which arguments are passed to the function must be maintained.
Each feature function returns a Pandas DataFrame. Each row in this DataFrame is the output for the corresponding row/item in the input (e.g., row k of the DataFrame is the output corresponding to row k in x and g.
Each feature function must be in a separate file (typically of just a few lines). The name of the file must be the same name of the feature function. It must also follow the naming conventions described below.
A new feature function is integrated into the module by moving its corresponding file to the subfolder ./series or ./graphs, depending on which type of feature it is. In addition, the __init__.py file on the corresponding subfolder must be edited to include a line importing the function. This line looks like this:
```
from [series/graphs] import [function name]
```

Naming conventions: Features should be named as follows:

[type of input]_[transformation of the input]_[other labels]

For example:

series_sum
series_derivative
graph_nodes
graph_degree

Statistics#

Statistics describe the behavior of the computed features. For example, the mean statistic finds the average of every column in the DataFrame rendered by the method get.
Management and design of statistics follow the same principles than those described for management and design of features, with a few changes:
- Inventory management is done using the arrays availableStats and enabledStats.
- The inventory management methods are: enableStatistic, disableStatistic, listStatistics, registerStatistic.
- Files that contain the code for computing a statistic must be saved into the ./statistics subfolder.

Utilities#

The following utility methods are available:

getArguments: Returns x, xref, g, and gref.
listFeatures: Display available features.
listActiveFeatures: Display active features.
printArguments: Display the contents of x, xref, g, and gref.
printFeatures: Display the features dataframe.
printStats: Display statistics.