Features#
The phylomodels.features
module contains the definition of the
features
class, as well
as internal methods and functions used by this class. The features
class is an
interface for the definition, computation, and general analyses of features. We
may also refer to features as summary statistics.
The ultimate goal of the features
module is to provide a standard interface
that enables easy and streamlined definition, addition, deletion, and overall
management of individual features. When using this module, the addition of a new
feature in a given Python script should take at most 1 additional line of code.
Coding and integration of new features could be also made in a few minutes.
For example, the basic usage of this module within a Python program should look as follows:
x = readTimeSeries() # One or more time series to be analyzed
f = features(x) # Instantiation of features object
f.compute() # Compute features
y, s = f.get() # Extract features and statistics as a Pandas DataFrame
Computation of features can be enabled/disabled either by groups or individually:
f.disableFeature(group="all") # Disable computation of all available
# features
f.enableFeature(name="series_sum") # Enable computation of the feature called
# series_sum
f.compute() # Only series_sum is computed
File structure#
- features.py
Definition of features class
- inventory.py
Methods and auxiliary functions for management of features.
- testFeatures.py
Test script.
- ./graphs
Methods and auxiliary functions for features that are obtained from graphs.
- ./series`
Methods and auxiliary functions for features that are obtained from time series.
- ./statistics
Methods and auxiliary functions for computing statistics on computed features.
Inputs#
The instantiation of a features object can receive any of the following inputs,
which are attributes of the features
class:
- x
Set of time series to analyze. This is a pandas dataframe with m rows by n columns. m is the number of sequences (i.e., each row is a different time series) and m is the length of each sequence.
- xref
Reference time series. This is a a pandas dataframe with 1 row and m columns, where m is defined as in x above.
- g
Set of graphs (e.g., trees) to analyze. This is a list of NetworkX data structures (see https://networkx.github.io/).
- gref
Reference graph. This is a NetworkX data structure.
Core methods#
There main methods for the computation and analysis of features are:
- compute
Compute all the features and statistics currently enabled (see “Management of features” below for details on enabling/disabling features).
- get
Return features and statistics rendered by the compute method.
Management of features#
The features class can extract and/or generate any number of features. Features are only computed when the compute method is called. Important notes regarding the management of features:
- Available features
Potential features to be extracted or generated should be in an inventory of features. This inventory is an attribute called
availableFeatures
.availableFeatures
is a list of feature cards.- Feature card
A feature card contains a brief description of a feature, as well as information regarding the actual function that computes the feature. It is defined based on the following namedtuple:
featureCard \\ = collections.namedtuple( "featureCard", ["id", "name", "description", "subroutine"] )
where “id” is an integer, “name” is a string (see naming conventions below), “subroutine” is the name of the function that extracts and/or computes the feature, and “parameters” is a list of arguments for said function.
- Enabled features
A feature is to be computed only if it is enabled. In general, the compute method loops over
availableFeatures
and checks if they are enabled. If so, then the corresponding subroutines are called. The status of a feature (enabled/disabled) is maintained in the Boolean attributeenabledFeatures
.- Methods
- enableFeature
Enable the computation/extraction of a feature or group of features.
- disableFeature
Disable the computation/extraction of a feature or group of features.
- listFeatures
Display list of available features.
- listActiveFeatures
Display list of active (i.e., enabled) features.
Designing and integrating new features#
Each feature must be computed by a single function. The goal is for each of these functions to be very short. This makes it easy to code, easy to review, and easy to maintain.
Each feature function receives two arguments, namely: input x transformed into a numpy array of size m x n (with m and n defined as indicated above), followed by input xref transformed into a numpy array. Not all the arguments have to be used by a feature function, but the order in which arguments are passed to the function must be maintained.
Each feature function returns a Pandas DataFrame. Each row in this DataFrame is the output for the corresponding row/item in the input (e.g., row
k
of the DataFrame is the output corresponding to rowk
inx
andg
.Each feature function must be in a separate file (typically of just a few lines). The name of the file must be the same name of the feature function. It must also follow the naming conventions described below.
A new feature function is integrated into the module by moving its corresponding file to the subfolder
./series
or./graphs
, depending on which type of feature it is. In addition, the__init__.py
file on the corresponding subfolder must be edited to include a line importing the function. This line looks like this:from [series/graphs] import [function name]
Naming conventions: Features should be named as follows:
[type of input]_[transformation of the input]_[other labels]
For example:
series_sum series_derivative graph_nodes graph_degree
Statistics#
Statistics describe the behavior of the computed features. For example, the mean statistic finds the average of every column in the DataFrame rendered by the method get.
Management and design of statistics follow the same principles than those described for management and design of features, with a few changes:
Inventory management is done using the arrays
availableStats
andenabledStats
.The inventory management methods are:
enableStatistic
,disableStatistic
,listStatistics
,registerStatistic
.Files that contain the code for computing a statistic must be saved into the
./statistics
subfolder.
Utilities#
The following utility methods are available:
- getArguments
Returns x, xref, g, and gref.
- listFeatures
Display available features.
- listActiveFeatures
Display active features.
- printArguments
Display the contents of x, xref, g, and gref.
- printFeatures
Display the features dataframe.
- printStats
Display statistics.