Statistics¶
This chapter describes the statistical functions in the library. The basic statistical functions include routines to compute the mean, variance and standard deviation. More advanced functions allow you to calculate absolute deviations, skewness, and kurtosis as well as the median and arbitrary percentiles. The algorithms use recurrence relations to compute average quantities in a stable way, without large intermediate values that might overflow.
Data Types¶
The functions are available in versions for datasets in the standard
floating-point and integer types. The versions for double precision
floating-point data have the prefix cml_stats
and are declared in
the header file cml/statistics/double.h
. The versions for integer
data have the prefix cml_stats_int
and are declared in the header
file cml/statistics/int.h
. All the functions operate on C
arrays with a stride
parameter specifying the spacing between
elements. The full list of available types is given below,
Prefix | Type |
---|---|
cml_stats | double |
cml_stats_float | float |
cml_stats_long_double | long double |
cml_stats_int | int |
cml_stats_uint | unsigned int |
cml_stats_long | long |
cml_stats_ulong | unsigned long |
cml_stats_short | short |
cml_stats_ushort | unsigned short |
cml_stats_char | char |
cml_stats_uchar | unsigned char |
cml_stats_complex | complex double |
cml_stats_complex_float | complex float |
cml_stats_complex_long_double | complex long double |
Mean, Standard Deviation and Variance¶
-
double
cml_stats_mean
(const double data[], size_t stride, size_t n)¶ This function returns the arithmetic mean of
data
, a dataset of lengthn
with stridestride
. The arithmetic mean, or sample mean, is denoted by and defined as,where are the elements of the dataset
data
. For samples drawn from a gaussian distribution the variance of is .
-
double
cml_stats_variance
(const double data[], size_t stride, size_t n)¶ This function returns the estimated, or sample, variance of
data
, a dataset of lengthn
with stridestride
. The estimated variance is denoted by and is defined by,where are the elements of the dataset
data
. Note that the normalization factor of results from the derivation of as an unbiased estimator of the population variance . For samples drawn from a Gaussian distribution the variance of itself is .This function computes the mean via a call to
cml_stats_mean()
. If you have already computed the mean then you can pass it directly tocml_stats_variance_m()
.
-
double
cml_stats_variance_m
(const double data[], size_t stride, size_t n, double mean)¶ This function returns the sample variance of
data
relative to the given value ofmean
. The function is computed with replaced by the value ofmean
that you supply,
-
double
cml_stats_sd
(const double data[], size_t stride, size_t n)¶ -
double
cml_stats_sd_m
(const double data[], size_t stride, size_t n, double mean)¶ The standard deviation is defined as the square root of the variance. These functions return the square root of the corresponding variance functions above.
-
double
cml_stats_tss
(const double data[], size_t stride, size_t n)¶ -
double
cml_stats_tss_m
(const double data[], size_t stride, size_t n, double mean)¶ These functions return the total sum of squares (TSS) of
data
about the mean. Forcml_stats_tss_m()
the user-supplied value ofmean
is used, and forcml_stats_tss()
it is computed usingcml_stats_mean()
.
-
double
cml_stats_variance_with_fixed_mean
(const double data[], size_t stride, size_t n, double mean)¶ This function computes an unbiased estimate of the variance of
data
when the population meanmean
of the underlying distribution is known a priori. In this case the estimator for the variance uses the factor and the sample mean is replaced by the known population mean ,
-
double
cml_stats_sd_with_fixed_mean
(const double data[], size_t stride, size_t n, double mean)¶ This function calculates the standard deviation of
data
for a fixed population meanmean
. The result is the square root of the corresponding variance function.
Absolute deviation¶
-
double
cml_stats_absdev
(const double data[], size_t stride, size_t n)¶ This function computes the absolute deviation from the mean of
data
, a dataset of lengthn
with stridestride
. The absolute deviation from the mean is defined as,where are the elements of the dataset
data
. The absolute deviation from the mean provides a more robust measure of the width of a distribution than the variance. This function computes the mean ofdata
via a call tocml_stats_mean()
.
-
double
cml_stats_absdev_m
(const double data[], size_t stride, size_t n, double mean)¶ This function computes the absolute deviation of the dataset
data
relative to the given value ofmean
,This function is useful if you have already computed the mean of
data
(and want to avoid recomputing it), or wish to calculate the absolute deviation relative to another value (such as zero, or the median).
Higher moments (skewness and kurtosis)¶
-
double
cml_stats_skew
(const double data[], size_t stride, size_t n)¶ This function computes the skewness of
data
, a dataset of lengthn
with stridestride
. The skewness is defined as,where are the elements of the dataset
data
. The skewness measures the asymmetry of the tails of a distribution.The function computes the mean and estimated standard deviation of
data
via calls tocml_stats_mean()
andcml_stats_sd()
.
-
double
cml_stats_skew_m_sd
(const double data[], size_t stride, size_t n, double mean, double sd)¶ This function computes the skewness of the dataset
data
using the given values of the meanmean
and standard deviationsd
,These functions are useful if you have already computed the mean and standard deviation of
data
and want to avoid recomputing them.
-
double
cml_stats_kurtosis
(const double data[], size_t stride, size_t n)¶ This function computes the kurtosis of
data
, a dataset of lengthn
with stridestride
. The kurtosis is defined as,The kurtosis measures how sharply peaked a distribution is, relative to its width. The kurtosis is normalized to zero for a Gaussian distribution.
-
double
cml_stats_kurtosis_m_sd
(const double data[], size_t stride, size_t n, double mean, double sd)¶ This function computes the kurtosis of the dataset
data
using the given values of the meanmean
and standard deviationsd
,This function is useful if you have already computed the mean and standard deviation of
data
and want to avoid recomputing them.
Autocorrelation¶
-
double
cml_stats_lag1_autocorrelation
(const double data[], const size_t stride, const size_t n)¶ This function computes the lag-1 autocorrelation of the dataset
data
.
-
double
cml_stats_lag1_autocorrelation_m
(const double data[], const size_t stride, const size_t n, const double mean)¶ This function computes the lag-1 autocorrelation of the dataset
data
using the given value of the meanmean
.
Covariance¶
-
double
cml_stats_covariance
(const double data1[], const size_t stride1, const double data2[], const size_t stride2, const size_t n)¶ This function computes the covariance of the datasets
data1
anddata2
which must both be of the same lengthn
.
-
double
cml_stats_covariance_m
(const double data1[], const size_t stride1, const double data2[], const size_t stride2, const size_t n, const double mean1, const double mean2)¶ This function computes the covariance of the datasets
data1
anddata2
using the given values of the means,mean1
andmean2
. This is useful if you have already computed the means ofdata1
anddata2
and want to avoid recomputing them.
Correlation¶
-
double
cml_stats_correlation
(const double data1[], const size_t stride1, const double data2[], const size_t stride2, const size_t n)¶ This function efficiently computes the Pearson correlation coefficient between the datasets
data1
anddata2
which must both be of the same lengthn
.
-
double
cml_stats_spearman
(const double data1[], const size_t stride1, const double data2[], const size_t stride2, const size_t n, double work[])¶ This function computes the Spearman rank correlation coefficient between the datasets
data1
anddata2
which must both be of the same lengthn
. Additional workspace of size 2 *n
is required inwork
. The Spearman rank correlation between vectors and is equivalent to the Pearson correlation between the ranked vectors and , where ranks are defined to be the average of the positions of an element in the ascending order of the values.
Maximum and Minimum values¶
The following functions find the maximum and minimum values of a
dataset (or their indices). If the data contains NaN
-s then a
NaN
will be returned, since the maximum or minimum value is
undefined. For functions which return an index, the location of the
first NaN
in the array is returned.
-
double
cml_stats_max
(const double data[], size_t stride, size_t n)¶ This function returns the maximum value in
data
, a dataset of lengthn
with stridestride
. The maximum value is defined as the value of the element which satisfies for all .If you want instead to find the element with the largest absolute magnitude you will need to apply
fabs()
orabs()
to your data before calling this function.
-
double
cml_stats_min
(const double data[], size_t stride, size_t n)¶ This function returns the minimum value in
data
, a dataset of lengthn
with stridestride
. The minimum value is defined as the value of the element which satisfies for all .If you want instead to find the element with the smallest absolute magnitude you will need to apply
fabs()
orabs()
to your data before calling this function.
-
void
cml_stats_minmax
(double * min, double * max, const double data[], size_t stride, size_t n)¶ This function finds both the minimum and maximum values
min
,max
indata
in a single pass.
-
size_t
cml_stats_max_index
(const double data[], size_t stride, size_t n)¶ This function returns the index of the maximum value in
data
, a dataset of lengthn
with stridestride
. The maximum value is defined as the value of the element which satisfies for all . When there are several equal maximum elements then the first one is chosen.
-
size_t
cml_stats_min_index
(const double data[], size_t stride, size_t n)¶ This function returns the index of the minimum value in
data
, a dataset of lengthn
with stridestride
. The minimum value is defined as the value of the element which satisfies for all . When there are several equal minimum elements then the first one is chosen.
-
void
cml_stats_minmax_index
(size_t * min_index, size_t * max_index, const double data[], size_t stride, size_t n)¶ This function returns the indexes
min_index
,max_index
of the minimum and maximum values indata
in a single pass.
Median and Percentiles¶
The median and percentile functions described in this section operate on sorted data. For convenience we use quantiles, measured on a scale of 0 to 1, instead of percentiles (which use a scale of 0 to 100).
-
double
cml_stats_median_from_sorted_data
(const double sorted_data[], size_t stride, size_t n)¶ This function returns the median value of
sorted_data
, a dataset of lengthn
with stridestride
. The elements of the array must be in ascending numerical order. There are no checks to see whether the data are sorted, so the functioncml_sort()
should always be used first.When the dataset has an odd number of elements the median is the value of element . When the dataset has an even number of elements the median is the mean of the two nearest middle values, elements and . Since the algorithm for computing the median involves interpolation this function always returns a floating-point number, even for integer data types.
-
double
cml_stats_quantile_from_sorted_data
(const double sorted_data[], size_t stride, size_t n, double f)¶ This function returns a quantile value of
sorted_data
, a double-precision array of lengthn
with stridestride
. The elements of the array must be in ascending numerical order. The quantile is determined by thef
, a fraction between 0 and 1. For example, to compute the value of the 75th percentilef
should have the value 0.75.There are no checks to see whether the data are sorted, so the function
cml_sort()
should always be used first.The quantile is found by interpolation, using the formula
where is
floor((n - 1)f)
and is .Thus the minimum value of the array (
data[0*stride]
) is given byf
equal to zero, the maximum value (data[(n-1)*stride]
) is given byf
equal to one and the median value is given byf
equal to 0.5. Since the algorithm for computing quantiles involves interpolation this function always returns a floating-point number, even for integer data types.
References and Further Reading¶
The standard reference for almost any topic in statistics is the multi-volume Advanced Theory of Statistics by Kendall and Stuart.
- Maurice Kendall, Alan Stuart, and J. Keith Ord. The Advanced Theory of Statistics (multiple volumes) reprinted as Kendall’s Advanced Theory of Statistics. Wiley, ISBN 047023380X.
Many statistical concepts can be more easily understood by a Bayesian approach. The following book by Gelman, Carlin, Stern and Rubin gives a comprehensive coverage of the subject.
- Andrew Gelman, John B. Carlin, Hal S. Stern, Donald B. Rubin. Bayesian Data Analysis. Chapman & Hall, ISBN 0412039915.