Welcome to cppyml’s documentation!

cppyml: Python bindings for efficient C++ implementations of selected ML algorithms.

cppyml provides Python programmers with a curated selection of popular ML algorithms implemented in C++. The goal is to provide well-tested, highly optimised implementations.

© 2020 Roman Werpachowski. Available under GPG v3 license.

All modules listed below should be imported as cppyml.<module name>, e.g. cppyml.linear_regression.

Clustering

Clustering algorithms.

class cppyml.clustering.CentroidsInitialiser

Bases: pybind11_builtins.pybind11_object

Abstract centroids initialiser.

class cppyml.clustering.ClosestCentroid

Bases: cppyml.clustering.ResponsibilitiesInitialiser

Assigns points to closest centroid.

class cppyml.clustering.EM

Bases: pybind11_builtins.pybind11_object

Gaussian Expectation-Maximisation algorithm.

assign_responsibilities(self: cppyml.clustering.EM, x: numpy.ndarray[float64[m, 1]]) → numpy.ndarray[float64[m, 1]]

Given a data point x, calculate each component’s responsibilities for x and return them.

Parameters:x – Data point with correct dimension.
Returns:Array of components’ responsibilities.
covariance(self: cppyml.clustering.EM, k: int) → numpy.ndarray[float64[m, n]]

Returns k-th covariance matrix.

Parameters:k – Gaussian component index.
Returns:2D square matrix with covariance coefficients.
fit(self: cppyml.clustering.EM, data: numpy.ndarray[float64[m, n], flags.c_contiguous]) → bool

Fits the components to the data.

Parameters:data – A 2D array with data points in rows.
Returns:True if EM algorithm converged.
log_likelihood

Maximised log-likelihood.

means

Fitted means.

mixing_probabilities

Mixing probabilities of components.

number_components

Number of Gaussian components.

responsibilities

Fitted responsibilities.

set_absolute_tolerance(self: cppyml.clustering.EM, absolute_tolerance: float) → None

Sets absolute tolerance.

set_maximise_first(self: cppyml.clustering.EM, maximise_first: bool) → None

Turns on/off doing an initial maximisation step before the E-M iterations.

set_maximum_steps(self: cppyml.clustering.EM, maximum_steps: int) → None

Sets maximum number of iterations.

set_means_initialiser(self: cppyml.clustering.EM, means_initialiser: cppyml.clustering.CentroidsInitialiser) → None

Sets the algorithm to initialise component means.

set_relative_tolerance(self: cppyml.clustering.EM, relative_tolerance: float) → None

Sets relative tolerance.

set_responsibilities_initialiser(self: cppyml.clustering.EM, responsibilities_initialiser: cppyml.clustering.ResponsibilitiesInitialiser) → None

Sets the algorithm to initialise responsibilities for data points.

set_seed(self: cppyml.clustering.EM, seed: int) → None

Sets PRNG seed.

set_verbose(self: cppyml.clustering.EM, verbose: bool) → None

Turns on/off the verbose mode.

class cppyml.clustering.Forgy

Bases: cppyml.clustering.CentroidsInitialiser

Forgy initialisation algorithm.

class cppyml.clustering.KMeans

Bases: pybind11_builtins.pybind11_object

Naive K-Means algorithm.

assign_label(self: cppyml.clustering.KMeans, x: numpy.ndarray[float64[m, 1]]) → Tuple[int, float]

Given a data point x, assigns it to the closest cluster.

Parameters:x – Data point with correct dimension.
Returns:Cluster label for point x.
centroids

Fitted centroids.

fit(self: cppyml.clustering.KMeans, data: numpy.ndarray[float64[m, n], flags.c_contiguous]) → bool

Fits the components to the data.

Parameters:data – A 2D array with data points in rows.
Returns:True if the algorithm converged.
inertia

Minimised inertia.

labels

Fitted labels.

number_clusters

Number of clusters.

set_absolute_tolerance(self: cppyml.clustering.KMeans, absolute_tolerance: float) → None

Sets absolute tolerance.

set_centroids_initialiser(self: cppyml.clustering.KMeans, centroids_initialiser: cppyml.clustering.CentroidsInitialiser) → None

Sets the algorithm to initialise cluster centroids.

set_maximum_steps(self: cppyml.clustering.KMeans, maximum_steps: int) → None

Sets maximum number of iterations.

set_number_initialisations(self: cppyml.clustering.KMeans, centroids_initialiser: int) → None

Sets number of initialisations to try, to find the clusters with lowest inertia.

set_seed(self: cppyml.clustering.KMeans, seed: int) → None

Sets the PRNG seed.

set_verbose(self: cppyml.clustering.KMeans, verbose: bool) → None

Turns on/off the verbose mode.

class cppyml.clustering.KPP

Bases: cppyml.clustering.CentroidsInitialiser

KMeans++ initialisation algorithm.

class cppyml.clustering.RandomPartition

Bases: cppyml.clustering.CentroidsInitialiser

Random Partition initialisation algorithm.

class cppyml.clustering.ResponsibilitiesInitialiser

Bases: pybind11_builtins.pybind11_object

Abstract responsibilities initialiser.

Decision trees

Decision tree algorithms.

class cppyml.decision_trees.ClassificationTree

Bases: pybind11_builtins.pybind11_object

Classification tree

cost_complexity(self: cppyml.decision_trees.ClassificationTree, alpha: float) → float

Calculates cost-complexity for given alpha.

number_leaf_nodes

Number of leaf nodes.

number_lowest_split_nodes

Number of lowest split nodes.

number_nodes

Number of nodes.

original_error

Original error.

total_leaf_error

Total leaf error.

class cppyml.decision_trees.RegressionTree

Bases: pybind11_builtins.pybind11_object

Regression tree

cost_complexity(self: cppyml.decision_trees.RegressionTree, alpha: float) → float

Calculates cost-complexity for given alpha.

number_leaf_nodes

Number of leaf nodes.

number_lowest_split_nodes

Number of lowest split nodes.

number_nodes

Number of nodes.

original_error

Original error.

total_leaf_error

Total leaf error.

cppyml.decision_trees.classification_tree(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], max_split_levels: int = 100, min_split_size: int = 10, alphas: List[float] = [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0], num_folds: int = 10) → Tuple[cppyml.decision_trees.ClassificationTree, float, float]

Grows a classification tree with pruning.

Parameters:
  • X – Independent variables (row-wise) with shape N x D.
  • y – Dependent variable (vector) with length N.
  • max_split_levels – Maximum number of split nodes on the way to any leaf node.
  • min_split_size – Minimum sample size which can be split (at least 2).
  • alphas – Candidate alphas (non-negative) for pruning to be selected by cross-validation. If this vector is empty, no pruning is done. If it has just one element, this value is used for pruning. If it has more than one, the one with smallest k-fold cross-validation test error is used. Defaults to [1E-6, 1E-5, …, 10, 100].
  • num_folds – Number of folds for cross-validation. Ignored if cross-validation is not done.
Returns:

trained decision tree, chosen alpha (NaN if no pruning was done) and minimum cross-validation test error (NaN if no cross-validation was done).

Return type:

Tuple of

cppyml.decision_trees.classification_tree_accuracy(tree: cppyml.decision_trees.ClassificationTree, X: numpy.ndarray[float64[m, n], flags.f_contiguous], y: numpy.ndarray[float64[m, 1]]) → float

Calculates classification tree accuracy on (X, y) data.

Parameters:
  • tree – Classification tree instance.
  • X – Independent variables (row-wise) with shape N x D.
  • y – Dependent variable (vector) with length N.
Returns:

Classification accuracy.

cppyml.decision_trees.regression_tree(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], max_split_levels: int = 100, min_split_size: int = 10, alphas: List[float] = [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0], num_folds: int = 10) → Tuple[cppyml.decision_trees.RegressionTree, float, float]

Grows a regression tree with pruning.

Parameters:
  • X – Independent variables (row-wise) with shape N x D.
  • y – Dependent variable (vector) with length N.
  • max_split_levels – Maximum number of split nodes on the way to any leaf node.
  • min_split_size – Minimum sample size which can be split (at least 2).
  • alphas – Candidate alphas (non-negative) for pruning to be selected by cross-validation. If this vector is empty, no pruning is done. If it has just one element, this value is used for pruning. If it has more than one, the one with smallest k-fold cross-validation test error is used. Defaults to [1E-6, 1E-5, …, 10, 100].
  • num_folds – Number of folds for cross-validation. Ignored if cross-validation is not done.
Returns:

trained decision tree, chosen alpha (NaN if no pruning was done) and minimum cross-validation test error (NaN if no cross-validation was done).

Return type:

Tuple of

cppyml.decision_trees.regression_tree_mean_squared_error(tree: cppyml.decision_trees.RegressionTree, X: numpy.ndarray[float64[m, n], flags.f_contiguous], y: numpy.ndarray[float64[m, 1]]) → float

Calculates regression tree mean squared error on (X, y) data.

Parameters:
  • tree – Regression tree instance.
  • X – Independent variables (row-wise) with shape N x D.
  • y – Dependent variable (vector) with length N.
Returns:

Mean squared error.

Linear regression

Linear regression algorithms.

class cppyml.linear_regression.LassoRegressionResult

Bases: cppyml.linear_regression.Result

Result of a (multivariate) Lasso regression with intercept.

Intercept is the last coefficient in beta.

var_y is calculated using dof as the denominator.

beta

Fitted coefficients of the model y_i = beta’^T X_i, followed by beta0.

effective_dof

N - tr [ X^T (X * X^T + lambda * I)^{-1} X ] - 1.

Type:Effective number of residual degrees of freedom
class cppyml.linear_regression.MultivariateOLSResult

Bases: cppyml.linear_regression.Result

Result of multivariate Ordinary Least Squares regression.

The cov property assumes independent Gaussian error terms.

beta

Fitted coefficients of the model y_i = beta^T X_i.

cov

Covariance matrix of beta coefficients.

class cppyml.linear_regression.RecursiveMultivariateOLS

Bases: pybind11_builtins.pybind11_object

Given a stream of pairs (X_i, y_i), updates the least-squares estimate for beta solving the equations

y_0 = X_0^T * beta + e_0 y_1 = X_1^T * beta + e_1 …

Based on https://cpb-us-w2.wpmucdn.com/sites.gatech.edu/dist/2/436/files/2017/07/22-notes-6250-f16.pdf

beta

Current beta estimate. If n == 0, returns an empty array.

d

Dimension of data points. If n == 0, returs 0.

n

Number of data points seen so far.

update(self: cppyml.linear_regression.RecursiveMultivariateOLS, X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]]) → None

Updates the beta estimate with a new sample.

Parameters:
  • X – N x D matrix of X values, with data points in rows.
  • y – Y vector with length N.
Throws:
ValueError: If n == 0 (i.e., (X, y) is the first sample) and N < D.
class cppyml.linear_regression.Result

Bases: pybind11_builtins.pybind11_object

adjusted_r2

1 - fraction of variance unexplained relative to the base model. Uses sample variances. Equal to 1 - rss * (n - 1) / tss / dof.

dof

Number of residual degrees of freedom (e.g. n - 2 or n - 1 for univariate regression with or without intercept).

n

Number of data points.

r2

1 - fraction of variance unexplained relative to the base model. Equal to 1 - rss / tss.

rss

sum_{i=1}^N (hat{y}_i - y_i)^2.

Type:Residual sum of squares
tss

sum_{i=1}^N (y_i - N^{-1} sum_{j=1}^N y_j)^2.

Type:Total sum of squares
var_y

Estimated variance of observations Y, equal to rss / dof.

class cppyml.linear_regression.RidgeRegressionResult

Bases: cppyml.linear_regression.Result

Result of a (multivariate) ridge regression with intercept.

Intercept is the last coefficient in beta.

var_y is calculated using dof as the denominator.

beta

Fitted coefficients of the model y_i = beta’^T X_i, followed by beta0.

cov

Covariance matrix of (beta’, beta0) coefficients.

effective_dof

N - tr [ X^T (X * X^T + lambda * I)^{-1} X ] - 1.

Type:Effective number of residual degrees of freedom
class cppyml.linear_regression.UnivariateOLSResult

Bases: cppyml.linear_regression.Result

Result of univariate Ordinary Least Squares regression (with or without intercept).

The following properties assume independent Gaussian error terms: var_slope, var_intercept and cov_slope_intercept.

cov_slope_intercept

Estimated covariance of the slope and the intercept.

intercept

Constant added to slope * X when predicting Y.

slope

Coefficient multiplying X values when predicting Y.

var_intercept

Estimated variance of the intercept.

var_slope

Estimated variance of the slope.

cppyml.linear_regression.lasso(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], lambda: float, do_standardise: bool = False) → cppyml.linear_regression.LassoRegressionResult

Carries out multivariate Lasso regression with intercept.

Given X and y, finds beta’ and beta0 minimising || y - beta’^T X - beta0 ||^2 + lambda * || beta’ ||_1.

R2 is always calculated w/r to model returning average y. The matrix X is assumed to be standardised unless do_standardise is set to True. Does not calculate the covariance matrix for estimated coefficients.

Parameters:
  • X – X matrix (shape N x D, with D <= N), with data points in rows.
  • y – Y vector with length N.
  • do_standardise – Whether to automatically subtract the mean from each row in X and divide it by its standard deviation (defaults to False).
Returns:

Instance of LassoRegressionResult. If do_standardise was True, the beta vector will be rescaled and shifted to original X units and origins.

cppyml.linear_regression.multivariate(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], add_ones: bool = False) → cppyml.linear_regression.MultivariateOLSResult

Carries out multivariate linear regression.

R2 is always calculated w/r to model returning average Y. If fitting with intercept is desired, include a row of 1’s in the X values or set the parameter add_ones to True.

Parameters:
  • X – X matrix (shape N x D, with D <= N), with data points in rows.
  • y – Y vector with length N.
  • add_ones – Whether to automatically add a column of 1’s at the end of X (optional, defaults to False).
Returns:

Instance of MultivariateOLSResult.

cppyml.linear_regression.press(*args, **kwargs)

Overloaded function.

  1. press(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], regularisation: str = ‘none’, reg_lambda: float = 0.0) -> float

Calculates the PRESS statistic (Predicted Residual Error Sum of Squares).

See https://en.wikipedia.org/wiki/PRESS_statistic for details.

NOTE: Training data will be standardised internally if using regularisation.

Parameters:
  • X – X matrix (shape N x D, with D <= N), with data points in rows. Unstandardised.
  • y – Y vector with length N.
  • regularisation – Type of regularisation: “none” or “ridge”. Defaults to “none”.
  • reg_lambda – Non-negative regularisation strength. Defaults to 0. Ignored if regularisation == “none”.
Returns:

Value of the PRESS statistic.

  1. press(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], regularisation: str = ‘none’, reg_lambda: float = 0.0) -> float

Calculates the PRESS statistic (Predicted Residual Error Sum of Squares).

See https://en.wikipedia.org/wiki/PRESS_statistic for details.

NOTE: Training data will be standardised internally if using regularisation.

Parameters:
  • X – X matrix (shape N x D, with D <= N), with data points in rows. Unstandardised.
  • y – Y vector with length N.
  • regularisation – Type of regularisation: “none” or “ridge”. Defaults to “none”.
  • reg_lambda – Non-negative regularisation strength. Defaults to 0. Ignored if regularisation == “none”.
Returns:

Value of the PRESS statistic.

cppyml.linear_regression.press_univariate(x: numpy.ndarray[float64[m, 1]], y: numpy.ndarray[float64[m, 1]], with_intercept: bool = True) → float

Calculates the PRESS statistic (Predicted Residual Error Sum of Squares) for univariate regression.

See https://en.wikipedia.org/wiki/PRESS_statistic for details.

Parameters:
  • x – X vector with length N.
  • y – Y vector with same length as x.
  • with_intercept – Whether the regression is with intercept or not (defaults to True).
Returns:

Value of the PRESS statistic.

cppyml.linear_regression.ridge(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], lambda: float, do_standardise: bool = False) → cppyml.linear_regression.RidgeRegressionResult

Carries out multivariate ridge regression with intercept.

Given X and y, finds beta’ and beta0 minimising || y - beta’^T X - beta0 ||^2 + lambda * || beta’ ||^2.

R2 is always calculated w/r to model returning average y. The matrix X is assumed to be standardised unless do_standardise is set to True.

Parameters:
  • X – X matrix (shape N x D, with D <= N), with data points in rows.
  • y – Y vector with length N.
  • do_standardise – Whether to automatically subtract the mean from each row in X and divide it by its standard deviation (defaults to False).
Returns:

Instance of RidgeRegressionResult. If do_standardise was True, the beta vector will be rescaled and shifted to original X units and origins, and the cov matrix will be transformed accordingly.

cppyml.linear_regression.univariate(x: numpy.ndarray[float64[m, 1]], y: numpy.ndarray[float64[m, 1]]) → cppyml.linear_regression.UnivariateOLSResult

Carries out univariate (aka simple) linear regression with intercept.

R2 coefficient is calculated w/r to a model returning average Y, and is equal to Corr(X, Y)^2:
R2 = 1 - sum_{i=1}^n (y_i - hat{y}_i)^2 / sum_{i=1}^n (y_i - avg(Y))^2.
Parameters:
  • x – X vector.
  • y – Y vector. x and y must have same length not less than 2.
Returns:

Instance of UnivariateOLSResult.

cppyml.linear_regression.univariate_regular(x0: float, dx: float, y: numpy.ndarray[float64[m, 1]]) → cppyml.linear_regression.UnivariateOLSResult

Carries out univariate (aka simple) linear regression with intercept on regularly spaced points.

R2 coefficient is calculated w/r to a model returning average Y, and is equal to Corr(X, Y)^2:
R2 = 1 - sum_{i=1}^n (y_i - hat{y}_i)^2 / sum_{i=1}^n (y_i - avg(Y))^2.
Parameters:
  • x0 – First X value.
  • dx – Positive X increment.
  • y – Y vector with length not less than 2.
Returns:

Instance of UnivariateOLSResult.

cppyml.linear_regression.univariate_without_intercept(x: numpy.ndarray[float64[m, 1]], y: numpy.ndarray[float64[m, 1]]) → cppyml.linear_regression.UnivariateOLSResult

Carries out univariate (aka simple) linear regression without intercept.

The R2 coefficient is calculated w/r to a model returning average Y:
R2 = 1 - sum_{i=1}^n (y_i - hat{y}_i)^2 / sum_{i=1}^n (y_i - avg(Y))^2.
Parameters:
  • x – X vector.
  • y – Y vector. x and y must have same length not less than 1.
Returns:

Instance of UnivariateOLSResult with intercept, var_intercept and cov_slope_intercept set to 0.

Logistic regression

Logistic regression algorithms.

class cppyml.logistic_regression.ConjugateGradientLogisticRegression

Bases: cppyml.logistic_regression.LogisticRegression

Binomial logistic regression algorithm.

Implemented the conjugate gradient algorithm described in Thomas P. Minka, “A comparison of numerical optimizers for logistic regression”.

class cppyml.logistic_regression.LogisticRegression

Bases: pybind11_builtins.pybind11_object

absolute_tolerance

Absolute tolerance for fitted weights

fit(self: cppyml.logistic_regression.LogisticRegression, X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]]) → cppyml.logistic_regression.Result

Fits the model and returns the result.

If fitting with intercept is desired, include a column of 1’s in the X values.

Parameters:
  • X – N x D matrix of X values, with data points in rows.
  • y – Y vector with length N. Values should be -1 or 1.
Returns:

Instance of Result.

lam

inverse variance of the Gaussian prior for w

Type:Regularisation parameter
maximum_steps

Maximum number of steps allowed

relative_tolerance

Relative tolerance for fitted weights

set_absolute_tolerance(self: cppyml.logistic_regression.LogisticRegression, absolute_tolerance: float) → None

Sets absolute tolerance for weight convergence.

set_lam(self: cppyml.logistic_regression.LogisticRegression, lam: float) → None

Sets the regularisation parameter.

set_maximum_steps(self: cppyml.logistic_regression.LogisticRegression, maximum_steps: int) → None

Sets maximum number of steps.

set_relative_tolerance(self: cppyml.logistic_regression.LogisticRegression, relative_tolerance: float) → None

Sets relative tolerance for weight convergence.

class cppyml.logistic_regression.Result

Bases: pybind11_builtins.pybind11_object

converged

Did it converge?

predict(self: cppyml.logistic_regression.Result, X: numpy.ndarray[float64[m, n], flags.c_contiguous]) → numpy.ndarray[float64[m, 1]]

Predicts labels for features X given w. Returns the predicted label vector.

steps_taken

Number of steps taken to converge.

w

Fitted coefficients of the LR model.

Indices and tables