Welcome to cppyml’s documentation!¶

cppyml: Python bindings for efficient C++ implementations of selected ML algorithms.

cppyml provides Python programmers with a curated selection of popular ML algorithms implemented in C++. The goal is to provide well-tested, highly optimised implementations.

All modules listed below should be imported as cppyml.<module name>, e.g. cppyml.linear_regression.

Clustering¶

Clustering algorithms.

class cppyml.clustering.CentroidsInitialiser¶

Bases: pybind11_builtins.pybind11_object

Abstract centroids initialiser.

class cppyml.clustering.ClosestCentroid¶

Bases: cppyml.clustering.ResponsibilitiesInitialiser

Assigns points to closest centroid.

class cppyml.clustering.EM¶

Bases: pybind11_builtins.pybind11_object

Gaussian Expectation-Maximisation algorithm.

assign_responsibilities(self: cppyml.clustering.EM, x: numpy.ndarray[float64[m, 1]]) → numpy.ndarray[float64[m, 1]]¶

Given a data point x, calculate each component’s responsibilities for x and return them.

Parameters:	x – Data point with correct dimension.
Returns:	Array of components’ responsibilities.

covariance(self: cppyml.clustering.EM, k: int) → numpy.ndarray[float64[m, n]]¶

Returns k-th covariance matrix.

Parameters:	k – Gaussian component index.
Returns:	2D square matrix with covariance coefficients.

fit(self: cppyml.clustering.EM, data: numpy.ndarray[float64[m, n], flags.c_contiguous]) → bool¶

Fits the components to the data.

Parameters:	data – A 2D array with data points in rows.
Returns:	True if EM algorithm converged.

log_likelihood¶: Maximised log-likelihood.

means¶: Fitted means.

mixing_probabilities¶: Mixing probabilities of components.

number_components¶: Number of Gaussian components.

responsibilities¶: Fitted responsibilities.

set_absolute_tolerance(self: cppyml.clustering.EM, absolute_tolerance: float) → None¶: Sets absolute tolerance.

set_maximise_first(self: cppyml.clustering.EM, maximise_first: bool) → None¶: Turns on/off doing an initial maximisation step before the E-M iterations.

set_maximum_steps(self: cppyml.clustering.EM, maximum_steps: int) → None¶: Sets maximum number of iterations.

set_means_initialiser(self: cppyml.clustering.EM, means_initialiser: cppyml.clustering.CentroidsInitialiser) → None¶: Sets the algorithm to initialise component means.

set_relative_tolerance(self: cppyml.clustering.EM, relative_tolerance: float) → None¶: Sets relative tolerance.

set_responsibilities_initialiser(self: cppyml.clustering.EM, responsibilities_initialiser: cppyml.clustering.ResponsibilitiesInitialiser) → None¶: Sets the algorithm to initialise responsibilities for data points.

set_seed(self: cppyml.clustering.EM, seed: int) → None¶: Sets PRNG seed.

set_verbose(self: cppyml.clustering.EM, verbose: bool) → None¶: Turns on/off the verbose mode.

class cppyml.clustering.Forgy¶

Bases: cppyml.clustering.CentroidsInitialiser

Forgy initialisation algorithm.

class cppyml.clustering.KMeans¶

Bases: pybind11_builtins.pybind11_object

Naive K-Means algorithm.

assign_label(self: cppyml.clustering.KMeans, x: numpy.ndarray[float64[m, 1]]) → Tuple[int, float]¶

Given a data point x, assigns it to the closest cluster.

Parameters:	x – Data point with correct dimension.
Returns:	Cluster label for point x.

centroids¶: Fitted centroids.

fit(self: cppyml.clustering.KMeans, data: numpy.ndarray[float64[m, n], flags.c_contiguous]) → bool¶

Fits the components to the data.

Parameters:	data – A 2D array with data points in rows.
Returns:	True if the algorithm converged.

inertia¶: Minimised inertia.

labels¶: Fitted labels.

number_clusters¶: Number of clusters.

set_absolute_tolerance(self: cppyml.clustering.KMeans, absolute_tolerance: float) → None¶: Sets absolute tolerance.

set_centroids_initialiser(self: cppyml.clustering.KMeans, centroids_initialiser: cppyml.clustering.CentroidsInitialiser) → None¶: Sets the algorithm to initialise cluster centroids.

set_maximum_steps(self: cppyml.clustering.KMeans, maximum_steps: int) → None¶: Sets maximum number of iterations.

set_number_initialisations(self: cppyml.clustering.KMeans, centroids_initialiser: int) → None¶: Sets number of initialisations to try, to find the clusters with lowest inertia.

set_seed(self: cppyml.clustering.KMeans, seed: int) → None¶: Sets the PRNG seed.

set_verbose(self: cppyml.clustering.KMeans, verbose: bool) → None¶: Turns on/off the verbose mode.

class cppyml.clustering.KPP¶

Bases: cppyml.clustering.CentroidsInitialiser

KMeans++ initialisation algorithm.

class cppyml.clustering.RandomPartition¶

Bases: cppyml.clustering.CentroidsInitialiser

Random Partition initialisation algorithm.

class cppyml.clustering.ResponsibilitiesInitialiser¶

Bases: pybind11_builtins.pybind11_object

Abstract responsibilities initialiser.

Decision trees¶

Decision tree algorithms.

class cppyml.decision_trees.ClassificationTree¶

Bases: pybind11_builtins.pybind11_object

Classification tree

cost_complexity(self: cppyml.decision_trees.ClassificationTree, alpha: float) → float¶: Calculates cost-complexity for given alpha.

number_leaf_nodes¶: Number of leaf nodes.

number_lowest_split_nodes¶: Number of lowest split nodes.

number_nodes¶: Number of nodes.

original_error¶: Original error.

total_leaf_error¶: Total leaf error.

class cppyml.decision_trees.RegressionTree¶

Bases: pybind11_builtins.pybind11_object

Regression tree

cost_complexity(self: cppyml.decision_trees.RegressionTree, alpha: float) → float¶: Calculates cost-complexity for given alpha.

number_leaf_nodes¶: Number of leaf nodes.

number_lowest_split_nodes¶: Number of lowest split nodes.

number_nodes¶: Number of nodes.

original_error¶: Original error.

total_leaf_error¶: Total leaf error.

cppyml.decision_trees.classification_tree(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], max_split_levels: int = 100, min_split_size: int = 10, alphas: List[float] = [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0], num_folds: int = 10) → Tuple[cppyml.decision_trees.ClassificationTree, float, float]¶

Grows a classification tree with pruning.

Parameters:	X – Independent variables (row-wise) with shape N x D. y – Dependent variable (vector) with length N. max_split_levels – Maximum number of split nodes on the way to any leaf node. min_split_size – Minimum sample size which can be split (at least 2). alphas – Candidate alphas (non-negative) for pruning to be selected by cross-validation. If this vector is empty, no pruning is done. If it has just one element, this value is used for pruning. If it has more than one, the one with smallest k-fold cross-validation test error is used. Defaults to [1E-6, 1E-5, …, 10, 100]. num_folds – Number of folds for cross-validation. Ignored if cross-validation is not done.
Returns:	trained decision tree, chosen alpha (NaN if no pruning was done) and minimum cross-validation test error (NaN if no cross-validation was done).
Return type:	Tuple of

cppyml.decision_trees.classification_tree_accuracy(tree: cppyml.decision_trees.ClassificationTree, X: numpy.ndarray[float64[m, n], flags.f_contiguous], y: numpy.ndarray[float64[m, 1]]) → float¶

Calculates classification tree accuracy on (X, y) data.

Parameters:	tree – Classification tree instance. X – Independent variables (row-wise) with shape N x D. y – Dependent variable (vector) with length N.
Returns:	Classification accuracy.

cppyml.decision_trees.regression_tree(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], max_split_levels: int = 100, min_split_size: int = 10, alphas: List[float] = [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0], num_folds: int = 10) → Tuple[cppyml.decision_trees.RegressionTree, float, float]¶

Grows a regression tree with pruning.

Parameters:	X – Independent variables (row-wise) with shape N x D. y – Dependent variable (vector) with length N. max_split_levels – Maximum number of split nodes on the way to any leaf node. min_split_size – Minimum sample size which can be split (at least 2). alphas – Candidate alphas (non-negative) for pruning to be selected by cross-validation. If this vector is empty, no pruning is done. If it has just one element, this value is used for pruning. If it has more than one, the one with smallest k-fold cross-validation test error is used. Defaults to [1E-6, 1E-5, …, 10, 100]. num_folds – Number of folds for cross-validation. Ignored if cross-validation is not done.
Returns:	trained decision tree, chosen alpha (NaN if no pruning was done) and minimum cross-validation test error (NaN if no cross-validation was done).
Return type:	Tuple of

cppyml.decision_trees.regression_tree_mean_squared_error(tree: cppyml.decision_trees.RegressionTree, X: numpy.ndarray[float64[m, n], flags.f_contiguous], y: numpy.ndarray[float64[m, 1]]) → float¶

Calculates regression tree mean squared error on (X, y) data.

Parameters:	tree – Regression tree instance. X – Independent variables (row-wise) with shape N x D. y – Dependent variable (vector) with length N.
Returns:	Mean squared error.

Linear regression¶

Linear regression algorithms.

class cppyml.linear_regression.LassoRegressionResult¶

Bases: cppyml.linear_regression.Result

Result of a (multivariate) Lasso regression with intercept.

Intercept is the last coefficient in beta.

var_y is calculated using dof as the denominator.

beta¶: Fitted coefficients of the model y_i = beta’^T X_i, followed by beta0.

effective_dof¶

N - tr [ X^T (X * X^T + lambda * I)^{-1} X ] - 1.

Type:	Effective number of residual degrees of freedom

class cppyml.linear_regression.MultivariateOLSResult¶

Bases: cppyml.linear_regression.Result

Result of multivariate Ordinary Least Squares regression.

The cov property assumes independent Gaussian error terms.

beta¶: Fitted coefficients of the model y_i = beta^T X_i.

cov¶: Covariance matrix of beta coefficients.

class cppyml.linear_regression.RecursiveMultivariateOLS¶

Bases: pybind11_builtins.pybind11_object

Given a stream of pairs (X_i, y_i), updates the least-squares estimate for beta solving the equations

y_0 = X_0^T * beta + e_0 y_1 = X_1^T * beta + e_1 …

Based on https://cpb-us-w2.wpmucdn.com/sites.gatech.edu/dist/2/436/files/2017/07/22-notes-6250-f16.pdf

beta¶: Current beta estimate. If n == 0, returns an empty array.

d¶: Dimension of data points. If n == 0, returs 0.

n¶: Number of data points seen so far.

update(self: cppyml.linear_regression.RecursiveMultivariateOLS, X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]]) → None¶

Updates the beta estimate with a new sample.

Parameters:	X – N x D matrix of X values, with data points in rows. y – Y vector with length N.

Throws:: ValueError: If n == 0 (i.e., (X, y) is the first sample) and N < D.

class cppyml.linear_regression.Result¶

Bases: pybind11_builtins.pybind11_object

adjusted_r2¶: 1 - fraction of variance unexplained relative to the base model. Uses sample variances. Equal to 1 - rss * (n - 1) / tss / dof.

dof¶: Number of residual degrees of freedom (e.g. n - 2 or n - 1 for univariate regression with or without intercept).

n¶: Number of data points.

r2¶: 1 - fraction of variance unexplained relative to the base model. Equal to 1 - rss / tss.

rss¶

sum_{i=1}^N (hat{y}_i - y_i)^2.

Type:	Residual sum of squares

tss¶

sum_{i=1}^N (y_i - N^{-1} sum_{j=1}^N y_j)^2.

Type:	Total sum of squares

var_y¶: Estimated variance of observations Y, equal to rss / dof.

class cppyml.linear_regression.RidgeRegressionResult¶

Bases: cppyml.linear_regression.Result

Result of a (multivariate) ridge regression with intercept.

Intercept is the last coefficient in beta.

var_y is calculated using dof as the denominator.

beta¶: Fitted coefficients of the model y_i = beta’^T X_i, followed by beta0.

cov¶: Covariance matrix of (beta’, beta0) coefficients.

effective_dof¶

N - tr [ X^T (X * X^T + lambda * I)^{-1} X ] - 1.

Type:	Effective number of residual degrees of freedom

class cppyml.linear_regression.UnivariateOLSResult¶

Bases: cppyml.linear_regression.Result

Result of univariate Ordinary Least Squares regression (with or without intercept).

The following properties assume independent Gaussian error terms: var_slope, var_intercept and cov_slope_intercept.

cov_slope_intercept¶: Estimated covariance of the slope and the intercept.

intercept¶: Constant added to slope * X when predicting Y.

slope¶: Coefficient multiplying X values when predicting Y.

var_intercept¶: Estimated variance of the intercept.

var_slope¶: Estimated variance of the slope.

cppyml.linear_regression.lasso(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], lambda: float, do_standardise: bool = False) → cppyml.linear_regression.LassoRegressionResult¶

Carries out multivariate Lasso regression with intercept.

Given X and y, finds beta’ and beta0 minimising || y - beta’^T X - beta0 ||^2 + lambda * || beta’ ||_1.

R2 is always calculated w/r to model returning average y. The matrix X is assumed to be standardised unless do_standardise is set to True. Does not calculate the covariance matrix for estimated coefficients.

Parameters:	X – X matrix (shape N x D, with D <= N), with data points in rows. y – Y vector with length N. do_standardise – Whether to automatically subtract the mean from each row in X and divide it by its standard deviation (defaults to False).
Returns:	Instance of LassoRegressionResult. If do_standardise was True, the beta vector will be rescaled and shifted to original X units and origins.

cppyml.linear_regression.multivariate(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], add_ones: bool = False) → cppyml.linear_regression.MultivariateOLSResult¶

Carries out multivariate linear regression.

R2 is always calculated w/r to model returning average Y. If fitting with intercept is desired, include a row of 1’s in the X values or set the parameter add_ones to True.

Parameters:	X – X matrix (shape N x D, with D <= N), with data points in rows. y – Y vector with length N. add_ones – Whether to automatically add a column of 1’s at the end of X (optional, defaults to False).
Returns:	Instance of MultivariateOLSResult.

cppyml.linear_regression.press(*args, **kwargs)¶

Overloaded function.

press(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], regularisation: str = ‘none’, reg_lambda: float = 0.0) -> float

Calculates the PRESS statistic (Predicted Residual Error Sum of Squares).

See https://en.wikipedia.org/wiki/PRESS_statistic for details.

NOTE: Training data will be standardised internally if using regularisation.

Parameters:	X – X matrix (shape N x D, with D <= N), with data points in rows. Unstandardised. y – Y vector with length N. regularisation – Type of regularisation: “none” or “ridge”. Defaults to “none”. reg_lambda – Non-negative regularisation strength. Defaults to 0. Ignored if regularisation == “none”.
Returns:	Value of the PRESS statistic.

press(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], regularisation: str = ‘none’, reg_lambda: float = 0.0) -> float

Calculates the PRESS statistic (Predicted Residual Error Sum of Squares).

See https://en.wikipedia.org/wiki/PRESS_statistic for details.

NOTE: Training data will be standardised internally if using regularisation.

Parameters:	X – X matrix (shape N x D, with D <= N), with data points in rows. Unstandardised. y – Y vector with length N. regularisation – Type of regularisation: “none” or “ridge”. Defaults to “none”. reg_lambda – Non-negative regularisation strength. Defaults to 0. Ignored if regularisation == “none”.
Returns:	Value of the PRESS statistic.

cppyml.linear_regression.press_univariate(x: numpy.ndarray[float64[m, 1]], y: numpy.ndarray[float64[m, 1]], with_intercept: bool = True) → float¶

Calculates the PRESS statistic (Predicted Residual Error Sum of Squares) for univariate regression.

See https://en.wikipedia.org/wiki/PRESS_statistic for details.

Parameters:	x – X vector with length N. y – Y vector with same length as x. with_intercept – Whether the regression is with intercept or not (defaults to True).
Returns:	Value of the PRESS statistic.

cppyml.linear_regression.ridge(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], lambda: float, do_standardise: bool = False) → cppyml.linear_regression.RidgeRegressionResult¶

Carries out multivariate ridge regression with intercept.

Given X and y, finds beta’ and beta0 minimising || y - beta’^T X - beta0 ||^2 + lambda * || beta’ ||^2.

R2 is always calculated w/r to model returning average y. The matrix X is assumed to be standardised unless do_standardise is set to True.

Parameters:	X – X matrix (shape N x D, with D <= N), with data points in rows. y – Y vector with length N. do_standardise – Whether to automatically subtract the mean from each row in X and divide it by its standard deviation (defaults to False).
Returns:	Instance of RidgeRegressionResult. If do_standardise was True, the beta vector will be rescaled and shifted to original X units and origins, and the cov matrix will be transformed accordingly.

cppyml.linear_regression.univariate(x: numpy.ndarray[float64[m, 1]], y: numpy.ndarray[float64[m, 1]]) → cppyml.linear_regression.UnivariateOLSResult¶

Carries out univariate (aka simple) linear regression with intercept.

R2 coefficient is calculated w/r to a model returning average Y, and is equal to Corr(X, Y)^2:: R2 = 1 - sum_{i=1}^n (y_i - hat{y}_i)^2 / sum_{i=1}^n (y_i - avg(Y))^2.

Parameters:	x – X vector. y – Y vector. x and y must have same length not less than 2.
Returns:	Instance of UnivariateOLSResult.

cppyml.linear_regression.univariate_regular(x0: float, dx: float, y: numpy.ndarray[float64[m, 1]]) → cppyml.linear_regression.UnivariateOLSResult¶

Carries out univariate (aka simple) linear regression with intercept on regularly spaced points.

R2 coefficient is calculated w/r to a model returning average Y, and is equal to Corr(X, Y)^2:: R2 = 1 - sum_{i=1}^n (y_i - hat{y}_i)^2 / sum_{i=1}^n (y_i - avg(Y))^2.

Parameters:	x0 – First X value. dx – Positive X increment. y – Y vector with length not less than 2.
Returns:	Instance of UnivariateOLSResult.

cppyml.linear_regression.univariate_without_intercept(x: numpy.ndarray[float64[m, 1]], y: numpy.ndarray[float64[m, 1]]) → cppyml.linear_regression.UnivariateOLSResult¶

Carries out univariate (aka simple) linear regression without intercept.

The R2 coefficient is calculated w/r to a model returning average Y:: R2 = 1 - sum_{i=1}^n (y_i - hat{y}_i)^2 / sum_{i=1}^n (y_i - avg(Y))^2.

Parameters:	x – X vector. y – Y vector. x and y must have same length not less than 1.
Returns:	Instance of UnivariateOLSResult with intercept, var_intercept and cov_slope_intercept set to 0.

Logistic regression¶

Logistic regression algorithms.

class cppyml.logistic_regression.ConjugateGradientLogisticRegression¶

Bases: cppyml.logistic_regression.LogisticRegression

Binomial logistic regression algorithm.

Implemented the conjugate gradient algorithm described in Thomas P. Minka, “A comparison of numerical optimizers for logistic regression”.

class cppyml.logistic_regression.LogisticRegression¶

Bases: pybind11_builtins.pybind11_object

absolute_tolerance¶: Absolute tolerance for fitted weights

fit(self: cppyml.logistic_regression.LogisticRegression, X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]]) → cppyml.logistic_regression.Result¶

Fits the model and returns the result.

If fitting with intercept is desired, include a column of 1’s in the X values.

Parameters:	X – N x D matrix of X values, with data points in rows. y – Y vector with length N. Values should be -1 or 1.
Returns:	Instance of Result.

lam¶

inverse variance of the Gaussian prior for w

Type:	Regularisation parameter

maximum_steps¶: Maximum number of steps allowed

relative_tolerance¶: Relative tolerance for fitted weights

set_absolute_tolerance(self: cppyml.logistic_regression.LogisticRegression, absolute_tolerance: float) → None¶: Sets absolute tolerance for weight convergence.

set_lam(self: cppyml.logistic_regression.LogisticRegression, lam: float) → None¶: Sets the regularisation parameter.

set_maximum_steps(self: cppyml.logistic_regression.LogisticRegression, maximum_steps: int) → None¶: Sets maximum number of steps.

set_relative_tolerance(self: cppyml.logistic_regression.LogisticRegression, relative_tolerance: float) → None¶: Sets relative tolerance for weight convergence.

class cppyml.logistic_regression.Result¶

Bases: pybind11_builtins.pybind11_object

converged¶: Did it converge?

predict(self: cppyml.logistic_regression.Result, X: numpy.ndarray[float64[m, n], flags.c_contiguous]) → numpy.ndarray[float64[m, 1]]¶: Predicts labels for features X given w. Returns the predicted label vector.

steps_taken¶: Number of steps taken to converge.

w¶: Fitted coefficients of the LR model.

Welcome to cppyml’s documentation!¶

Clustering¶

Decision trees¶

Linear regression¶

Logistic regression¶

Indices and tables¶

Table of Contents

This Page