Welcome to cppyml’s documentation!¶
cppyml: Python bindings for efficient C++ implementations of selected ML algorithms.
cppyml provides Python programmers with a curated selection of popular ML algorithms implemented in C++. The goal is to provide well-tested, highly optimised implementations.
© 2020 Roman Werpachowski. Available under GPG v3 license.
All modules listed below should be imported as cppyml.<module name>, e.g. cppyml.linear_regression.
Clustering¶
Clustering algorithms.
-
class
cppyml.clustering.
CentroidsInitialiser
¶ Bases:
pybind11_builtins.pybind11_object
Abstract centroids initialiser.
-
class
cppyml.clustering.
ClosestCentroid
¶ Bases:
cppyml.clustering.ResponsibilitiesInitialiser
Assigns points to closest centroid.
-
class
cppyml.clustering.
EM
¶ Bases:
pybind11_builtins.pybind11_object
Gaussian Expectation-Maximisation algorithm.
-
assign_responsibilities
(self: cppyml.clustering.EM, x: numpy.ndarray[float64[m, 1]]) → numpy.ndarray[float64[m, 1]]¶ Given a data point x, calculate each component’s responsibilities for x and return them.
Parameters: x – Data point with correct dimension. Returns: Array of components’ responsibilities.
-
covariance
(self: cppyml.clustering.EM, k: int) → numpy.ndarray[float64[m, n]]¶ Returns k-th covariance matrix.
Parameters: k – Gaussian component index. Returns: 2D square matrix with covariance coefficients.
-
fit
(self: cppyml.clustering.EM, data: numpy.ndarray[float64[m, n], flags.c_contiguous]) → bool¶ Fits the components to the data.
Parameters: data – A 2D array with data points in rows. Returns: True if EM algorithm converged.
-
log_likelihood
¶ Maximised log-likelihood.
-
means
¶ Fitted means.
-
mixing_probabilities
¶ Mixing probabilities of components.
-
number_components
¶ Number of Gaussian components.
-
responsibilities
¶ Fitted responsibilities.
-
set_absolute_tolerance
(self: cppyml.clustering.EM, absolute_tolerance: float) → None¶ Sets absolute tolerance.
-
set_maximise_first
(self: cppyml.clustering.EM, maximise_first: bool) → None¶ Turns on/off doing an initial maximisation step before the E-M iterations.
-
set_maximum_steps
(self: cppyml.clustering.EM, maximum_steps: int) → None¶ Sets maximum number of iterations.
-
set_means_initialiser
(self: cppyml.clustering.EM, means_initialiser: cppyml.clustering.CentroidsInitialiser) → None¶ Sets the algorithm to initialise component means.
-
set_relative_tolerance
(self: cppyml.clustering.EM, relative_tolerance: float) → None¶ Sets relative tolerance.
-
set_responsibilities_initialiser
(self: cppyml.clustering.EM, responsibilities_initialiser: cppyml.clustering.ResponsibilitiesInitialiser) → None¶ Sets the algorithm to initialise responsibilities for data points.
-
set_seed
(self: cppyml.clustering.EM, seed: int) → None¶ Sets PRNG seed.
-
set_verbose
(self: cppyml.clustering.EM, verbose: bool) → None¶ Turns on/off the verbose mode.
-
-
class
cppyml.clustering.
Forgy
¶ Bases:
cppyml.clustering.CentroidsInitialiser
Forgy initialisation algorithm.
-
class
cppyml.clustering.
KMeans
¶ Bases:
pybind11_builtins.pybind11_object
Naive K-Means algorithm.
-
assign_label
(self: cppyml.clustering.KMeans, x: numpy.ndarray[float64[m, 1]]) → Tuple[int, float]¶ Given a data point x, assigns it to the closest cluster.
Parameters: x – Data point with correct dimension. Returns: Cluster label for point x.
-
centroids
¶ Fitted centroids.
-
fit
(self: cppyml.clustering.KMeans, data: numpy.ndarray[float64[m, n], flags.c_contiguous]) → bool¶ Fits the components to the data.
Parameters: data – A 2D array with data points in rows. Returns: True if the algorithm converged.
-
inertia
¶ Minimised inertia.
-
labels
¶ Fitted labels.
-
number_clusters
¶ Number of clusters.
-
set_absolute_tolerance
(self: cppyml.clustering.KMeans, absolute_tolerance: float) → None¶ Sets absolute tolerance.
-
set_centroids_initialiser
(self: cppyml.clustering.KMeans, centroids_initialiser: cppyml.clustering.CentroidsInitialiser) → None¶ Sets the algorithm to initialise cluster centroids.
-
set_maximum_steps
(self: cppyml.clustering.KMeans, maximum_steps: int) → None¶ Sets maximum number of iterations.
-
set_number_initialisations
(self: cppyml.clustering.KMeans, centroids_initialiser: int) → None¶ Sets number of initialisations to try, to find the clusters with lowest inertia.
-
set_seed
(self: cppyml.clustering.KMeans, seed: int) → None¶ Sets the PRNG seed.
-
set_verbose
(self: cppyml.clustering.KMeans, verbose: bool) → None¶ Turns on/off the verbose mode.
-
-
class
cppyml.clustering.
KPP
¶ Bases:
cppyml.clustering.CentroidsInitialiser
KMeans++ initialisation algorithm.
-
class
cppyml.clustering.
RandomPartition
¶ Bases:
cppyml.clustering.CentroidsInitialiser
Random Partition initialisation algorithm.
-
class
cppyml.clustering.
ResponsibilitiesInitialiser
¶ Bases:
pybind11_builtins.pybind11_object
Abstract responsibilities initialiser.
Decision trees¶
Decision tree algorithms.
-
class
cppyml.decision_trees.
ClassificationTree
¶ Bases:
pybind11_builtins.pybind11_object
Classification tree
-
cost_complexity
(self: cppyml.decision_trees.ClassificationTree, alpha: float) → float¶ Calculates cost-complexity for given alpha.
-
number_leaf_nodes
¶ Number of leaf nodes.
-
number_lowest_split_nodes
¶ Number of lowest split nodes.
-
number_nodes
¶ Number of nodes.
-
original_error
¶ Original error.
-
total_leaf_error
¶ Total leaf error.
-
-
class
cppyml.decision_trees.
RegressionTree
¶ Bases:
pybind11_builtins.pybind11_object
Regression tree
-
cost_complexity
(self: cppyml.decision_trees.RegressionTree, alpha: float) → float¶ Calculates cost-complexity for given alpha.
-
number_leaf_nodes
¶ Number of leaf nodes.
-
number_lowest_split_nodes
¶ Number of lowest split nodes.
-
number_nodes
¶ Number of nodes.
-
original_error
¶ Original error.
-
total_leaf_error
¶ Total leaf error.
-
-
cppyml.decision_trees.
classification_tree
(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], max_split_levels: int = 100, min_split_size: int = 10, alphas: List[float] = [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0], num_folds: int = 10) → Tuple[cppyml.decision_trees.ClassificationTree, float, float]¶ Grows a classification tree with pruning.
Parameters: - X – Independent variables (row-wise) with shape N x D.
- y – Dependent variable (vector) with length N.
- max_split_levels – Maximum number of split nodes on the way to any leaf node.
- min_split_size – Minimum sample size which can be split (at least 2).
- alphas – Candidate alphas (non-negative) for pruning to be selected by cross-validation. If this vector is empty, no pruning is done. If it has just one element, this value is used for pruning. If it has more than one, the one with smallest k-fold cross-validation test error is used. Defaults to [1E-6, 1E-5, …, 10, 100].
- num_folds – Number of folds for cross-validation. Ignored if cross-validation is not done.
Returns: trained decision tree, chosen alpha (NaN if no pruning was done) and minimum cross-validation test error (NaN if no cross-validation was done).
Return type: Tuple of
-
cppyml.decision_trees.
classification_tree_accuracy
(tree: cppyml.decision_trees.ClassificationTree, X: numpy.ndarray[float64[m, n], flags.f_contiguous], y: numpy.ndarray[float64[m, 1]]) → float¶ Calculates classification tree accuracy on (X, y) data.
Parameters: - tree – Classification tree instance.
- X – Independent variables (row-wise) with shape N x D.
- y – Dependent variable (vector) with length N.
Returns: Classification accuracy.
-
cppyml.decision_trees.
regression_tree
(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], max_split_levels: int = 100, min_split_size: int = 10, alphas: List[float] = [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0], num_folds: int = 10) → Tuple[cppyml.decision_trees.RegressionTree, float, float]¶ Grows a regression tree with pruning.
Parameters: - X – Independent variables (row-wise) with shape N x D.
- y – Dependent variable (vector) with length N.
- max_split_levels – Maximum number of split nodes on the way to any leaf node.
- min_split_size – Minimum sample size which can be split (at least 2).
- alphas – Candidate alphas (non-negative) for pruning to be selected by cross-validation. If this vector is empty, no pruning is done. If it has just one element, this value is used for pruning. If it has more than one, the one with smallest k-fold cross-validation test error is used. Defaults to [1E-6, 1E-5, …, 10, 100].
- num_folds – Number of folds for cross-validation. Ignored if cross-validation is not done.
Returns: trained decision tree, chosen alpha (NaN if no pruning was done) and minimum cross-validation test error (NaN if no cross-validation was done).
Return type: Tuple of
-
cppyml.decision_trees.
regression_tree_mean_squared_error
(tree: cppyml.decision_trees.RegressionTree, X: numpy.ndarray[float64[m, n], flags.f_contiguous], y: numpy.ndarray[float64[m, 1]]) → float¶ Calculates regression tree mean squared error on (X, y) data.
Parameters: - tree – Regression tree instance.
- X – Independent variables (row-wise) with shape N x D.
- y – Dependent variable (vector) with length N.
Returns: Mean squared error.
Linear regression¶
Linear regression algorithms.
-
class
cppyml.linear_regression.
LassoRegressionResult
¶ Bases:
cppyml.linear_regression.Result
Result of a (multivariate) Lasso regression with intercept.
Intercept is the last coefficient in beta.
var_y is calculated using dof as the denominator.
-
beta
¶ Fitted coefficients of the model y_i = beta’^T X_i, followed by beta0.
-
effective_dof
¶ N - tr [ X^T (X * X^T + lambda * I)^{-1} X ] - 1.
Type: Effective number of residual degrees of freedom
-
-
class
cppyml.linear_regression.
MultivariateOLSResult
¶ Bases:
cppyml.linear_regression.Result
Result of multivariate Ordinary Least Squares regression.
The cov property assumes independent Gaussian error terms.
-
beta
¶ Fitted coefficients of the model y_i = beta^T X_i.
-
cov
¶ Covariance matrix of beta coefficients.
-
-
class
cppyml.linear_regression.
RecursiveMultivariateOLS
¶ Bases:
pybind11_builtins.pybind11_object
Given a stream of pairs (X_i, y_i), updates the least-squares estimate for beta solving the equations
y_0 = X_0^T * beta + e_0 y_1 = X_1^T * beta + e_1 …
Based on https://cpb-us-w2.wpmucdn.com/sites.gatech.edu/dist/2/436/files/2017/07/22-notes-6250-f16.pdf
-
beta
¶ Current beta estimate. If n == 0, returns an empty array.
-
d
¶ Dimension of data points. If n == 0, returs 0.
-
n
¶ Number of data points seen so far.
-
update
(self: cppyml.linear_regression.RecursiveMultivariateOLS, X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]]) → None¶ Updates the beta estimate with a new sample.
Parameters: - X – N x D matrix of X values, with data points in rows.
- y – Y vector with length N.
- Throws:
- ValueError: If n == 0 (i.e., (X, y) is the first sample) and N < D.
-
-
class
cppyml.linear_regression.
Result
¶ Bases:
pybind11_builtins.pybind11_object
-
adjusted_r2
¶ 1 - fraction of variance unexplained relative to the base model. Uses sample variances. Equal to 1 - rss * (n - 1) / tss / dof.
-
dof
¶ Number of residual degrees of freedom (e.g. n - 2 or n - 1 for univariate regression with or without intercept).
-
n
¶ Number of data points.
-
r2
¶ 1 - fraction of variance unexplained relative to the base model. Equal to 1 - rss / tss.
-
rss
¶ sum_{i=1}^N (hat{y}_i - y_i)^2.
Type: Residual sum of squares
-
tss
¶ sum_{i=1}^N (y_i - N^{-1} sum_{j=1}^N y_j)^2.
Type: Total sum of squares
-
var_y
¶ Estimated variance of observations Y, equal to rss / dof.
-
-
class
cppyml.linear_regression.
RidgeRegressionResult
¶ Bases:
cppyml.linear_regression.Result
Result of a (multivariate) ridge regression with intercept.
Intercept is the last coefficient in beta.
var_y is calculated using dof as the denominator.
-
beta
¶ Fitted coefficients of the model y_i = beta’^T X_i, followed by beta0.
-
cov
¶ Covariance matrix of (beta’, beta0) coefficients.
-
effective_dof
¶ N - tr [ X^T (X * X^T + lambda * I)^{-1} X ] - 1.
Type: Effective number of residual degrees of freedom
-
-
class
cppyml.linear_regression.
UnivariateOLSResult
¶ Bases:
cppyml.linear_regression.Result
Result of univariate Ordinary Least Squares regression (with or without intercept).
The following properties assume independent Gaussian error terms: var_slope, var_intercept and cov_slope_intercept.
-
cov_slope_intercept
¶ Estimated covariance of the slope and the intercept.
-
intercept
¶ Constant added to slope * X when predicting Y.
-
slope
¶ Coefficient multiplying X values when predicting Y.
-
var_intercept
¶ Estimated variance of the intercept.
-
var_slope
¶ Estimated variance of the slope.
-
-
cppyml.linear_regression.
lasso
(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], lambda: float, do_standardise: bool = False) → cppyml.linear_regression.LassoRegressionResult¶ Carries out multivariate Lasso regression with intercept.
Given X and y, finds beta’ and beta0 minimising || y - beta’^T X - beta0 ||^2 + lambda * || beta’ ||_1.
R2 is always calculated w/r to model returning average y. The matrix X is assumed to be standardised unless do_standardise is set to True. Does not calculate the covariance matrix for estimated coefficients.
Parameters: - X – X matrix (shape N x D, with D <= N), with data points in rows.
- y – Y vector with length N.
- do_standardise – Whether to automatically subtract the mean from each row in X and divide it by its standard deviation (defaults to False).
Returns: Instance of LassoRegressionResult. If do_standardise was True, the beta vector will be rescaled and shifted to original X units and origins.
-
cppyml.linear_regression.
multivariate
(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], add_ones: bool = False) → cppyml.linear_regression.MultivariateOLSResult¶ Carries out multivariate linear regression.
R2 is always calculated w/r to model returning average Y. If fitting with intercept is desired, include a row of 1’s in the X values or set the parameter add_ones to True.
Parameters: - X – X matrix (shape N x D, with D <= N), with data points in rows.
- y – Y vector with length N.
- add_ones – Whether to automatically add a column of 1’s at the end of X (optional, defaults to False).
Returns: Instance of MultivariateOLSResult.
-
cppyml.linear_regression.
press
(*args, **kwargs)¶ Overloaded function.
- press(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], regularisation: str = ‘none’, reg_lambda: float = 0.0) -> float
Calculates the PRESS statistic (Predicted Residual Error Sum of Squares).
See https://en.wikipedia.org/wiki/PRESS_statistic for details.
NOTE: Training data will be standardised internally if using regularisation.
Parameters: - X – X matrix (shape N x D, with D <= N), with data points in rows. Unstandardised.
- y – Y vector with length N.
- regularisation – Type of regularisation: “none” or “ridge”. Defaults to “none”.
- reg_lambda – Non-negative regularisation strength. Defaults to 0. Ignored if regularisation == “none”.
Returns: Value of the PRESS statistic.
- press(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], regularisation: str = ‘none’, reg_lambda: float = 0.0) -> float
Calculates the PRESS statistic (Predicted Residual Error Sum of Squares).
See https://en.wikipedia.org/wiki/PRESS_statistic for details.
NOTE: Training data will be standardised internally if using regularisation.
Parameters: - X – X matrix (shape N x D, with D <= N), with data points in rows. Unstandardised.
- y – Y vector with length N.
- regularisation – Type of regularisation: “none” or “ridge”. Defaults to “none”.
- reg_lambda – Non-negative regularisation strength. Defaults to 0. Ignored if regularisation == “none”.
Returns: Value of the PRESS statistic.
-
cppyml.linear_regression.
press_univariate
(x: numpy.ndarray[float64[m, 1]], y: numpy.ndarray[float64[m, 1]], with_intercept: bool = True) → float¶ Calculates the PRESS statistic (Predicted Residual Error Sum of Squares) for univariate regression.
See https://en.wikipedia.org/wiki/PRESS_statistic for details.
Parameters: - x – X vector with length N.
- y – Y vector with same length as x.
- with_intercept – Whether the regression is with intercept or not (defaults to True).
Returns: Value of the PRESS statistic.
-
cppyml.linear_regression.
ridge
(X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]], lambda: float, do_standardise: bool = False) → cppyml.linear_regression.RidgeRegressionResult¶ Carries out multivariate ridge regression with intercept.
Given X and y, finds beta’ and beta0 minimising || y - beta’^T X - beta0 ||^2 + lambda * || beta’ ||^2.
R2 is always calculated w/r to model returning average y. The matrix X is assumed to be standardised unless do_standardise is set to True.
Parameters: - X – X matrix (shape N x D, with D <= N), with data points in rows.
- y – Y vector with length N.
- do_standardise – Whether to automatically subtract the mean from each row in X and divide it by its standard deviation (defaults to False).
Returns: Instance of RidgeRegressionResult. If do_standardise was True, the beta vector will be rescaled and shifted to original X units and origins, and the cov matrix will be transformed accordingly.
-
cppyml.linear_regression.
univariate
(x: numpy.ndarray[float64[m, 1]], y: numpy.ndarray[float64[m, 1]]) → cppyml.linear_regression.UnivariateOLSResult¶ Carries out univariate (aka simple) linear regression with intercept.
- R2 coefficient is calculated w/r to a model returning average Y, and is equal to Corr(X, Y)^2:
- R2 = 1 - sum_{i=1}^n (y_i - hat{y}_i)^2 / sum_{i=1}^n (y_i - avg(Y))^2.
Parameters: - x – X vector.
- y – Y vector. x and y must have same length not less than 2.
Returns: Instance of UnivariateOLSResult.
-
cppyml.linear_regression.
univariate_regular
(x0: float, dx: float, y: numpy.ndarray[float64[m, 1]]) → cppyml.linear_regression.UnivariateOLSResult¶ Carries out univariate (aka simple) linear regression with intercept on regularly spaced points.
- R2 coefficient is calculated w/r to a model returning average Y, and is equal to Corr(X, Y)^2:
- R2 = 1 - sum_{i=1}^n (y_i - hat{y}_i)^2 / sum_{i=1}^n (y_i - avg(Y))^2.
Parameters: - x0 – First X value.
- dx – Positive X increment.
- y – Y vector with length not less than 2.
Returns: Instance of UnivariateOLSResult.
-
cppyml.linear_regression.
univariate_without_intercept
(x: numpy.ndarray[float64[m, 1]], y: numpy.ndarray[float64[m, 1]]) → cppyml.linear_regression.UnivariateOLSResult¶ Carries out univariate (aka simple) linear regression without intercept.
- The R2 coefficient is calculated w/r to a model returning average Y:
- R2 = 1 - sum_{i=1}^n (y_i - hat{y}_i)^2 / sum_{i=1}^n (y_i - avg(Y))^2.
Parameters: - x – X vector.
- y – Y vector. x and y must have same length not less than 1.
Returns: Instance of UnivariateOLSResult with intercept, var_intercept and cov_slope_intercept set to 0.
Logistic regression¶
Logistic regression algorithms.
-
class
cppyml.logistic_regression.
ConjugateGradientLogisticRegression
¶ Bases:
cppyml.logistic_regression.LogisticRegression
Binomial logistic regression algorithm.
Implemented the conjugate gradient algorithm described in Thomas P. Minka, “A comparison of numerical optimizers for logistic regression”.
-
class
cppyml.logistic_regression.
LogisticRegression
¶ Bases:
pybind11_builtins.pybind11_object
-
absolute_tolerance
¶ Absolute tolerance for fitted weights
-
fit
(self: cppyml.logistic_regression.LogisticRegression, X: numpy.ndarray[float64[m, n], flags.c_contiguous], y: numpy.ndarray[float64[m, 1]]) → cppyml.logistic_regression.Result¶ Fits the model and returns the result.
If fitting with intercept is desired, include a column of 1’s in the X values.
Parameters: - X – N x D matrix of X values, with data points in rows.
- y – Y vector with length N. Values should be -1 or 1.
Returns: Instance of Result.
-
lam
¶ inverse variance of the Gaussian prior for w
Type: Regularisation parameter
-
maximum_steps
¶ Maximum number of steps allowed
-
relative_tolerance
¶ Relative tolerance for fitted weights
-
set_absolute_tolerance
(self: cppyml.logistic_regression.LogisticRegression, absolute_tolerance: float) → None¶ Sets absolute tolerance for weight convergence.
-
set_lam
(self: cppyml.logistic_regression.LogisticRegression, lam: float) → None¶ Sets the regularisation parameter.
-
set_maximum_steps
(self: cppyml.logistic_regression.LogisticRegression, maximum_steps: int) → None¶ Sets maximum number of steps.
-
set_relative_tolerance
(self: cppyml.logistic_regression.LogisticRegression, relative_tolerance: float) → None¶ Sets relative tolerance for weight convergence.
-
-
class
cppyml.logistic_regression.
Result
¶ Bases:
pybind11_builtins.pybind11_object
-
converged
¶ Did it converge?
-
predict
(self: cppyml.logistic_regression.Result, X: numpy.ndarray[float64[m, n], flags.c_contiguous]) → numpy.ndarray[float64[m, 1]]¶ Predicts labels for features X given w. Returns the predicted label vector.
-
steps_taken
¶ Number of steps taken to converge.
-
w
¶ Fitted coefficients of the LR model.
-