Predictive and prescriptive analytics to bridge the hole between segmentation and prediction for real-world functions
In lots of real-world machine studying duties, the inhabitants being studied is usually numerous and heterogeneous. This variability presents distinctive challenges, notably in regression and classification duties the place a single, generalized mannequin might fail to seize necessary nuances inside the knowledge. For instance, segmenting prospects in advertising and marketing campaigns, estimating the gross sales of a brand new product utilizing knowledge from comparable merchandise, or diagnosing a affected person with restricted medical historical past based mostly on related circumstances all spotlight the necessity for fashions that may adapt to totally different subpopulations.
This idea of segmentation will not be new. Fashions like k-Nearest Neighbors or Choice Bushes already implicitly leverage the thought of dividing the enter area into areas that share considerably related properties. Nonetheless, these approaches are sometimes heuristic and don’t explicitly optimize for each clustering and prediction concurrently.
On this article, we strategy this problem from an optimization perspective, following the literature on Predictive and Prescriptive Analytics ([8]). Particularly, we deal with the duty of joint clustering and prediction, which seeks to phase the information into clusters whereas concurrently becoming a predictive mannequin inside every cluster. This strategy has gained consideration for its means to bridge the hole between data-driven decision-making and actionable insights and extracting extra info from knowledge than different conventional strategies (see [2] as an example).
After presenting some theoretical insights on Clustering and Regression from latest literature, we introduce a novel Classification methodology (Cluster Whereas Classify) and present its superior efficiency in low knowledge environments.
1.1 Unique optimization downside
We first begin with formulating the issue of optimum clustering and regression — collectively — to attain one of the best becoming and prediction efficiency. Some formal notations and assumptions:
- Information has the shape (X, Y), the place X = (xᵢ) are the options and Y is the goal.
- We assume a clustering with okay clusters — okay may be outlined later — and introduce the binary variables zᵢⱼ equal to 1 if the i-th knowledge level is assigned to cluster j, 0 in any other case.
- We assume a category of regression fashions (fⱼ) (e.g. linear fashions), parametrized by (θⱼ) and their loss perform L. Word that every θⱼ is particular to regression mannequin fⱼ.
As in the end a regression downside, the purpose of the duty is to discover the set of parameters (i.e. parameters for every regression mannequin θⱼ in addition to the extra cluster project variables zᵢⱼ) minimizing the loss perform L:
1.2 Suboptimality of Cluster Then Regress:
One of the pure approaches — and utilized in quite a few sensible utility of clustering and regression analyses — is the naive Cluster Then Regress (CTR) strategy — i.e. first working clustering then run a regression mannequin on the static results of this clustering. It’s identified to be suboptimal: specifically, error propagates from the clustering step to the regression step, and misguided assignments can have vital penalties on the efficiency.
We are going to mathematically present this suboptimality. When working a CTR strategy, we first assign the clusters, after which match the okay regression fashions with cluster assignments as static. This interprets to the next nested optimization:
With TIC a measure of Whole Intra Cluster Variance. On condition that Z is included in ({0, 1})ⁿ, we see that the CTR strategy solves an issue that’s truly extra constrained the the unique one (i.e. additional constraining the (zᵢⱼ) to be in Z quite than free in ({0, 1})ⁿ). Consequently, this yields a suboptimal outcome for the unique optimization.
1.3 Cluster Whereas Regress: an approximation resolution to the unique optimization
Sadly, making an attempt to straight clear up the unique optimization introduced in part 1.1 may be intractable in apply, (Combined integer optimization downside, with potential non-linearity coming from the selection of regression fashions). [1] presents a quick and straightforward — however approximate — resolution to collectively study the optimum cluster project and regression fashions: doing it iteratively. In apply, the Cluster Whereas Regress (CWR) is:
- At iteration i, contemplate cluster assignments as static and calibrate the okay regression fashions
- Then contemplate the regression fashions as static and select the cluster assignments that will reduce the entire loss
- Redo the earlier two steps till cluster assignments don’t change
Moreover the iterative nature of this methodology, it presents a key distinction with the CTR strategy: clustering and regression optimize for a similar goal perform.
Making use of the earlier reasoning to classification, now we have 2 totally different routes:
- Rewriting a brand new mannequin from scratch, i.e. Cluster Whereas Classify
- Utilizing CWR on the log odds for a Logistic Regression strategy — see appendix
2.1 Formulating Cluster Whereas Classify:
Just a few modifications are to be completed to the target downside, specifically the loss perform L which turns into a classification loss. For simplicity, we’ll deal with binary classification, however this formulation can simply be prolonged.
A preferred loss perform when doing binary classification is the binary cross-entropy loss:
The place p is the prediction of the classification mannequin parametrized by θ when it comes to chance of being within the class 1.
Introducing the clustering into this loss yields the next optimization mannequin:
Equally to CWR, we are able to discover an approximate resolution to this downside via the identical algorithm, i.e. iteratively becoming the clustering and classification steps till convergence.
2.2. Software for Logistic Regression:
On this particular case, the chances are of the shape:
Injecting this method within the goal perform of the optimization downside offers:
2.3 Mannequin inference:
Inference with each CWR and CWC fashions may be completed with the next course of, described in particulars in [1]:
- Infer cluster project: match a multiclass classification mannequin on the information factors with labels being the ultimate cluster assignments. Use this classification mannequin to assign possibilities of being in a cluster.
- Prediction: for a given knowledge level, the chance of being in a given class turns into the weighted sum of possibilities given by every fitted mannequin. This comes from the legislation of complete chance:
The place P(Yᵢ = 1| Xᵢ, i ∈ Clusterⱼ) is given by j-th classification mannequin fitted and P(i ∈ Clusterⱼ) comes from the cluster project classifier.
Generalization to non-integer weights relaxes the integer constraint on the z variables. This corresponds to the case of an algorithm permitting for (probabilistic) project to a number of clusters, e.g. Delicate Ok-Means — on this case assignments turn into weights between 0 and 1.
The becoming and inference processes are similar to beforehand, with the only real variations being throughout the becoming section: calibrating the regression / classification fashions on every cluster is changed with calibrated the weighted regressions (e.g. Weighted Least Squares) or weighted classifications (e.g. Weighted Logistic Regression — see [4] for an instance), with weight matrices Wⱼ = Diag(zᵢⱼ) with i similar to all of the indices such that zᵢⱼ > 0. Word that in contrast to strategies resembling Weighted Least Squares, weights listed below are given when becoming the regression.
This generalization has 2 direct implications on the issue at hand:
- Being a much less constrained optimization downside, it can naturally yield a greater resolution, i.e. a decrease loss in-sample than the integer-constrained model
- It will likely be extra vulnerable to overfitting, subsequently bringing the necessity for elevated regularization
[1] already included a regularization time period for the regression coefficients, which corresponds to having regularized fⱼ fashions: within the case of a Linear Regression, this may means as an example that fⱼ is a LASSO or a Ridge quite than a easy OLS.
But, the proposal right here is totally different, as we recommend further regularization, this time penalizing the non-zero zᵢⱼ: the rationale is that we wish to restrict the variety of fashions implicated within the becoming / inference of a given knowledge level to scale back noise and levels of freedom to forestall overfitting.
In apply, we add a brand new set of binary variables (bᵢⱼ) equal to 1 if zᵢⱼ > 0 and 0 in any other case. We will write it as linear constraints utilizing the massive M methodology:
All in, now we have the 2 optimization fashions:
Generalized Cluster Whereas Regress:
Generalized Cluster Whereas Classify:
These issues may be effectively solved with First Order strategies or Reducing Planes — see [3] for particulars.
We consider these strategies on 3 totally different benchmark datasets for example 3 key elements of their conduct and efficiency:
- Propension to overfit
- Higher efficiency when knowledge is imbalanced or uneven — i.e. larger penalties in circumstances of false optimistic or false detrimental
- Higher efficiency in low knowledge settings
Some implementation particulars:
- Given that each one the strategies introduced are agnostic to the kind of classification mannequin used, we’ll assume the similar classifier throughout the board to make sure honest comparability. For simplicity, we select Logistic Regression with L2 regularization (which is the bottom setting in Scikit-Be taught).
- For Cluster Then Classify (CTC), we use the Ok-Means clustering algorithm. We select the variety of clusters that maximizes the silhouette rating of the clustering.
- For Cluster Whereas Classify (CWC), we select the variety of clusters by cross-validation, i.e. the variety of clusters that maximizes the AUC of the ROC curve on a validation dataset. We then re-fit the chosen mannequin on each prepare and validation datasets. If the optimum variety of clusters is 2, we go for the CWC with integer weights for parsimony.
- Inference for CTC and CWC is finished with utilizing course of Mannequin Inference introduced earlier, i.e. a weighted sum of the chances predicted by every sub-model.
4.1 UCI Diabetes 130 dataset
The Diabetes 130-US Hospitals dataset (1999–2008) ([5]) accommodates details about diabetes sufferers admitted to 130 hospitals throughout the US over a 9-year interval. The purpose of the classification process is to foretell whether or not a given diabetes affected person shall be readmitted. We are going to simplify the lessons into 2 lessons — readmitted or not — as an alternative of three (readmitted after lower than 30 days, readmitted after greater than 30 days, not readmitted). We may also contemplate a subset of 20,000 knowledge factors as an alternative of the total 100,000 cases for quicker coaching.
4.2 UCI MAGIC Gamma Telescope dataset
The MAGIC Gamma Telescope dataset ([6]) accommodates knowledge from an observatory aimed toward classifying high-energy cosmic ray occasions as both gamma rays (sign) or hadrons (background). A specificity of this dataset is the non-symmetric nature of errors: given the upper value of false positives (misclassifying hadrons as gamma rays), accuracy will not be appropriate. As an alternative, efficiency is evaluated utilizing the ROC curve and AUC, with a deal with sustaining a false optimistic price (FPR) under 20% — as defined in [6].
4.3 UCI Parkinsons dataset
The Parkinson’s dataset ([7]) accommodates knowledge collected from voice recordings of 195 people, together with each these with Parkinson’s illness and wholesome controls. The dataset is used for classifying the presence or absence of Parkinson’s based mostly on options extracted from speech alerts. A key problem of this dataset is the low variety of datapoints, which makes generalization with conventional ML strategies tough. We will diagnose this generalization problem and overfitting by evaluating the efficiency numbers on prepare vs take a look at units.
The research of baseline and joint clustering and classification approaches demonstrates that selection of methodology relies upon considerably on the traits of the information and the issue setting — in brief, there isn’t any one-size-fits-all mannequin.
Our findings spotlight key distinctions between the approaches studied throughout varied eventualities:
- In conventional settings, i.e. giant datasets, quite a few options and balanced outcomes, standard machine studying fashions typically carry out properly. The added step of clustering affords minor advantages, however the potential for overfitting with strategies like CWC might result in worse efficiency on unseen knowledge.
- In non-traditional settings with uneven error penalties, the place false positives or false negatives carry unequal prices, strategies like CWC present some benefit. By tailoring predictions to cluster-specific dynamics, CWC appears to raised aligns with the loss perform’s priorities.
- In low-data environments, the advantages of joint clustering and prediction turn into notably pronounced. Whereas conventional fashions and CTC approaches usually wrestle with overfitting as a consequence of inadequate knowledge, CWC outperforms by extracting extra info from what is accessible. Its iterative optimization framework permits higher generalization and robustness in these difficult eventualities.
CWR on the log odds for Logistic Regression
Beginning with the log odds of Logistic Regression within the CWR kind:
This yields the chances:
Reinjecting these expressions within the probability perform of Logistic Regression:
And the log-likelihood:
This yields the identical goal perform as CWC when constraining the zᵢⱼ to be binary variables.
References:
[1] L. Baardman, I. Levin, G. Perakis, D. Singhvi, Leveraging Comparables for New Product Sales Forecasting (2018), Wiley
[2] L. Baardman, R. Cristian, G. Perakis, D. Singhvi, O. Skali Lami, L. Thayaparan, The role of optimization in some recent advances in data-driven decision-making (2023), Springer Nature
[3] D. Bertsimas, J. Dunn, Machine Learning Under a Modern Optimization Lens (2021), Dynamic Concepts
[4] G. Zeng, A comprehensive study of coefficient signs in weighted logistic regression (2024), Helyion
[5] J. Clore, Ok. Cios, J. DeShazo, B. Strack, Diabetes 130-US Hospitals for Years 1999–2008 [Dataset] (2014), UCI Machine Studying Repository (CC BY 4.0)
[6] R. Bock, MAGIC Gamma Telescope [Dataset] (2004), UCI Machine Studying Repository (CC BY 4.0)
[7] M. Little, Parkinsons [Dataset] (2007). UCI Machine Studying Repository (CC BY 4.0)
[8] D. Bertsimas, N. Kallus, From Predictive to Prescriptive Analytics (2019), INFORMS