Welcome to my sequence on Causal AI, the place we are going to discover the combination of causal reasoning into machine studying fashions. Anticipate to discover a variety of sensible purposes throughout totally different enterprise contexts.
Within the final article we lined safeguarding demand forecasting with causal graphs. At this time, we flip our consideration to powering experiments utilizing CUPED and double machine studying.
When you missed the final article on safeguarding demand forecasting, test it out right here:
On this article, we consider whether or not CUPED and double machine studying can improve the effectiveness of your experiments. We are going to use a case research to discover the next areas:
- The constructing blocks of experimentation: Speculation testing, energy evaluation, bootstrapping.
- What’s CUPED and the way can it assist energy experiments?
- What are the conceptual similarities between CUPED and double machine studying?
- When ought to we use double machine studying quite than CUPED?
The total pocket book will be discovered right here:
Background
You’ve lately joined the experimentation workforce at a number one on-line retailer recognized for its huge product catalog and dynamic consumer base. The information science workforce has deployed a complicated recommender system designed to boost consumer expertise and drive gross sales. This method integrates in real-time with the retailer’s platform and includes vital infrastructure and engineering prices.
The finance workforce is raring to grasp the system’s monetary affect, particularly how a lot further income it generates in comparison with a baseline situation with out suggestions. To judge the recommender system’s effectiveness, you intend to conduct a randomized managed experiment.
Knowledge-generating course of: Pre-experiment
We begin by creating some pre-experiment knowledge. The information-generating course of we use has the next traits:
- 3 noticed covariates associated to the recency (x_recency), frequency (x_frequency) and worth (x_value) of earlier gross sales.
- 1 unobserved covariate, the customers month-to-month earnings (u_income).
- A fancy relationship between covariates is used to estimate our goal metric, gross sales worth:
The python code beneath is used to create the pre-experiment knowledge:
np.random.seed(123)n = 10000 # Set variety of observations
p = 4 # Set variety of pre-experiment covariates
# Create pre-experiment covariates
X = np.random.uniform(measurement=n * p).reshape((n, -1))
# Nuisance parameters
b = (
1.5 * X[:, 0] +
2.5 * X[:, 1] +
X[:, 2] ** 3 +
X[:, 3] ** 2 +
X[:, 1] * X[:, 2]
)
# Create some noise
noise = np.random.regular(measurement=n)
# Calculate final result
y = np.most(b + noise, 0)
# Scale variables for interpretation
df_pre = pd.DataFrame({"noise": noise * 1000,
"u_income": X[:, 0] * 1000,
"x_recency": X[:, 1] * 1000,
"x_frequency": X[:, 2] * 1000,
"x_value": X[:, 3] * 1000,
"y_value": y * 1000
})
# Visualise goal metric
sns.histplot(df_pre['y_value'], bins=30, kde=False)
plt.xlabel('Gross sales Worth')
plt.ylabel('Frequency')
plt.title('Gross sales Worth')
plt.present()
Earlier than we get onto CUPED, I believed it might be worthwhile overlaying some foundational data on experimentation.
Speculation testing
Speculation testing helps decide if noticed variations in an experiment are statistically vital or simply random noise. In our experiment, we divide customers into two teams:
- Management Group: Receives no suggestions.
- Therapy Group: Receives personalised suggestions from the system.
We outline our hypotheses as follows:
- Null Speculation (H₀): The recommender system doesn’t have an effect on income. Any noticed variations are on account of probability.
- Various Speculation (Hₐ): The recommender system will increase income. Customers receiving suggestions generate considerably extra income in comparison with those that don’t.
To evaluate the hypotheses you’ll be evaluating the imply income within the management and therapy group. Nonetheless, there are some things to concentrate on:
- Kind I error (False constructive): If the experiment concludes that the recommender system considerably will increase income when in actuality, it has no impact.
- Kind II error (Beta, False unfavorable): If the experiment finds no vital enhance in income from the recommender system when in actuality, it does result in a significant enhance
- Significance Stage (Alpha): When you set the importance stage to 0.05, you’re accepting a 5% probability of incorrectly concluding that the recommender system improves income when it doesn’t (false constructive).
- Energy (1 — Beta): Reaching an influence of 0.80 means you will have an 80% probability of detecting a major enhance in income because of the recommender system if it really has an impact. A better energy reduces the danger of false negatives.
As you begin to consider designing the experiment, you set some preliminary objectives:
- You need to reliably detect the impact — Ensuring you steadiness the dangers of detecting a non-existent impact vs the danger of not detecting an actual impact.
- As shortly as attainable — Finance are in your case!
- Retaining the pattern measurement as price environment friendly as attainable — The enterprise case from the information science workforce suggests the system goes to drive a big enhance in income so that they don’t need the management group being too massive.
However how are you going to meet these objectives? Let’s delve into energy evaluation subsequent!
Energy evaluation
After we discuss powering experiments, we’re normally referring to the method of figuring out the minimal pattern measurement wanted to detect an impact of a sure measurement with a given confidence. There are 3 parts to energy evaluation:
- Impact measurement — The distinction between the imply worth of H₀ and Hₐ. We typically must make wise assumptions round this primarily based on understanding what issues to the enterprise/trade we’re working inside.
- Significance stage — The likelihood of incorrectly concluding there’s an impact when there isn’t, usually set at 0.05.
- Energy — The likelihood of appropriately detecting an impact when there’s one, usually set at 0.80.
I discovered the instinct behind these fairly arduous to understand at first, however visualising it might actually assist. So lets give it a strive! The important thing areas are the place H₀ and Hₐ crossover — See in case you it helps you tie collectively the parts mentioned above…
A bigger pattern measurement results in a smaller normal error. With a smaller normal error, the sampling distributions of H₀ and Hₐ grow to be narrower and fewer overlapping. This decreased overlap makes it simpler to detect a distinction, resulting in greater energy.
The perform beneath exhibits how we will use the statsmodels python bundle to hold out an influence evaluation:
from typing import Union
import pandas as pd
import numpy as np
import statsmodels.stats.energy as smpdef power_analysis(metric: Union[np.ndarray, pd.Series], exp_perc_change: float, alpha: float = 0.05, energy: float = 0.80) -> int:
'''
Carry out an influence evaluation to find out the minimal pattern measurement required for a given metric.
Args:
metric (np.ndarray or pd.Collection): Array or Collection containing the metric values for the management group.
exp_perc_change (float): The anticipated share change within the metric for the take a look at group.
alpha (float, elective): The importance stage for the take a look at. Defaults to 0.05.
energy (float, elective): The specified energy of the take a look at. Defaults to 0.80.
Returns:
int: The minimal pattern measurement required for every group to detect the anticipated share change with the desired energy and significance stage.
Raises:
ValueError: If `metric` will not be a NumPy array or pandas Collection.
'''
# Validate enter sorts
if not isinstance(metric, (np.ndarray, pd.Collection)):
increase ValueError("metric must be a NumPy array or pandas Collection.")
# Calculate statistics
control_mean = metric.imply()
control_std = np.std(metric, ddof=1) # Use ddof=1 for pattern normal deviation
test_mean = control_mean * (1 + exp_perc_change)
test_std = control_std # Assume the take a look at group has the identical normal deviation because the management group
# Calculate (Cohen's D) impact measurement
mean_diff = control_mean - test_mean
pooled_std = np.sqrt((control_std**2 + test_std**2) / 2)
effect_size = abs(mean_diff / pooled_std) # Cohen's d must be constructive
# Run energy evaluation
power_analysis = smp.TTestIndPower()
sample_size = spherical(power_analysis.solve_power(effect_size=effect_size, alpha=alpha, energy=energy))
print(f"Management imply: {spherical(control_mean, 3)}")
print(f"Management std: {spherical(control_std, 3)}")
print(f"Min pattern measurement: {sample_size}")
return sample_size
So let’s check it out with our pre-experiment knowledge!
exp_perc_change = 0.05 # Set the anticipated share change within the chosen metric attributable to the therapymin_sample_size = power_analysis(df_pre["y_value"], exp_perc_change
We will see that given the distribution of our goal metric, we would want a pattern measurement of 1,645 to detect a rise of 5%.
Knowledge-generating course of: Experimental knowledge
Somewhat than rush into establishing the experiment, you resolve to take the pre-experiment knowledge and simulate the experiment.
The next perform randomly selects customers to be handled and applies a therapy impact. On the finish of the perform we file the imply distinction earlier than and after the therapy was utilized in addition to the true ATE (common therapy impact):
def exp_data_generator(t_perc_change, t_samples):# Create copy of pre-experiment knowledge prepared to govern into experiment knowledge
df_exp = df_pre.reset_index(drop=True)
# Calculate the preliminary therapy impact
treatment_effect = spherical((df_exp["y_value"] * (t_perc_change)).imply(), 2)
# Create therapy column
treated_indices = np.random.alternative(df_exp.index, measurement=t_samples, change=False)
df_exp["treatment"] = 0
df_exp.loc[treated_indices, "treatment"] = 1
# therapy impact
df_exp["treatment_effect"] = 0
df_exp.loc[df_exp["treatment"] == 1, "treatment_effect"] = treatment_effect
# Apply therapy impact
df_exp["y_value_exp"] = df_exp["y_value"]
df_exp.loc[df_exp["treatment"] == 1, "y_value_exp"] = df_exp["y_value"] + df_exp["treatment_effect"]
# Calculate imply diff earlier than therapy
mean_t0_pre = df_exp[df_exp["treatment"] == 0]["y_value"].imply()
mean_t1_pre = df_exp[df_exp["treatment"] == 1]["y_value"].imply()
mean_diff_pre = spherical(mean_t1_pre - mean_t0_pre)
# Calculate imply diff after therapy
mean_t0_post = df_exp[df_exp["treatment"] == 0]["y_value_exp"].imply()
mean_t1_post = df_exp[df_exp["treatment"] == 1]["y_value_exp"].imply()
mean_diff_post = spherical(mean_t1_post - mean_t0_post)
# Calculate ate
treatment_effect = spherical(df_exp[df_exp["treatment"]==1]["treatment_effect"].imply())
print(f"Diff-in-means earlier than therapy: {mean_diff_pre}")
print(f"Diff-in-means after therapy: {mean_diff_post}")
print(f"ATE: {treatment_effect}")
return df_exp
We will feed by the minimal pattern measurement we beforehand calculated:
np.random.seed(123)
df_exp_1 = exp_data_generator(exp_perc_change, min_sample_size)
Let’s begin by inspecting the information we created for handled customers that will help you perceive what the perform is doing:
Subsequent let’s check out the outcomes which the perform prints:
Attention-grabbing, we see that after we choose customers to be handled, however earlier than we deal with them, there’s already a distinction in means. This distinction is because of probability. Which means once we take a look at the distinction after customers are handled we don’t appropriately estimate the ATE (common therapy impact). We are going to come again so far once we cowl CUPED.
Subsequent let’s discover a extra refined method of creating an inference than simply taking the distinction in means…
Bootstrapping
Bootstrapping is a robust statistical approach that includes resampling knowledge with alternative. These resampled datasets, referred to as bootstrap samples, assist us estimate the variability of a statistic (just like the imply or median) from our authentic knowledge. That is significantly engaging on the subject of experimentation because it permits us to calculate confidence intervals. Let’s stroll by it step-by-step utilizing a easy instance…
You will have run an experiment with a management and therapy group every made up of 1k customers.
- Create bootstrap samples — Randomly choose (with alternative) 1k customers from the management after which therapy group. This offers us 1 bootstrap pattern for management and one for therapy.
- Repeat this course of n occasions (e.g. 10k occasions).
- For every pair of bootstrap samples calculate the imply distinction between management and therapy.
- We now have a distribution (made up of the imply distinction between 10k bootstrap samples) which we will use to calculate confidence intervals.
Making use of it to our case research
Let’s use our case research for instance the way it works. Under we use the sciPy stats python bundle to assist calculate bootstrap confidence intervals:
from typing import Union
import pandas as pd
import numpy as np
from scipy import statsdef mean_diff(group_a: Union[np.ndarray, pd.Series], group_b: Union[np.ndarray, pd.Series]) -> float:
'''
Calculate the distinction in means between two teams.
Args:
group_a (Union[np.ndarray, pd.Series]): The primary group of knowledge factors.
group_b (Union[np.ndarray, pd.Series]): The second group of knowledge factors.
Returns:
float: The distinction between the imply of group_a and the imply of group_b.
'''
return np.imply(group_a) - np.imply(group_b)
def bootstrapping(df: pd.DataFrame, adjusted_metric: str, n_resamples: int = 10000) -> np.ndarray:
'''
Carry out bootstrap resampling on the adjusted metric of two teams within the dataframe to estimate the imply distinction and confidence intervals.
Args:
df (pd.DataFrame): The dataframe containing the information. Should embrace a 'therapy' column indicating group membership.
adjusted_metric (str): The title of the column within the dataframe representing the metric to be resampled.
n_resamples (int, elective): The variety of bootstrap resamples to carry out. Defaults to 1000.
Returns:
np.ndarray: The array of bootstrap resampled imply variations.
'''
# Separate the information into two teams primarily based on the 'therapy' column
group_a = df[df["treatment"] == 1][adjusted_metric]
group_b = df[df["treatment"] == 0][adjusted_metric]
# Carry out bootstrap resampling
res = stats.bootstrap((group_a, group_b), statistic=mean_diff, n_resamples=n_resamples, technique='percentile')
ci = res.confidence_interval
# Extract the bootstrap distribution and confidence intervals
bootstrap_means = res.bootstrap_distribution
bootstrap_ci_lb = spherical(ci.low,)
bootstrap_ci_ub = spherical(ci.excessive)
bootstrap_mean = spherical(np.imply(bootstrap_means))
print(f"Bootstrap confidence interval decrease sure: {bootstrap_ci_lb}")
print(f"Bootstrap confidence interval higher sure: {bootstrap_ci_ub}")
print(f"Bootstrap imply diff: {bootstrap_mean}")
return bootstrap_means
After we run it for our case research knowledge we will see that we now have some confidence intervals:
bootstrap_og_1 = bootstrapping(df_exp_1, "y_value_exp")
Our floor fact ATE is 143 (the precise therapy impact from our experiment knowledge generator perform), which falls inside our confidence intervals. Nonetheless, it’s price noting that the imply distinction hasn’t modified (it’s nonetheless 93 as earlier than once we merely calculated the imply distinction of management and therapy), and the pre-treatment distinction continues to be there.
So what if we wished to provide you with narrower confidence intervals? And is there any method we will take care of the pre-treatment variations? This leads us properly into CUPED…
Background
CUPED (managed experiments utilizing pre-experiment knowledge) is a robust approach for bettering the accuracy of experiments developed by researchers at Microsoft. The unique paper is an insightful learn for anybody desirous about experimentation:
https://ai.stanford.edu/~ronnyk/2009controlledExperimentsOnTheWebSurvey.pdf
The core concept of CUPED is to make use of knowledge collected earlier than your experiment begins to cut back the variance in your goal metric. By doing so, you can also make your experiment extra delicate, which has two main advantages:
- You possibly can detect smaller results with the identical pattern measurement.
- You possibly can detect the identical impact with a smaller pattern measurement.
Consider it like eradicating the “background noise” so you possibly can see the “sign” extra clearly.
Variance, normal deviation, normal error
Whenever you examine CUPED chances are you’ll hear folks discuss it decreasing the variance, normal deviation or normal error. In case you are something like me, you may end up forgetting how these are associated, so earlier than we go any additional let’s recap on this!
- Variance: Variance measures the common squared deviation of every knowledge level from the imply, reflecting the general unfold or dispersion inside a dataset.
- Customary Deviation: Customary deviation is the sq. root of variance, representing the common distance of every knowledge level from the imply, and offering a extra interpretable measure of unfold.
- Customary Error: Customary error quantifies the precision of the pattern imply as an estimate of the inhabitants imply, calculated as the usual deviation divided by the sq. root of the pattern measurement.
How does CUPED work?
To know how CUPED works, let’s break it down…
Pre-experiment covariate — Within the lightest implementation of CUPED, the pre-experiment covariate could be the goal metric measured in a time interval earlier than the experiment. So in case your goal metric was gross sales worth, your covariate could possibly be every clients gross sales worth 4 weeks previous to the experiment.
It’s necessary that your covariate is correlated together with your goal metric and that it’s unaffected by the therapy. For this reason we might usually use pre-treatment knowledge from the management group.
Regression adjustment — Linear regression is used to mannequin the connection between the covariate (measured earlier than the experiment) and the goal metric (measured throughout the experiment interval). We will then calculate the CUPED adjusted goal metric by eradicating the affect of the covariate:
It’s price noting that taking away the imply of the covariate is finished to centre the end result variable across the imply to make it interpretable when in comparison with the unique goal metric.
Variance discount — After the regression adjustment the variance in our goal metric has lowered. Decrease variance signifies that the variations between the management and therapy group are simpler to detect, thus rising the statistical energy of the experiment.
Making use of it to our case research
Let’s use our case research for instance the way it works. Under we code CUPED up in a perform:
from typing import Union
import pandas as pd
import numpy as np
import statsmodels.api as smdef cuped(df: pd.DataFrame, pre_covariates: Union[str, list], target_metric: str) -> pd.Collection:
'''
Implements the CUPED (Managed Experiments Utilizing Pre-Experiment Knowledge) approach to regulate the goal metric
by eradicating predictable variation utilizing pre-experiment covariates. This reduces the variance of the metric and
will increase the statistical energy of the experiment.
Args:
df (pd.DataFrame): The enter DataFrame containing each the pre-experiment covariates and the goal metric.
pre_covariates (Union[str, list]): The column title(s) within the DataFrame similar to the pre-experiment covariates used for the adjustment.
target_metric (str): The column title within the DataFrame representing the metric to be adjusted.
Returns:
pd.Collection: A pandas Collection containing the CUPED-adjusted goal metric.
'''
# Match management mannequin utilizing pre-experiment covariates
control_group = df[df['treatment'] == 0]
X_control = control_group[pre_covariates]
X_control = sm.add_constant(X_control)
y_control = control_group[target_metric]
model_control = sm.OLS(y_control, X_control).match()
# Compute residuals and modify goal metric
X_all = df[pre_covariates]
X_all = sm.add_constant(X_all)
residuals = df[target_metric].to_numpy().flatten() - model_control.predict(X_all)
adjustment_term = model_control.params['const'] + sum(model_control.params[covariate] * df[pre_covariates].imply()[covariate] for covariate in pre_covariates)
adjusted_target = residuals + adjustment_term
return adjusted_target
After we apply it to our case research knowledge and evaluate the adjusted goal metric to the unique goal metric, we see that the variance has lowered:
# Apply CUPED
pre_covariates = ["x_recency", "x_frequency", "x_value"]
target_metric = ["y_value_exp"]
df_exp_1["adjusted_target"] = cuped(df_exp_1, pre_covariates, target_metric)# Plot outcomes
plt.determine(figsize=(10, 6))
sns.kdeplot(knowledge=df_exp_1[df_exp_1['treatment'] == 0], x="adjusted_target", hue="therapy", fill=True, palette="Set1", label="Adjusted Worth")
sns.kdeplot(knowledge=df_exp_1[df_exp_1['treatment'] == 0], x="y_value_exp", hue="therapy", fill=True, palette="Set2", label="Authentic Worth")
plt.title(f"Distribution of Worth by Authentic vs CUPED")
plt.xlabel("Worth")
plt.ylabel("Density")
plt.legend(title="Distribution")
Does it scale back the usual error?
Now we’ve got utilized CUPED and lowered the variance, lets run our bootstrapping perform to see what affect it has:
bootstrap_cuped_1 = bootstrapping(df_exp_1, "adjusted_target")
When you evaluate this to our earlier consequence utilizing the unique goal metric you see that the boldness intervals are narrower:
bootstrap_1 = pd.DataFrame({
'authentic': bootstrap_og_1,
'cuped': bootstrap_cuped_1
})# Plot the KDE plots
plt.determine(figsize=(10, 6))
sns.kdeplot(bootstrap_1['original'], fill=True, label='Authentic', colour='blue')
sns.kdeplot(bootstrap_1['cuped'], fill=True, label='CUPED', colour='orange')
# Add imply traces
plt.axvline(bootstrap_1['original'].imply(), colour='blue', linestyle='--', linewidth=1)
plt.axvline(bootstrap_1['cuped'].imply(), colour='orange', linestyle='--', linewidth=1)
plt.axvline(spherical(df_exp_1[df_exp_1["treatment"]==1]["treatment_effect"].imply(), 3), colour='inexperienced', linestyle='--', linewidth=1, label='Therapy impact')
# Customise the plot
plt.title('Distribution of Worth by Authentic vs CUPED')
plt.xlabel('Worth')
plt.ylabel('Density')
plt.legend()
# Present the plot
plt.present()
The bootstrap distinction in means additionally strikes nearer to the bottom fact therapy impact. It is because CUPED can be very efficient at coping with pre-existing variations between the management and therapy group.
Does it scale back the minimal pattern measurement?
The following query is does it scale back the minimal pattern measurement we want. Nicely lets discover out!
treatment_effect_1 = spherical(df_exp_1[df_exp_1["treatment"]==1]["treatment_effect"].imply(), 2)
cuped_sample_size = power_analysis(df_exp_1[df_exp_1['treatment'] == 0]['adjusted_target'], treatment_effect_1 / df_exp_1[df_exp_1['treatment'] == 0]['adjusted_target'].imply())
The minimal pattern measurement wanted has lowered from 1,645 to 901. Each Finance and the Knowledge Science workforce are going to be happy as we will run the experiment for a shorter time interval with a smaller management pattern!
Background
After I first examine CUPED, I considered double machine studying and the similarities. When you aren’t aware of double machine studying, take a look at my article from earlier within the sequence:
Take note of the primary stage final result mannequin in double machine studying:
- Final result mannequin (de-noising): Machine studying mannequin used to estimate the end result utilizing simply the management options. The result mannequin residuals are then calculated.
That is conceptually similar to what we’re doing with CUPED!
How does it evaluate to CUPED?
Let’s feed by our case research knowledge and see if we get an identical consequence:
# Prepare DML mannequin
dml = LinearDML(discrete_treatment=False)
dml.match(df_exp_1[target_metric].to_numpy().ravel(), T=df_exp_1['treatment'].to_numpy().ravel(), X=df_exp_1[pre_covariates], W=None)
ate_dml = spherical(dml.ate(df_exp_1[pre_covariates]))
ate_dml_lb = spherical(dml.ate_interval(df_exp_1[pre_covariates])[0])
ate_dml_ub = spherical(dml.ate_interval(df_exp_1[pre_covariates])[1])print(f'DML confidence interval decrease sure: {ate_dml_lb}')
print(f'DML confidence interval higher sure: {ate_dml_ub}')
print(f'DML ate: {ate_dml}')
We get an virtually equivalent consequence!
After we plot the residuals we will see that the variance is lowered like in CUPED (though we don’t add the imply to scale for interpretation):
# Match mannequin final result mannequin utilizing pre-experiment covariates
X_all = df_exp_1[pre_covariates]
X_all = sm.add_constant(X)
y_all = df_exp_1[target_metric]
outcome_model = sm.OLS(y_all, X_all).match()# Compute residuals and modify goal metric
df_exp_1['outcome_residuals'] = df_exp_1[target_metric].to_numpy().flatten() - outcome_model.predict(X_all)
# Plot outcomes
plt.determine(figsize=(10, 6))
sns.kdeplot(knowledge=df_exp_1[df_exp_1['treatment'] == 0], x="outcome_residuals", hue="therapy", fill=True, palette="Set1", label="Adjusted Goal")
sns.kdeplot(knowledge=df_exp_1[df_exp_1['treatment'] == 0], x="y_value_exp", hue="therapy", fill=True, palette="Set2", label="Authentic Worth")
plt.title(f"Distribution of Worth by Authentic vs DML")
plt.xlabel("Worth")
plt.ylabel("Density")
plt.legend(title="Distribution")
plt.present()
“So what?” I hear you ask!
Firstly, I believe it’s an fascinating remark for anybody utilizing double machine studying — The primary stage final result mannequin assist scale back the variance and due to this fact we must always get comparable advantages to CUPED.
Secondly, it raises the query when is every technique applicable? Let’s shut issues off by overlaying off this query…
There are a number of the explanation why it might make sense to have a tendency in the direction of CUPED:
- It’s simpler to grasp.
- It’s less complicated to implement.
- It’s one mannequin quite than three, which means you will have much less challenges with overfitting.
Nonetheless, there are a few exceptions the place double machine studying outperforms CUPED:
- Biased therapy project — When the therapy project is biased, for instance when you’re utilizing observational knowledge, double machine studying can take care of this. My article from earlier within the sequence builds on this:
- Heterogenous therapy results — Whenever you need to perceive results at a person stage, for instance discovering out who it’s price sending reductions to, double machine studying might help with this. There’s a good case research which illustrates this in my earlier article on optimising therapy methods:
At this time we did a whistle cease tour of experimentation, overlaying speculation testing, energy evaluation and bootstrapping. We then explored how CUPED can scale back the usual error and enhance the facility of our experiments. Lastly, we touched on it’s similarities to double machine studying and mentioned when every technique must be used. There are a couple of further key factors that are price mentioning in phrases CUPED:
- We don’t have to make use of linear regression — If we’ve got a number of covariates, possibly some with non-linear relationships, we may use a machine studying approach like boosting.
- If we do go down the route of utilizing a machine studying approach, we want to ensure to not overfit the information.
- Some cautious thought ought to go into when to run CUPED — Are you going to run it earlier than you begin your experiment after which run an influence evaluation to find out your lowered pattern measurement? Or are you simply going to run it after your experiment to cut back the usual error?