It could be shocking, however on this article, I wish to speak concerning the knapsack drawback, the traditional optimisation drawback that has been studied for over a century. In accordance with Wikipedia, the issue is outlined as follows:
Given a set of things, every with a weight and a price, decide which objects to incorporate within the assortment in order that the entire weight is lower than or equal to a given restrict and the entire worth is as giant as potential.
Whereas product analysts might not bodily pack knapsacks, the underlying mathematical mannequin is very related to a lot of our duties. There are quite a few real-world purposes of the knapsack drawback in product analytics. Listed below are a couple of examples:
- Advertising and marketing Campaigns: The advertising staff has a restricted funds and capability to run campaigns throughout totally different channels and areas. Their purpose is to maximise a KPI, such because the variety of new customers or income, all whereas adhering to present constraints.
- Retail House Optimization: A retailer with restricted bodily house of their shops seeks to optimize product placement to maximise income.
- Product Launch Prioritization: When launching a brand new product, the operations staff’s capability could be restricted, requiring prioritization of particular markets.
Such and related duties are fairly widespread, and lots of analysts encounter them recurrently. So, on this article, I’ll discover totally different approaches to fixing it, starting from naive, easy strategies to extra superior strategies reminiscent of linear programming.
One more reason I selected this subject is that linear programming is among the strongest and in style instruments in prescriptive analytics — a sort of research that focuses on offering stakeholders with actionable choices to make knowledgeable choices. As such, I consider it’s an important ability for any analyst to have of their toolkit.
Let’s dive straight into the case we’ll be exploring. Think about we’re a part of a advertising staff planning actions for the upcoming month. Our goal is to maximise key efficiency indicators (KPIs), such because the variety of acquired customers and income whereas working inside a restricted advertising funds.
We’ve estimated the anticipated outcomes of assorted advertising actions throughout totally different nations and channels. Right here is the info we now have:
nation
— the market the place we are able to do some promotional actions;channel
— the acquisition methodology, reminiscent of social networks or influencer campaigns;customers
— the anticipated variety of customers acquired inside a month of the promo marketing campaign;cs_contacts
— the incremental Buyer Assist contacts generated by the brand new customers;marketing_spending
— the funding required for the exercise;income
— the first-year LTV generated from acquired clients.
Observe that the dataset is artificial and randomly generated, so don’t attempt to infer any market-related insights from it.
First, I’ve calculated the high-level statistics to get a view of the numbers.
Let’s decide the optimum set of selling actions that maximizes income whereas staying inside the $30M advertising funds.
At first look, the issue could appear simple: we may calculate all potential mixtures of selling actions and choose the optimum one. Nonetheless, it could be a difficult job.
With 62 segments, there are 2⁶² potential mixtures, as every phase can both be included or excluded. This ends in roughly 4.6*10¹⁸ mixtures — an astronomical quantity.
To higher perceive the computational feasibility, let’s take into account a smaller subset of 15 segments and estimate the time required for one iteration.
import itertools
import pandas as pd
import tqdm# studying knowledge
df = pd.read_csv('marketing_campaign_estimations.csv', sep = 't')
df['segment'] = df.nation + ' - ' + df.channel
# calculating mixtures
mixtures = []
segments = checklist(df.phase.values)[:15]
print('variety of segments: ', len(segments))
for num_items in vary(len(segments) + 1):
mixtures.prolong(
itertools.mixtures(segments, num_items)
)
print('variety of mixtures: ', len(mixtures))
tmp = []
for chosen in tqdm.tqdm(mixtures):
tmp_df = df[df.segment.isin(selected)]
tmp.append(
{
'selected_segments': ', '.be part of(chosen),
'customers': tmp_df.customers.sum(),
'cs_contacts': tmp_df.cs_contacts.sum(),
'marketing_spending': tmp_df.marketing_spending.sum(),
'income': tmp_df.income.sum()
}
)
# variety of segments: 15
# variety of mixtures: 32768
It took roughly 4 seconds to course of 15 segments, permitting us to deal with round 7,000 iterations per second. Utilizing this estimate, let’s calculate the execution time for the complete set of 62 segments.
2**62 / 7000 / 3600 / 24 / 365
# 20 890 800.6
Utilizing brute drive, it will take round 20.9 million years to get the reply to our query — clearly not a possible choice.
Execution time is fully decided by the variety of segments. Eradicating only one phase can cut back time twice. With this in thoughts, let’s discover potential methods to merge segments.
As regular, there are extra small-sized segments than greater ones, so merging them is a logical step. Nonetheless, it’s essential to notice that this strategy might cut back accuracy since a number of segments are aggregated into one. Regardless of this, it may nonetheless yield an answer that’s “ok.”
To simplify, let’s merge all segments that contribute lower than 0.1% of income.
df['share_of_revenue'] = df.income/df.income.sum() * 100
df['segment_group'] = checklist(map(
lambda x, y: x if y >= 0.1 else 'different',
df.phase,
df.share_of_revenue
))print(df[df.segment_group == 'other'].share_of_revenue.sum())
# 0.53
print(df.segment_group.nunique())
# 52
With this strategy, we are going to merge ten segments into one, representing 0.53% of the entire income (the potential margin of error). With 52 segments remaining, we are able to acquire the answer in simply 20.4K years. Whereas it is a vital enchancment, it’s nonetheless not adequate.
You could take into account different heuristics tailor-made to your particular job. For example, in case your constraint is a ratio (e.g., contact fee = CS contacts / customers ≤ 5%), you can group all segments the place the constraint holds true, because the optimum resolution will embrace all of them. In our case, nevertheless, I don’t see any extra methods to cut back the variety of segments, so brute drive appears impractical.
That stated, if the variety of mixtures is comparatively small and brute drive might be executed inside an affordable time, it may be a really perfect strategy. It’s easy to develop and offers correct outcomes.
Since brute drive will not be possible for calculating all mixtures, let’s take into account an easier algorithm to deal with this drawback.
One potential strategy is to concentrate on the top-performing segments. We will consider phase efficiency by calculating income per greenback spent, then type all actions primarily based on this ratio and choose the highest performers that match inside the advertising funds. Let’s implement it.
df['revenue_per_spend'] = df.income / df.marketing_spending
df = df.sort_values('revenue_per_spend', ascending = False)
df['spend_cumulative'] = df.marketing_spending.cumsum()
selected_df = df[df.spend_cumulative <= 30000000]
print(selected_df.form[0])
# 48
print(selected_df.income.sum()/1000000)
# 107.92
With this strategy, we chosen 48 actions and obtained $107.92M in income.
Sadly, though the logic appears affordable, it isn’t the optimum resolution for maximizing income. Let’s take a look at a easy instance with simply three advertising actions.
Utilizing the highest markets strategy, we would choose France and obtain $68M in income. Nonetheless, by selecting two different markets, we may obtain considerably higher outcomes — $97.5M. The important thing level is that our algorithm optimizes not just for most income but additionally for minimizing the variety of chosen segments. Subsequently, this strategy won’t yield the very best outcomes, particularly contemplating its lack of ability to account for a number of constraints.
Since all easy approaches have failed, we should return to the basics and discover the idea behind this drawback. Thankfully, the knapsack drawback has been studied for a few years, and we are able to apply optimization strategies to unravel it in seconds somewhat than years.
The issue we’re making an attempt to unravel is an instance of Integer Programming, which is definitely a subdomain of Linear Programming.
We’ll focus on this shortly, however first, let’s align on the important thing ideas of the optimization course of. Every optimisation drawback consists of:
- Determination variables: Parameters that may be adjusted within the mannequin, sometimes representing the levers or choices we need to make.
- Goal operate: The goal variable we goal to maximise or reduce. It goes with out saying that it should depend upon the choice variables.
- Constraints: Situations positioned on the choice variables that outline their potential values. For instance, guaranteeing the staff can not work a unfavourable variety of hours.
With these primary ideas in thoughts, we are able to outline Linear Programming as a state of affairs the place the next situations maintain:
- The target operate is linear.
- All constraints are linear.
- Determination variables are real-valued.
Integer Programming is similar to Linear Programming, with one key distinction: some or all resolution variables have to be integers. Whereas this will likely seem to be a minor change, it considerably impacts the answer strategy, requiring extra complicated strategies than these utilized in Linear Programming. One widespread method is branch-and-bound. We gained’t dive deeper into the idea right here, however you’ll be able to at all times discover extra detailed explanations on-line.
For linear optimization, I desire the extensively used Python package deal PuLP. Nonetheless, there are different choices accessible, reminiscent of Python MIP or Pyomo. Let’s set up PuLP by way of pip.
! pip set up pulp
Now, it’s time to outline our job as a mathematical optimisation drawback. There are the next steps for it:
- Outline the set of resolution variables (levers we are able to modify).
- Align on the target operate (a variable that we are going to be optimising for).
- Formulate constraints (the situations that should maintain true throughout optimisations).
Let’s undergo the steps one after the other. However first, we have to create the issue object and set the target — maximization in our case.
from pulp import *
drawback = LpProblem("Marketing_campaign", LpMaximize)
The following step is defining the choice variables — parameters that we are able to change throughout optimisation. Our most important resolution is both to run a advertising marketing campaign or not. So, we are able to mannequin it as a set of binary variables (0 or 1) for every phase. Let’s do it with the PuLP library.
segments = vary(df.form[0])
chosen = LpVariable.dicts("Chosen", segments, cat="Binary")
After that, it’s time to align on the target operate. As mentioned, we need to maximise the income. The overall income will likely be a sum of income from all the chosen segments (the place decision_variable = 1
). Subsequently, we are able to outline this components because the sum of the anticipated income for every phase multiplied by the choice binary variable.
drawback += lpSum(
chosen[i] * checklist(df['revenue'].values)[i]
for i in segments
)
The ultimate step is so as to add constraints. Let’s begin with a easy constraint: our advertising spending have to be under $30M.
drawback += lpSum(
chosen[i] * df['marketing_spending'].values[i]
for i in segments
) <= 30 * 10**6
Trace: you’ll be able to print
drawback
to double examine the target operate and constraints.
Now that we’ve outlined every little thing, we are able to run the optimization and analyze the outcomes.
drawback.clear up()
It takes lower than a second to run the optimization, a big enchancment in comparison with the hundreds of years that brute drive would require.
Outcome - Optimum resolution discoveredGoal worth: 110162662.21000001
Enumerated nodes: 4
Whole iterations: 76
Time (CPU seconds): 0.02
Time (Wallclock seconds): 0.02
Let’s save the outcomes of the mannequin execution — the choice variables indicating whether or not every phase was chosen or not — into our dataframe.
df['selected'] = checklist(map(lambda x: x.worth(), chosen.values()))
print(df[df.selected == 1].income.sum()/10**6)
# 110.16
It really works like magic, permitting you to acquire the answer rapidly. Moreover, word that we achieved increased income in comparison with our naive strategy: $110.16M versus $107.92M.
We’ve examined integer programming with a easy instance that includes only one constraint, however we are able to prolong it additional. For example, we are able to add extra constraints for our CS contacts to make sure that our Operations staff can deal with the demand in a wholesome approach:
- The variety of extra CS contacts ≤ 5K
- Contact fee (CS contacts/customers) ≤ 0.042
# outline the issue
problem_v2 = LpProblem("Marketing_campaign_v2", LpMaximize)# resolution variables
segments = vary(df.form[0])
chosen = LpVariable.dicts("Chosen", segments, cat="Binary")
# goal operate
problem_v2 += lpSum(
chosen[i] * checklist(df['revenue'].values)[i]
for i in segments
)
# Constraints
problem_v2 += lpSum(
chosen[i] * df['marketing_spending'].values[i]
for i in segments
) <= 30 * 10**6
problem_v2 += lpSum(
chosen[i] * df['cs_contacts'].values[i]
for i in segments
) <= 5000
problem_v2 += lpSum(
chosen[i] * df['cs_contacts'].values[i]
for i in segments
) <= 0.042 * lpSum(
chosen[i] * df['users'].values[i]
for i in segments
)
# run the optimisation
problem_v2.clear up()
The code is simple, with the one tough half being the transformation of the ratio constraint into an easier linear type.
One other potential constraint you may take into account is limiting the variety of chosen choices, for instance, to 10. This constraint may very well be fairly useful in prescriptive analytics, for instance, when you must choose the top-N most impactful focus areas.
# outline the issue
problem_v3 = LpProblem("Marketing_campaign_v2", LpMaximize)# resolution variables
segments = vary(df.form[0])
chosen = LpVariable.dicts("Chosen", segments, cat="Binary")
# goal operate
problem_v3 += lpSum(
chosen[i] * checklist(df['revenue'].values)[i]
for i in segments
)
# constraints
problem_v3 += lpSum(
chosen[i] * df['marketing_spending'].values[i]
for i in segments
) <= 30 * 10**6
problem_v3 += lpSum(
chosen[i] for i in segments
) <= 10
# run the optimisation
problem_v3.clear up()
df['selected'] = checklist(map(lambda x: x.worth(), chosen.values()))
print(df.chosen.sum())
# 10
One other potential choice to tweak our drawback is to vary the target operate. We’ve been optimising for the income, however think about we need to maximise each income and new customers on the identical time. For that, we are able to barely change our goal operate.
Let’s take into account the very best strategy. We may calculate the sum of income and new customers and goal to maximise it. Nonetheless, since income is, on common, 1000 instances increased, the outcomes could be skewed towards maximizing income. To make the metrics extra comparable, we are able to normalize each income and customers primarily based on their whole sums. Then, we are able to outline the target operate as a weighted sum of those ratios. I might use equal weights (0.5) for each metrics, however you’ll be able to modify the weights to offer extra worth to one among them.
# outline the issue
problem_v4 = LpProblem("Marketing_campaign_v2", LpMaximize)# resolution variables
segments = vary(df.form[0])
chosen = LpVariable.dicts("Chosen", segments, cat="Binary")
# goal Perform
problem_v4 += (
0.5 * lpSum(
chosen[i] * df['revenue'].values[i] / df['revenue'].sum()
for i in segments
)
+ 0.5 * lpSum(
chosen[i] * df['users'].values[i] / df['users'].sum()
for i in segments
)
)
# constraints
problem_v4 += lpSum(
chosen[i] * df['marketing_spending'].values[i]
for i in segments
) <= 30 * 10**6
# run the optimisation
problem_v4.clear up()
df['selected'] = checklist(map(lambda x: x.worth(), chosen.values()))
We obtained the optimum goal operate worth of 0.6131, with income at $104.36M and 136.37K new customers.
That’s it! We’ve realized how you can use integer programming to unravel numerous optimisation issues.
You could find the complete code on GitHub.
On this article, we explored totally different strategies for fixing the knapsack drawback and its analogues in product analytics.
- We started with a brute-force strategy however rapidly realized it will take an unreasonable period of time.
- Subsequent, we tried utilizing widespread sense by naively choosing the top-performing segments, however this strategy yielded incorrect outcomes.
- Lastly, we turned to Integer Programming, studying how you can translate our product duties into optimization fashions and clear up them successfully.
With this, I hope you’ve gained one other invaluable analytical software on your toolkit.
Thank you a large number for studying this text. I hope this text was insightful for you. In case you have any follow-up questions or feedback, please depart them within the feedback part.
All the photographs are produced by the creator until in any other case said.