Small data and the power of mathematical modelling using Google OR-Tools
Introduction
There is a huge demand from companies to collect more and more data. This surge has led companies to forget what can be done with small data and discrete optimisation. The latest advancement in machine learning field has eclipsed other AI domains which are quite powerful in what they can accomplish. In chess for instance, a board has 64 squares and 32 pieces leads to billions of possible combinations and potential strategies to win the game.
Similarly to chess, companies might be holding a few amount of data but still can, through computational power and optimization techniques, be enabled to choose the best strategies among many different ones that can emanate from this data.
Machine Learning and Big Data
Machine learning is a field of computer science that aims to teach computers how to learn and act without being explicitly programmed. More specifically, it is an approach to data analysis that involves the use of models that allow programs to be learned through experience. Machine learning involves algorithms that adjust their models to improve their ability to make predictions. Machine Learning if used correctly can be very powerful but still it has some cons such as:
- Depend on large amount of data.
- Data must be of good quality.
- Interpretation of results might be very difficult.
- Probability of having a biased model.
Discrete optimization does not have the cons shown above but the model must be generated explicitly.
Google OR-Tools
OR-Tools is an open source software suite for optimization, tuned for tackling the world’s toughest problems in vehicle routing, flows, integer and linear programming, and constraint programming. After modeling your problem in the programming language of your choice, you can use many solvers to solve it.
Knapsack Problem
The Knapsack problem is a very interesting optimization problem. Imagine that you are going camping soon in the Australian outback. There is many items needed but unfortunately you can only select some. Each item has a certain value and size or volume and using this info you have to make an objective decision on which items are most valuable.
Problem Abstraction
Models mostly operates independently of the concrete world. Systems and real world problems have in general an actual representation which is too complex to model in all its details. The model must undergo a kind of translation to scaled down “abstract” version. In this context this problem we do not delve in the details on how to fit the items in the backpack but assume of the summation of the chosen items volume is less than the backpack volume then they fit in it. Furthermore, A ball could be represented by a sphere or a cube for instance and details such as color, texture, or make are not needed.
Problem Formulation
We are going to formulate the problem by defining the set of variables needed then linking those variables using linear equations. These set of equations will define the problem objective and constraints and then the built in solver could be used to search through he space defined by the equations structure to find the optimal solution.
Data
The following table defines all parameters and variables needed to write the equations. We associate with each object a variable x. A variable can take different values and in this specific case we either take the whole object or not. The value associated with each object is subjective and could be changed according to the location of the camping site and one’s personal preference. The volume is an approximate figure of the percentage space taken from the bag.
Mathematical Model
In this model, each variable is either 1 or 0. A value of 1 denotes that we are going to put the item in the bag otherwise we are going to leave it behind. The the items taken is represented by sum of the product. A multiplication by zero insures the item volume is not considered. While a multiplication by one will insure the whole volume of the item is taken into consideration. The objective is simply to maximize the summation of the value for each item taken. The value is represented by p and multiplying p by variable x will insure that the value of the item left behind will not be considered.
Python Code
To implement the mathematical model programmatically we use Google OR-Tools https://developers.google.com/optimization. The class optimization defines a blueprint from which many different problems can be formulated. In this class we define variables, objectives, constraints to model the problem. The solver uses the structure defined by the constraints to optimize the problem and return an optimal solution if the constraints can be satisfied.
from ortools.linear_solver import pywraplp
class Optimisation:
def __init__(self, n):
self.solver = pywraplp.Solver.CreateSolver("SCIP")
def parmeters(self):
pass
def variables(self):
pass
def constraints(self):
pass
def objective(self):
pass
def solve(self):
status = self.solver.Solve()
if status == pywraplp.Solver.OPTIMAL:
pass
def test(self):
pass
if __name__ == '__main__':
optim = Optimisation(4)
optim.parmeters()
optim.variables()
optim.constraints()
optim.objective()
optim.solve()
optim.test()
We use the data and model described above and generate the following code. Notice here that we have added a parameter function to distinguish between variables and values. The code is kind of self explanatory and uses list generation extensively.
from ortools.linear_solver import pywraplp
class Optimisation:
def __init__(self, n):
self.solver = pywraplp.Solver.CreateSolver("SCIP")
self.n = n #number of items
def parmeters(self):
self.p=[30,35,80,90] #value for each item
self.v=[40,20,30,40] #volume taken by each item
def variables(self):
self.x = {} # 1 or 0
for i in range(self.n):
self.x[i] = self.solver.IntVar(0, 1, "x" + str(i + 1))
def constraints(self):
self.solver.Add(sum([self.v[i]*self.x[i] for i in range(self.n)])<=100)
def objective(self):
self.solver.Maximize(sum([self.p[i]*self.x[i] for i in range(self.n)]))
def solve(self):
status = self.solver.Solve()
if status == pywraplp.Solver.OPTIMAL:
print("Solution:")
for i in range(self.n):
print(self.x[i].solution_value())
def test(self):
pass
if __name__ == '__main__':
optim = Optimisation(4)
optim.parmeters()
optim.variables()
optim.constraints()
optim.objective()
optim.solve()
optim.test()
Solution
The solver uses “SCIP” which is a kind of solver that can deal with integer variables. In this case all variable are integer and binary. In this case the solution tells us to select all items but the first one (the ball).
Problem solved in 10.000000 milliseconds
Problem solved in 0 iterations
Problem solved in 1 branch-and-bound nodes
Solution:
x1=0
x2=1
x3=1
x4=1
Bigger problem size
You can solve this problem with 100 or 1000 of items and still it would be considered small but non trivial especially if you start adding some other constraints to model different scenarios and complexity that comes with real world problems.
Application of the Knapsack Problem in Finance
A direct application of the knapsack problem in finance is to replace items by projects. The model would be to select a subset of these projects within budget, time and risk while maximizing return on investment.
Conclusion
There is a lot of very interesting problems that can be solved using small amount of data. Small amount of data does not mean that the problem is easy and will not give big value to a company. Think about a Rubik Cube and the number of people that can solve it on the spot. While it fits directly in the hand, it is still quite a challenging puzzle to solve.