Gregor Godbersen : Kaggle Christmas Challenge 2019
Published

Kaggle Christmas Challenge 2019

Note that this entry was originally published 4 years ago. It may present outdated technologies or knowlege.

This year I participated in the annual Kaggle Christmas Challenge and will illustrate my approaches in this post since the submission deadline has passed. My final model reaches the optimal solution within 37 hours using a linear programming min-cost-flow formulation. A 3% optimality gap is reached within 5 hours.

Problem Introduction

In this year’s problem, we are asked to optimally schedule visitors for Santa’s Workshop in the 100 days before Christmas. Five thousand families have each submitted ten date preferences and must be scheduled so that the number of visitors for each day falls between 125 and 300. If we assign a family to their less desired preferences, we have to offer larger and larger gifts as a penalty. Additionally an accounting cost,

c=dD(Nd125)400Nd(12+NdN(d1)50)c = \sum_{d \in \mathcal{D}} \frac{(N_d - 125)}{400} {N_d}^{\left( \frac{1}{2} + \frac{\vert N_d - N_{\left(d-1\right)} \vert }{50} \right)}

, based on the number of visitors at each day NdN_d applies. A full problem description is available on the competition page.

The accounting cost is where the main complexity comes from as it is intentionally designed to be non-linear and non-convex which presents a challenge for optimization models.

Attempts

We will first try to ignore, then limit and finally linearize the accounting costs and show the gains in solution quality along the way.

1. Ignoring the Accounting Costs

As an initial attempt we will just ignore the accounting costs and find the assignment penalty optimal solution. Using the sets of families F\mathcal{F} and days D\mathcal{D}, we can formulate the following mixed integer optimization problem:

minimize dDfFxdfcdfsubject to:fFsfxdf125,dDfFsfxdf300,dDdDxdf=1,fF\begin{aligned} &\text{minimize } \sum_{d \in \mathcal{D}} \sum_{f \in \mathcal{F}} x_{df} \cdot c_{df} & \\ &\text{subject to:} & \\ &\sum_{f \in \mathcal{F}} s_f x_{df} & \geq 125 \quad, \forall d \in \mathcal{D} \\ &\sum_{f \in \mathcal{F}} s_f x_{df} & \leq 300 \quad, \forall d \in \mathcal{D} \\ &\sum_{d \in \mathcal{D}} x_{df} & = 1 \quad, \forall f \in \mathcal{F} \\ \end{aligned}

With decision variable xdf=1x_{df} = 1 when family ff is assigned to day dd, and xdf=0x_{df} = 0 otherwise. As parameters we are given sfs_f as the size of family ff and can precompute cdfc_{df} as the penalty cost incurred when assigning family ff to day dd based on their preference list.

This can be visualized as the partial day-family matrix shown below, where for every column (=family) we need to pick precisely one day, and for every row (=day), we must select between PminP^{\min} and PmaxP^{\max} people while minimizing the sum of our selected assignment penalties cdfc_{df}.

illustration of the assignment problem
illustration of the assignment problem

This model can be easily solved to optimality within seconds using modern solver software. When we evaluate the solution using the full cost function, the costs are enormous, much higher than the provided sample solution: 15 583 511 274. It is evident that just optimizing for the assignment penalty costs is not enough.

In the optimal assignment chart, we can identify a preference for weekends and the last days before Christmas.

Assignment chart for the basic formulation
Assignment chart for the basic formulation

2. Limiting the Accounting Costs

For a single day we can reformulate the accounting cost function to (Nd125)400Nd(12+Δd50)\frac{(N_d - 125)}{400} {N_d}^{( \frac{1}{2} + \frac{\vert\Delta_d\vert}{50} )}, with NdN_d as the number of visitors assigned to the day and Δd\Delta_d as the difference in visitors from the previous day. When plotting the function we can see that it is particularly sensitive to a change in Δd\Delta_d and is more forgiving at lower NdN_d values:

Accounting costs for a single day dependent on NdN_d and Δd\Delta_d, costs capped at 2000

As a first measure, we will now extend our first formulation by limiting the delta allowed between two days by a fixed value Δmax\Delta^{\max}. We linearize the the absolute value function and add the following constraints:

fF(sfx(d1),f)+ΔdΔd+= fFsfxd,f,dDΔd+ Δmax,dDΔd Δmax,dDΔd,Δd+ 0,dD\begin{aligned} &\sum_{f \in \mathcal{F}} \left( s_f x_{(d-1),f} \right) + \Delta_d^{-} - \Delta_d^{+} & = &\ \sum_{f \in \mathcal{F}} s_f x_{d,f} \quad, \forall d \in \mathcal{D} \\ & \Delta_d^{+} & \leq &\ \Delta^{\max} \quad, \forall d \in \mathcal{D} \\ & \Delta_d^{-} & \leq &\ \Delta^{\max} \quad, \forall d \in \mathcal{D} \\ & \Delta_d^{-}, \Delta_d^{+} & \geq &\ 0 \quad, \forall d \in \mathcal{D} \\ \end{aligned}

By experimenting with different values, we determine Δmax=30\Delta^{\max} = 30 to yield the best results. The optimal solution for this model is 77 347.70, which is found after approx. 200 seconds, showing a huge improvement over the previous cost of 15 583 511 274.

In the assignment profile we can see that the peaks have been smoothed out:

 Assignment chart for the basic limit formulation
Assignment chart for the basic limit formulation

We can further improve the result by replacing the fixed limit using a function dependent on NdN_d as a decrease to low visitor levels is hardly penalized. We revisit the cost function plot and draw an approximate piecewise linear function where the accounting costs exceed 120.

Accounting costs for a single day dependent on NdN_d and Δd\Delta_d, costs capped at 400, piecewise linear function through costs of approx. 120 drawn dashed.

We can now limit the Δd\Delta_d depending on NdN_d in the linear program. Optimizing the model using the piecewise linear function for a time limit of one hour results in a further improvement to 73 338.75. In the assignment graph, we can see that the ramp-up then -down pattern in the peaks to the right has been replaced with going straight back to 125, a move for which we don’t pay any accounting costs.

Assignment chart for the piecewise linear formulation
Assignment chart for the piecewise linear formulation

Solving the computationally expensive piecewise linear problem formulation to optimality would likely improve the result slightly but wouldn’t lead to a global optimum when applying the full cost function. We need to incorporate the accounting cost function in the objective value instead of just limiting the solution space.

3. Linearizing the Accounting Costs

If we were able to linearize the accounting costs, the full problem could be approached using linear programming. Since the accounting cost is somewhat equivalent to a state transition cost, a graph representation of the problem looks suitable. Based on states we could be in for a single day (=the visitor levels), we can transition to a new state for the next day. The accounting costs can then be expressed as a transition cost.

To begin our linearization we first discretize the possible visitor levels for each day. As our visitor limits are between 125 and 300 we get a total of (300125)100=175000(300-125)*100 = 175\,000 possible states over all dates with transition arcs between them. More formally we define the problem as directed graph G=(V,A)G = (\mathcal{V},\mathcal{A}), with set V\mathcal{V} containing a vertex (d,l)(d,l) for every day dDd \in \mathcal{D} and visitor level lLl \in \mathcal{L}. The arc set A\mathcal{A} contains an arc ((d,l),(d+1,m))((d,l),(d+1,m)) for every day dDd \in \mathcal{D}, current visitor level lLl \in \mathcal{L} and next visitor level mLm \in \mathcal{L}. We add source and sink nodes SS and FF respectively. For each arc (i,j)(i,j) we define a cost parameter cij=(j125)400j(12+ji50)\overline{c}_{ij} = \frac{(j - 125)}{400} {j}^{( \frac{1}{2} + \frac{\vert j - i \vert }{50} )} based on the accounting cost formula. A partial graph with a limited amount nodes is illustrated below:

Partial transition graph with 3 days and
visitor levels of 126-128. Exemplary valid path with edge costs shown in blue.
Partial transition graph with 3 days and visitor levels of 126-128. Exemplary valid path with edge costs shown in blue.

The lowest accounting costs can now be found using the shortest path from the start to the finish node using the costs cij\overline{c}_{ij} as distance measure. Arcs that exceed a large cost limit can be removed from the graph as they are doubtful to be used. There are no cycles, so this is possible in polynomial time using a dynamic programming approach. However, since we need to integrate the family assignment problem with the accounting cost, we have to choose a min-cost-flow formulation for linear programming instead. We start with the initial family assignment model containing the following constraints:

fFsfxdf125,dDfFsfxdf300,dDdDxdf= 1,fF\begin{aligned} &\sum_{f \in \mathcal{F}} s_f x_{df} & \geq 125 \quad, \forall d \in \mathcal{D} \\ &\sum_{f \in \mathcal{F}} s_f x_{df} & \leq 300 \quad, \forall d \in \mathcal{D} \\ &\sum_{d \in \mathcal{D}} x_{df} & =\ 1 \quad, \forall f \in \mathcal{F} \\ \end{aligned}

We now introduce a binary variable aija_{ij} indicating if we traverse arc (i,j)(i,j) (with i and j each representing a day-level vertex). Sets δi+\mathcal{\delta}_i^{+} and δi\mathcal{\delta}_i^{-} contain incoming and outgoing arcs of an vertex ii respectively. Using these sets, we can add flow conservation constraints to the model, which enforce that we only have an outflow from a node if we had an inflow and that we have outgoing and incoming flows for the source and sink.

jδiaijjδi+aji= 0,iV,i{S,F}jδiaij= 1,i=Sjδi+aji= 1,i=F\begin{aligned} &\sum_{j \in \mathcal{\delta}_i^{-}} a_{ij} - \sum_{j \in \mathcal{\delta}_i^{+}} a_{ji} & = &\ 0 \quad, \forall i \in V, i \notin \{S,F\} \\ &\sum_{j \in \mathcal{\delta}_i^{-}} a_{ij} & = &\ 1 \quad, i = S \\ &\sum_{j \in \mathcal{\delta}_i^{+}} a_{ji} & = &\ 1 \quad, i = F \\ \end{aligned}

Having both the family assignment and the accounting problem within the same model, the only missing step is connecting the two. We introduce linking constraints that require that we assign ll visitors on the day dd when there is an incoming flow to vertex (d,l)(d,l).

fFsfxdf=lLjδ(d,l)+laj,(d,l),dD\begin{aligned} &\sum_{f \in \mathcal{F}} s_f x_{df} & = & \sum_{l \in \mathcal{L}} \sum_{j \in \delta^{+}_{(d,l)}} l \cdot a_{j,(d,l)} \quad, \forall d \in \mathcal{D} \\ \end{aligned}

Together with the sum of the family assignment penalties and the accounting cost the full model is then:

minimize dDfFxdfcdf+(i,j)Aaijcijsubject to:fFsfxdf 125,dDfFsfxdf 300,dDdDxdf= 1,fFjδiaijjδi+aji= 0,iV,i{S,F}jδiaij= 1,i=Sjδi+aji= 1,i=FfFsfxdf= lLjδi+laj,(d,l),dD\begin{aligned} &\text{minimize } \sum_{d \in \mathcal{D}} \sum_{f \in \mathcal{F}} x_{df} \cdot c_{df} & + & \sum_{(i,j) \in \mathcal{A}} a_{ij} \cdot \overline{c}_{ij} \\ &\text{subject to:} & \\ &\sum_{f \in \mathcal{F}} s_f x_{df} & \geq &\ 125 \quad, \forall d \in \mathcal{D} \\ &\sum_{f \in \mathcal{F}} s_f x_{df} & \leq &\ 300 \quad, \forall d \in \mathcal{D} \\ &\sum_{d \in \mathcal{D}} x_{df} & = &\ 1 \quad, \forall f \in \mathcal{F} \\ &\sum_{j \in \delta_i^{-}} a_{ij} - \sum_{j \in \delta_i^{+}} a_{ji} & = &\ 0 \quad, \forall i \in \mathcal{V}, i \notin \{S,F\} \\ &\sum_{j \in \delta_i^{-}} a_{ij} & = &\ 1 \quad, i = S \\ &\sum_{j \in \delta_i^{+}} a_{ji} & = &\ 1 \quad, i = F \\ &\sum_{f \in \mathcal{F}} s_f x_{df} & = &\ \sum_{l \in \mathcal{L}} \sum_{j \in \delta_i^{+}} l \cdot a_{j,(d,l)} \quad, \forall d \in \mathcal{D} \\ \end{aligned}

Solving this model requires roughly 12GB of memory due to a large number of arc variables but can be completed within 37 hours on a standard quad-core PC using Gurobi as a solver. A 3% optimality gap was reached after 5 hours and lowered to 1% after 12 hours.

The final (and optimal) total cost identified is 68 888.04343.

Assignment chart for the optimal solution
Assignment chart for the optimal solution

In the assignment chart, we see the same development as seen when changing to the piecewise linear model, just much more pronounced. We have several peaks with slow up-ramps and then a direct return to the minimum visitor-level 125.

Final Remarks

I was impressed by how well Gurobi was handling the sizeable min-cost-flow subproblem with 175 000 vertices and 3 million arcs. It likely has special optimizations that detect the min-cost-flow problem part and applies tailored heuristics to it. This is definitively something to remember when modeling complex problems.

Outside of the Kaggle Challenge, even the relatively simple piecewise linear function approach from attempt two would have been reasonably good, having just a 6% gap to the optimal solution. On Kaggle even the optimal solution was only good enough for a silver medal as I was quite late to the party. Definitely a fun challenge, though, and I’m looking forward to next year’s edition.