# Kaggle Christmas Challenge 2019

This year I participated in the annual Kaggle Christmas Challenge and will illustrate my approaches in this post since the submission deadline has passed. My final model reaches the optimal solution within 37 hours using a linear programming min-cost-flow formulation. A 3% optimality gap is reached within 5 hours.

In this year’s problem, we are asked to optimally schedule visitors for Santa’s Workshop in the 100 days before Christmas. Five thousand families have each submitted ten date preferences and must be scheduled so that the number of visitors for each day falls between 125 and 300. If we assign a family to their less desired preferences, we have to offer larger and larger gifts as a penalty. Additionally an accounting cost, $c = \sum_{d \in \mathcal{D}} \frac{(N_d - 125)}{400} {N_d}^{\left( \frac{1}{2} + \frac{\vert N_d - N_{\left(d-1\right)} \vert }{50} \right)}$

, based on the number of visitors at each day $N_d$ applies. A full problem description is available on the competition page.

The accounting cost is where the main complexity comes from as it is intentionally designed to be non-linear and non-convex which presents a challenge for optimization models. We will first try to ignore, then limit and finally linearize the accounting costs and show the gains in solution quality along the way.

## Attempt 1: Ignoring the Accounting Costs

As an initial attempt we will just ignore the accounting costs and find the assignment penalty optimal solution. Using the sets of families $\mathcal{F}$ and days $\mathcal{D}$, we can formulate the following mixed integer optimization problem:

$\begin{aligned} &\text{minimize } \sum_{d \in \mathcal{D}} \sum_{f \in \mathcal{F}} x_{df} \cdot c_{df} & \\ &\text{subject to:} & \\ &\sum_{f \in \mathcal{F}} s_f x_{df} & \geq 125 \quad, \forall d \in \mathcal{D} \\ &\sum_{f \in \mathcal{F}} s_f x_{df} & \leq 300 \quad, \forall d \in \mathcal{D} \\ &\sum_{d \in \mathcal{D}} x_{df} & = 1 \quad, \forall f \in \mathcal{F} \\ \end{aligned}$

With decision variable $x_{df} = 1$ when family $f$ is assigned to day $d$, and $x_{df} = 0$ otherwise. As parameters we are given $s_f$ as the size of family $f$ and can precompute $c_{df}$ as the penalty cost incurred when assigning family $f$ to day $d$ based on their preference list.

This can be visualized as the partial day-family matrix shown below, where for every column (=family) we need to pick precisely one day, and for every row (=day), we must select between $P^{\min}$ and $P^{\max}$ people while minimizing the sum of our selected assignment penalties $c_{df}$.

This model can be easily solved to optimality within seconds using modern solver software. When we evaluate the solution using the full cost function, the costs are enormous, much higher than the provided sample solution: **15 583 511 274**. It is evident that just optimizing for the assignment penalty costs is not enough.

In the optimal assignment chart, we can identify a preference for weekends and the last days before Christmas.

## Attempt 2: Limiting the Accounting Costs

For a single day we can reformulate the accounting cost function to $\frac{(N_d - 125)}{400} {N_d}^{( \frac{1}{2} + \frac{\vert\Delta_d\vert}{50} )}$, with $N_d$ as the number of visitors assigned to the day and $\Delta_d$ as the difference in visitors from the previous day. When plotting the function we can see that it is particularly sensitive to a change in $\Delta_d$ and is more forgiving at lower $N_d$ values:

As a first measure, we will now extend our first formulation by limiting the delta allowed between two days by a fixed value $\Delta^{\max}$. We linearize the the absolute value function and add the following constraints:

$\begin{aligned} &\sum_{f \in \mathcal{F}} \left( s_f x_{(d-1),f} \right) + \Delta_d^{-} - \Delta_d^{+} & = &\ \sum_{f \in \mathcal{F}} s_f x_{d,f} \quad, \forall d \in \mathcal{D} \\ & \Delta_d^{+} & \leq &\ \Delta^{\max} \quad, \forall d \in \mathcal{D} \\ & \Delta_d^{-} & \leq &\ \Delta^{\max} \quad, \forall d \in \mathcal{D} \\ & \Delta_d^{-}, \Delta_d^{+} & \geq &\ 0 \quad, \forall d \in \mathcal{D} \\ \end{aligned}$

By experimenting with different values, we determine $\Delta^{\max} = 30$ to yield the best results. The optimal solution for this model is **77 347.70**, which is found after approx. 200 seconds, showing a huge improvement over the previous cost of *15 583 511 274*.

In the assignment profile we can see that the peaks have been smoothed out:

We can further improve the result by replacing the fixed limit using a function dependent on $N_d$ as a decrease to low visitor levels is hardly penalized. We revisit the cost function plot and draw an approximate piecewise linear function where the accounting costs exceed 120.

We can now limit the $\Delta_d$ depending on $N_d$ in the linear program. Optimizing the model using the piecewise linear function for a time limit of one hour results in a further improvement to **73 338.75**. In the assignment graph, we can see that the ramp-up then -down pattern in the peaks to the right has been replaced with going straight back to 125, a move for which we don’t pay any accounting costs.

Solving the computationally expensive piecewise linear problem formulation to optimality would likely improve the result slightly but wouldn’t lead to a global optimum when applying the full cost function. We need to incorporate the accounting cost function in the objective value instead of just limiting the solution space.

## Attempt 3: Linearizing the Accounting Costs

If we were able to linearize the accounting costs, the full problem could be approached using linear programming. Since the accounting cost is somewhat equivalent to a state transition cost, a graph representation of the problem looks suitable. Based on states we could be in for a single day (=the visitor levels), we can transition to a new state for the next day. The accounting costs can then be expressed as a transition cost.

To begin our linearization we first discretize the possible visitor levels for each day. As our visitor limits are between 125 and 300 we get a total of $(300-125)*100 = 175\,000$ possible states over all dates with transition arcs between them. More formally we define the problem as directed graph $G = (\mathcal{V},\mathcal{A})$, with set $\mathcal{V}$ containing a vertex $(d,l)$ for every day $d \in \mathcal{D}$ and visitor level $l \in \mathcal{L}$. The arc set $\mathcal{A}$ contains an arc $((d,l),(d+1,m))$ for every day $d \in \mathcal{D}$, current visitor level $l \in \mathcal{L}$ and next visitor level $m \in \mathcal{L}$. We add source and sink nodes $S$ and $F$ respectively. For each arc $(i,j)$ we define a cost parameter $\overline{c}_{ij} = \frac{(j - 125)}{400} {j}^{( \frac{1}{2} + \frac{\vert j - i \vert }{50} )}$ based on the accounting cost formula. A partial graph with a limited amount nodes is illustrated below:

The lowest accounting costs can now be found using the shortest path from the start to the finish node using the costs $\overline{c}_{ij}$ as distance measure. Arcs that exceed a large cost limit can be removed from the graph as they are doubtful to be used. There are no cycles, so this is possible in polynomial time using a dynamic programming approach. However, since we need to integrate the family assignment problem with the accounting cost, we have to choose a min-cost-flow formulation for linear programming instead. We start with the initial family assignment model containing the following constraints:

$\begin{aligned} &\sum_{f \in \mathcal{F}} s_f x_{df} & \geq 125 \quad, \forall d \in \mathcal{D} \\ &\sum_{f \in \mathcal{F}} s_f x_{df} & \leq 300 \quad, \forall d \in \mathcal{D} \\ &\sum_{d \in \mathcal{D}} x_{df} & =\ 1 \quad, \forall f \in \mathcal{F} \\ \end{aligned}$

We now introduce a binary variable $a_{ij}$ indicating if we traverse arc $(i,j)$ (with i and j each representing a day-level vertex). Sets $\mathcal{\delta}_i^{+}$ and $\mathcal{\delta}_i^{-}$ contain incoming and outgoing arcs of an vertex $i$ respectively. Using these sets, we can add flow conservation constraints to the model, which enforce that we only have an outflow from a node if we had an inflow and that we have outgoing and incoming flows for the source and sink.

$\begin{aligned} &\sum_{j \in \mathcal{\delta}_i^{-}} a_{ij} - \sum_{j \in \mathcal{\delta}_i^{+}} a_{ji} & = &\ 0 \quad, \forall i \in V, i \notin \{S,F\} \\ &\sum_{j \in \mathcal{\delta}_i^{-}} a_{ij} & = &\ 1 \quad, i = S \\ &\sum_{j \in \mathcal{\delta}_i^{+}} a_{ji} & = &\ 1 \quad, i = F \\ \end{aligned}$

Having both the family assignment and the accounting problem within the same model, the only missing step is connecting the two. We introduce linking constraints that require that we assign $l$ visitors on the day $d$ when there is an incoming flow to vertex $(d,l)$.

$\begin{aligned} &\sum_{f \in \mathcal{F}} s_f x_{df} & = & \sum_{l \in \mathcal{L}} \sum_{j \in \delta^{+}_{(d,l)}} l \cdot a_{j,(d,l)} \quad, \forall d \in \mathcal{D} \\ \end{aligned}$

Together with the sum of the family assignment penalties and the accounting cost the full model is then:

$\begin{aligned} &\text{minimize } \sum_{d \in \mathcal{D}} \sum_{f \in \mathcal{F}} x_{df} \cdot c_{df} & + & \sum_{(i,j) \in \mathcal{A}} a_{ij} \cdot \overline{c}_{ij} \\ &\text{subject to:} & \\ &\sum_{f \in \mathcal{F}} s_f x_{df} & \geq &\ 125 \quad, \forall d \in \mathcal{D} \\ &\sum_{f \in \mathcal{F}} s_f x_{df} & \leq &\ 300 \quad, \forall d \in \mathcal{D} \\ &\sum_{d \in \mathcal{D}} x_{df} & = &\ 1 \quad, \forall f \in \mathcal{F} \\ &\sum_{j \in \delta_i^{-}} a_{ij} - \sum_{j \in \delta_i^{+}} a_{ji} & = &\ 0 \quad, \forall i \in \mathcal{V}, i \notin \{S,F\} \\ &\sum_{j \in \delta_i^{-}} a_{ij} & = &\ 1 \quad, i = S \\ &\sum_{j \in \delta_i^{+}} a_{ji} & = &\ 1 \quad, i = F \\ &\sum_{f \in \mathcal{F}} s_f x_{df} & = &\ \sum_{l \in \mathcal{L}} \sum_{j \in \delta_i^{+}} l \cdot a_{j,(d,l)} \quad, \forall d \in \mathcal{D} \\ \end{aligned}$

Solving this model requires roughly 12GB of memory due to a large number of arc variables but can be completed within 37 hours on a standard quad-core PC using Gurobi as a solver. A 3% optimality gap was reached after 5 hours and lowered to 1% after 12 hours.

The final (and optimal) total cost identified is **68 888.04343**.

In the assignment chart, we see the same development as seen when changing to the piecewise linear model, just much more pronounced. We have several peaks with slow up-ramps and then a direct return to the minimum visitor-level 125.

## Final Remarks

I was impressed by how well Gurobi was handling the sizeable min-cost-flow subproblem with 175 000 vertices and 3 million arcs. It likely has special optimizations that detect the min-cost-flow problem part and applies tailored heuristics to it. This is definitively something to remember when modeling complex problems.

Outside of the Kaggle Challenge, even the relatively simple piecewise linear function approach from attempt two would have been reasonably good, having just a 6% gap to the optimal solution. On Kaggle even the optimal solution was only good enough for a silver medal as I was quite late to the party. Definitely a fun challenge, though, and I’m looking forward to next year’s edition.