## Finding the right balance between exploitation and exploration

Making decisions under uncertainty is a common challenge faced by professionals in various fields, including data science and asset management. Asset managers face this problem when selecting among multiple execution algorithms to carry out their trades. The allocation of orders among algorithms resembles the multi-armed bandit problem that gamblers face when deciding which slot machines to play, as they must determine the number of times to play each machine, the order in which to play them, and whether to continue with the current machine or switch to another. In this article, we describe how an asset manager can best distribute orders among available algorithms based on realized execution cost.

## Dummy example

For each order, we take an action *a *to allocate to one of *K* algorithms

The value of action *a *is the expected execution cost for the algorithm

Suppose that *K = 3 *and the expected execution cost for the algorithms are

If you would know the action values a priori, it would be trivial to solve the problem. You would always select the algorithm with the lowest expected execution cost. Suppose now that we start allocating orders among the three algorithms as shown in Figure 1.

We still do not know the action values with certainty, but we do have estimates after some time *t*:

We can for instance construct the empirical distribution of the execution costÂ¹ for each algorithm, as shown in Figure 2.

Allocating all orders to the algorithm with the lowest expected execution cost may appear to be the best approach. However, doing so would prevent us from gathering information on the performance of the other algorithms. This illustrates the classical multi-armed bandit dilemma:

- Exploit the information that has already been learned
- Explore to learn which actions give the best outcomes

The objective is to **minimize the average execution cost **after allocating *N *orders:

## Solving the problem using policies

To solve the problem, we need an action selection policy that tells us how to allocate each order based on current information *S. *We can define a policy as a map from* S *to* a:*

We discuss the most well known policiesÂ² for the multi-armed bandit problem, which can be classified in the following categories:

**Semi-uniform strategies:***Greedy &**Îµ-greedy***Probability matching strategies:***Upper-Confidence-Bound & Thompson sampling*

*Greedy*

The *greedy approach *allocates all orders to the action with the lowest estimated value. This policy always exploits current knowledge to maximize immediate reward:

## Ïµ-Greedy

The *Îµ-greedy approach *behaves greedily most of the time but with probability *Îµ* selects randomly among the suboptimal actions:

An advantage of this policy is that it converges to the optimal action in the limit.

## Upper-Confidence-Bound

The *Upper-Confidence-Bound (UCB) approach *selects the action with the lowest action value *minus *a term that is inversely proportional to the number of times the trading algorithm is used, i.e. *Nt(a)*. The approach thus selects among the non-greedy actions according to their potential for actually being optimal and the associated uncertainties in those estimates:

## Thompson Sampling

The *Thompson Sampling approach, *as proposed by Thompson (1933), assumes a known initial distribution over the action values and updates the distribution after each order allocationÂ³. The approach selects actions according to their posterior probability of being the best action:

## Evaluating policies

In practice, policies are commonly evaluated on *regret* which is the deviation from the optimal solution:

where *Î¼* *is the minimal execution cost mean:

Actions are a direct consequence of the policy, and we can therefore also define regret as a function of the chosen policy:

In Figure 3, we simulate the regret for the aforementioned policies in the dummy example. We observe that the *Upper-Confidence-Bound approach *and *Thompson sampling approach* perform best.

## Allocating orders? Embrace uncertainty!

The dummy example simulation results strongly indicate that relying solely on a *greedy approach* may not yield optimal outcomes. It is, therefore, crucial to incorporate and measure the uncertainty in the execution cost estimates when developing an order allocation strategy.

## Footnotes

Â¹ To ensure comparability of the empirical distribution of the execution cost, we need to either allocate similar orders or use order-agnostic cost metrics for evaluation.

Â² In situation where an algorithmâ€™s execution cost are dependent on the order characteristics, contextual bandits are a more suitable option. To learn more about this approach, we recommend Chapter 2.9 in Barto & Sutton (2018) for an introduction.

Â³ We strongly suggest Russo et al. (2018) as an outstanding resource to learn about Thompson sampling.

## Additional resources

The following tutorials / lectures were personally very helpful for my understanding of multi-armed bandit problems.

## Industry

**Academia**

## References

[1] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction.* MIT press*.

[2] Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). A tutorial on Thompson sampling. *Foundations and TrendsÂ® in Machine Learning*, *11*(1), 1â€“96.

[3] Thompson, W. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. *Biometrika*. 25(3/4): 285â€“294.

[4] Thompson, W. R. 1935. On the theory of apportionment. *American Journal of Mathematics*. 57(2): 450â€“456.

[5] Eckles, D. and M. Kaptein. 2014. Thompson sampling with the online bootstrap. *arXiv preprint arXiv:1410.4009*.