July 24, 2012 —

The famous problem is called the “multi-armed bandit.” It was first put forward as a statistical challenge in the 1930's. A gambler is in a casino facing at a row of slot machines. Each machine has a different payout. The gambler wants to find out which ones to play for maximum return, and do so with minimum losses, losses defined by a mathematical quantity called “regret.”

## Yi Gai |

The approach followed by Yi Gai, an Oakley Fellow and senior Ph.D. student in the Ming Hsieh Department of Electrical Engineering, her advisor Associate Professor Bhaskar Krishnamachari and his colleague Assistant Professor Rahul Jain improves on existing strategies for a computer pulling the levers for structured problems where the number of possible arms is very large.

Their article, entitled “Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations” is now in the IEEE/ACM Transactions on Networking website and will appear in a forthcoming print edition.

For instance, you have just moved to a new city and wish to find the fastest way to get to work from home. Each day you try a new path and observe a random delay on each road-stretch corresponding to the traffic that day. What’s the fastest way to find the fastest? The challenge for learning in this problem is that the number of possible paths may be exponentially large.

## Bhaskar Krishnamachari |

Previously, investigators had imagined a gambler embarking on a serial approach, playing each arm in turn, one or many times, tabulating the results and using this to determine which arm to play the next time. In the context of the shortest-path problem, this corresponds to treating each path as an arm, which results in a large cost of learning.

The USC researchers developed a new approach that can dramatically reduce this cost, providing a way to learn the average shortest delay path very quickly.

Unlike the previous approach, which treats each path as being independent, the USC modeling technique uses the observations of the delay on each individual road stretch on a given traversed path to simultaneously update information about the many paths that overlap with it.

The paper spells out the strategy and the benefits. “The key idea behind this algorithm is store and use observations for each random variable, rather than for each action as a whole. Since the same random variable can be observed while operating different actions this allows information gained from operation of one action to make decisions about a dependent action.” This new approach can be applied to any problem where the total cost to be minimized or reward to be maximized can be expressed as a linear combination of the average of the underlying random variables.

## Rahul Jain |

The benefit: the older single play analytics quickly become expensive — regret increases exponentially as the number of underlying unknown variables increase. But the USC multiple analyses have regret increasing only as a polynomial function of the underlying variables, the first such algorithm for these problems to do so. Crucially, it requires only a linear increase in the storage requirements for intermediate calculations in contrast to the older approaches that require storage to grow exponentially as problem size – the number of levers – increases.

The algorithm developed by USC researchers (referred to as LLR, for "Learning with Linear Rewards") has broad applicability in many settings involving smart networks, which rapidly assemble varying configurations of connections in response to changes in traffic conditions and bandwidth availability. In particular, the researchers have demonstrated its utility in improving the throughput of mobile wireless devices and computing low delay paths for packet routing on the Internet.

This work was supported in part by the U.S. Army Research Laboratory under the Network Science Collaborative Technology Alliance Agreement No. W911NF-09-2-0053 and the U.S. National Science Foundation under Award No. CNS-1049541. The work of Jain was supported by the AFOSR under Grant FA9550-10-1-0307 and the NSF under CAREER Award CNS-0954116.