For large uncertainties, approaches based on offline robust planning tend to yield overly conservative deci- sions. This motivates the study of methods that mitigate uncertainties adaptively by learning the unknown environment during online interactions.
Fast online adaptation: Compared with learning in computer vision and natural language processing, online decision-making under uncertainty typically operates in the “small data” regime, where the amount of data collected from interacting with the environment is significantly more limited. We have studied how to accelerate online learning for adapting to model uncertainty, i.e., uncertainties in the transition dynamics, by leveraging side information or useful structures in the problem. One situation is when an inaccurate model is available, where we have demonstrated the benefits of using an inaccurate model to guide model-free reinforcement learning. Another situation is when a hypothesis set of possible models is provided.
Online learning against learning opponents: Multi-agent environments pose an additional challenge to online learning. Aside from the planning (ego) agent, other self-interested agents may also adjust their strategies over time based on certain learning rules, which must be accounted for in designing online learning algorithms. Moreover, the learning rules used by other agents are typically unknown to the decision-making agent. We view the unknown learning rule as a nonparametric, uncertain dynamical system and use tools from control theory to provide guidelines for designing learning algorithms. In particular, we have studied two representative cases. One is the case of two-player zero-sum games. Another is two-player Stackelberg games, where the (unknown) follower learns much faster than the leader.
In a competitive multi-agent setting, the planning agent can counteract her opponent by creating disinformation through deception. The disinformation is used to mislead the opponent and prevent the opponent from using strategies that exploit any information advantage.
Capability deception: In a competitive setting, the planning agent may gain an advantage by capability deception, where the agent chooses to hide her capabilities in order to create a false image of weakness. Doing so can make her opponent (incorrectly) exploit the misperceived weakness and choose a suboptimal strategy against the planning agent’s true capabilities. We have studied the problem of planning with action deception in sequential decision-making, where the planning agent may choose to initially hide certain actions, only to use (and hence reveal) them when it is beneficial.
Reward manipulation: Another way to exploit an unknown opponent is to manipulate the perceived reward of the opponent. In cyber defense, for instance, this can be done by introducing additional computer hosts camouflaged as valid targets in the eyes of potential attackers. When the unmanipulated reward of the opponent is known, we show that an optimal manipulation strategy can be computed by solving a mixed-integer linear program. We further extend the results to settings in which the unmanipulated reward is partially or completely unknown.