# Thermodynamics as a theory of decision-making with information-processing costs

Pedro A. Ortega, Daniel A. Braun

## Abstract

Perfectly rational decision-makers maximize expected utility, but crucially ignore the resource costs incurred when determining optimal actions. Here, we propose a thermodynamically inspired formalization of bounded rational decision-making where information processing is modelled as state changes in thermodynamic systems that can be quantified by differences in free energy. By optimizing a free energy, bounded rational decision-makers trade off expected utility gains and information-processing costs measured by the relative entropy. As a result, the bounded rational decision-making problem can be rephrased in terms of well-known variational principles from statistical physics. In the limit when computational costs are ignored, the maximum expected utility principle is recovered. We discuss links to existing decision-making frameworks and applications to human decision-making experiments that are at odds with expected utility theory. Since most of the mathematical machinery can be borrowed from statistical physics, the main contribution is to re-interpret the formalism of thermodynamic free-energy differences in terms of bounded rational decision-making and to discuss its relationship to human decision-making experiments.

## 1. Introduction

In everyday life, decision-makers often have to make fast and frugal choices that are constrained by limited resources such as time, money, food, knowledge or computational effort [14]. Classic theories of decision-making generally ignore information-processing constraints by assuming that perfectly rational decision-makers always pick the option with maximum return—irrespective of the effort or the resources it might take to find or compute the optimal action [57]. Unlike perfectly rational decision-makers, bounded rational decision-makers are subject to limited information-processing resources. Starting with Simon [810], bounded rationality has extensively been studied in psychology, economics, political science, industrial organization, computer science and artificial intelligence research [1118]. Numerous experiments in behavioural economics have shown that humans are bounded rational and systematically deviate from perfect rationality [19]. Here, we develop a thermodynamic model of bounded rational decision-making that can explain some of these deviations.

Previous attempts to apply thermodynamics and statistical physics to the problem of bounded rational decision-making [2023] have focused on the Boltzmann distribution, thereby stipulating an analogy between the concepts of energy and utility: just like physical systems tend to pick states with low energy, decision-makers tend to pick states with high utility. Being perfectly rational then corresponds to physical systems with zero temperature, in which all probability mass is concentrated on the lowest energy state. In particular, quantal response equilibrium (QRE) models of bounded rationality typically assume bounded rational players whose choice probabilities are given by the Boltzmann distribution and whose rationality is determined by a temperature parameter [20,21]. Boltzmann-like stochastic choice rules have also been extensively studied in the psychological and econometric literature [24,25], in particular in the form of logit choice models going back to Luce [26], McFadden [27], Meginnis [28] and Fudenberg & Kreps [29]. In the machine learning and reinforcement learning literature [30], Boltzmann-like choice rules are known as softmax rules and used for stochastic sampling of actions in the context of the exploration–exploitation dilemma.

In statistical physics, it is well known that the Boltzmann distribution satisfies a variational principle in the free energy F=UTS, which instantiates a trade-off between the internal energy U and the entropic cost S [31]. These two terms have been previously related to utility and information-processing costs in QRE models of bounded rational decision-making [2023]. In this article, we generalize these previous models of bounded rationality based on the duality between information and utility [3234]: instead of considering absolute free energies F, we consider differences in free energy ΔF between an initial state and a final state corresponding to the situation before and after the deliberation associated with the decision-making process. Considering energy differences rather than absolute energies is not only physically meaningful, but it also accounts for the fact that human decision-makers have been shown to consider changes in value rather than absolute value, which is one of the cornerstones of prospect theory [35]. We will show that this seemingly innocuous extension leads to a substantial generalization that allows definition of a certainty-equivalent concept that is closely related to the physical concept of work. The simple Boltzmann distribution is still contained as a special case in the general class of exponential family distributions that satisfy a generalized variational principle in the free-energy difference. Intriguingly, this variational principle can be applied to both action and perception. As special cases, it allows not only the derivation of a number of decision-making frameworks—including expected utility theory [57]—but also the formulation of a variational principle for (approximate) Bayesian inference, which has recently been suggested to underlie self-organizing systems [36,37].

Below we will first expand the thermodynamic intuition, then relate the thermodynamic quantities to decision-making variables and show how to apply this decision-making framework to actual decision-making experiments.

## 2. Thermodynamic intuition

In the following, we conceive information processing as changes in information states, assuming that information states can be represented as probability distributions. The archetypical example is the updating of beliefs from observations in the shape of Bayesian inference: the initial information state is given by a prior distribution, which is transformed by applying a likelihood function to a posterior distribution representing the final information state [38]. Importantly, the same idea can also be applied to the case of acting. The initial information state then corresponds to a prior or default policy before deliberation takes place, for example, given by a uniform distribution over the set of available actions. Applying a likelihood function corresponds to a deliberation process that concentrates probability mass on desirable actions. The final information state corresponds to the posterior distribution over actions, and includes the delta function over the single best action as a special case.

Bounded rationality in the case of acting comes into play when only a certain amount of change in information state can be afforded. To quantify the cost of changing information states, we can employ a thermodynamic illustration (figure 1a,b). Assume that we want to physically represent a probability distribution by means of an ideal gas particle in a box of volume V with diathermal walls immersed in a heat bath at constant temperature T. Assume further that we want to update our information state to a new distribution over a reduced volume V ′=cV . This transformation requires physical work given by the free-energy difference 2.1where k>0 is the Boltzmann constant and kT can be interpreted as the conversion factor between one unit of information and one unit of energy for a molecule-in-a-box device [3943]. Depending on what physical system we use to represent the distributions, this conversion factor could be higher, thus, making information processing more expensive.

Figure 1.

The molecule-in-a-box device. (a) Initially, a molecule moves freely within a space of volume V delimited by two pistons. The compartments A and B correspond to the two logical states of the device. (b) Representing a different distribution with reduced uncertainty over the two states requires work by the pistons. (c) A molecule in a box with multiple compartments. (d) When changing the individual compartment volumes, the free-energy difference in each compartment can be related to the total work by equation (2.2).

While this toy example illustrates the cost of changing information states, it hides the dependence on the underlying information structure given by the partitioning of possible microstates. Therefore, we consider a modified example by introducing multiple compartments—see figure 1c,d. Let each compartment x have an initial volume V (x) that is changed to V ′(x) such that and . The initial probability of being in state x is p0(x)=V (x)/V and the final probability is p(x)=V ′(x)/V ′. The free-energy difference can then be expressed as follows: 2.2where ΔF(x) represents the free-energy difference of compartment x. Importantly, it can be seen that the free-energy difference consists of two terms: the averaged free-energy contributions of the individual compartments and a cost term that measures the information-theoretic distance between the initial and final information state, which is then converted into units of energy.

### (a) The free-energy difference

As illustrated in figure 2, the isothermal transformations discussed in the two examples can be characterized in a general and abstract manner as follows.

• — An initial information state is given by a prior distribution p0(x) over some states . Each state x has associated to it an initial energy potential ϕ0(x) such that p0(x)=(1/Z0) eαϕ0(x) with .

• — A transformation is applied by adding a new potential Δϕ(x).

• — The final information state is represented by a distribution q(x) under the constraint of the combined potential ϕ(x)=ϕ0(x)+Δϕ(x).

Figure 2.

Representing a decision-maker as a thermodynamic system, the behaviour of the decision-maker exposed to a gain Δϕ=−U can be expressed as a change of initial cost potential ϕ0 to a final cost potential ϕ, where ϕ=ϕ0U. The choice or belief probabilities of the decision-maker change according to (3.2) from p0 to p. (Online version in colour.)

The parameter α indicates the inverse temperature. The free-energy difference between the initial and final information state can be computed as follows: where we have used the equality . As described in standard textbooks on statistical physics [31], the free energy F[q] obtains its minimum for the equilibrium distribution p(x)=(1/Zϕ) eαϕ(x) with . Since ΔF[q] and F[q] differ only by a constant, the same minimum is obtained in the free-energy difference. However, the free-energy difference allows us to write the equilibrium distribution in the equivalent form p(x)=(1/Z)p0(x) eαΔϕ(x) with to separate the two essential ingredients: the prior information state p0(x) and the transforming potential difference Δϕ(x). Accordingly, the free-energy difference can always be thought of as a trade-off between the energy and the information divergence measured by . In §3, we are going to exploit this trade-off to model bounded rational decision-makers that reason about utility gains U(x):=−Δϕ(x).

## 3. Thermodynamics of decision-making

In decision theory, preferences between alternative outcomes are usually represented by a real-valued function U over called the utility function. Among other things, this requires that the preferences between any two elements of can be established and that these preferences are stable and transitive [57]. Given a choice over the whole set , a perfectly rational decision-maker will consistently choose the best outcome , presupposing that such a unique optimum exists. However, if the set is very large and the available resources to search this set are very limited, then it might not always be possible to find the best option for a bounded rational decision-maker.

As a specific example of such a search, consider drawing balls labelled by numbers U from an urn with replacement such that the drawing process can be described by a probability μ(U). After drawing once, twice, thrice, etc., we observe a time series U1,U2,U3,… of independent and identically distributed data, and our task is to find the maximum by keeping track of the largest number seen so far, which after m draws is . The cumulative distribution function of choosing v after m draws is given by Fm(v)=F0(v)m, where F0 is the cumulative distribution function of μ [44]. In the continuous limit, the associated density is given by pm(v)=(d/dv)F0(v)m. Similar to other evidence accumulation schemes [45], we can then compute the log odds between any two random outcomes v and v′ as where pm(v) can be rewritten as an exponential family distribution pm(v)∝μ(v) eβV (v) with energy levels and the parameter β=m−1. As β plays the role of the number of draws m, it might be considered as a measure of the information-processing resource. Accordingly, the more resources β a decision-maker spends, the more he resembles a perfectly rational decision-maker that chooses the maximum without fail, whereas for any finite number of β, some uncertainty about the maximum remains.

In general, the boundedness parameter β can be thought of as a Lagrange multiplier in a constrained optimization problem. In QRE models [2023], the Lagrange multiplier is used to constrain the entropy or mean cost, whereas here we apply it to express a constraint on the relative entropy or Kullback–Leibler (KL) divergence. By replacing the thermodynamic energy potential Δϕ(x) of §2a with the economic utility gain U(x)=−Δϕ(x), we can formulate the following variational principle in the negative free-energy difference.

Variational principle. Given an initial information state p0(x), a resource parameter β and a utility gain U(x), the negative free-energy difference −ΔF=F0F between the initial and final information state 3.1is maximized by the equilibrium distribution q(x)=p(x), where 3.2such that 3.3

Just like its physical pendant [31], this variational principle can be regarded in two ways: (i) as a minimum (relative) entropy principle where the expected utility gain is fixed or (ii) as a maximum utility gain principle where the (relative) entropy is fixed. The first interpretation provides a principle for estimation in the context of observer modelling under model uncertainty, where utility gains (or losses) can induce deviations from the probabilistic belief p0. The second interpretation provides a principle for bounded rational decision-making in the case of actor modelling, where the relative entropy constrains the information-processing capacity of the decision-maker. This naturally leads to a trade-off between utility gains and information-processing costs.

### (a) Choice and belief probabilities

In line with the twofold interpretation of the variational principle, the distribution (3.2) can be interpreted both as an action or observation probability. In the case of actions, p0 represents the a priori choice probability of the agent that is refined to the choice probability p when evaluating the imposed gain (or loss) U. The allowed change in probability depends on the resource parameter β and corresponds to the search that is necessary to evaluate the gains (or losses). In the case of observations, p0 represents the a priori belief of the agent given by a probabilistic model, which is then distorted because of the presence of anticipated gains (or losses) U that are evaluated by the holder of the belief. This way, the agent can anticipate a collaborative (β>0) or adversarial (β<0) environment of (assumed) rationality β. For different values of β, the distribution (3.2) has the following limits: In the case of actions, the three limits imply the following. The limit corresponds to the perfectly rational actor that infallibly selects the action that maximizes gain U(x) or minimizes loss −U(x). The limit is an actor without resources that simply selects his action according to his prior p0. The limit corresponds to an actor that is perfectly ‘anti-rational’ and always selects the action with the worst outcome. In the case of observations, the three limits correspond to an extremely optimistic observer () who believes in a cooperative environment that can be essentially regarded as an extension of the agent, an extremely pessimistic observer () who anticipates only the worst-case scenario, and a risk-neutral Bayesian observer () who simply relies on the probabilistic model p0.

### (b) The certainty equivalent

In economics, the certainty equivalent measures the value of a risky gamble in terms of a risk-free equivalent such that a decision-maker would be indifferent between the risky and the risk-free option. In statistical physics [31], the free-energy difference ΔFEQ=W measures the amount of available work W by subtracting the heat Q from the total change in energy ΔE. The crucial physical intuition is that heat is a form of energy that we have uncertainty about, for example, we do not know the exact trajectories of all gas particles at temperature β. This uncertainty implies that we do not have full control over the objects and consequently cannot make use of all the energy [43]. Work, on the other hand, is a pure form of energy uncontaminated by uncertainty and therefore fully transformable and usable. Economically speaking, the physical concept of work, and therefore also the difference in free energy, measures the certainty equivalent of an energy difference that is contaminated by uncertainty. In the following, we generalize the use of the free-energy difference to ascribe a certainty-equivalent value to gambles that are characterized by utility gains (or losses) U and an initial information state p0. As can be seen from (3.3), the free-energy difference between equilibrium distributions is given by the log-partition function, i.e. the logarithm of the normalization constant . For different values of β, the certainty equivalent takes the following limits: Again, the case corresponds to the perfectly rational actor (or the extremely optimistic observer), the case corresponds to the perfectly ‘anti-rational’ actor (or the extremely pessimistic observer) and the case corresponds to the actor that has no resources (or the risk-neutral observer) such that the best one can expect is the expected gain or loss. For illustration, see figure 3.

Figure 3.

(a) Negative free-energy difference −ΔF versus the resource parameter β. The resource parameter allows modelling decision-makers with bounded resources, either when generating their own actions (β3>0) or when anticipating an adversarial environment (β1<0). The negative free-energy difference corresponds to the certainty equivalent. (b) Distribution over the outcomes depending on the resource parameter β. For large positive β, the distribution concentrates on the outcome with maximum gain . For large negative β, the distribution concentrates on the worst outcome with gain . For β2=0, the outcomes follow the given distribution p0. (Online version in colour.)

### (c) Sequential decision-making

The formalism for bounded rational decision-making can also be extended to sequential decision-making where a vector of random variables xT=x1x2xT has to be determined by consecutively drawing from distributions p(xt | x<t) that depend on the history x<t=x1xt−1 [46]. Each history x<t defines a decision node with respect to the variable xt and can be characterized by an initial information state p0(xt | x<t), a utility gain U(xt | x<t) and a Lagrange multiplier β(x<t). The latter allows assigning node-specific information-processing resources. The negative free-energy difference −ΔF=F0F between the initial information state and the final information state q(xT) is then 3.4This negative free-energy difference has a nested structure where the latest time step forms the innermost variational problem that can be solved first, and all other variational problems of the previous time steps can be solved recursively by working backwards in time. This leads to a recursive solution for the equilibrium distribution, where Z(xT)=1 and where for all t<T, 3.5with the certainty equivalent 3.6such that . Depending on how the resource parameters β(x<t) are set, a range of different decision-making schemes can be recovered.

• — KL control. For state-dependent loss functions U(xt), Markov probabilities p0(xt | xt−1) and uniform β(x<t)=β, the KL-control framework [4750] can be recovered, where (3.6) simplifies to a recursion equivalent to z-iteration [49,51].

• — Optimal control. For at all action nodes and at all observation nodes, (3.6) simplifies to the Bellman optimality equations [52] of the perfectly rational decision-maker in a stochastic environment.

• — Risk-sensitive control. For at all action nodes and β(x<t)=β≠0 at all observation nodes, the framework of risk-sensitive control [53] can be recovered, where a decision-maker allows for model uncertainty in his observational beliefs.

• — Robust control. For at all action nodes and at all observation nodes, robust control and minimax games can be recovered, where in either case the decision-maker makes worst-case assumptions about his environment [54,55].

Robust control and minimax problems have long been known to be related to risk-sensitive control [56,57]. Risk-sensitive KL control was previously studied in [58]. See [33,51] for more details on the relationships between these frameworks.

## 4. Application to decision-making experiments

In the following, we will discuss the thermodynamic model of bounded rational decision-making in relation to four different kinds of decision-making experiments on humans that have shown systematic deviations from expected utility theory: (i) the Ellsberg experiment, (ii) the Allais experiment, (iii) perceptual decision-making experiments and (iv) risk-sensitive sensorimotor integration experiments. The first two experiments come from behavioural economics, whereas the last two experiments are taken from behavioural neuroscience.

### (a) Application to Ellsberg’s paradox

In his infamous experiment, Ellsberg could show that humans systematically violate expected utility theory because they are averse to ambiguity, that is, model uncertainty as opposed to known risk uncertainty [59]. Ambiguity aversion can be modelled by variational preferences [60] that can be straightforwardly incorporated by a bounded rational decision-maker. Consider, for example, an urn that contains 90 balls, 30 of which are red, and the remaining 60 balls are black or yellow in an unknown proportion. While the uncertainty with respect to red constitutes a known risk, we have ambiguity regarding the proportion of black and yellow, as there are many possible distributions. We are now presented with the following gambles:

Most people prefer 1 over 2 and 4 over 3, a pattern that is inconsistent with expected utility theory because there is no single probability λ for the proportion of black balls that can explain both choices. This can be seen as follows. In the following, we assume U($0)=0 to simplify the argument, although it would, in general, suffice to assume U($100)>U($0). Preferring 1 over 2 implies EU1>EU2, that is, in our case which implies . Preferring 4 over 3, however, implies EU4>EU3, that is, which implies , and therefore leads to a contradiction. The choice pattern is, however, consistent with a decision-maker that is averse to ambiguity. Such an ambiguity averse decision-maker can be constructed by assuming an adversarial environment with rationality β<0. The crucial observation is that this leads to a distortion of our beliefs about the hidden variable λ that takes on different values for all possible scenarios—that is all the different proportions of black and yellow. We can then construct the certainty equivalent CEi of gamble i as follows: where , p0(x=B | λ)=λ and . Assuming a uniform prior probability density for and assuming U($0)=0, one can easily check that CE1>CE2 and CE3<CE4.

### (b) Application to Allais’ paradox

Allais’ experiment reveals another systematic violation of expected utility theory, which stipulates that the addition or removal of common consequences to two lotteries may not reverse a decision-maker’s preferences [61]. Yet, such reversals are frequently observed in gambles like the following [62,63]:

Most people prefer 2 over 1 and 3 over 4, even though gambles 1 and 2 can be obtained from gambles 3 and 4 by adding the common consequence of 0.66⋅$2400. This can be seen as follows. We will again assume that U($0)=0 to simplify the argument. Preferring 3 over 4 implies EU3>EU4, that is, However, preferring 2 over 1 implies EU2>EU1, that is, which implies 0.34U($2400)>0.33U($2500), and leads to a contradiction.

Usually, this reversal is explained by distortions of the probabilities in cumulative prospect theory [35,64]. An alternative explanation of the reversal has been proposed in [65] by introducing a non-vanishing weighting function a(x) in a generalized quasi-linear mean model [6568] such that the certainty equivalent has the form . Such a weighting function could also be constructed for the negative free-energy certainty equivalent by defining a generalized difference , 4.1where a(x)=eβV (x) introduces a bias. In the example, by setting U($2500)=1, U($0)=0 and V ($2500)=V ($0)=0, the reversal can be achieved by setting and assuming again β<0. The reversal cannot be explained without the biasing term a(x), unless one assumes a change of β between the two choices. The biasing term can be regarded both as biasing the probabilities p0 and as distorting the utility gains U. However, like in prospect theory, there is no explanation why this biasing term should occur from a normative point of view.

### (c) Application to perceptual decision-making

In perceptual decision-making experiments, subjects typically have to judge on very short time scales whether one stimulus is bigger or brighter than another, whether a stimulus is moving up or down, left or right, and so forth. One of the most widespread perceptual decision-making experiments are the random-dot motion experiments [69], where a cloud of points move on a screen. While a certain percentage of points moves coherently either up or down, the remainder of points move in random directions. The task is to decide whether the coherent movement is up or down. Depending on the level of coherence, this decision can be easy or difficult, which is reflected in subjects’ reaction times—a proxy for their computational resources.

Choice probabilities and reaction times in random-dot motion tasks are usually modelled by diffusion-to-bound models [45]. In the basic model, there is a binary choice between option A and option B associated with values UA and UB, respectively, such that the difference in value is given by μ=UAUB. Crucially, the difference in value is not known a priori and at each moment of time, there is only a noisy sample of μ available. By observing many samples, however, more evidence can be accumulated, which is modelled by a diffusion process where μ corresponds to the drift and σ to the diffusion constant. The process starts at zero and ends when it hits one of two bounds θ or −θ, where each bound corresponds to one of the two options. This leads to a speed–accuracy trade-off: the further away the bounds, the more probable it is to reach the bound that corresponds to the option with a higher value, but the longer it will take, as many little diffusion steps will be needed to reach a distant bound.

If we assume a prior probability p0 for choosing A—and hence 1−p0 for the prior choice probability of B—we can incorporate this information by setting the bound to . This bound can either have a positive or negative sign, which means that choosing A could correspond to crossing the upper (θA>0) or lower (θA<0) bound. The probability of deciding for option A is then and P(B)=1−P(A), which corresponds exactly to the formulae suggested to describe bounded rational decision-making. It can also be shown that for large β, the average duration 〈T〉 of the diffusion is approximately proportional to β, Consequently, the resource parameter β can also be seen as a proxy of the average computation time.

### (d) Application to sensorimotor control

Sensorimotor behaviours can often be described by optimality principles that take into account performance criteria such as energy requirements, endpoint accuracy, smoothness of movement trajectories and other task-related criteria [70,71]. Therefore, choosing how to move an effector can also be considered as a decision-making problem [72]. While many previous studies have investigated the optimization of expected movement costs to describe behaviour, a number of recent studies has found risk-sensitive deviations from expected utility theory in sensorimotor control [7377].

As already outlined in §3c, risk-sensitive decision-makers optimize a stress function of the form , where the utility is expressed as a cost C(u) that depends on a control command u [53]. Risk-sensitive decision-makers do not simply maximize the expectation of the utility, but also consider higher-order cumulants. A risk-averse decision-maker (β<0) discounts variability, whereas a risk-seeking decision-maker (β>0) adds value to the expected utility in the face of variability. In terms of bounded rationality, one could regard the environment also as a bounded rational agent of rationality β that can be either adversarial or collaborative. In fact, a collaborative environment of rationality β is mathematically equivalent to a decision-maker with rationality β and can therefore be regarded as an extension of the decision-maker. If the real environment is partially unknown, risk-sensitivity can be used as a tool to consider model uncertainty. Boundedness in this case is the lack of information about the adequacy of the model. Thus, risk sensitivity can bias the beliefs about the environment optimistically (collaborative environment) or pessimistically (adversarial environment).

Both the effect of sensitivity to variance in the utility and the distortion of beliefs as a consequence of model uncertainty have been recently reported. In [74], subjects had to choose between hitting different targets whose size could be varied. Succeeding or failing to hit a target required a second movement with varying effort. By these two variations, ‘motor lotteries’ with different degrees of mean and variance in the motor effort could be created. For equal means, subjects were sensitive to this variance. In [77], subjects had to integrate prior information about the position of a target with noisy feedback information. The beliefs about the target position were indicated with a robot handle that required different effort levels for indicating different beliefs. Crucially, in the absence of uncertainty about the target position, these effort levels did not affect behaviour. However, the more uncertainty subjects had about the target position, the more they tended to deviate from the Bayes optimal belief—a signature of model uncertainty.

## 5. Discussion

In the proposed thermodynamic interpretation of bounded rationality, decision-makers can be thought of as thermodynamic systems abstractly represented by probability distributions. When information processing takes place, these distributions change. Physically, one can imagine the change in distribution as a consequence of imposing a new energy potential. The expected difference in the potential corresponds to a utility gain in economic choice. However, changing states is also costly. In thermodynamic systems, the KL divergence provides a natural measure for the costs that arise due to the changes in the probability distributions. The resulting trade-off between utility gains and resource costs provides a variational principle for bounded rational decision-making in the shape of a negative free-energy difference. The adequacy of this framework can be demonstrated in a number of decision-making experiments from behavioural economics and neuroscience.

The variational principle in the negative free-energy difference generalizes previous studies in which QRE models of bounded rationality assume bounded rational players whose choice probabilities are given by the Boltzmann distribution [2023], and whose temperature parameters can be interpreted as a strategic choice of preferences [78]. The QRE model can be obtained as a special case of the model presented here where all prior probabilities are assumed to be uniform. However, unlike in the QRE model, these prior probabilities have to be explicitly stated when considering the difference in free energy rather than the free energy itself. These prior probabilities are crucial when defining the certainty equivalent in terms of a finite log-partition sum that ranges from minimum to maximum and includes the expected utility as a limit case. As the certainty equivalent corresponds to physical work, this also allows relating bounded rational decision-making to thermodynamic processes. The distinction of a prior policy and a utility is fundamental to the notion of bounded rationality proposed in this article, which ultimately allows extending the explanatory power of bounded rationality beyond the realm of QRE models.

### (a) Bounded rationality

Simon [8] proposed in the 1950s that bounded rational decision-makers do not commit to an unlimited optimization by searching for the absolute best option. Rather, they follow a strategy of satisficing, i.e. they settle for an option that is good enough in some sense. Since then, it has been debated whether satisficing decision-makers can be described as bounded rational decision-makers that act optimally under resource constraints or whether optimization is a wrong concept altogether [16]. If decision-makers did indeed explicitly attempt to solve such a constrained optimization problem, this would lead to an infinite regress, and the paradoxical situation that a bounded rational decision-maker would have to solve a more complex (i.e. constrained) optimization problem than a perfectly rational decision-maker.

To resolve this paradox, the bounded rational decision-maker must not be able to reason about his constraints. He just searches randomly for the best option, until his resources run out. An observer will then be able to assign a probability distribution to the decision-maker’s choices and investigate how this probability distribution changes depending on the available resources. Consider, for example, an anytime algorithm that will compute a solution more and more precisely the more time it has at its disposal. As one does not want to wait forever for an answer, the anytime computation will be interrupted at some point where one assumes that the answer is going to be good enough. This concept of satisficing can be used to interpret equation (3.2) in terms of an anytime search, as illustrated with the example of finding the maximum by drawing from an urn with replacement.

### (b) Information theory in control and game theory

As already outlined in §3c, a number of studies have recently suggested the use of the relative entropy as a cost function for control [4850,79]. In [48,49], the transition probabilities of a Markov decision process are controlled directly, and the control costs are given by the KL divergence of the manipulated state transition probabilities with respect to a baseline distribution that describes the passive dynamics. In the sequential decision-making setup proposed in §3c, this KL control corresponds to the case where all random variables are action variables with boundedness parameter β. The stochasticity in this case, however, is not thought to arise from environmental passive dynamics like in the KL-control literature, but rather is a direct consequence of bounded rational control in a (possibly) deterministic environment. The continuous case of KL control relies on the formalism of path integrals [47,80], but essentially the same ideas can be applied [51]. This has inspired the formulation of optimal control problems as inference problems [50,81].

Previously, Saridis [82] has framed optimal and adaptive control as entropy minimization problems. Statistical physics has also served as an inspiration to a number of other studies, for example, to an information-theoretic approach to interactive learning [83], to use information theory to approximate joint strategies in games with bounded rational players [22] and to the problem of optimal search [84,85], where the utility losses correspond directly to search effort. Recently, Tishby & Polani [86] applied information-theoretic reasoning to understand information flow in the action–observation cycle. The contribution of our study is to extend previous models of bounded rationality by exploiting a thermodynamically motivated variational principle that trades off utility gains and information-processing costs and to apply this principle to human decision-making experiments. In the future, it will be interesting to relate the information-processing resource costs of bounded rational agents to more traditional notions of resource costs in computer science such as space and time requirements of algorithms [87].

### (c) Variational preferences

In the economic literature, the KL divergence has appeared in the context of multiplier preference models that can deal with model uncertainty [55]. In particular, it has been proposed that a bound on the KL divergence could be used to indicate how much of a deviation from a proposed model p0 is allowed when computing robust decision strategies that work under a range of models in the neighbourhood of p0. In variational preference models [60], this is generalized to models of the form where c(p) can be interpreted as an ambiguity index that can explain effects of ambiguity aversion. The thermodynamic certainty equivalent of work—computed as the log-partition sum—also falls within this preference model. However, an important difference is that the choice in a thermodynamic system is not deterministic with respect to the certainty equivalent, but stochastic following a generalized Boltzmann distribution. Owing to this stochasticity of the choice behaviour itself, the thermodynamic model can be linked to both bounded rationality and model uncertainty, whereas variational preference models have so far concentrated on explaining effects of ambiguity aversion and model uncertainty.

### (d) Stochastic choice

Stochastic choice rules have extensively been studied in the psychological and econometric literature, in particular logit choice models based on the Boltzmann distribution [24,25]. The literature on Boltzmann distributions for decision-making goes back to Luce [26], extending through McFadden [27], Meginnis [28] and Fudenberg & Kreps [29]. Luce [26] has studied stochastic choice rules of the form , which includes the Boltzmann distribution and the softmax rule known in the reinforcement learning literature [30]. McFadden [27] has shown that such distributions can arise, for example, when utilities are contaminated with additive noise following an extreme value distribution. While stochastic choice models are generally accepted to account for human choices better than their deterministic counterparts [8890], they have also been strongly criticized, especially for a property known as independence of irrelevant alternatives (IIA). Similar to the independence axiom in expected utility theory, IIA implies that the ratio of two choice probabilities does not depend on the presence of a third irrelevant alternative in the choice set. What distinguishes the free-energy equations from the above choice rules is that stochastic choice behaviour is described by a generalized exponential family distribution of the form . Changing the choice set might in general also change the prior p0(x), but more importantly, it might also change the resource parameter β.

### (e) Variational Bayes and free-energy principle

It is well known that (approximate) Bayesian inference satisfies a variational principle in the free energy [91]. Given a prior p0(h) over a latent variable h, a likelihood model p(y | h) explaining observation y, and a distribution q(h) to approximate the posterior p(h | y), the free-energy difference is extremized for the Bayesian posterior q(h)=p(h | y) under the particular choice of utility , which minimizes informational surprise. Variational Bayes methods often rewrite the free energy as which provides a bound on the evidence . Thus, approximate Bayesian inference can be achieved by minimizing free-energy differences, where the boundedness consists of being restricted to model class q. This variational Bayes approach has recently also been proposed as a theoretical framework to understand brain function [36,37], where perception is modelled as variational Bayesian inference over hidden causes of observations. But in this case, the likelihood model p(y | h,a) also depends on the chosen action a. According to [36,37], action and perception then consist of choosing a and q, respectively, so as to minimize free energy, thereby minimizing surprise.

## 6. Conclusion

Thermodynamics provides a framework for bounded rationality that can be both descriptive and prescriptive. It can be descriptive in the sense that it describes behaviour that is clearly sub-optimal from the point of view of a perfectly rational decision-maker with infinite resources. It can be prescriptive in the sense that it prescribes how a bounded rational actor should behave optimally given resource constraints formalized by β. Ultimately, it might even be possible to connect computational processes of bounded rational decision-makers with real physical processes, for example, by relating the generated entropy to energy requirements [40]. Finally, the thermodynamic model of bounded rationality suggests a notion of intelligence that is closely related to the process of evolution. It was already mentioned that bounded rational controllers of the form (3.2) share their structure with Bayes’ rule, which in turn shares its structural form with discrete replicator dynamics that model evolutionary processes [92]. In such evolutionary processes, a population of samples are pushed through a fitness function (likelihood, gain function) that biases the distribution of the population, thereby transforming a distribution p0 to a new distribution p. In this picture, different hypotheses x compete for probability mass over subsequent iterations, favouring those x that have a lower-than-average cost. Just like the evolutionary random processes of variation and selection created intelligent organisms on a phylogenetic time scale, similar random processes might underlie (bounded) intelligent behaviour in individuals on an ontogenetic time scale.

## Acknowledgements

This study was supported by the DFG, Emmy Noether grant no. BR4164/1-1.