College Papers

Synthesis of Provably Correct Autonomy Protocols for Shared Control Murat Cubuktepe

Synthesis of Provably Correct Autonomy Protocols
for Shared Control

Murat Cubuktepe, Nils Jansen, Mohammed Alsiekh, Ufuk Topcu

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Abstract—We develop algorithms for the synthesis of shared
control protocols subject to probabilistic temporal logic specifi-
cations. More specifically, we introduce a modeling formalism
in which both a human and an autonomy protocol can issue
commands to a robot towards performing a certain task. These
commands are blended into a joint input to the robot. We model
the interaction between the human and the robot as a Markov
decision process that incorporates the potential randomness in
decisions caused by factors such as complexity of the task spec-
ifications and imperfect interfaces. Using inverse reinforcement
learning, we obtain an abstraction of the human’s behavior as a
so-called randomized strategy. We design the autonomy protocol
to ensure that the robot behavior—resulting from the blending
of the commands—satisfies given safety and performance specifi-
cations in probabilistic temporal logic. Additionally, the resulting
strategies generate behavior as similar to the behavior induced by
the human’s commands as possible. We formulate the underlying
problem as a quasiconvex optimization problem, which is solved
by checking feasibility for a number of linear programming
problems. We assess the applicability and the scalability of the
approach through case studies involving autonomous wheelchair
navigation and unmanned aerial vehicle mission planning.

I. INTRODUCTION

We study the problem of shared control, where a robot
is to execute a task according to the goals of a human
operator while adhering to additional safety and performance
requirements. Applications of such human-robot interaction
include remotely operated semi-autonomous wheelchairs 13,
robotic teleoperation 16, and human-in-the-loop unmanned
aerial vehicle mission planning 9. Basically, the human
operator issues a command through an input interface, which
maps the command directly to an action for the robot. The
problem is that a sequence of such actions may fail to
accomplish the task at hand, due to limitations of the interface
or failure of the human in comprehending the complexity of
the problem. Therefore, a so-called autonomy protocol provides
assistance for the human in order to complete the task according
to the given requirements.

At the heart of the shared control problem is the design of
an autonomy protocol. In the literature, there are two main
directions, based on either switching the control authority
between human and autonomy protocol 26, or on blending
their commands towards joined inputs for the robot 7, 15.

M. Cubuktepe and U. Topcu are with the Department of Aerospace Engi-
neering and Engineering Mechanics, University of Texas at Austin, 201 E 24th
St, Austin, TX 78712, USA. Nils Jansen is with the Department of Software
Science, Radboud University Nijmegen, Comeniuslaan 4, 6525 HP Nijmegen,
Netherlands. Mohammed Alsiekh is with the Institute for Computational
Engineering and Sciences, University of Texas at Austin, 201 E 24th St,
Austin, TX 78712, USA. email:({mcubuktepe,malsiekh,utopcu}@utexas.edu,
[email protected]).

One approach to switching the authority first determines
the desired goal of the human operator with high confidence,
and then assists towards exactly this goal 8, 18. In 12,
switching the control authority between the human and au-
tonomy protocol ensures satisfaction of specifications that are
formally expressed in temporal logic. In general, the switching
of authority may cause a decrease in human’s satisfaction, who
usually prefers to retain as much control as possible 17.

The second direction is to provide another command in
addition to the one of the human operator. To introduce a
more flexible trade-off between the human’s control authority
and the level of autonomous assistance, both commands are
then blended to form a joined input for the robot. A blending
function determines the emphasis that is put on the autonomy
protocol in the blending, that is, regulating the amount of
assistance provided to the human. Switching of authority can
be seen as a special case of blending, as the blending function
may assign full control to the autonomy protocol or to the
human. In general, putting more emphasis on the autonomy
protocol in blending leads to greater accuracy in accomplishing
the task 6, 7, 20. However, as before, humans may prefer
to retain control of the robot 16, 17. None of the existing
blending approaches provide formal correctness guarantees that
go beyond statistical confidence bounds. Correctness here refers
to ensuring safety and optimizing performance according to
the given requirements. Our goal is the design of an autonomy
protocol that admits formal correctness while rendering the
robot behavior as close to the human’s inputs as possible.

We model the behavior of the robot as a Markov decision
process (MDP) 23, which captures the robot’s actions inside a
potentially stochastic environment. Problem formulations with
MDPs typically focus on maximizing an expected reward (or,
minimizing the expected cost). However, such formulations may
not be sufficient to ensure safety or performance guarantees in
a task that includes a human operator. Therefore, we design
the autonomy protocol to ensure that the resulting behavior is
formally verified with respect to probabilistic temporal logic
specifications. Such verification problems have been extensively
studied for MDPs 2.

Take as an example a scenario involving a semi-autonomous
wheelchair 13 whose navigation has to account for a randomly
moving autonomous vacuum cleaner, see Fig. 1. The wheelchair
needs to navigate to the exit of a room, and the vacuum
cleaner moves in the room according to a probabilistic transition
function. The goal of the wheelchair is to reach the exit without
crashing into the vacuum cleaner. The human may not fully
perceive the motion of the vacuum cleaner. Note that the
human’s commands, depicted with the solid red line in Fig 1(a),

(a) Autonomy perspective

0.4

0.4

0.2

0.2

(b) Human perspective

Fig. 1. A wheelchair in a shared control setting.

may cause the wheelchair to crash into the vacuum cleaner.
The autonomy protocol provides another set of commands,
which is indicated by the solid red line in Fig 1(b), to carry
out the task safely without crashing. However, the autonomy
protocol’s commands deviate highly from the commands of
the human. The two sets of commands are then blended into
a new set of commands, depicted using the dashed red line
in Fig 1(b). The blended commands perform the task safely
while generating behavior as similar to the behavior induced
by the human’s commands as possible.

In 15, we formulated the problem of designing the
autonomy protocol as a nonlinear programming problem. How-
ever, solving nonlinear programs is generally intractable 3.
Therefore, we proposed a greedy algorithm that iteratively
repairs the human strategy such that the specifications are
satisfied without guaranteeing optimality, based on 22. Here,
we propose an alternative approach for the blending of the two
strategies. We follow the approach of repairing the strategy
of the human to compute an autonomy protocol. We ensure
that the resulting behavior induced by the repaired strategy (1)
deviates minimally from the human strategy and (2) satisfies
given temporal logic specifications after blending. We formally
define the problem as a quasiconvex optimization problem,
which can be solved efficiently by checking feasibility of a
number of convex optimization problems 4.

A human may be uncertain about which command to issue
in order to accomplish a task. Moreover, a typical interface
used to parse human’s commands, such as a brain-computer
interface, is inherently imperfect. To capture such uncertainties
and imperfections in the human’s decisions, we introduce
randomness to the commands issued by humans. It may not
be possible to blend two different deterministic commands. If
the human’s command is “up” and the autonomy protocol’s
command is “right”, we cannot blend these two commands
to obtain another deterministic command. By introducing
randomness to the commands of the human and the autonomy
protocol, we therefore ensure that the blending is always well-
defined. In what follows, we call a formal interpretation of a
sequence of the human’s commands the human strategy, and
the sequence of commands issued by the autonomy protocol
the autonomy strategy.

The question remains how to obtain the human strategy

in the first place. It may be unrealistic that a human can
comprehend an MDP that models a realistic scenario. To this
end, we create a virtual simulation environment that captures
the behavior of the MDP. We ask humans to participate in two
case studies to collect data about typical human behavior. We
use inverse reinforcement learning to get a formal interpretation
as a strategy based on human’s inputs 1, 28. We model
a typical shared control scenario based on an autonomous
wheelchair navigation 13 in our first case study. In our second
case study, we consider an unmanned aerial vehicle mission
planning scenario, where the human operator is to patrol certain
regions while keeping away from enemy aerial vehicles.

In summary, the main contribution this paper is to synthesize
the autonomy protocol such that the resulting blended or
repaired strategy meets all given specifications while only
minimally deviating from the human strategy. We introduce all
formal foundations that we need in Section II. We provide an
overview of the general shared control concept in Section III.
We present the shared control synthesis problem and provide
a solution based on linear programming in Section IV. We
indicate the applicability and scalability of our approach on
experiments in Section V and draw a short conclusion and
critique of our approach in Section VI.

II. PRELIMINARIES

In this section, we introduce the required formal models and
specifications that we use to synthesize the autonomy protocol,
and we give a short example illustrating the main concepts.

1) Markov Decision Processes: A probability distribution
over a finite set X is a function µ : X ? 0, 1 ? R with?
x?X µ(x) = µ(X) = 1. The set X of all distributions is

Distr(X).

Definition 1 (MDP). A Markov decision process (MDP)M =
(S, sI ,A,P) has a finite set S of states, an initial state sI ? S,
a finite set A of actions, and a transition probability function
P : S ×A? Distr(S).

MDPs have nondeterministic choices of actions at the
states; the successors are determined probabilistically via the
associated probability distribution. We assume that all actions
are available at each state and that the MDP contains no
deadlock states. A cost function C : S ×A? R?0 associates
cost to transitions. If there one single action available at each
state, the MDP reduces to a discrete-time Markov chain (MC).

To define a probability measure and expected cost on MDPs,
so-called strategies resolve the choices of actions.

Definition 2 (Strategy). A randomized strategy for an MDPM
is a function ? : S ? Distr(A). If ?(s, ?) = 1 for ? ? A and
?(s, ?) = 0 for all ? ? A \ {?}, the strategy is deterministic.
The set of all strategies over M is StrM.

Resolving all the nondeterminism for an MDP M with a
strategy ? ? StrM yields an induced Markov chain M? .

s0 s1 s2

s3 s4

a

b

c

d

0.6

0.4

0.4

0.6

0.6

0.4

0.4

0.6

11

1

(a) MDP M
s0 s1 s2

s3 s4

0.5

0.5

0.5

0.5

11

1

(b) Induced MC M?1

Fig. 2. MDP M with target state s2 and induced MC for strategy ?unif

Definition 3 (Induced MC). For an MDP M = (S, sI ,A,P)
and strategy ? ? StrM, the MC induced by M and ? is
M? = (S, sI ,A,P?) where

P?(s, s?) =
?

??A(s)

?(s, ?) · P(s, ?, s?) for all s, s? ? S .

The occupancy measure of a strategy ? gives the expected
number of taking an action at a state in an MDP. In our
formulation, we use the occupancy measure of a strategy to
compute an autonomy protocol.

Definition 4 (Occupancy Measure). The occupancy measure
x? of a strategy ? is defined as

x?(s, ?) = E

??
t=0

P (?t = ?|st = s)

(1)

where st and ?t denote the state and action in an MDP at
time step t. The occupancy measure x?(s, ?) is the expected
number of taking the action ? at state s with the strategy ?.

2) Specifications: A probabilistic reachability specification
P??(?T ) with the threshold ? ? 0, 1 ? Q and the set of
target states T ? S restricts the probability to reach T from
sI in M to be at most ?. Similarly, expected cost properties
E??(?G) restrict the expected cost to reach the set G ? S
of goal states by an upper bound ? ? Q. Intuitively, some
bad states in T should only be reached with probability ?
(safety specification) while the expected cost to reach the goal
states in G should be below ? (performance specification). We
also use until properties of the form Pr??(¬T U G), which
require that the probability of reaching G while not reaching
T beforehand is at least ?.

The synthesis problem is to find one particular strategy ?
for an MDP M such that given specifications ?1, . . . , ?n are
satisfied in the induced MC M? , written ? |= ?1, . . . , ?n.

Example 1. Fig. 2(a) depicts an MDPM with initial state s0.
States s0 and s1 have choices between actions a or b and c or
d. Action a, for example, has a probabilistic choice between
s1 and s3 with probabilities 0.6 and 0.4. For the self loops at
s2, s3 and s4 we omit actions.

For a safety specification ? = P?0.21(?s2), the deterministic
strategy ?1 ? StrM with ?1(s0, a) = 1 and ?1(s1, c) =
1 induces the probability 0.36 of reaching s2. Therefore,
the specification is not satisfied, see the induced MC in

Human
Strategy

Autonomy
Strategy

Blended
Strategy

Robot
execution

command

command

blended command

Blending
function b

Formal
model Mr

Specifications
?1, . . . , ?n

Human
strategy

Fig. 3. Shared control architecture.

Fig. 2(b). Likewise, the randomized strategy ?unif ? StrM
with ?unif(s0, a) = ?unif(s0, b) = 0.5 and ?unif(s1, c) =
?unif(s1, d) = 0.5 violates the specification, as the probability
of reaching s2 is 0.25. However, the deterministic strategy
?safe ? StrM with ?safe(s0, b) = 1 and ?safe(s1, d) = 1 induces
the probability 0.16, thus ?safe |= ?.

III. CONCEPTUAL DESCRIPTION OF SHARED CONTROL

We now detail the general shared control concept adopted in
this paper. Consider the setting in Fig. 3. As inputs, we have a
set of task specifications, a model Mr for the robot behavior,
and a blending function b. The given robot task is described by
certain performance and safety specifications ?1, . . . , ?n. For
example, it may not be safe to take the shortest route because
there may be too many obstacles in that route. In order to
satisfy performance considerations, the robot should prefer to
take the shortest route possible while not violating the safety
specifications. We model the behavior of the robot inside a
stochastic environment as an MDP Mr.

In general, a human issues a set of commands for the robot
to execute. It is unrealistic that a human can comprehend
an MDP that models a realistic scenario in the first place.
Indeed, a human will likely have difficulties interpreting a
large number of possibilities and the associated probability
of paths and payoffs 11, and it may be impractical for the
human to provide the human strategy to the autonomy protocol,
due to possibly large state space of the MDP. Therefore,
we compute a human strategy ?h as an abstraction of a
sequence of human’s commands, which we obtain using inverse
reinforcement learning 1, 28.

We design an autonomy protocol that provides another
strategy ?a, which we call the autonomy strategy. Then, we
blend the two strategies according to the blending function b
into the blended strategy ?ha. The blending function reflects
preference over the human strategy or the autonomy strategy.
We ensure that the blended strategy deviates minimally from
the human strategy.

At runtime, we can then blend commands of the human with
commands of the autonomy strategy. The resulting “blended”
commands will induce same behavior as with the blended
strategy ?ha, and the specifications are satisfied. This procedure

requires to check feasibility of a number of linear programming
problems, which can be expensive, but it is very efficient to
implement at run time.

The shared control synthesis problem is then the synthesis
of the repaired strategy such that, for the repaired strategy ?ha,
it holds ?ha |= ?1, . . . , ?n while minimally deviating from ?h.
The deviation between the human strategy ?h and the repaired
strategy ?ha is measured by the maximal difference between
the two strategies.

IV. SYNTHESIS OF THE AUTONOMY PROTOCOL

A. Strategy blending

Given the human strategy ?h ? StrMr and the autonomy
strategy ?a ? StrMr , a blending function computes a weighted
composition of the two strategies by favoring one or the other
strategy in each state of the MDP 16, 6, 7.

It is argued that blending reflects the confidence in how
good the autonomy protocol is able to assist with respect to
the human’s task 7. Put differently, the actions of the human
should be assigned low confidence by the blending function, if
the actions may lead to a violation of the specifications. Recall
Fig. 1 and the example in the introduction. In the cells of the
gridworld where some actions may result in a collusion with
the vacuum cleaner with high probability, it makes sense to
assign a high weight in the autonomy strategy.

We pick the blending function as a state-dependent function
that weighs the confidence in both the human strategy and the
autonomy strategy at each state of an MDP Mr 16, 6, 7.

Definition 5 (Linear blending). Given an MDP Mr =
(S, sI ,A,P), two strategies ?h, ?a ? StrMr , and a blending
function b : S ? 0, 1, the blended strategy ?ha ? StrMr for
all states s ? S, and actions ? ? A is

?ha(s, ?) = b(s) · ?h(s, ?) + (1? b(s)) · ?a(s, ?) .

Note that the blended strategy ?ha is a well-defined random-
ized strategy. For each s ? S, the value b(s) represents the
“weight” of ?h at s, meaning how much emphasis the blending
function puts on the human strategy at state s. For example,
referring back to Fig. 1, the critical cells of the gridworld
correspond to certain states of the MDP Mr. At these states,
we assign a very low confidence in the human strategy. For
instance at such a state s ? S, we might have b(s) = 0.1,
meaning the blended strategy in state s puts more emphasis
on the autonomy strategy ?a.

B. Strategy repair

In this section, we describe our approach to the shared control
synthesis problem. We formulate the specific problem of repair-
ing the human strategy based on quasiconvex programming,
which can be solved by checking feasibility of a number of
convex (here even linear) programming problems. We show
that the repaired strategy satisfies the task specifications while
deviating minimally from the human strategy. Finally, we
compute the autonomy strategy ?a that may then be subject
to blending.

We start by formalizing the concept of strategy perturbation.
Then, we introduce the dual linear programming (LP) formula-
tion to synthesize a strategy in a MDP. Finally, we present an
LP formulation of the strategy repair problem and show that
the solution indeed satisfies the specifications while deviating
from the human’s strategy minimally.

1) Perturbation of strategies: As mentioned in the introduc-
tion, the blended strategy should deviate minimally from the
human strategy. To measure the quantity of such a deviation,
we introduce the concept of perturbation, which was used
in 5. To modify a (randomized) strategy, we employ additive
perturbation by increasing or decreasing the probabilities of
action choices in each state. We also ensure that for each state,
the resulting strategy is a well-defined distribution over the
actions.

Definition 6 (Strategy perturbation). Given an MDP Mr and
a strategy ? ? StrMr , an (additive) perturbation ? is a function
? : S ×A? ?1, 1 with?

??A
?(s, ?) = 0 ?s ? S.

The perturbation value at state s for action ? is ?(s, ?).
Overloading the notation, the perturbed strategy ?(?) is given
by

?(?)(s, ?) = ?(s, ?) + ?(s, ?) ?s ? S.? ? A. (2)

2) Dual linear programming formulation for MDPs: In this
section, we refer to the LP formulation to compute a policy
that maximizes the probability of satisfying a reachability
specification P??(?T ) in a MDP 24, 10.

The LP formulation builds on the set Var of variables
including the following:
• x(s, ?) ? 0,?) for each s ? S \ T and ? ? A defines

the occupancy measure of a state-action pair.

maximize
?
s?S\T

?
??A

?(s, ?)x(s, ?) (3)

subject to
?s ? S \ T.?
??A

x(s, ?) =
?

s??S\T

?
??A
P(s?, ?, s)x(s?, ?) + ?s (4)?

s?S\T

?
??A

?(s, ?)x(s, ?) ? ? (5)

where ?(s, ?) =
?
s??T
P(s, ?, s?), and ?s = 1 if s = sI and

?s = 0 if s 6= sI .
For any optimal solution x to the LP in (3)–(5), then

?(s, ?) =
x(s, ?)?

??A
x(s, ?)

(6)

is an optimal strategy, and x is the occupancy measure of ?,
see 24 and 10 for details.

After finding an optimal strategy to the LP in (3)–(5), we
can compute the probability of satisfying a specification by

?
s?S\T

?
??A

?(s, ?)x(s, ?). (7)

3) Strategy repair using quasiconvex programming: Given
the human strategy, ?h ? StrMr , the aim of the autonomy
protocol is to compute the blended strategy, or the repaired
strategy ?ha that induces a similar behavior to the human
strategy while satisfying the specifications. We compute the
repaired strategy by perturbing the human strategy, which is
introduced in Definition 6. As in Definition 6, we perturb the
human strategy ?h to the repaired strategy ?ha by

?s ? S.? ? A. ?ha(s, ?) = ?h(s, ?) + ?(s, ?). (8)

We cannot add this constraint to the LP in (3)–(5) because
the variables in the LP are for the occupancy measures for
the strategy. To perturb the strategy, we use the definition of
occupancy measure to translate the constraint in (8) into the
constraint

?s ? S.? ? A. xha(s, ?)?
??A

xha(s, ?)
= ?h(s, ?) + ?(s, ?) (9)

or equivalently to the constraint

?s ? S.? ? A.

xha(s, ?) =
?
??A

xha(s, ?) (?h(s, ?) + ?(s, ?)) . (10)

Note that the constraint in (10) is not a linear constraint
because of multiplication of the perturbation variables ? and
occupancy measure variables x. Since we are interested in
minimizing the maximal deviation, we can assign a common
variable ?? ? 0, 1 for all state-action pairs to put an upper
bound on the deviation by

?s ? S.? ? A.

|xha(s, ?)?
?
??A

xha(s, ?)?h(s, ?)| ? ??
?
??A

xha(s, ?).

(11)

The constraint in (11) is also not linear constraint. In fact, it
is a quadratic constraint due to multiplication of ?? and xha.
However, the feasible set of policies scales monotonically with
??, i. e., more policies are feasible with increasing ??. Note that
for a fixed ??, the above constraint is linear, and therefore it is
a convex constraint.

We show that the formulation for the shared control synthesis
problem is given by a quasiconvex optimization problem (QCP).
The QCP formulation incorporates the occupancy variables
for the repaired strategy ?ha, and an upper bound for the
perturbation ?? of the human strategy. We show the formulation
for a single reachability specification ? = P??(?T ). Then,

we discuss how to add additional safety and expected cost
specifications in the QCP formulation.

The QCP formulation builds on the set Var of variables
including the following:
• xha(s, ?) ? 0,?) for each s ? S \ T and ? ? A

defines the occupancy measure of a state-action pair for
the repaired strategy.

• ?? ? 0, 1 gives the upper bound of the perturbation of
the strategy ?h.

The shared control synthesis problem can be formulated as the
following QCP:

minimize ?? (12)
subject to

?s ? S \ T.?
??A

xha(s, ?) =
?

s??S\T

?
??A
P(s?, ?, s)xha(s?, ?) + ?s

(13)
?s ? S.? ? A.?

s?S\T

?
??A

?(s, ?)xha(s, ?) ? ? (14)

|xha(s, ?)?
?
??A

xha(s, ?)?h(s, ?)| ? ??
?
??A

xha(s, ?).

(15)

Let us walk through the optimization (12)–(15). We minimize
the infinity norm of the perturbation in (12). The constraints
in (13) ensures that the resulting strategy ?ha from the
occupancy variables xha is well-defined. The constraint in (14)
constrains the probability of reaching the set of target states
T is smaller than or equal to ? to satisfy the specification
P??(?T ). We perturb the human strategy ?h to the repaired
strategy ?ha as in Definition 6 in (15) by putting an upper
bound on the maximal perturbation.

The problem in (12)–(15) not a convex optimization problem
because of the nonconvex constraint in (15). However, for a
fixed ??, the problem in (12)–(15) is a LP and the set of feasible
policies scales monotonically in ??. Therefore, the problem
in (12)–(15) is a quasiconvex optimization problem. We can
solve the QCP in (12)–(15) by employing a bisection method
on ??. A method to solve quasiconvex optimization problems is
given in 4, Algorithm 4.1. We adopt the Algorithm 4.1 in 4
as follows:

The above algorithm requires exactly
?
log2(

1

?
)

?
iterations,

and we can compute a policy that induces a minimal deviation
up to any given accuracy of ?.

Theorem 1. The strategy obtained from the QCP in (12)–(15)
satisfies the task specifications and it is optimal with respect
to the maximal deviation.

Proof. From a satisfying assignment to the constraints in (12)–
(15), we can compute a policy that satisfies the specification

Algorithm 1: Bisection method for quasiconvex optimiza-
tion problem.

given l = 0, u = 1, tolerance ? > 0.
repeat

1. Set ?? = (l + u)/2.
2. Solve the convex feasibility problem in (13)–(15).
3. if the problem in (13)–(15) is feasible, u := ??; else
l := ??.

until u? l ? ?.

using (6). Using Algorithm (1), we can compute the repaired
policy ?ha that deviates minimally from the human strategy

?h up to ? accuracy in
?
log2(

1

?
)

?
iterations. Therefore, by

solving the QCP (12)–(15), we can compute an optimal policy
for the shared control synthesis problem.

We note that the solution given by the QCP in (12)–(15)
computes the minimally deviating repaired strategy ?ha using
strategy repair. On the one hand, reference 15 considers
repairing the strategy with a greedy approach. Their approach
requires solving possibly an unbounded number of LPs to
compute a feasible strategy that is not necessarily optimal. On
the other hand, we only need to check feasibility of a bounded
number of LPs to compute an optimal strategy. Note that
we do not compute the autonomy strategy ?a with the QCP
in (12)–(15) directly. After computing the repaired strategy
?ha, we compute the autonomy strategy ?a according to the
Definition 5.

4) Additional specifications: The QCP in (12)–(15) com-
putes an optimal policy for a single reachability specification.
Suppose that we are given another reachability specification
?? = P???(?T ?) with T ? 6= T in addition to the reachability
specification ? = P??(?T ). We can handle this specification
by the adding ?

s?S\T ?

?
??A

??(s, ?)x(s, ?) ? ? (16)

to the QCP in (12)–(15), where ??(s, ?) =
?
s??T ?

P(s, ?, s?).

The constraint in (16) ensures that the probability of reaching
T ? is less than ??.

We handle an expected cost specification E??(?G) for G ?
S, by adding ?

s?S\G

?
??A

C(s, ?)x(s, ?) ? ? (17)

to the QCP in (12)–(15).
Similar to the probability computation for reachability

specifications, the expected cost for reaching G is given by?
s?S\G

?
??A

C(s, ?)x(s, ?). (18)

V. CASE STUDY AND EXPERIMENTS

Defining a formal synthesis approach to the shared control
scenario requires an abstract representation of the human’s
desires as a strategy. We now discuss how such strategies may
be obtained using inverse reinforcement learning and report
on case study results.

A. Experimental setting

We consider two scenarios, the first of which is the wheelchair
scenario from Fig. 1. We model the wheelchair scenario inside
an interactive Python environment.

In the second scenario, we use a tool called AMASE1, which
is used to simulate multi-unmanned aircraft vehicles (UAV)
missions. Its graphical user interfaces allow humans to send
commands to the one or multiple vehicles at run time. It
includes three main programs: a simulator, a data playback
tool, and a scenario setup tool.

We use the probabilistic model checker PRISM 19 to verify
if the computed strategies satisfy the specification. We use the
LP solver Gurobi 14 to check the feasibility of the LP
problems that is given in Section IV. We also implemented
the greedy approach for strategy repair in 15.

B. Data collection

We asked five participants to accomplish tasks in the wheelchair
scenario. The goal is moving the wheelchair to a target cell in
gridworld while never being in same cell as the moving obstacle.
Similarly, three participants performed the surveillance task in
the AMASE environment.

From the data obtained from each participant, we compute
an individual randomized human strategy ?h via maximum-
entropy inverse reinforcement learning (MEIRL) 28. Refer-
ence 16 uses inverse reinforcement learning to reason about
human behavior in a shared control scenario (though without
any formal guarantees) from human’s demonstrations. In 25,
inverse reinforcement learning is used to distinguish human
intents with respect to different tasks. We show the work flow
of the case study in Figure 4.

In our setting, we denote each sample as one particular
command of the participant, and we assume that the participant
issues the command to satisfy the specification. Under this
assumption, we can bound the probability of a possible
deviation from the actual intent with respect to the number of
samples using Hoeffding’s inequality for the resulting strategy,
see 27 for details. Using these bounds, we can determine
the required number of command to get an approximation
of a typical human behavior. The probability of a possible
deviation from the human behavior is given by O(exp(?n?2)),
where n is the number of commands from the human and ?
is the upper bound on the deviation between the probability
of satisfying the specification with the true human strategy
and the probability obtained by the strategy that is computed
by inverse reinforcement learning. For example, to ensure an
upper bound ? = 0.05 on the deviation of the probability of

1https://github.com/afrl-rq/OpenAMASE

Fig. 4. The setting of the case study for the shared control simulation.

satisfying the specification with probability 0.99, the required
amount of samples is 1060.

We design the blending function as follows: At states where
the human strategy yields a lower probability of satisfying
the specification, we assign a lower confidence in the human
strategy. Using this function, we create the autonomy strategy
?a and pass (together with the blending function) back to the
environment. Note that the blended strategy ?ha satisfies the
specification, by Theorem 1.

C. Gridworld

The size of the gridworld in Fig. 1 is variable, and we
generate a number of randomly moving (e.g., the vacuum
cleaner) and stationary obstacles. An agent (e.g., the wheelchair)
moves in the gridworld according to the commands from a
human. For the gridworld scenario, we construct an MDP and
the states of the MDP represents the positions of the agent and
the obstacles. The actions in the MDP induce changes in the
position of the agent.

The safety specification specifies that the agent has to reach
a target cell while not crashing into an obstacle with a certain
probability ? ? 0, 1, formally P??(¬crash U target).

First, we report results for one particular participant in a
gridworld scenario with a 8× 8 grid and one moving obstacle.
The resulting MDP has 2304 states. We compute the human
strategy using MEIRL with the features, e. g., the components
of the cost function of the human, giving the distance to the
obstacle and the goal state.

We instantiate the safety specification with ? = 0.7, which
means the target should be reached with at least a probability
of 0.7. The human strategy ?h induces a probability of 0.546
to satisfy the specification. That is, it does not satisfy the
specification.

We compute the repaired strategy ?ha using the greedy and
the QCP approach, and both strategies satisfies the specification
by inducing a probability of satisfying the specification larger
than ?. On the one hand, the maximum deviation between
the human strategy ?h and ?ha is 0.15 with the LP approach,
which implies that the strategy of the human and the autonomy
protocol deviates at most 15% for all states and actions. On
the other hand, the maximum deviation between the human
strategy ?h and the blended strategy ?ha is 0.03 with the QCP
approach. The results shows that the QCP approach computes
a blended strategy that induces more similar strategy to the
human strategy compared to the LP approach.

(a) Strategy ?h (b) Strategy ?ah (c) Strategy ?a

Fig. 5. Graphical representation of the obtained human, blended, and autonomy
strategy in the grid.

TABLE I
SCALABILITY RESULTS FOR THE GRIDWORLD EXAMPLE.

grid states trans. LP synth. ?LP QCP synth. ?QCP

8× 8 2, 304 36, 864 14.12 0.15 31.49 0.03
10× 10 3, 600 57, 600 23.80 0.24 44.61 0.04
12× 12 14, 400 230, 400 250.78 0.33 452.27 0.05

To finally assess the scalability of our approach, consider
Table I. We generated MDPs for different gridworlds with
different number of states and number of obstacles. We list
the number of states in the MDP (labeled as “states”) and
the number of transitions (labeled as “trans”). We report on
the time that the synthesis process took with the LP approach
and QCP approach (labeled as “LP synth.” and “QCP synth”),
which includes the time of solving the LP or QCPs measured
in seconds. It also includes the model checking times using
PRISM for the LP approach. To represent the optimality
of the synthesis, we list the maximal deviation from the
human strategy for the LP and QCP approach (labeled as
“?LP” and “?QCP”). In all of the examples, we observe that the
strategies with the QCP approach yields autonomy strategies
with less deviation to the human strategy while having similar
computation time with the LP approach.

D. UAV mission planning

Similar to the gridworld scenario, we generate an MDP
with states corresponding to the position of the agents and
the obstacles for the AMASE scenario. Consider an AMASE
scenario in Fig. 6. In this scenario, the specification or the
mission of the agent (blue UAV) is to keep surveilling the
green regions (labeled as w1, w2, w3) while avoiding restricted
operating zones (labeled as “ROZ1, ROZ2”) and enemy agents
(purple and green UAVs). As we consider reachability problems,
we asked the participants to visit the regions in a sequence,
i.e., visiting the first region, then second, and then the third
region. After visiting the third region, the task is to visit the
first region again to perform the surveillance.

For example, if the last visited region is w3, then the
safety specification in this scenario is P??((¬crash ?
¬ROZ) U target), where ROZ is to visit the ROZ areas
and target is visiting w1.

We synthesize the autonomy protocol on the AMASE
scenario with two enemy agents that induces an MDP with
15625 states. We use the same blending function and same

(a) An AMASE simulator. (b) The GUI of AMASE.

Fig. 6. An example of an AMASE scenario.

threshold ? = 0.7 as in the gridworld example. The features
to compute the human strategy with MEIRL are given by
the distance to the closest ROZ, enemy agents and the target
region.

The human strategy ?h induces a probability of 0.163 to
satisfy the specification, and it violates the specification like
in the gridworld example. Similar to the gridworld example,
we compute the repaired strategy ?ha with the LP and the
QCP approach, and both strategies satisfy the specification. On
the one hand, the maximum deviation between ?h and ?ha is
0.42 with the LP approach, which means the strategies of the
human and the autonomy protocol are significantly different in
some states of the MDP. On the other hand, the QCP approach
yields a repaired strategy ?ha that is more similar to the human
strategy ?h with the maximum deviation being 0.06. The time
of the synthesis procedure with the LP approach is 481.31
seconds and the computation time with the QCP approach
is 749.18 seconds, showing the trade-offs between the LP
approach and the QCP approach. We see that, the LP approach
can compute a feasible solution slightly faster, however the
resulting blended strategy may be less similar to the human
strategy compared to the QCP approach.

VI. CONCLUSION AND CRITIQUE

We introduced a formal approach to synthesize an autonomy
protocol in a shared control setting subject to probabilistic
temporal logic specifications. The proposed approach utilizes
inverse reinforcement learning to compute an abstraction of
a human’s behavior as a randomized strategy in a Markov
decision process. We designed an autonomy protocol to
ensure that the resulting robot behavior satisfies safety and
performance specifications in probabilistic temporal logic. We
also ensured that the resulting robot behavior is as similar to the
behavior induced by the human’s commands as possible. We
synthesized the robot behavior using quasiconvex programming.
We showed the practical usability of our approach through
case studies involving autonomous wheelchair navigation and
unmanned aerial vehicle planning.

There is a number of limitations and also possible extensions
of the proposed approach. First of all, we computed an
globally optimal strategy by bisection, which requires checking

feasibility of a number of linear programming problems. A
convex formulation of the shared control synthesis problem
would make computing the globally optimal strategy more
efficient.

We assumed that the human’s commands are consistent
through the whole execution, i. e., the human issues each
command to satisfy the specification. Also, this assumption
implies the human does not consider assistance from the robot
while providing commands – and in particular, the human does
not adapt the strategy to the assistance. It may be possible
to extend the approach to handle non-consistent commands
by utilizing additional side information, such as the task
specifications.

Finally, in order to generalize the proposed approach to
other task domains, it is worth to explore transfer learning 21
techniques. Such techniques will allow us to handle different
scenarios without requiring to relearn the human strategy from
the human’s commands.

REFERENCES

1 Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse
reinforcement learning. In ICML, page 1. ACM, 2004.

2 Christel Baier and Joost-Pieter Katoen. Principles of Model Checking.
The MIT Press, 2008.

3 Mihir Bellare and Phillip Rogaway. The complexity of approximating
a nonlinear program. In Complexity in numerical optimization, pages
16–32. World Scientific, 1993.

4 Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cam-
bridge University Press, New York, NY, USA, 2004.

5 Taolue Chen, Yuan Feng, David S. Rosenblum, and Guoxin Su.
Perturbation analysis in verification of discrete-time Markov chains.
In CONCUR, volume 8704 of LNCS, pages 218–233. Springer, 2014.

6 Anca D. Dragan and Siddhartha S. Srinivasa. Formalizing assistive
teleoperation. In Robotics: Science and Systems, 2012.

7 Anca D. Dragan and Siddhartha S. Srinivasa. A policy-blending
formalism for shared control. I. J. Robotic Res., 32(7):790–805, 2013.

8 Andrew Fagg, Michael Rosenstein, Robert Platt, and Roderic Grupen.
Extracting user intent in mixed initiative teleoperator control. In
Intelligent Systems Technical Conference, page 6309, 2004.

9 Lu Feng, Clemens Wiltsche, Laura Humphrey, and Ufuk Topcu. Synthesis
of human-in-the-loop control protocols for autonomous systems. IEEE
Transactions on Automation Science and Engineering, 13(2):450–462,
2016.

10 Vojte?ch Forejt, Marta Kwiatkowska, Gethin Norman, David Parker, and
Hongyang Qu. Quantitative multi-objective verification for probabilistic
systems. In TACAS, pages 112–127. Springer, 2011.

11 Roland Fryer and Matthew O Jackson. A categorical model of cognition
and biased decision making. The BE Journal of Theoretical Economics,
8(1).

12 Jie Fu and Ufuk Topcu. Synthesis of shared autonomy policies with
temporal logic specifications. IEEE Transactions on Automation Science
and Engineering, 13(1):7–17, 2016.

13 F. Gala?n, M. Nuttin, E. Lew, P. W. Ferrez, G. Vanacker, J. Philips,
and J. del R. Milla?n. A brain-actuated wheelchair: Asynchronous and
non-invasive brain-computer interfaces for continuous control of robots.
Clinical Neurophysiology, 119(9):2159–2169, 2016/05/28.

14 Gurobi Optimization, Inc. Gurobi optimizer reference manual.
url=http://www.gurobi.com, 2013.

15 Nils Jansen, Murat Cubuktepe, and Ufuk Topcu. Synthesis of shared
control protocols with provable safety and performance guarantees. In
ACC, pages 1866–1873. IEEE, 2017.

16 Shervin Javdani, J Andrew Bagnell, and Siddhartha Srinivasa. Shared
autonomy via hindsight optimization. In Robotics: Science and Systems,
2015.

17 Dae-Jin Kim, Rebekah Hazlett-Knudsen, Heather Culver-Godfrey, Greta
Rucks, Tara Cunningham, David Portee, John Bricout, Zhao Wang, and
Aman Behal. How autonomy impacts performance and satisfaction:
Results from a study with spinal cord injured subjects using an assistive
robot. IEEE Transactions on Systems, Man, and Cybernetics-Part A:
Systems and Humans, 42(1):2–14, 2012.

18 Jonathan Kofman, Xianghai Wu, Timothy J Luu, and Siddharth Verma.
Teleoperation of a robot manipulator using a vision-based human-robot
interface. IEEE transactions on industrial electronics, 52(5):1206–1219,
2005.

19 Marta Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0:
Verification of probabilistic real-time systems. In CAV, volume 6806 of
LNCS, pages 585–591. Springer, 2011.

20 Adam Leeper, Kaijen Hsiao, Matei Ciocarlie, Leila Takayama, and David
Gossow. Strategies for human-in-the-loop robotic grasping. In HRI, pages
1–8. IEEE, 2012.

21 Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE
Transactions on knowledge and data engineering, 22(10):1345–1359,
2010.

22 Shashank Pathak, Erika A?braha?m, Nils Jansen, Armando Tacchella,
and Joost-Pieter Katoen. A greedy approach for the efficient repair of
stochastic models. In NFM, volume 9058 of LNCS, pages 295–309.
Springer, 2015.

23 Martin L. Puterman. Markov Decision Processes: Discrete Stochastic
Dynamic Programming. John Wiley and Sons, 1994.

24 Martin L Puterman. Markov decision processes: discrete stochastic
dynamic programming. John Wiley ; Sons, 2014.

25 Constantin A Rothkopf and Dana H Ballard. Modular inverse rein-
forcement learning for visuomotor behavior. Biological cybernetics,
107(4):477–490, 2013.

26 Jian Shen, Javier Ibanez-Guzman, Teck Chew Ng, and Boon Seng
Chew. A collaborative-shared control system with safe obstacle avoidance
capability. In Robotics, Automation and Mechatronics, volume 1, pages
119–123. IEEE, 2004.

27 Brian D Ziebart. Modeling purposeful adaptive behavior with the principle
of maximum causal entropy. 2010.

28 Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey.
Maximum entropy inverse reinforcement learning. 2008.

x

Hi!
I'm Mindy!

Would you like to get a custom essay? How about receiving a customized one?

Check it out