# Geometric and algorithmic issues in control and learning

## PDE-AI projet kickoff meeting - Jussieu, January 23-24 2024

## Nice team

Jean-Baptiste Caillau, UniCA (PR) - *Optimal control*

Thibaud Kloczko, Inria (IR) - *Scientitic computing, dev*

Ludovic Rifford, UniCA (PR) - *Control, SR geometry, transport*

Samuel Vaiter, CNRS (CR) - *Optimisation, machine learning*

## Theme 1 – The dynamics of Neural Networks training

*Objective 1.2 – Control and machine learning: there and back again*(participants: Strasbourg, Nice)**Task: Control for neural network analysis**(1 postdoc)*Objective 1.3 – Scalable solvers and softwares*(participants: U. Paris Cité, Nice)**Task: Automatic differentiation and control**(1 postdoc + 1 engineer)

## ResNets as discretised linear control systems

After Agrachev (SISSA), Sarychev (Florence), Scagliotti (TUM) *et al*:

- Agrachev, A. A.; Sarychev, A. V. Control in the Spaces of Ensembles of Points.
*SIAM J. Control Optim.***58**(2020), no. 3, 1579–1596 - Agrachev, A. A.; Sarychev, A. V. Control on the Manifolds of Mappings with a View to the Deep Learning.
*J. Dyn. Control Syst.***28**(2021), 989–1008 - A Scagliotti. Deep Learning approximation of diffeomorphisms via linear-control systems.
*MCRF***13**(2023), no. 3, 1226-1257

*ResNets* are compositions of nonlinear mappings

$$ \Phi = \Phi_M \circ \cdots \circ \Phi_1 $$

where \(M\) is the depth of the neural network and where each building block is of the form (additional term *wrt.* non-residual networks)

$$ \Phi_l(x) = x + \sigma(W_lx+b_l). $$

For the approximation properties of such compositions see, *e.g.*,

- Yarotsky, D. Error bounds for approximations with deep ReLU networks.
*Neural Networks***94**(2017), 103-114

in the case of non-residual networks. This composition can also be interpretated as the explicit Euler discretisation of the *Neural ODE*

$$ \dot{x}(t) = \sigma(W(t)x(t)+b(t)), $$

where \(W\) and \(b\) are now functions of time (continuum of layers), *controls*. Point of view developed, *e.g.* in Tabuada & Gharesifard:

- Tabuada, P.; Gharesifard, B. Universal Approximation Power of Deep Neural Networks via Nonlinear Control Theory. arXiv:007.06007 (2020)

See also *constructive* approach for Lipschitz (ReLU-like) activation function in references below:

- Li, Q.; Lin, T; Shen, Z. Deep learning via dynamical systems: An approximation perspective.
*J. Eur. Math. Soc.***25**(2023), 1671–1709 - Ruiz-Balet, D.; Zuazua, E. Neural ode control for classification, approximation and transport.
*SIAM Review***65**(2022), no. 3, 735-773

**Alternative point of view**: nonlinear in the data \(x\) but *linear* in the parameters:

$$ \Phi_l(x) = x + G(x)u_l. $$

Composition now interpretated as the discretisation of

$$ \dot{x}(t) = G(x(t))u(t) = \sum_{i=1}^m u_i(t)F_i(x(t)) $$

where the *smooth* vector fields \( F_1,\ldots,F_m \) are the columns of the nonlinear function \(G\).

Ability to learn data (finite or *continuum*): controllability properties of the control system for *ensembles*.

## Ensembles

**Definition.** For \(\Theta\) compact subset of \(\mathbf{R}^n\) (set of possibly infinite indices of the data), define an ensemble as a continuous *injective* map from \(\Theta\) to \(\mathbf{R}^n\). Denote \(\mathscr{E}_\Theta\) the set of ensembles.

**Example.** For \(|\Theta| = N < \infty\) finite, ensemble = open subset of \( (\mathbf{R}^n)^N \) of pairwise distinct vectors: \( (x_1,\ldots,x_N) \in (\mathbf{R}^n)^{(N)} \).

## Exact controllability

For an admissible control \(u\) in \( L^2([0,1],\mathbf{R}^m) \) (+ growth conditions on vector fields), define the time \(1\) flow \( \Phi_u \) of the controlled system

$$ \dot{x}(t) = \sum_{i=1}^m u_i(t)F_i(x(t)) \qquad (1) $$

mapping an initial condition \(x_0\) to \( \Phi_u(x_0) := x(1,u,x_0) \).

**Definition.** The control system (1) is said to be controllable on \(\mathscr{E}_\Theta\) if, for any ensembles \(\gamma_0\), \(\gamma_f\), there exists an admissible control \(u\) *s.t.*

$$ \Phi_u \circ \gamma_0 = \gamma_f. $$

**Example.** For \(|\Theta| = N\) finite and any ensembles \( (x^0_1,\ldots,x^0_N) \), \( (x^f_1,\ldots,x^f_N) \) in \( (\mathbf{R}^n)^{(N)} \), controllability means that there exists an admissible control \(u\) *s.t.*

$$ \Phi_u(x^0_j) = x^f_j,\quad j=1,\ldots,N. $$

**Theorem.** (Agrachev-Sarychev'2020) For \(k \geq 2\) controls, any \(N \geq 1\) and \(k\) sufficiently large, the set of vector fields \( F_1,\ldots,F_m \) *s.t.* controllability holds on \(\mathscr{E}_\Theta\) with \( |\Theta| = N\), is residual in \(\mathscr{C}^k(\mathbf{R}^n)\).

**Remark.** Typical control-geometric proof: for *generic* vector fields \( F_1,\ldots,F_m \), the family of folds \( F^{(N)}_1,\ldots,F^{(N)}_m \) is bracket generating on \( (\mathbf{R}^n)^{(N)}\):

$$ \text{Lie}_{x^{(N)}} \lbrace F^{(N)}_1,\ldots,F^{(N)}_m \rbrace = (\mathbf{R}^n)^N,\quad x^{(N)} \in (\mathbf{R}^n)^{(N)}, $$

where the *fold* of vector field \(F\) is \(F^{(N)}(x_1,\ldots,x_N) := (F(x_1),\ldots,F(x_N)) \). Controllability then ensured by Chow-Rashewsky.

## Approximate reachability

**Definition.** Let \(\gamma_0\) and \(\gamma_f\) be two ensembles in \(\mathscr{E}_\Theta\). Then \(\gamma_f\) is said to be \(\mathscr{C}^0\)-*approximately reachable* from \(\gamma_0\) by control system (1) if, for any \(\varepsilon > 0\), there exists an admissible control \(u\) *s.t.*

$$ \sup_{\theta \in \Theta} |\Phi_u(\gamma_0(\theta))-\gamma_f(\theta)| \leq \varepsilon. $$

**Remark.** As a flow, any such \( \Phi_u \) is diffeotopic to identity since

$$ x(0,u,\cdot) = \text{Id},\quad x(1,u,\cdot) = \Phi_u. $$

So if \(\gamma_f\) is reachable from \(\gamma_0\), the two ensembles must be diffeotopic.

**Definition.** The family \(F_1,\ldots,F_m\) satisfies the *(Lie algebra) strong approximation property* if there exists \(k \geq 1\) *s.t.*, for any \(\mathscr{C}^k\) vector field \(X\) and for any compact \(K \subset \mathbf{R}^n\), there is \(\delta > 0\) *s.t.*

$$ \inf \lbrace \max_{x \in K} |X(x)-Y(x)| \text{ with } Y \in \text{Lie} \lbrace F_1,\ldots,F_m \rbrace,\ \Vert Y \Vert_{1,K} \leq \delta \rbrace = 0. $$

**Remark.** The strong approximation property implies that, for every \(N \geq 1\), the folds of \(F_1,\ldots,F_m\) are bracket generating on \((\mathbf{R}^n)^{(N)}\). So (exact) controllability holds for finite sets \(\Theta\).

**Example.** The family below satisfies the strong approximation property (\(n \geq 2)\):

$$ F_i(x) = \frac{\partial}{\partial x_i},\quad G_i(x) = e^{-|x|^2} \frac{\partial}{\partial x_i},\quad i=1,\ldots,n. $$

**Theorem.** (Agrachev-Sarychev'2021) Let \(\gamma_0\) and \(\gamma_f\) be two *diffeotopic* ensembles in \(\mathscr{E}_\Theta\). Under the strong approximation property, \(\gamma_f\) is \(\mathscr{C}^0\)-*approximately reachable* from \(\gamma_0\) by control system (1).

**Remark.** The proof relies on the possibility of approximating uniformly on a given compact \(K\) a diffeomorphism \(\Psi\) (diffeotopic to identity) by a flow \(\Phi_u\) for some admissible control \(u\):

$$ \sup_{x \in K} |\Phi_u(x)-\Psi(x)| \leq \varepsilon. $$

But when \(\Psi\) is *not* a flow, controls cannot remain bounded when \(\varepsilon \to 0\) (for instance when one triangulates \(K\) with a finite number \(N\) of points and uses ensemble controllability: overfitting when \(N \to \infty\)).

## Going further

Use optimal control to penalise / regularise according to

$$ \frac{1}{N}\sum_{j=1}^N \text{loss}( \Phi_u(x^0_j)-\Psi(x^0_j) ) + \frac{\alpha}{2} \Vert u \Vert_2^2 \to \min \qquad (2) $$

where \((x^0_1,\ldots,x^0_N) \in (\mathbf{R}^n)^{(N)}\) is the training set of values in \(K\) for the diffeomorphism \(\Psi\).

- \(\Gamma\)-convergence results (Scagliotti'2022) when \(N \to \infty\) towards

$$ \int_K \text{loss}( \Phi_u(x)-\Psi(x) )\ \mathrm{d}\mu + \frac{\alpha}{2} \Vert u \Vert_2^2,\quad \mu = \lim_{N \to \infty} (1/N) \sum_{j=1}^N \delta_{x^0_j} $$

- numerical methods: gradient flow on (2), indirect method (Pontrjagin maximum principle)
- need for direct / indirect optimal control solvers & AD (automatic differentiation): control-toolbox
- basic example (direct method)
- advanced example (direct + indirect methods)