Geometric and algorithmic issues in control and learning

PDE-AI projet kickoff meeting - Jussieu, January 23-24 2024

logo

Nice team

Jean-Baptiste Caillau, UniCA (PR) - Optimal control

Thibaud Kloczko, Inria (IR) - Scientitic computing, dev

Ludovic Rifford, UniCA (PR) - Control, SR geometry, transport

Samuel Vaiter, CNRS (CR) - Optimisation, machine learning

Theme 1 – The dynamics of Neural Networks training

  • Objective 1.2 – Control and machine learning: there and back again (participants: Strasbourg, Nice)
  • Task: Control for neural network analysis (1 postdoc)
  • Objective 1.3 – Scalable solvers and softwares (participants: U. Paris Cité, Nice)
  • Task: Automatic differentiation and control (1 postdoc + 1 engineer)

ResNets as discretised linear control systems

After Agrachev (SISSA), Sarychev (Florence), Scagliotti (TUM) et al:

ResNets are compositions of nonlinear mappings

$$ \Phi = \Phi_M \circ \cdots \circ \Phi_1 $$

where \(M\) is the depth of the neural network and where each building block is of the form (additional term wrt. non-residual networks)

$$ \Phi_l(x) = x + \sigma(W_lx+b_l). $$

For the approximation properties of such compositions see, e.g.,

in the case of non-residual networks. This composition can also be interpretated as the explicit Euler discretisation of the Neural ODE

$$ \dot{x}(t) = \sigma(W(t)x(t)+b(t)), $$

where \(W\) and \(b\) are now functions of time (continuum of layers), controls. Point of view developed, e.g. in Tabuada & Gharesifard:

See also constructive approach for Lipschitz (ReLU-like) activation function in references below:

Alternative point of view: nonlinear in the data \(x\) but linear in the parameters:

$$ \Phi_l(x) = x + G(x)u_l. $$

Composition now interpretated as the discretisation of

$$ \dot{x}(t) = G(x(t))u(t) = \sum_{i=1}^m u_i(t)F_i(x(t)) $$

where the smooth vector fields \( F_1,\ldots,F_m \) are the columns of the nonlinear function \(G\).

Ability to learn data (finite or continuum): controllability properties of the control system for ensembles.

Ensembles

Definition. For \(\Theta\) compact subset of \(\mathbf{R}^n\) (set of possibly infinite indices of the data), define an ensemble as a continuous injective map from \(\Theta\) to \(\mathbf{R}^n\). Denote \(\mathscr{E}_\Theta\) the set of ensembles.

Example. For \(|\Theta| = N < \infty\) finite, ensemble = open subset of \( (\mathbf{R}^n)^N \) of pairwise distinct vectors: \( (x_1,\ldots,x_N) \in (\mathbf{R}^n)^{(N)} \).

Exact controllability

For an admissible control \(u\) in \( L^2([0,1],\mathbf{R}^m) \) (+ growth conditions on vector fields), define the time \(1\) flow \( \Phi_u \) of the controlled system

$$ \dot{x}(t) = \sum_{i=1}^m u_i(t)F_i(x(t)) \qquad (1) $$

mapping an initial condition \(x_0\) to \( \Phi_u(x_0) := x(1,u,x_0) \).

Definition. The control system (1) is said to be controllable on \(\mathscr{E}_\Theta\) if, for any ensembles \(\gamma_0\), \(\gamma_f\), there exists an admissible control \(u\) s.t.

$$ \Phi_u \circ \gamma_0 = \gamma_f. $$

Example. For \(|\Theta| = N\) finite and any ensembles \( (x^0_1,\ldots,x^0_N) \), \( (x^f_1,\ldots,x^f_N) \) in \( (\mathbf{R}^n)^{(N)} \), controllability means that there exists an admissible control \(u\) s.t.

$$ \Phi_u(x^0_j) = x^f_j,\quad j=1,\ldots,N. $$

Theorem. (Agrachev-Sarychev'2020) For \(k \geq 2\) controls, any \(N \geq 1\) and \(k\) sufficiently large, the set of vector fields \( F_1,\ldots,F_m \) s.t. controllability holds on \(\mathscr{E}_\Theta\) with \( |\Theta| = N\), is residual in \(\mathscr{C}^k(\mathbf{R}^n)\).

Remark. Typical control-geometric proof: for generic vector fields \( F_1,\ldots,F_m \), the family of folds \( F^{(N)}_1,\ldots,F^{(N)}_m \) is bracket generating on \( (\mathbf{R}^n)^{(N)}\):

$$ \text{Lie}_{x^{(N)}} \lbrace F^{(N)}_1,\ldots,F^{(N)}_m \rbrace = (\mathbf{R}^n)^N,\quad x^{(N)} \in (\mathbf{R}^n)^{(N)}, $$

where the fold of vector field \(F\) is \(F^{(N)}(x_1,\ldots,x_N) := (F(x_1),\ldots,F(x_N)) \). Controllability then ensured by Chow-Rashewsky.

Approximate reachability

Definition. Let \(\gamma_0\) and \(\gamma_f\) be two ensembles in \(\mathscr{E}_\Theta\). Then \(\gamma_f\) is said to be \(\mathscr{C}^0\)-approximately reachable from \(\gamma_0\) by control system (1) if, for any \(\varepsilon > 0\), there exists an admissible control \(u\) s.t.

$$ \sup_{\theta \in \Theta} |\Phi_u(\gamma_0(\theta))-\gamma_f(\theta)| \leq \varepsilon. $$

Remark. As a flow, any such \( \Phi_u \) is diffeotopic to identity since

$$ x(0,u,\cdot) = \text{Id},\quad x(1,u,\cdot) = \Phi_u. $$

So if \(\gamma_f\) is reachable from \(\gamma_0\), the two ensembles must be diffeotopic.

Definition. The family \(F_1,\ldots,F_m\) satisfies the (Lie algebra) strong approximation property if there exists \(k \geq 1\) s.t., for any \(\mathscr{C}^k\) vector field \(X\) and for any compact \(K \subset \mathbf{R}^n\), there is \(\delta > 0\) s.t.

$$ \inf \lbrace \max_{x \in K} |X(x)-Y(x)| \text{ with } Y \in \text{Lie} \lbrace F_1,\ldots,F_m \rbrace,\ \Vert Y \Vert_{1,K} \leq \delta \rbrace = 0. $$

Remark. The strong approximation property implies that, for every \(N \geq 1\), the folds of \(F_1,\ldots,F_m\) are bracket generating on \((\mathbf{R}^n)^{(N)}\). So (exact) controllability holds for finite sets \(\Theta\).

Example. The family below satisfies the strong approximation property (\(n \geq 2)\):

$$ F_i(x) = \frac{\partial}{\partial x_i},\quad G_i(x) = e^{-|x|^2} \frac{\partial}{\partial x_i},\quad i=1,\ldots,n. $$

Theorem. (Agrachev-Sarychev'2021) Let \(\gamma_0\) and \(\gamma_f\) be two diffeotopic ensembles in \(\mathscr{E}_\Theta\). Under the strong approximation property, \(\gamma_f\) is \(\mathscr{C}^0\)-approximately reachable from \(\gamma_0\) by control system (1).

Remark. The proof relies on the possibility of approximating uniformly on a given compact \(K\) a diffeomorphism \(\Psi\) (diffeotopic to identity) by a flow \(\Phi_u\) for some admissible control \(u\):

$$ \sup_{x \in K} |\Phi_u(x)-\Psi(x)| \leq \varepsilon. $$

But when \(\Psi\) is not a flow, controls cannot remain bounded when \(\varepsilon \to 0\) (for instance when one triangulates \(K\) with a finite number \(N\) of points and uses ensemble controllability: overfitting when \(N \to \infty\)).

Going further

Use optimal control to penalise / regularise according to

$$ \frac{1}{N}\sum_{j=1}^N \text{loss}( \Phi_u(x^0_j)-\Psi(x^0_j) ) + \frac{\alpha}{2} \Vert u \Vert_2^2 \to \min \qquad (2) $$

where \((x^0_1,\ldots,x^0_N) \in (\mathbf{R}^n)^{(N)}\) is the training set of values in \(K\) for the diffeomorphism \(\Psi\).

  • \(\Gamma\)-convergence results (Scagliotti'2022) when \(N \to \infty\) towards

$$ \int_K \text{loss}( \Phi_u(x)-\Psi(x) )\ \mathrm{d}\mu + \frac{\alpha}{2} \Vert u \Vert_2^2,\quad \mu = \lim_{N \to \infty} (1/N) \sum_{j=1}^N \delta_{x^0_j} $$

  • numerical methods: gradient flow on (2), indirect method (Pontrjagin maximum principle)
  • need for direct / indirect optimal control solvers & AD (automatic differentiation): control-toolbox
  • basic example (direct method)
  • advanced example (direct + indirect methods)