Deep Learning For Embedded Security Evaluation

Deep Learning For Embedded Security EvaluationEmmanuel PROUFF

Joint work with Ryad Benadjila, Eleonora Cagli (CEA LETI), Cecile Dumas (CEA

LETI), Houssem Maghrebi (UL), Loic Masure (CEA LETI), Thibault Portigliatti (ex

SAFRAN), Remi Strullu and Adrian Thillard

ANSSI (French Network and Information Security Agency)

June 17, 2019

June 2019, Summer School, Sibenik, Croatia| E. Prouff| 0/18

Illustration| Template Attacks|

Probability distribution function (pdf) of Electromagnetic Emanations

Cryptographic Processing with a secret k = 1.

1/18 Emmanuel PROUFF - ANSSI / Invited Talk PANDA 2018

Context:

Target Device Clone Device

[On Clone Device] For every k estimate the pdf of−→X | K = k.

k = 4k = 1 k = 2 k = 3

[On Target Device] Estimate the pdf of−→X.

[Key-recovery] Compare the pdf estimations.

Context:

k = 4k = 1 k = 2 k = 3

Context:

k = 4k = 1 k = 2 k = 3

Context:

k = 4k = 1 k = 2 k = 3

Context:

k = 4k = 1 k = 2 k = 3

[Key-recovery] Compare the pdf estimations.2/18 Emmanuel PROUFF - ANSSI / Invited Talk PANDA 2018

Side Channel Attacks (Classical Approach)

Notations

~X observation of the device behaviour

P public input of the processing

Z target (a cryptographic sensitive variable Z = f(P,K))

Goal: make inference over Z, observing ~X

Pr[Z|~X]

Template Attacks

Profiling phase (using profiling traces under known Z)

I manage de-synchronization problemI mandatory dimensionality reductionI estimate for each value of z

Attack phase (N attack traces ~xi, e.g. with known plaintexts pi)Log-likelihood score for each key hypothesis k

N∑i=1

log Pr[~X = ~xi|Z = f(pi, k)]

Notations

Goal: make inference over Z, observing ~X Pr[Z|~X]

Template Attacks

N∑i=1

Notations

Template Attacks

dk =N∑i=1

Notations

Template Attacks

I manage de-synchronization problemI mandatory dimensionality reduction

I estimate Pr[~X|Z = z] by simple distributions for each value of z

dk =N∑i=1

Notations

Template Attacks

I manage de-synchronization problemI mandatory dimensionality reduction

I estimate Pr[~X|Z = z] for each value of z

Attack phase (N attack traces ~xi, e.g. with known plaintexts pi)I Log-likelihood score for each key hypothesis k

dk =N∑i=1

Notations

Template Attacks

I manage de-synchronization problem

I mandatory dimensionality reductionI estimate Pr[~X|Z = z] for each value of z

dk =N∑i=1

Notations

Template Attacks

Profiling phase (using profiling traces under known Z)I manage de-synchronization problemI mandatory dimensionality reductionI estimate Pr[ε(X)|Z = z] for each value of z

dk =N∑i=1

log Pr[ε(X) = ε(xi)|Z = f(pi, k)]

Defensive Mechanisms

Misaligning Countermeasures

Random Delays, Clock Jittering, ...

In theory: assume to be insufficient to provide security

In practice: one of the main issues for evaluators

=⇒ Need for efficient resynchronization techniques

Masking Countermeasure

Each key-dependent internal state element is randomly split into 2 shares

The crypto algorithm is adapted to always manipulate shares at 6= times

The adversary needs to recover information on the two shares to recover K

=⇒ Need for efficient Methods to recover tuple of leakage samplesthat jointly depend on the target secret

Defensive Mechanisms

Misaligning Countermeasures

Random Delays, Clock Jittering, ...

In theory: assume to be insufficient to provide security

In practice: one of the main issues for evaluators

=⇒ Need for efficient resynchronization techniques

Masking Countermeasure

Each key-dependent internal state element is randomly split into 2 shares

The crypto algorithm is adapted to always manipulate shares at 6= times

The adversary needs to recover information on the two shares to recover K

=⇒ Need for efficient Methods to recover tuple of leakage samplesthat jointly depend on the target secret

Introduction| Convolutional Neural Networks| Training of Models|

Motivating Conclusions

preprocessing to prepare dataI Traces resynchronisationI Selection of PoIs

make strong hypotheses on thestatistical dependencyI e.g. Gaussian approximation

characterization to extractinformationI e.g. Maximum Likelihood

The proposed perspective:

Train algorithms to directlyextract information

Motivating Conclusions

characterization to extractinformationI e.g. Maximum Likelihood

The proposed perspective:

Train algorithms to directlyextract information

Side Channel Attacks

Notations

~X side channel trace

Template Attacks Machine Learning Side Channel Attacks

Profiling phase (using profiling traces under known Z)I manage de-synchronization problemI mandatory dimensionality reductionI estimate Pr[~X|Z = z] for each value of z

Attack phase (N attack traces, e.g. with known plaintexts pi)I Log-likelihood score for each key hypothesis k

dk =N∑i=1

Side Channel Attacks with a Classifier

Notations

Profiling phase (using profiling traces under known Z)I manage de-synchronization problemI mandatory dimensionality reductionI estimate Pr[~X|Z = z] for each value of z

dk =N∑i=1

Notations

Training phase (using training traces under known Z)I manage de-synchronization problemI mandatory dimensionality reductionI estimate Pr[~X|Z = z] for each value of z

dk =N∑i=1

Notations

Training phase (using training traces under known Z)I manage de-synchronization problemI mandatory dimensionality reductionI construct a classifier F s.t. F (~x)[z] = y ≈ Pr[Z = z|~X = ~x]

dk =N∑i=1

Notations

Training phase (using training traces under known Z)I manage de-synchronization problemI mandatory dimensionality reductionI construct a classifier F s.t. F (~x)[z] = y ≈ Pr[Z = z|~X = ~x]

dk =N∑i=1

logF (~xi)[f(pi, k)]

Notations

Training phase (using training traces under known Z)Integrated approach

I manage de-synchronization problemI mandatory dimensionality reductionI construct a classifier F s.t.

F (~x)[z] = y ≈ Pr[Z = z|~X = ~x]

dk =N∑i=1

logF (~xi)[f(pi, k)]

Machine Learning Approach

Overview of Machine Learning Methodology

Human effort:

choose a class of algorithms

Neural Networks

choose a model to fit +tune hyper-parameters

MLP, ConvNet

Automatic training:

automatic tuning oftrainable parametersto fit data

Stochastic Gradient Descent

Human effort:

choose a class of algorithmsNeural Networks

choose a model to fit +tune hyper-parameters

MLP, ConvNet

Automatic training:

automatic tuning oftrainable parametersto fit dataStochastic Gradient Descent

Human effort:

choose a model to fit +tune hyper-parametersMLP, ConvNet

Automatic training:

Human effort:

choose a model to fit +tune hyper-parametersMLP, ConvNet

Automatic training:

Convolutional Neural Networks

An answer to translation-invariance

0% 20% 40% 60%

Classification

Horse Dog Cat

Classifier

It is important to explicit the data translation-invarianceConvolutional Neural Networks: share weights across space

0% 20% 40% 60%

Classification

Horse Dog Cat

Classifier

0% 20% 40% 60%

Classification

Horse Dog Cat

Classifier

0% 20% 40% 60%

Classification

Horse Dog Cat

Classifier

It is important to explicit the data translation-invariance

Convolutional Neural Networks: share weights across space

0% 10% 20% 30% 40% 50%

Classification

Horse Dog Cat

Classifier? ? ?

Classifier0% 50% 100%

P(Z|X=x)

Z=1 Z=0x

P(Z|X=x)

Z=1 Z=0x

P(Z|X=x)

Z=1 Z=0x

P(Z|X=x)

Z=1 Z=0x

P(Z|X=x)

Z=1 Z=0x

Basic Example

Figure: Convolutional filtering: W = 2, nfilter = 4, stride = 1, padding = same. Maxpooling layer: W = stride = 3.

Example: masked manipulation of a sensitive datum Z

Figure: Deep Learning Behaviour Against Masked Datum

Training of Neural NetworksTrading Side-Channel Expertise for Deep Learning Expertise .... or huge computational power!

Training

Aims at finding the parameters of the function modelling for thedependency btw the target value and the leakage.

The search is done by solving a minimization problem with respect to somemetric (aka loss function)

The training algorithm has itself some training hyper-parameters:the number of iterations (aka epochs) of the minimization procedure,the number of input traces (aka batch) treated during a single iteration.

The trained model has architecture hyper-parameters:the size of the layers, the nature of the layers, the number of layers, etc.

Tricky Points

Find sound hyper-parameters is the main issue in Deep Learning: this can bedone thanks to a good understanding of the underlying structure of thedata and/or access to important computational power.

Training

Tricky Points

Training

Tricky Points

Training

Tricky Points

Training

Tricky Points

Training

Tricky Points

ASCAD Open Data-Base| Results|

Creation of an open database for Training and Testing

ANSSI recently publishes

source codes of secure implementations of AES128 for public 8-bitarchitectures (https://github.com/ANSSI-FR/secAES-ATmega8515)I first version: 10-masking + processing in random orderI second version: affine masking + processing in random order (plus other

minor tricks)

data-bases of electromagnetic leakages(https://github.com/ANSSI-FR/ASCAD)

example scripts for the training and testing of models in SCA contexts

Enable fair and easy benchmarking

Initiate discussions and exchanges on the application of DL to SCA

Create a community of contributors on this subject

Comparisons with State-Of-the-Art Methods

Feedbacks & Open Issues

Feedbacks

The number of epochs for the training is between 100 and 1000

Model architectures are relatively complex (more than 10 layers)

Data-bases for the training must be large

Require important processing capacities (several GPUs, RAM memory, etc.)

Importance of cross-validation

Open Issues

Models are trained to recover manipulated values (e.g. sbox outputs) but notthe key itself

Current loss functions measure the accuracy of pdf estimations but not theefficiency of the resulting attack

Adaptation to get (very) efficient key enumeration algorithms

Feedbacks

Open Issues

Feedbacks

Open Issues

Feedbacks

Open Issues

Feedbacks

Open Issues

Feedbacks

Open Issues

Feedbacks

Open Issues

Feedbacks

Open Issues

Feedbacks

Open Issues

Feedbacks

Open Issues

Recent Results|

Conclusions

State-of-the-Art Template Attack separates resynchronization/dimensionalityreduction from characterization

Deep Learning provides an integrated approach to directly extractinformation from rough data (no preprocessing)

Many recent results validate the practical interest of the Machine Learningapproach

We are in the very beginning and we are still discovering how much DeepLearning is efficient

New needs:I big data-bases for the training,I platforms to enable comparisons and benchmarking,I create an open community ”ML for Embedded Security Analysis”,I encourage exchanges with the Machine Learning community,I understand the efficiency of the current countermeasures

Thank You! Questions?

Recent Results|

Conclusions

Recent Results|

Conclusions

Recent Results|

Conclusions

Recent Results|

Conclusions

Recent Results|

Conclusions

Thank You! Questions?18/18 Emmanuel PROUFF - ANSSI / Invited Talk PANDA 2018

Deep Learning For Embedded Security Evaluation

Documents

Transcript of Deep Learning For Embedded Security Evaluation