The Alan Turing Institute Internship Programme 2018 · PDF filesocial dynamics. Project Goal...

1

The Alan Turing Institute Internship

Programme 2018

Contents

Project 1 – An interdisciplinary approach to programming by example……………………...2

Project 2 – Algorithms for automatic detection of new word meanings from social media to

understand language and social dynamics……………………………………………………….5

Project 3 – High performance, large-scale regression…………………………………………..8

Project 4 – Design analysis and applications of efficient algorithms for graph based

modelling……………………………………………………………………………………………10

Project 5 – Privacy-aware neural network classification & training…………………………. 12

Project 6 – Clustering signed networks and time series data……………………………….. 14

Project 7 – Uncovering hidden cooperation in democratic institutions…………………..…..17

Project 8 – Deep learning for object tracking over occlusion…………………………………20

Project 9 – Listening to the crowd: Data science to understand the British Museum

visitors……………………………………………………………………………………………….22

2

Project 1 - An interdisciplinary approach to programming

by example

Project Goal

To compare approaches to the versatile idea of ‘programming by example’, which has

relevance in various different fields and contexts, and design interdisciplinary new

techniques.

Project Supervisors

Adria Gascon (Research Fellow, The Alan Turing Institute, University of Edinburgh)

Nathanaël Fijalkow (Research Fellow, The Alan Turing Institute, University of Warwick)

Brookes Paige (Research Fellow, The Alan Turing Institute, University of Cambridge)

Project Description

Programming by example is a very natural and simple approach to programming: instead of

writing a programme, give the computer a desired set of inputs and outputs, and hope that

the programme will write itself out of these examples. In general, nothing prevents the

computer from relying on training data, initiating an interactive dialogue with the user to

resolve uncertainties, or even relying on the Internet, e.g. StackOverflow, to produce a

solution that realises the user’s intent.

A typical application is for an excel sheet: you write 2,4,6, and click on “continue”. You hope

that the computer will output 8,10,12... Another application is for robotics, where

Programming by Example is often called Programming by demonstration. The goal there is

to teach robots complicated behaviours, not by hardcoding them, which would be too costly

and complicated, but by showing a few examples and asking the robot to imitate.

Automated program synthesis, namely having programs write correct programs, is a problem

with a rich history in computer science that dates back to the origins of the field itself. In

particular, the simple paradigm of “Programming by Example” has been independently

developed within several subfields (at least formal verification, programming languages, and

learning) under different names and with different approaches. This project is about

3

understanding the tradeoffs between these techniques, comparing them, and possibly

devising one to beat them all.

Programming by example can be seen as a concrete framework for program synthesis. In

synthesis the specification for the programme is given by a high level specification, for

instance a logical formula. The special case where only inputs and outputs are given is

nonetheless pertinent in synthesis (see, for example https://dspace.mit.edu/openaccess-

disseminate/1721.1/90876). Adria Gascon has a long experience on synthesis, in particular

using SMT solvers. This will be one of the approaches to look at.

Programming by example can be attempted by neural networks and probabilistic inference.

There is some recent work in this direction which attempts to solve the program induction

problem directly (see for instance https://arxiv.org/abs/1703.04990), as well as work which

adopts deep learning as a way to provide assistance to SMT solvers (e.g.

https://arxiv.org/abs/1611.01989). Brooks Paige is familiar with such approaches. This will

be a second approach to look at.

Programming by example can be seen as an automaton learning task. In this scenario, the

goal is to learn a weighted automaton, which is a simple recursive finite-state machine

outputting real numbers. There are powerful techniques for learning weighted automata, for

instance through spectral techniques. Nathanaël Fijalkow has worked on these questions.

This will be a third approach to look at.

Besides studying their formal guarantees, we plan to empirically evaluate our algorithms,

and hence the project will involve a significant amount of coding.

Number of Students on Project: 2

Internship Person Specification

Essential Skills and Knowledge

• Interest in theoretical computer science in general

• Interest in various computational models: automata, neural networks

• Interest in programming languages

https://dspace.mit.edu/openaccess-disseminate/1721.1/90876

https://dspace.mit.edu/openaccess-disseminate/1721.1/90876

https://arxiv.org/abs/1703.04990

https://arxiv.org/abs/1611.01989

4

• Interest in interdisciplinarity (inside maths and computer

science), as the different techniques to be understood and compared are rather

diverse

• Coding skills

Desired Skills and Knowledge

• Previous experience in SMT solving

• Previous experience in NNs

• Previous experience in automata learning

Return to Contents

5

Project 2 – Algorithms for automatic detection of new word

meanings from social media, to understand language and

social dynamics.

Project Goal

To develop computational methods for identifying the emergence of new word meanings

using social media data, advance understandings of cultural and linguistic interaction online,

and improve natural language processing tools.

Project Supervisors

Barbara McGillivray (Research Fellow, The Alan Turing Institute, University of Cambridge)

Dong Nguyen (Research Fellow, The Alan Turing Institute, University of Edinburgh)

Scott Hale (Turing Fellow, The Alan Turing Institute, University of Oxford)

Project Description

This project focuses on developing a system for identifying new word meanings as they

emerge in language, focussing on words entering English from different languages and

changes in their polarity (e.g., from neutral to negative or offensive). An example is the word

kaffir, which, starting from a neutral meaning, has acquired an offensive use as a racial or

religious insult. The proposed research furthers the state of the art in Natural Language

Processing (NLP) by developing better tools for processing language data semantically, and

has impact on important social science questions.

Language evolves constantly through social interactions. New words appear, others become

obsolete, and others acquire new meanings. Social scientists and linguists are interested in

investigating the mechanisms driving these changes. For instance, analysing the meaning of

loanwords from foreign languages using social media data helps us understand the precise

sense of what is communicated, how people interact online, and the extent to which social

media facilitate cross-cultural exchanges. In the case of offensive language, understanding

the mechanisms by which it is propagated can inform the design of collaborative online

platforms and provide recommendations to limit offensive language where this is desired.

6

Detecting new meanings of words is also crucial to improve the accuracy of NLP tools for

downstream tasks, for example in the estimation of the "polarity" of words in sentiment

analysis (e.g. sick has recently acquired a positive meaning of 'excellent' alongside the

original meaning of 'ill'). Work to date has mostly focused on changes over longer time

periods (cf., e.g., Hamilton et al. 2016). For instance, awful in texts from the 1850s was a

synonym of 'solemn' and nowadays stands for 'terrible'. New data on language use and new

data science methods allow for studying this change at finer timescales and higher

resolutions. In addition to social media, online collaborative dictionaries like Urban Dictionary

are excellent sources for studying language change as it happens; they are constantly

updated and the threshold for including new material is lower than for traditional dictionaries.

The meaning of words in state-of-art NLP algorithms is often expressed by vectors in a low-

dimensional space, where geometric closeness stands for semantic similarity. These vectors

are usually fed into neural architectures built for specific tasks. The proposed project aims at

capturing meaning change on a fine-grained, short time scale. We will use the algorithm

developed by Hamilton et al. (2016), who used it to identify new meanings using Google

Books. We will train in-house vectors on multilingual Twitter data collected from 2011 to

2017. Through this process we will identify meaning change candidates and evaluate them

against the dictionary data by focussing on analysing the factors that drive foreign words to

enter the English language and to change their polarity. In doing so, we will shed light on the

extent to which the detected meaning changes are driven by linguistically internal rather than

external (e.g. social, technological, etc.) factors.

The original contributions of this research are:

• The development of an NLP system for detecting meaning change occurring in a

relatively short time period, so as to further the state of the art in NLP.

• The design of an evaluation framework which compares automatically derived

candidates for meaning change against dictionary data.

• The analysis of subsets of such candidates to answer social science questions about

the dynamics of human behaviour online.

The specific tasks of this project are:

a) Implement existing algorithms for identifying words that acquire new meanings as

they appear in the English language using social media data from Twitter collected

over a multiyear period (2011-2017).

7

b) Validate candidate words from Task (a) against Urban Dictionary and other

dictionaries.

c) Evaluate word meaning change in areas such as foreign loanwords and polarity

change, and address research questions regarding cultural and linguistic exchanges

online, as well as the creation and propagation of offensive language online.

d) Prepare an article to be submitted to a journal or conference.




• All interns will need to have advanced NLP skills, linguistic interest, and experience

with working with large datasets and cloud computing. At least one of the interns

should have some social data science experience.

• Experience developing R packages would be beneficial although training can be

provided.


• Previous experience in SMT solving

• Previous experience in NNs

• Previous experience in automata learning

Return to Contents

8

Project 3 – High performance, large-scale regression

Project Goal

To investigate distributed, scalable approaches to the standard statistical task of high-

dimensional regression with very large amounts of data, with the ultimate goal of informing

current best practice in terms of algorithms, architectures and implementations.

Project Supervisors

Anthony Lee (Research Fellow, The Alan Turing Institute, University of Cambridge)

Rajen Shah (Turing Fellow, The Alan Turing Institute, University of Cambridge)

Yi Yu (University of Bristol)

Project Description

The ultimate goal is to critically understand how different, readily available, large-scale

regression algorithms/software and frameworks perform for distributed systems, and isolate

both computational and statistical performance issues. A specific challenging dataset will

also be included to add additional focus, and there is the opportunity to investigate more

sophisticated, but less readily-available algorithms for comparison.

This project aligns to the Institute’s strategic priorities in establishing leadership and

providing guidance for common data analysis tasks at scale. It can feed in to a larger data

science at scale software programme around performance and usability, which it is hoped

will be developed in 2018.

Phases:

First phase: benchmark and profile available approaches on the Cray Urika-GX, and

potentially other architectures, for a scalable example class of models with carefully chosen

characteristics. Different regimes can be explored where there are substantial effects on

performance.

Second phase: use the benchmarks and profiling information to identify which, if any,

recently proposed approaches to large-scale regression may improve performance, with the

advice of Yi Yu and Rajen Shah.

Third phase: apply the skills and software developed to a large and challenging data set.

9

Throughout the project, documentation will be written to enable other data scientists to

perform large scale regressions with greater ease, and understand the implications of using

different architectures, frameworks, algorithms, and implementations.

This project is supported by Cray Computing.




• Familiarity with a cluster computing framework for data science / machine learning,

e.g., Spark

• Basic statistical understanding of regression

Desirable Skills and Knowledge

• Some experience with high-performance computing

Return to Contents

https://www.cray.com/

10

Project 4 - Design, analysis and applications of efficient

algorithms for graph based modelling

Project Goal

To develop fast and efficient numerical methods for optimization problems on graphs,

making use of continuum (large data) limits in order to develop multi-scale methods, with

real-world applications in medical imaging and time series data.

Project Supervisors

Matthew Thorpe (University of Cambridge)

Kostas Zygalakis (Turing Fellow, The Alan Turing Institute, University of Edinburgh)

Carola-Bibiane Schönlieb (Turing Fellow, The Alan Turing Institute, University of Cambridge)

Elizabeth Soilleux (University of Cambridge)

Mihai Cucuringu (Research Fellow, The Alan Turing Institute, University of Oxford)

Project Description

Many machine learning methods use a graphical representation of data in order to capture

the geometry it, in the absence of a physical model. If we consider the problem of classifying

a large data set, say 107 data points, then one common approach is spectral clustering. The

idea behind spectral clustering is to project the data onto a small number of discriminating

directions where the data should naturally separate into classes. In practice one uses the

eigenvectors of the graph Laplacian as directions and then uses off-the-shelf methods such

as k-means for the clustering. More importantly, this methodology easily extends to the

semi-supervised learning context.

A bottleneck in the above approach is in the computation of eigenvectors of the graph

Laplacian. The dimension of the graph Laplacian is equal to the number of data points and

therefore becomes infeasible for large data sets. Our approach is to use continuum (large

data) limits of the graph Laplacian to approximate the discrete problem with a continuum

PDE problem. We can then use standard methods to discretise the continuum PDE problem

on a potentially much coarser scale compared to the original discrete problem. In particular,

instead of computing eigenvectors of the graph Laplacian, one would compute

11

eigenfunctions of the continuum limit of the graph Laplacian and use these instead. This

should remove the bottleneck in spectral clustering methods for large data. The approach is

amenable to multi-scale methods, in particular by computing coarse approximations and

iteratively refining using known scaling results

The project will start by implementing modifications of existing algorithms, in particular we

will replace bottlenecks such as computing eigenvalues with an approximation based on

continuum limits. Once we have a working algorithm we aim to take the project further by

developing classification algorithms for diagnosing coeliac disease from medical images. In

particular, using our algorithms, we aim to improve on the current state of the art methods of

diagnosing coeliac disease (microscopic examination of biopsies), which is inaccurate with

around 20%misclassification.




• Good scientific computing skills, preferably in either Matlab or Python

• Competence in basic linear algebra

• Some functional analysis and PDEs

• Strong communication skills


• Experience with implementing Bayesian methods

Return to Contents

12

Project 5 - Privacy-aware neural network classification &

training

Project Goal

To invent new encrypted methods for neural network training and classification.

Project Supervisors

Matt Kusner (Research Fellow, The Alan Turing Institute, University of Warwick)

Adria Gascon (Research Fellow, The Alan Turing Institute, University of Warwick)

Varun Kanade (Turing Fellow, The Alan Turing Institute, University of Oxford)

Project Description

Neural networks crucially rely on significant amounts of data to achieve state-of-the-art

accuracy. This makes paradigms such as cloud computing and learning on distributed

datasets appealing. In the former setting, computation and storage are efficiently outsourced

to a trusted computing party, e.g. Azure, while in the latter, the computation of accurate

models is enabled by aggregating data from several sources.

However, because of regulatory and/or ethical reasons data can’t always be shared. For

instance, many hospitals may have overlapping patient statistics which, if aggregated, could

produce highly-accurate classifiers. However, this may compromise highly-personal data.

This kind of privacy concern prevents useful analysis on sensitive data. To tackle this issue,

privacy-preserving data analysis is an emerging area involving several disciplines such as

statistics, computer science, cryptography, and systems security. Although privacy in data

analysis is not a solved problem, many theoretical and engineering breakthroughs have

made privacy-enhancing technologies such as homomorphic encryption, multi-party

computation, and differential privacy related techniques into approaches of practical interest.

However, such generic techniques do not scale to input sizes required for training accurate

deep learning models, and custom approaches carefully combining them are necessary to

13

overcome scalability issues. Recent work on sparsifying neural networks and discretising the

weights used when training neural networks would be suitable avenues to enable application

of modern encryption techniques. However, issues such as highly non-linear activation

functions and the requirement for current methods to keep track of some high-precision

parameters may inhibit direct application.

The project will focus on both these aspects:

• Designing training procedures that use only low-precision weights and simple activation

functions.

• Adapting cryptographic primitives, such as those used in homomorphic encryption and

multi-party computation, to enable private training on these modified training procedures.

The ultimate goal of the project is to integrate both of these aspects into an implementation

of a provably privacy-preserving system for Neural Network Classification & Training.




• Interest in theoretical aspects of computer science

• Knowledge of public-key cryptography (RSA, Pallier, GSW)

• Knowledge of ML and NN (Residual Networks, Convolutional Networks)

• Experience in implementing secure and/or data analysis systems

• Experience in implementing distributed systems


• Experience implementing cryptographic protocols

• Experience implementing multi-party computation protocols

Return to Contents

14

Project 6 - Clustering signed networks and time series data

Project Goal

To implement and compare several recent algorithms, and potentially develop new ones, for

clustering signed networks, with a focus on correlation matrices arising from real-world

multivariate time series data sets.

Project Supervisors

Mihai Cucuringu (Research Fellow, The Alan Turing Institute, University of Oxford)

Hemant Tyagi (Research Fellow, The Alan Turing Institute, University of Edinburgh)

Project Description

Clustering is one of the most widely used techniques in data analysis, and aims to identify

groups of nodes that exhibit similar features. Spectral clustering methods have become a

fundamental tool with a broad range of applications in areas including network science,

machine learning and data mining. The analysis of signed networks - with negative weights

denoting dissimilarity or distance between a pair of nodes in a network - has become an

increasingly important research topic in recent times. Examples include social networks that

contain both friend and foe links, and shopping bipartite networks that encode like and

dislike relationships between users and products. When analysing time series data, the most

popular measure of linear dependence between variables is the Pearson correlation taking

values in [−1, 1], and clustering such correlation matrices is important in certain applications.

This proposal will develop k-way clustering in signed weighted graphs, motivated by social

balance theory, where the task of clustering aims to decompose the network into disjoint

groups. These will be such that individuals within the same group are connected by as many

positive edges as possible, while those from different groups by as many negative edges as

possible.

We expect that the low-dimensional embeddings obtained via the various approaches we

will investigate could be of independent interest in the context of robust dimensionality

reduction in multivariate time series analysis. Of particular interest is learning nonlinear

mappings from time series data which are able to exploit (even weak) temporal correlations

15

inherent in sequential data, with the end goal of improving out-of-sample prediction. We will

focus on a subset of the following problems.

(1) Signed Network Embedding via a Generalized Eigenproblem. This approach is

inspired by recent work that relies on a generalised eigenvalue formulation which can be

solved extremely fast due to recent developments in Laplacian linear system solvers, making

the approach scalable to networks with millions of nodes.

(2) Signed clustering via Semidefinite Programming (SDP). This approach relies on a

semidefinite programming-based formulation, inspired by recent work in the context of

community detection in sparse networks. We efficiently solve the SDP program efficiently via

a Burer-Monteiro approach, and extract clusters via minimal spanning tree-based clustering.

(3) An MBO scheme. Another direction relates to graph-based diffuse interface models

utilizing the Ginzburg-Landau functionals, based on an adaptation of the classic numerical

Merriman-Bence-Osher (MBO) scheme for minimizing such graph-based functionals. The

latter approach bears the advantage that it can easily incorporate labelled data, in the

context of semi-supervised clustering.

(4) Another research direction is along the lines of clustering time series using Frechet

distance. The existing algorithm in the literature is quite complicated and not directly

implementable in practice. It essentially involves a pre-processing step where each time

series is replaced with its lower complexity version via its “signature”. This leads to a faster

algorithm for clustering (in theory). The approach via signatures could prove powerful, and

one could consider forming the signature via randomized sampling of the “segments” of the

time series.

(5) Graph motifs. This approach relies on extending recent work on clustering the

motif/graphlet adjacency matrix, as proposed recently in a Science paper by Benson, Gleich,

and Leskovec.

(6) Spectrum-based deep nets. A recent approach in the literature focuses on fraud

detection in signed graphs with very few labelled training sample points. This problem and

its setup are very similar to the topic of an ongoing research grant “Accenture and Turing

16

alliance for Data Science”, using network analysis tools for fraud detection, that could benefit

from any algorithmic developments that would take place during the internship. The

approach proposes a novel framework that combines deep neural networks and

spectral graph analysis, by relying on the low-dimensional spectral coordinates (extracted

by our approaches (1) - (5) detailed above) as input to deep neural networks, making the

later computationally feasible to train.

(1), (2), (3) already have a working MATLAB implementation available, that could be built

upon and compared to (4), (5). Time permitting, (6) can also be explored.

There will be freedom to pursue any subset of the above topics that align best with the

candidates’ background and maximise the chances of a publication.

A strong emphasis will be placed on assessing the performance of the algorithms on real-

world, publicly available data sets arising in economic data science, meteorology, medicine

monitoring or finance.

Number of Students Project: 2



• Both students will have familiarity with the same programming language (either R,

Python, or MATLAB)

• Solid knowledge of linear algebra and algorithms

• Familiarity with basic machine learning tools such as clustering, linear regression and

PCA


• Desirable but not needed: basic familiarity with spectral methods, optimization,

nonlinear dimensionality reduction, graph theory, model selection, LASSO/Ridge

regression, SVMs, NNs

Return to Contents

17

Project 7 - Uncovering hidden cooperation in democratic

institutions

Project Goal

To generalise the method of Vote-Trading Networks, previously developed to study hidden

cooperation in the US Congress, to a wider set of democratic institutions, developing a

research programme in the measurement and characterisation of hidden cooperation on a

large scale.

Project Supervisors

Omar A Guerrero (Research Fellow, The Alan Turing Institute, University College London)

Ulrich Matter (University of St Gallen)

Dong Nguyen (Research Fellow, The Alan Turing Institute, University of Edinburgh)

Project Description

The project aims at improving our understanding of cooperation in democratic institutions. In

particular, it will shed new light on cooperative behaviour that is intentionally ‘hidden’. An

example of such hidden cooperation is when two legislators agree to support each other’s

favourite bills, despite their ideological preferences, and/or despite such support being

disapproved of by their respective voters or campaign donors. This kind of behaviour is key to

the passage or blockage of critical legislation; however, we know little about it due to its

unobservable nature. The objective of this project is to exploit newly available big data on

voting behaviour from different institutional contexts and state-of-the-art methods from data

science, in order to develop two distinct research papers with clear policy implications for the

design and evaluation of political institutions.

Political institutions, such as parliaments and congresses, shape the life of every democratic

society. Hence, understanding how legislative decisions arise from hidden agreements has

direct implications on the guidelines that governments follow when conducting policy

interventions. Moreover, decision making by voting is common in other areas than legislative

law-making. It is prevalent in courts, international organizations, as well as in board rooms of

private enterprises.

The supervisors have collected comprehensive data sets on two institutions; the US Supreme

Court and the United Nations General Assembly. Each intern will work on one institution, using

18

the data provided by the supervisors and, sometimes, collecting

complementary data (through web scraping). The work conducted on the two institutions will

share a set of tools and methods, but also have unique requirements. In order to streamline

the workflow, the internship will be structured in three phases. Every week, there will be a

group meeting where each intern will give a presentation of his or her progress. This will be

an opportunity to share ideas, questions, challenges and solutions that the interns have

experienced. It will also serve to evaluate progress and adjust goals and objectives. In

addition, the documentation of their progress will be the basis for a final report to be handed

on the last week.

Phase 1: Introduction (1 to 1.5 weeks)

The interns will receive an introduction to the topic of cooperation in social systems, with a

particular focus on political institutions and situations in which cooperation is intentionally

hidden, such as vote trading, and, hence, unobservable in real-world data. Some specifics

about this phase are the following:

• Introduction to vote trading in democratic institutions, its societal relevance, evidence,

measurements and challenges.

• Introduction to web scraping and text mining.

• Tutorial on network science.

• Tutorial on stochastic and agent-based models.

• Tutorial on the Vote-Trading Networks framework.

Phase 2: Work with Data (3 to 4 weeks)

In this phase, the interns will conduct independent work to prepare their datasets and perform

statistical analysis to understand its structure. The supervisors will provide the ‘core’ datasets,

which will then be processed, pruned and analysed by the interns. Preparation work varies

depending on the project. The intern working with US Supreme Court data will apply natural

language processing (NLP) techniques to a large set of raw text documents, and then, match

the extracted information to voting records. Given the nature of the problems related to NLP,

this work could require substantially more time than the UN project. Hence, the goals and

timelines for this project will be adjusted according to progress. The intern working with UN

data will extend a web scraper, previously developed by the supervisors, in order to download

data from the UN Library on resolutions, and match it to voting data from the UN General

Assembly.

19

Once the data sets have been prepared, the interns will conduct statistical analysis. This will

serve the group to get a better understanding about the composition of the population, their

characteristics, voting patterns, voting outcomes, etc.

Phase 3: Computational Analysis (rest of the internship)

In this phase, the interns will bring together their understanding about institutions, ideas behind

hidden cooperation, data sets and computational methods. The interns will write up their

results, with the goal of publishing two distinct research articles.




• Knowledgeable in the Python and/or R programming language

• Familiar with statistical concepts such as random variables, probability distributions

and hypothesis testing

• Experience working with empirical micro-data


• Knowledgeable about political institutions and economic behaviour

• Familiar with complexity science and complex networks

• Familiar with agent-based modelling and Monte Carlo simulation

Return to Contents

20

Project 8 - Deep learning for object tracking over occlusion

Project Goal

To use deep learning to discover occluded objects in an image.

Project Supervisors

Vaishak Belle (Turing Fellow, The Alan Turing Institute, University of Edinburgh)

Chris Russell (Turing Fellow, The Alan Turing Institute, University of Surrey)

Brooks Paige (Research Fellow, The Alan Turing Institute, University of Cambridge)

Project Description

Numerous applications in data science require us to parse unstructured data in an

automated fashion. However, many of these models are not human-interpretable. Given the

increasing need for explainable machine learning, an inherent challenge is whether

interpretable representations can be learned from data.

Consider the application of object tracking. Classically, algorithms simply track the changing

positions of objects across frames. But in many complex applications, ranging from robotics

to satellite images to security, objects get occluded and thus disappear from the

observational viewpoint. The first task here is then to learn semantic representations for

concepts such as "inside", "behind" and "contained in."

The first supervisor (V. Belle) has written a few papers on using probabilistic programming

languages to define such occlusion models -- in the sense of instantiate them as graphical

models -- and use that construction in particle filtering (PF) problems, and decision-theoretic

planning problems.

However, the main barrier to success here was these occlusion models need to be defined

carefully by hand by a human, which makes them difficult to deploy in new contexts. The

main challenge of this internship is to take steps towards automating the learning of these

21

occlusion models directly from data.

Specifically, the idea is to jointly train a state estimation model -- specifically a particle filter

(PF) -- with a background vision segmentation model so that we can predict the next position

of an occluded object. The second supervisor (C. Russell) has extensive experience in

vision and segmentation who will serve as the principal point of contact at the ATI for the

interns. (The first supervisor will also make a continuous visit during the initial stages.) We

will focus on using variational auto encoders, recurrent neural nets or other relevant deep

learning architectures such as sum product networks to enable to the learning of semantic

representations. For instantiating deep learning architectures, B. Paige will be contributing

his recent approaches to integrate the learning framework with PyTorch and/or Pyro, the

latter recent proposed by UBER.

For the data, we plan on using 2 kinds of data sets. From the object tracking community, we

will be using tracking videos and clips to annotate occluded objects to train the models.

(Russell's working relationship with our new partner QMUL gives us direct access to their

tracking expertise, and sports and commuter tracking datasets. In consultation with them, we

intend to apply PF-RNN to these problems.)

The expected outcome is the following: a learned model M such that for any clip C where

object O gets occluded at some point in C, a query about the position of O against M would

correctly identify that O is occluded and where it's position is, based on the velocity of O’s

movement and its position the last time it was occluded.




• Background in machine learning and deep learning

• Preferably background in handling image data

• Background in sum product networks or pytorch, pyro, etc. would be beneficial

Return to Contents

22

Project 9 – Listening to the crowd: Data science to

understand the British Museum visitors

Project Goal

To analyse and understand the British Museum visitors’ behaviour and feedback, using

different sets of data including the Trip Advisor feedback, the Wifi access and “intelligent

counting data, and methods such as natural language processing and time series analysis.

Project Supervisors

Taha Yasseri (Turing Fellow, The Alan Turing Institute, University of Oxford)

Coline Cuau (British Museum)

Harrison Pim (British Museum)

Project Description

There is more to The British Museum than Egyptian mummies and the Rosetta Stone - more

than 6 million people walk through the doors each year, travelling from every corner of the

globe to see the Museum's collection and get a better understanding of their shared

histories. Those visitors offer us a unique test bed for data science and real world testing at

scale.

In order to address some of the challenges of welcoming such a large number of visitors, the

British Museum is constantly gathering feedback and information about the visiting

experience. Research about visitors informs decisions made by teams around the Museum

and help the Museum evolve along with its audience. The tools at the museum’s disposal

include direct feedback channels (such as email or comment cards), “intelligent counting

data, wifi data, audio guide data, social media conversations, satisfaction surveys, on-site

observation and conversations on online review sites such as Trip Advisor.

Trip Advisor reviews are one of the largest and richest qualitative datasets the Museum has

access to. On average, over 1,000 visitors review their visit on the platform every month.

These reviews are written in over 10 languages by visitors from all parts of the world, and

historical data stretches back over two years. In these comments, visitors discuss the

23

positive and negative aspects of their visits, make recommendations to others, and rate their

satisfaction. The data set is an opportunity for the Museum to learn more about its visitors, to

understand what the most talked about topics are, and which factors have the biggest

impact on satisfaction.

This research project aims to dig into a rich set of qualitative data, uncovering actionable

insights which will have a real impact on the Museum. The research will have an immediate

and tangible effect and will help the organisation improve the visiting experience currently on

offer at the Museum. The Museum is currently undergoing pivotal strategic change, and the

insights will also feed into future iterations of the display and audience strategies. As far as

we know, the British Museum is the first institution of its kind to take a programmatic

approach to this kind of qualitative data. This pioneering research could potentially impact

the rest of the cultural sector and show the way to a new method of evaluation and visitor

research.

Some of the questions we hope to answer with this data are:

• Understanding satisfaction – what it means, how it affects propensity to recommend,

and which aspects of a visit have the biggest impact on overall satisfaction.

• Analysing the different topics talked about in different languages. Do positive and

negative experiences vary according to language?

• Analysing which parts of the collection or objects visitors talk about the most, and

how feedback differs from one area of the Museum to another.

• Tracking comments regarding a variety of key topics, and understanding how they

relate to one another (tours and talks, audio guides, access, facilities, queues,

overcrowding…).

• Understanding and anticipating external factors which might impact decisions made

to visit (economy, weather, security concerns, strikes, politics…).

The Museum has recently set up a partnership with Trip Advisor, which gives us access to

the reviews in an XML format. This file includes the date and URL of the reviews, as well as

their title, score, language and full review text. The Museum could take a manual approach

to tagging and analysing reviews, but we believe that more insight can be generated through

computational approaches.

24

The proposed research project will therefore involve heavy use of modern Natural Language

Processing (NLP) techniques. The complete corpus of review text consists of approximately

7,500,000 words in 50,000 distinct reviews. Recent advances in machine learning and NLP

provide a wide range of potential approaches to the subject, but suggested methods include:

• Topic modelling

• Clustering/classifying reviews by topic or sentiment

• word2vec style approaches to training/using word embeddings

• Automating the tagging of new reviews

• Time series analysis and principal component analysis




• Familiarity with large scale data analysis

• Experience in scientific programming (R or Python)

• Interest in natural language processing techniques

• Interest or past experience with advanced statistics methods such as time series

analysis and PCA


• Interest in culture and museums and familiarity with context of the project

Return to Contents

The Alan Turing Institute Internship Programme 2018 · PDF filesocial dynamics. Project Goal...

Documents

Transcript of The Alan Turing Institute Internship Programme 2018 · PDF filesocial dynamics. Project Goal...