I am a PhD Candidate at the Oxford-Man Institute of Quantitative Finance and Oxford e-Research Centre, co-supervised by Janet B. Pierrehumbert and Stefan Zohren. My research is focused on incorporating Natural Language Processing into Times Series Forecasting to deliver more accurate predictions of explicit data or events. I am very interested in the ability of NLP to improve current forecasting models and am keen to speak with anyone also interested in this topic. The Grand Union DTP, one of the Economic and Social Research Council’s (ESRC) Doctoral Training Partnerships, supports my research through an AQM studentship.
Alongside my studies, I am part of the GB Rowing Development Squad, rowing for Leander Club and Oxford University Boat Club. Having won the U23 and Junior World Championships, I am training to take the step up to Senior level.
Download my CV.
PhD in Quantitative Finance, 2021 - present
University of Oxford
MEng in Engineering, Entrepreneurship and Management, 2017 - 2021
University of Oxford
We present a novel approach incorporating transformer-based language models into infectious disease modelling. Text-derived features are quantified by tracking high-density clusters of sentence-level representations of Reddit posts within specific US states' COVID-19 subreddits. We benchmark these clustered embedding features against features extracted from other high-quality datasets. In a threshold-classification task, we show that they outperform all other feature types at predicting upward trend signals, a significant result for infectious disease modelling in areas where epidemiological data is unreliable. Subsequently, in a time-series forecasting task we fully utilise the predictive power of the caseload and compare the relative strengths of using different supplementary datasets as covariate feature sets in a transformer-based time-series model.
This paper evaluates the ability to predict COVID-19 caseloads in local areas using the text of geographically specific subreddits, in conjunction with other features. The problem is constructed as a binary classification task on whether the caseload change exceeds a threshold or not. We find that including Reddit features, alongside other informative resources, improves the models’ performance in predicting COVID-19 cases. On top of this, we show that exclusive use of Reddit features can act as a strong alternative data source for predicting a short-term rise in caseload due to its strong performance and the fact that it is readily available and updates instantaneously.