Alex Hill's industrial placement at IBM Research

PhD student Alex Hill undertook an industrial placement at IBM Research in Daresbury, Cheshire. He studied rare-event surrogate modelling with supervisor Dr Małgorzata Zimoń, and discusses his experience in the below blog post.

I first heard about the possibility of undertaking an internship position at IBM Research Daresbury through the project coordinator of the LIV.DAT CDT. Immediately I was drawn to the position - it was the perfect environment to apply the data skills that I had developed throughout my PhD studies, as well as to learn first-hand from leading researchers in the private sector. I applied, and gratefully I was successful in getting to the interview stage with a researcher within the Engineering team. She outlined the potential project - comparing methods regarding surrogate modelling for simulations in an engineering setting. This was a great match with my field - astronomy is beset with a deluge of possible cosmogonies (parameterisations of the makeup of the universe), all in an ideal world requiring a bespoke simulation to compare with reality. Surrogate modelling is increasingly being viewed as a solution to the computational challenges that this imposes, so as a budding simulator I saw it as my duty to jump at the opportunity to work alongside one of IBM’s foremost experts on the subject. I fortunately gave the impression of competence and was offered a position. Hurray! Full-steam ahead.

The first hurdle to overcome in getting started at IBM Daresbury was in fact getting to Daresbury. The travel gods have not yet seen fit to imbue Daresbury with a train station, nor to install a convenient bus route from Liverpool. Cycling was the way forward, between Runcorn station and Daresbury there is a lovely canal with many wading birds and swans, whose morning dashes across the path kept things interesting. Further entertainment was in ‘racing’ a fellow commuter each morning, who hopefully remained unaware of this arrangement. Based in the Cheshire countryside on the site of the Daresbury Nuclear Laboratory, Sci-Tech Daresbury retains part of the feeling of a university campus despite the many companies based there.

Throughout the first week my fellow interns and I took introductory classes in the tools of our trade, including git, shell scripting and high performance computing. It was one of those instances where you find yourself wishing that these lessons had come years earlier, a sure sign of their worth. For the first month or so I worked to familiarise myself with the literature and practical methodology surrounding surrogate modelling. This involved reading papers, watching online lectures, and attending study sessions that my supervisor kindly organised. The surrogate methods we employed in our research were Polynomial Chaos Expansion and Gaussian Process Emulation. Briefly - the former is a method in approximating a simulator as a linear combination of basis functions, while the latter models a simulation as a Gaussian process with mean and covariance matrices which determine the strength of correlation in the model outputs and are updated with training runs. These methods originate from different disciplines, yet address similar challenges. The particular motivation of our research was rare event modelling - where the simulation we attempt to model takes input values sampled from a Pareto distribution. These distributions are extremely heavy-tailed, where the bulk samples will have low values, though a non-zero sampling probability continues up to infinity. Our challenge was to produce a surrogate which not only reproduced the expected output statistics of the simulation - i.e. closely approximate the simulator at low values - but also accurately reproduce the simulator outputs along the tail. These two considerations are often in tension with each other, and posed our main challenge.

I fell into the routine of early mornings and set working hours - typically a rare occurrence for a PhD student. I was more than compensated by the company of my fellow interns and the IBM staff: I found them to a person to be friendly, driven and undertaking exciting work. We interns were immediately made to feel welcome, both within our research groups and the wider IBM community. It was reassuring to find a working environment outside of university-based academia where the virtues of that life were replicated in the private sector.  As a PhD graduate in waiting, I may well find myself casting an eye towards openings at IBM and other such companies for job prospects in the future. 

As we saw with Bong Joon-ho’s Parasite, great stories often have an unexpected twist at the halfway mark. Sadly, so was the case with this internship. As my supervisor and I were making strong progress in our research and a report to be submitted to a client, the severity of Covid-19 was beginning to become all too clear. The work from home request came early, I grabbed my computing equipment and one last biscuit from the break room, and headed for my new office. As nice as my dinner table is - a steal from a second-hand store, wooden, with white Yorkshire roses inlaid around the edges - it doesn’t hold a candle to Daresbury as an office, and the company isn’t quite as varied. It took some adjustment, but thankfully I could rely on endless patience and understanding from my supervisor and other collaborators, who may also have been struggling with constant proximity to a comfy bed and a panic-stocked kitchen.

Life, uh, finds a way, and I soon got into the swing of things. Our report was submitted shortly after lockdown, the key finding of which was the decomposition of the Pareto input distribution into ‘peak’ and ‘tail’ components, followed by the creation of two separate surrogates which are used in tandem to approximate the full model. The necessity of this followed our selection of training data inputs: quadrature nodes, which tended to cluster at the borders of the lower and upper extremities of our input range. We found significant improvement in the approximation of the simulator output across the support range, as well as in the approximation of output statistics. My attention then shifted to the ‘Curse of Dimensionality’, which sounds like a discarded Harry Potter title but actually refers to the unfortunate fact that as the number of input variables of a simulation increases, the computational cost of approximating it increases more dramatically still. We aimed to optimise the selection of training inputs in higher dimensions, as well as to visualise and quantify uncertainty in the simulation and surrogate outputs. This last point was what I focussed on for most of the last month, mainly on the idea of functional box plots and band depths. When you have a simulation that takes in input variables sampled from a given distribution function, it essentially maps the them onto an output distribution function. This is easy enough to visualise for one output value, but for a time-series simulation it becomes more challenging. Functional box plots and band depth (e.g. Ross T. Whitaker et al. 2013) allow you to visualise and quantify the range of expected behaviour of a given curve within an ensemble. I had made good progress in applying this to more complicated simulations for the client, as well as to epidemiological models for Covid-19, when the internship drew to a close.

Following my closing presentation over WebEx, I began to get my affairs in order and say goodbye to my colleagues. However I haven’t quite moved on to the great blue yonder, I continue to work with my IBM supervisor on our research, and hopefully will do so for a while yet. The sense of an ending that arises from changing the laptop you work on is not the same as saying goodbye to a roomful of people. We interns have been invited back to Daresbury for a proper goodbye and a trip to the pub, which I am greatly looking forward to. I fully enjoyed my time at IBM, and only wish that I could have experienced the full person-to-person programme envisaged. Despite the challenges, it has changed the way I view approaches to coding, the potential for cross-subject collaboration, and the possibilities in working outside academia.