Since I left the organised education industry, my PhD thesis has not been readily available on the web. It’s about time I gave it a proper home.
So: my thesis can now be downloaded from here. (This is a postscript file converted to PDF, and might look a bit ‘scratchy’ in places. IIRC back then we only had 10 dpi fonts.
I just found another version, if you prefer, which is a scanned image from the middle-class finishing school down the road. It was added late 2012 it seems.
I also realised that it was almost 15 years ago to the day that I passed my PhD viva (with minor changes). That seems a long time ago. What do I think about the whole thing from the distance of almost a sixth of a century? Basically, I’m still quite pleased with it. Quite a lot of what I said seems still relevant nowadays, particularly questioning the goals of wider NLP and trying to understand the value we’re creating (or not). I’ll go into more detail in another post.
But for now, here’s the abstract.
This research addresses the question, "how do we evaluate systems like LOLITA?" LOLITA is the Natural Language Processing (NLP) system under development at the University of Durham. It is intended as a platform for building NL applications. We are therefore interested in questions of evaluation for such general NLP systems. The thesis has two parts. The first, and main, part concerns the participation of LOLITA in the Sixth Message Understanding Conference (MUC-6). The MUC-relevant portion of LOLITA is described in detail. The adaptation of LOLITA for MUC-6 is discussed, including work undertaken by the author. Performance on a specimen article is analysed qualitatively, and in detail, with anonymous comparisons to competitors' output. We also examine current LOLITA performance. A template comparison tool was implemented to aid these analyses. The overall scores are then considered. A methodology for analysis is discussed, and a comparison made with current scores. The comparison tool is used to analyse how systems performed relative to each-other. One method, Correctness Analysis, was particularly interesting. It provides a characterisation of task difficulty, and indicates how systems approached a task. Finally, MUC-6 is analysed. In particular, we consider the methodology and ways of interpreting the results. Several criticisms of MUC-6 are made, along with suggestions for future MUC-style events. The second part considers evaluation from the point of view of general systems. A literature review shows a lack of serious work on this aspect of evaluation. A first principles discussion of evaluation, starting from a view of NL systems as a particular kind of software, raises several interesting points for single task evaluation. No evaluations could be suggested for general systems; their value was seen as primarily economic. That is, we are unable to analyse their linguistic capability directly.