A REPORTING GUIDELINE FOR MEDICAL APPLICATIONS OF ARTIFICIALNEURAL NETWORKS

H.C.E. McGowan, M. Stevenson, M.Frize
Department of Electrical Engineering, University of New Brunswick,PO Box 4400, Fredericton, NB E3B 5A3

ABSTRACT

Given the interdisciplinary nature of projects which employ ArtificialNeural Networks (ANNs) to estimate medical outcomes, and the wide spectrumof journals in which the results are published, there exists a need toestablish a guideline for reporting the details of these experiments. Inthis paper, one model will be proposed and discussed. Once a standard guidelineis adopted, comparisons of the work of different researchers will be greatlyfacilitated. While such a guideline need not be followed to the letterin order to publish a "good" paper in this field, incorporatingmany of the suggested details into publications would ensure that the overallperformance of each ANN is clearly documented and that the results arenot difficult to reproduce.

INTRODUCTION

The introduction of Artificial Neural Network (ANN) technology intothe commercial market, which followed a series of advances in researchin the late 1980's, has resulted in ANN software packages being widelyavailable to scientists and engineers of all disciplines. Specialists inmany fields other than neural network research quickly recognized the vastpotential of a "new" non-linear modeling technique and many nowincorporate ANNs into their work.

The field of medicine has not been immune to this trend. Some doctorsand nurses, whose work involves developing models of disease by performingtraditional statistical analyses of vast and complex databases of medicalinformation, have initiated research projects aimed at determining theusefulness of ANNs in estimating medical outcomes. In a flurry of researchpublished in the early 1990's, ANNs were used to identify everything fromheart attacks [Baxt, 1990 and Baxt, 1991] to microcalcifications on mammographicx-rays [Wu, 1993], with varying degrees of success.

Since the results of the first experiments of this type were published,there has been little discussion of how one determines whether or not aparticular ANN is useful in the medical domain and what (if any) techniquescan be used to improve the performance level of such an ANN. To date, researcherswith various levels of ANN expertise have been working on a wide varietyof databases, conducting isolated experiments to determine the best valuesfor a large number of ANN parameters, and rarely reporting anything otherthan a wealth of similar outcomes. Comparing the work in this field isoften difficult and usually not particularly helpful in answering questionsregarding the resiliency of a particular ANN architecture or algorithmwhen it is used on a medical database. As a result, progress in this fieldhas been slow. However, if researchers were to report (as a minimum) astandard set of parameters each time the results of an experiment are documented,progress would be enhanced because this information could be used to pinpointthe types of ANNs and ANN techniques which yield the best results for aparticular problem of interest.

MOTIVATION

In some cases, it is difficult to glean critical information from thedetails published in the literature about an ANN which has been used toestimate a particular medical outcome. Some of the more serious omissionsinclude: critical details of the dataset used in the ANN experiments (suchas a priori knowledge of the various outcomes in the dataset); detailsof ANN performance (such as the number of epochs completed and the stoppingcriteria which were used); performance comparisons with the results obtainedusing traditional statistical benchmarks and models (such as an estimatedBayes-type minimum distance classifier and/or regression models); in veryexceptional circumstances there may even have been no evidence presentedto indicate that the ANN model was ever applied to a set of test data.

In order to rectify this, it is proposed that the following guideline,which sets forth a standard set of parameters to be reported in all papersdocumenting the use of ANNs to estimate medical outcomes, be adopted byall researchers in this field. The proposed guideline consists of threedistinct categories and is based largely on questions which arose duringan extensive literature survey, although it also incorporates the ideasand suggestions of our own research group. To avoid singling out a particularauthor (or group of authors), references to papers which raised questionshave not been included. While neither exhaustive in its inclusion of everypossible ANN detail, nor minimal in the sense that it is not necessaryfor every detail of the guideline to be included in a paper for it to beconsidered to be "complete" (since it is understood that someof the information, such as comparisons to regression models, may not evenbe available), it is hoped that this basic guide will motivate researchersto carefully consider what details of their work to report.

THE PROPOSED REPORTING STANDARD

The three reporting categories suggested are: a) The Dataset, b) ANNDetails, c) ANN Results and Statistical Comparisons. The parameters whichit would be useful to include are described in the sections which follow.

a) The Dataset

To answer the questions:

How large is the dataset? How large are the training and test sets?
How many variables are there? What type of variables are they? Whatrange of values does each variable encompass? Were the variable valuesscaled before they were presented to the ANN?
What classification rate can be achieved by classifying every patientas belonging to the class with the highest a priori probability(a constant predictor)?

The documentation should include:

the number of patients in the training and test sets
the number of variables, variable names, types and value ranges (i.e.the variable HR = heart rate takes on continuous values from 40 to 120),and information describing any variable scaling/normalization procedures
a priori class probabilities for each outcome variable (forthe whole dataset, training and test sets)

b) ANN Details

To answer the questions:

How was the ANN trained?
What type of ANN was used and how large was it?
What were the initial parameter values?

The documentation should include:

the training algorithm (i.e. standard backpropagation) and the stoppingcriteria
details of the architecture (i.e. a feedforward ANN with 40 inputs,a hidden layer containing 15 nodes, and one output) and the type of transferfunctions employed
values of ANN parameters such as learning rate, momentum factors, maximumvalues for initial weights, etc.

c) ANN Results and Statistical Comparisons

To answer the questions:

What is the best predictive performance obtained using the ANN? Howwas it determined?
How well does the ANN generalize?
How well does the ANN predict: true positives? true negatives? falsepositives? false negatives? How does varying the classification thresholdchange these results?
How does the ANN performance compare to chance?
Does the ANN predict outcome better than a constant predictor? Howdoes ANN performance compare to that of an estimated Bayes-type classifier?Does it come close to minimizing the probability of classification error?
Does the ANN perform as well as or better than traditional statisticalmodels (such as regression)?

The documentation should include:

the classification rate on the test set and training set, number ofepochs completed and the author's definition of an epoch
curves showing classification rate and MSE during training (for boththe training and test sets)
the sensitivity and specificity at the operating point (it would alsobe appropriate to include the false alarm rate, although this can be calculatedfrom the sensitivity and specificity)
ROC curves for both the training and test sets for each output andthe area under each curve a comparison of ANN results to the a prioriprobabilities for each outcome variable
a comparison of ANN results to those obtained by estimating a Bayes-typeminimum-distance classifier [Duda and Hart, 1973]
a comparison of ANN results to those obtained using any statisticalmodels of the same dataset

CONCLUSION

Based on a consideration of the current literature, it is apparent thatthere is a need to establish a guideline for reporting the results of experimentsin which ANNs have been used to estimate medical outcomes. It is hopedthat the model suggested in this paper will act as a catalyst for meaningfuldiscussion on how best to compare and evaluate the results of all suchexperiments, and eventually lead to a standard which will be accepted byall researchers in this field.

ACKNOWLEDGEMENTS

This work was completed with the assistance of a NSERC PGS-A Scholarshipand MRC Grant CGAA-45088.

REFERENCES

[Baxt, 1990] Baxt, W.G. Use of an Artificial Neural Network for DataAnalysis in Clinical Decision-Making: The Diagnosis of Acute Coronary Occlusion.Neural Computation, 2, 480-489: 1990.

[Baxt, 1991] Baxt, W.G. Use of an Artificial Neural Network for theDiagnosis of Myocardial Infarction. Annals of Internal Medicine, 115,843-848: 1991.

[Duda and Hart, 1973] Duda, R.O. and Hart, P.E. Pattern Classificationand Scene Analysis. New York: John Wiley and Sons, Ltd., 1973.

[Wu, 1993] Wu, Y. et al. Artificial Neural Networks in Mammography:Application to Decision Making in the Diagnosis of Breast Cancer. Radiology,187, 81-87: 1993.