Data Formats for ROC Curve Fitting
John Eng, M.D.
The Russell H. Morgan Department of Radiology and Radiological Science
Johns Hopkins University, Baltimore, Maryland, USA
Please send any bugs, questions, comments, or suggestions to
electronic mail will be answered.
General Comments (Back to main JROCFIT page.)
For all three formats below, the activity being evaluated involves an individual classifying cases as either
"positive" or "negative". In addition, the individual specifies his/her level of confidence with
the classification of each case according to an ordinal rating scale (1, 2, 3, ...). The meaning
of the ordinal rating scale is dependent on the particular data format as described below. For
each format, the number of categories in the rating scale must be entered after the data format
on the main calculation Web page.
The data for each format is organized as multiple lines of numbers. The numbers on each line may
be separated by any number of spaces or tab characters. This character format is often called
"ASCII" format and can be exported and imported by many spreadsheet and database programs.
Format 1 (Back to main JROCFIT page.)
In this data format, each line represents one case. On each line, there are two numbers. The first number
is either "0" or "1", depending on whether the case is truly positive ("1") or truly negative ("0"). The
second number is an integer (1, 2, 3, ...) representing the confidence rating for each case. For
example, in a 6-point rating scale, the categories would have the following meaning:
1 - Definitely negative
2 - Probably negative
3 - Possibly negative
4 - Possibly positive
5 - Probably positive
6 - Definitely positive
Format 2 (Back to main JROCFIT page.)
This data format allows the calculation of sensitivity, specificity, and overall accuracy in
addition to the ROC curve. As
in the previous data format, each line represents data from one
case. Each line has five fields. The
first field is either "0" or "1", depending on whether the case is truly positive ("1") or
truly negative ("0"). The second field is a text string indicating the true location of the abnormality,
if one is present. If the case is truly negative, then specify "none" as the location. The
third field is either "0" or "1", depending on whether the individual
thought the case is positive ("1") or negative ("0"). The fourth field is a text string indicating
where the individual thought there is an abnormality, if he/she thought the case is positive. If
the individual thought the case is negative, then he/she should specify "none" as the location. The
fifth field is the level of confidence (1, 2, 3, ...) the individual associated with his/her
response. Since positivity and negativity is specified separately, the rating scale indicates degree of confidence, whether positive
or negative. For example, in a 3-point rating scale, the categories would have the following meaning:
1 - Low confidence (that case is pos or neg)
2 - Moderate confidence (that case is pos or neg)
3 - High confidence (that case is pos or neg)
Format 3 (Back to main JROCFIT page.)
This data format is very different than Formats 1 or 2. In this data format, there are always
only two lines of numbers, regardless of how many cases there are. Each line has a number for each of the categories in the rating
scale. Therefore, there are 2 rows and N columns, with N being the number of categories in the rating scale. The first
row represents negative cases, and the second row represents positive cases. The first column
represents the first rating category, the second column represents the second rating category,
and so on. The numbers in each "cell" represent the frequency that each category was used for
the positive and negative cases. For example, if the individual responded with a confidence rating
of "4" in 10 of the positive cases, then the 4th number in the 2nd row would be "10". The meaning
of the rating categories is the same as in Format 1 above. Data Format 3 is the same as that used
by the original ROCFIT program, but it is less convenient to use than the others since the data are not usually
collected in this form.
(Page last updated: 2/11/2001)