Data Formats for ROC Analysis
John Eng, M.D.
Russell H. Morgan Department of Radiology and Radiological Science
Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
 
Please send any bugs, questions, comments, or suggestions to email will be answered.



General Comments   (Back to main JROCFIT page.)

For all data formats below, the activity being evaluated involves an individual classifying cases as either "positive" or "negative". In addition, the individual specifies his/her level of confidence with the classification of each case according to an ordinal rating scale (1, 2, 3, ...). The meaning of the ordinal rating scale is dependent on the particular data format as described below. For each format (except Format 5), the number of categories in the rating scale must be entered after the data format on the main calculation Web page. Format 5 allows the rating scale to be based on continuously distributed values instead of integers.

The data for each format is organized as multiple lines of numbers. The numbers on each line may be separated by any number of spaces or tab characters. This character format is often called "ASCII" format and can be exported and imported by many spreadsheet and database programs.

To see examples of each data format, click "Paste Example Data" on the main JROCFIT page.


Format 1:  Ordinal Rating Scale   (Back to main JROCFIT page.)

In this data format, each line represents one case. On each line, there are two numbers. The first number is either "0" or "1", depending on whether the case is truly positive ("1") or truly negative ("0"). The second number is an integer (1, 2, 3, ...) representing the confidence rating for each case, with higher numbers indicating greater positivity. For example, in a 6-point rating scale, the categories would have the following meaning:

     1 - Definitely negative
     2 - Probably negative
     3 - Possibly negative
     4 - Possibly positive
     5 - Probably positive
     6 - Definitely positive


Format 2:  Binary Response with Confidence Rating   (Back to main JROCFIT page.)

This data format allows the calculation of sensitivity, specificity, and overall accuracy in addition to the ROC curve. As in the previous data format, each line represents data from one case. Each line has five fields. The first field is either "0" or "1", depending on whether the case is truly positive ("1") or truly negative ("0"). The second field is a text string indicating the true location of the abnormality, if one is present. If the case is truly negative, then specify "none" as the location. The third field is either "0" or "1", depending on whether the individual thought the case is positive ("1") or negative ("0"). The fourth field is a text string indicating where the individual thought there is an abnormality, if he/she thought the case is positive. If the individual thought the case is negative, then he/she should specify "none" as the location. The fifth field is the level of confidence (1, 2, 3, ...) the individual associated with his/her response. If the task does not involve location, then specify "none" (or any other identical string) for all values in the second and fourth fields. Since positivity and negativity is specified separately, the rating scale indicates degree of confidence, whether positive or negative. For example, in a 3-point rating scale, the categories would have the following meaning:

     1 - Low confidence (that case is pos or neg)
     2 - Moderate confidence (that case is pos or neg)
     3 - High confidence (that case is pos or neg)


Format 3:  Frequency Table   (Back to main JROCFIT page.)

This data format is very different than Formats 1 or 2. In this data format, there are always only two lines of numbers, regardless of how many cases there are. Each line has a number for each of the categories in the rating scale. Therefore, there are 2 rows and N columns, with N being the number of categories in the rating scale. The first row represents negative cases, and the second row represents positive cases. The first column represents the first rating category, the second column represents the second rating category, and so on. The numbers in each "cell" represent the frequency that each category was used for the positive and negative cases. For example, if the individual responded with a confidence rating of "4" in 10 of the positive cases, then the 4th number in the 2nd row would be "10". The meaning of the rating categories is the same as in Format 1 above. Data Format 3 is the same as that used by the original ROCFIT program, but it is less convenient to use than the others since the data are not usually collected in this form.


Format 4:  Cumulative Hits vs. False Alarms   (Back to main JROCFIT page.)

This data format is commonly found in psychology research. The first row contains two values which specify the number of positive and negative cases, respectively. This row is followed by N additional rows, where N is the number of categories in the confidence rating scale. Each of these N additional rows contains two values: the cumulative hit and false-alarm rates for each rating category. These additional rows are in order of increasing confidence. More precisely, the Nth additional row gives the proportion of all positive and negative cases that were assigned a confidence rating at least as high as the Nth lowest confidence rating. By definition, both values in the first additional row are equal to 1.0.
    Note concerning rounding errors.  Rounding errors in the hit and false-alarm rates may cause the total number of cases stated in JROCFIT's program output to be slightly different than what was specified in the input data. To prevent this, specify the hit and false-alarm rates with as high a numerical precision as appropriate.
    If you do not know the number of cases.  If you do not know the total number of positive and negative cases, you can arbitrarily specify a large number for each (e.g., 1000). JROCFIT will still be able to fit a ROC curve, but the standard deviations and confidence limits will be meaningless.


Format 5:  Continuous Rating Scale   (Back to main JROCFIT page.)

This data format is essentially the same as Format 1, except the rating scale is a continuous distribution of values. Higher values indicate greater positivity. This format may also be used when the rating scale has a large number (perhaps more than 10?) of discrete values.
 
 (Content last updated: 23 March 2017)