Community
 
Aggiungi lista preferiti Aggiungi lista nera Invia ad un amico
------------------
Crea
Profilo
Blog
Video
Sito
Foto
Amici
   
 
 

The NIST StRD Benchmarks

This article appears by courtesy of
Michael Kozluk

Extracted from:
"SOFTWARE TESTING EXCEL SPREADSHEET PROGRAM" , technical note AFM/TN-CA-01.0 (Draft) Michael J. Kozluk, Feb 2002, Kinectrics Inc.

The Statistical Engineering and Mathematical and Computational Sciences Divisions of the Information Technology Laboratory (National Institute of Standards and Technology) has released a number of benchmark datasets for assessing the numerical accuracy of statistical software. The Statistical Reference Datasets (StRD) were designed explicitly to assist researchers in benchmarking statistical software packages. A link to the StRD is maintain on the NIST’s website for applied mathematics, statistics, and computational science{1}.

The Statistical Reference Datasets include both generated and "real-world" data. The StRD{2} are divided into four suites of numerical benchmarks for statistical software:

For each suite of tests there are several problems in each of three difficulty levels: low, average, and high. These benchmarks were specifically designed so that reliable algorithms implemented in double precision would produce acceptable results for all four suites.

In the case of linear procedures the certified values were calculated using multiple precision calculations (accurate to 500 digits). Data were read in exactly as multiple precision numbers and all calculations were made with this very high precision. For the nonlinear least-squares regression benchmarks, two different algorithms with different implementations were used with 128-bit precision to calculate the certified values. Both of these algorithms can, in double precision (64-bit), return 10 accurate digits for each of the nonlinear benchmarks. In addition, multiple profiles of the least-squares surface were used to ensure that a global minimum had been attained.

The certified values for the benchmarks are provided to 15 digits for linear procedures and 11 digits for nonlinear procedures.

Background

This section gives a brief overview of topics related to software testing. Specifically discussed are the four types of errors associated with numerical computations, the metric to be used for measuring accuracy, and what is an acceptable level of accuracy for engineering assessments

Types of Errors

As described in Reference [2], there are four types of errors in numerical computations.

  1. A computer represents numbers in binary form using finite precision. For example, by the IEEE-754 standard for computer arithmetic (which is implemented in the hardware of most every computer) single precision has about six or seven digits of accuracy, while double precision has about 15 or 16 digits of accuracy. Often, a given floating-point number cannot be represented exactly in binary form. For example, the binary representation of the decimal number 0.1 is .001100110011, where the underline indicates an infinitely repeating sequence.
  2. Rounding error is a function of hardware, and is primarily due to finite precision. One type of rounding error is cancellation error, which occurs when two nearly equal numbers are subtracted from each other, leaving only the rightmost digits. Successive rounding errors in a series of calculations do not cancel but, rather, accumulate, and the bound on the total error is proportional to the number of calculations. Sometimes this total error only affects the rightmost digits of the final answer. Sometimes the total error can be so large as to completely swamp the answer, resulting in no accurate digits.
  3. Truncation error is a function of software and may be considered an approximation error. Many mathematical functions can be represented as an infinite series expansion. Assuming infinite precision, the difference between the true value and that achieved by summing a finite number of terms is truncation error.
  4. Use of an "unstable" algorithm is a function of software. In general a number of different algorithms can be used to compute a given mathematical quantity. Each has its own strengths and weaknesses, and the algorithm is often chosen based on considerations of speed or accuracy.

Performance Index

Reference [2] uses a measure of the number of correct significant digits in the estimated value as a performance index. This measure of accuracy is an absolute measure of accuracy of an estimate and is easier to apply than the relative performance index (measure) proposed in Reference [5]. The relative performance index of Reference [5] is a measure of the number of figures of accuracy lost by the test software over and above what software based on an optimally stable algorithm would produce. The simpler log relative error (LRE) is used as the basis for performance index in this technical note. Figure 1 is a flowchart for determining the LRE. The base-10 logarithm of the relative error is defined as:

 

where: LRE is the log relative error. q is the estimated value. c is the correct (certified) value. The LRE is a measure of the number of correct significant digits only when the estimated value is "close" to the exact value. Therefore, each estimated quantity must be compared to its certified value to make sure that they differ by a factor of less than two; otherwise the LRE for the estimated quantity is zero. In the event that the estimated value equals the correct value, the LRE is undefined, in which case it should be set equal to the number of digits in the certified value. In those cases that the correct value is zero, the base-10 logarithm of the absolute error should be used:

 

See the algorithm flow chart for details

Conditioning of data

The conditioning of the data consists of subtracting the sample mean from all of the values in the sample. In the case of the univariate summary statistics and the analysis of variance procedures, the results for the raw and conditioned benchmarks are mathematically identical, but use of the conditioned data results in a problem that is numerically better conditioned. In the case of the linear regression procedures, the results of the raw and conditioned benchmarks are not mathematically identical. In this case it is necessary to evaluate the regression parameters (in terms of the sample mean and the conditioned regression parameters) for the conditioned data benchmarks so that they can be compared with the raw data benchmarks and the certified values for the dataset.

Acceptable Accuracy

As described in Reference [2], the LRE from low-difficulty linear procedures should be at least nine. This is because a decent implementation of a decent algorithm for a linear procedure should, with well-behaved data, return ten accurate digits in double precision. This is not to suggest that the user need ten digits of accuracy. Rather, the idea is that a program that cannot deliver ten digits of accuracy for an easy benchmark is more likely to deliver inaccurate results in practice.

Producing correct results for all of the benchmark problems does not imply that the software will do the same for actual datasets. It does, however, provide some degree of assurance that the software does provide correct results for benchmark problems known to yield incorrect results for some software programs.

In practice, the author considers that four digits of accuracy are sufficient for a majority of engineering assessments. The important issue is that the analyst must know of, understand, and appreciate the implications of any/all limitations in the software packages that they use.

References
..........

[2] McCullough,B.D.

Assessing the Reliability of Statistical Software: Part I
The American Statistician, November 1998, Volume 52, pages 358-366
This reference is available from the NIST-StRD website{2}.

The paper proposes a set of intermediate-level tests (benchmarks) focussing on three areas: estimation, both linear and nonlinear; random number generation; and statistical distributions (e.g. for calculating probabilities). The complete methodology is described in detail. Also, convenient methods for summarizing the results of the testing are proposed. The performance measure used is an absolute measure of accuracy, calculated using the base-10 logarithm of the relative error. The estimation capabilities are assessed using the recently released NIST Statistical Reference Datasets. The output of the random number generator was assessed using statistical tests for randomness; while the accuracy of statistical distributions is assessed by comparing the results to specialized statistical software programs.

.......

[5] Cook,H.R., Cox,M.G., Dainton,M.P., Harris,P.M.

A Methodology for Testing Spreadsheets and Other Packages used in Metrology
Report to the National Measurement System Policy Unit, Department of Trade and Industry, from the UK Software Support for Metrology Programme NPL Report CISE 25/99, September 1999
This report is a part of the deliverable for Project 2.1 of the first Software Support for Metrology (SSfM) program and is available from the NPL-SSfM website{3}.

This report documents a general methodology for testing the numerical accuracy of scientific software. The basis of the approach is the design and use of reference datasets and corresponding reference results to be used in black-box testing. Reference datasets and results are to be generated in a manner consistent with the functional specification of the problem addressed by the test software. Datasets corresponding to problems with various "degrees-of-difficulty", or with application-specific properties, may be produced. The results returned by the software for the reference data are compared objectively with the reference results. The objective comparison is performed using quality metrics that account for the key aspects of the problem. In addition, complementary tests of numerical accuracy that do not require the use of reference datasets are used. These include consistency and continuity checks, spot checks against tabulated values, and checks of solution characteristics. The proposed performance measure indicates the number of figures of accuracy lost by the test software over and above what software based on an optimally stable algorithm would produce.

.......

{1} http://math.nist.gov/

The National Institute of Standards and Technology (NIST) website for applied mathematics,statistics, and computational science. This website maintains a link to the Statistical ReferenceDatasets (StRD).

{2} http://www.intl.nist.gov/div898/strd/

The National Institute of Standards and Technology (NIST) website for Statistical Reference Datasets (StRD).

{3} http://www.npl.co.uk/ssfm/

The National Physical Laboratory (NPL) website for Software Support for Metrology (SSfM). Part of the mandate for the SSfM program is the testing of spreadsheets and other software programs. A series of reports are available, which document the methodology used and result of testing of the Microsoft Excel 95 spreadsheet program.

 

 

TOP