|
This article appears by courtesy of |
Extracted from: |
The Statistical Engineering and Mathematical and Computational Sciences Divisions of the Information Technology Laboratory (National Institute of Standards and Technology) has released a number of benchmark datasets for assessing the numerical accuracy of statistical software. The Statistical Reference Datasets (StRD) were designed explicitly to assist researchers in benchmarking statistical software packages. A link to the StRD is maintain on the NIST’s website for applied mathematics, statistics, and computational science{1}.
The Statistical Reference Datasets include both generated and "real-world" data. The StRD{2} are divided into four suites of numerical benchmarks for statistical software:
For each suite of tests there are several problems in each of three difficulty levels: low, average, and high. These benchmarks were specifically designed so that reliable algorithms implemented in double precision would produce acceptable results for all four suites.
In the case of linear procedures the certified values were calculated using multiple precision calculations (accurate to 500 digits). Data were read in exactly as multiple precision numbers and all calculations were made with this very high precision. For the nonlinear least-squares regression benchmarks, two different algorithms with different implementations were used with 128-bit precision to calculate the certified values. Both of these algorithms can, in double precision (64-bit), return 10 accurate digits for each of the nonlinear benchmarks. In addition, multiple profiles of the least-squares surface were used to ensure that a global minimum had been attained.
The certified values for the benchmarks are provided to 15 digits for linear procedures and 11 digits for nonlinear procedures.
Background
This section gives a brief overview of topics related to software testing. Specifically discussed are the four types of errors associated with numerical computations, the metric to be used for measuring accuracy, and what is an acceptable level of accuracy for engineering assessments
Types of Errors
As described in Reference [2], there are four types of errors in numerical computations.
Performance Index
Reference [2] uses a measure of the number of correct significant digits in the estimated value as a performance index. This measure of accuracy is an absolute measure of accuracy of an estimate and is easier to apply than the relative performance index (measure) proposed in Reference [5]. The relative performance index of Reference [5] is a measure of the number of figures of accuracy lost by the test software over and above what software based on an optimally stable algorithm would produce. The simpler log relative error (LRE) is used as the basis for performance index in this technical note. Figure 1 is a flowchart for determining the LRE. The base-10 logarithm of the relative error is defined as:
where: LRE is the log relative error. q is the estimated value. c is the correct (certified) value. The LRE is a measure of the number of correct significant digits only when the estimated value is "close" to the exact value. Therefore, each estimated quantity must be compared to its certified value to make sure that they differ by a factor of less than two; otherwise the LRE for the estimated quantity is zero. In the event that the estimated value equals the correct value, the LRE is undefined, in which case it should be set equal to the number of digits in the certified value. In those cases that the correct value is zero, the base-10 logarithm of the absolute error should be used:
See the algorithm flow chart for details
Conditioning of data
The conditioning of the data consists of subtracting the sample mean from all of the values in the sample. In the case of the univariate summary statistics and the analysis of variance procedures, the results for the raw and conditioned benchmarks are mathematically identical, but use of the conditioned data results in a problem that is numerically better conditioned. In the case of the linear regression procedures, the results of the raw and conditioned benchmarks are not mathematically identical. In this case it is necessary to evaluate the regression parameters (in terms of the sample mean and the conditioned regression parameters) for the conditioned data benchmarks so that they can be compared with the raw data benchmarks and the certified values for the dataset.
Acceptable Accuracy
As described in Reference [2], the LRE from low-difficulty linear procedures should be at least nine. This is because a decent implementation of a decent algorithm for a linear procedure should, with well-behaved data, return ten accurate digits in double precision. This is not to suggest that the user need ten digits of accuracy. Rather, the idea is that a program that cannot deliver ten digits of accuracy for an easy benchmark is more likely to deliver inaccurate results in practice.
Producing correct results for all of the benchmark problems does not imply that the software will do the same for actual datasets. It does, however, provide some degree of assurance that the software does provide correct results for benchmark problems known to yield incorrect results for some software programs.
In practice, the author considers that four digits of accuracy are sufficient for a majority of engineering assessments. The important issue is that the analyst must know of, understand, and appreciate the implications of any/all limitations in the software packages that they use.
References
[2] McCullough,B.D.
Assessing the Reliability of Statistical Software: Part I
The paper proposes a set of intermediate-level tests (benchmarks) focussing on three areas: estimation, both linear and nonlinear; random number generation; and statistical distributions (e.g. for calculating probabilities). The complete methodology is described in detail. Also, convenient methods for summarizing the results of the testing are proposed. The performance measure used is an absolute measure of accuracy, calculated using the base-10 logarithm of the relative error. The estimation capabilities are assessed using the recently released NIST Statistical Reference Datasets. The output of the random number generator was assessed using statistical tests for randomness; while the accuracy of statistical distributions is assessed by comparing the results to specialized statistical software programs.
.......
[5] Cook,H.R., Cox,M.G., Dainton,M.P., Harris,P.M.
A Methodology for Testing Spreadsheets and Other Packages used in Metrology
This report documents a general methodology for testing the numerical accuracy of scientific software. The basis of the approach is the design and use of reference datasets and corresponding reference results to be used in black-box testing. Reference datasets and results are to be generated in a manner consistent with the functional specification of the problem addressed by the test software. Datasets corresponding to problems with various "degrees-of-difficulty", or with application-specific properties, may be produced. The results returned by the software for the reference data are compared objectively with the reference results. The objective comparison is performed using quality metrics that account for the key aspects of the problem. In addition, complementary tests of numerical accuracy that do not require the use of reference datasets are used. These include consistency and continuity checks, spot checks against tabulated values, and checks of solution characteristics. The proposed performance measure indicates the number of figures of accuracy lost by the test software over and above what software based on an optimally stable algorithm would produce.
.......
{1}
http://math.nist.gov/The National Institute of Standards and Technology (NIST) website for applied mathematics,statistics, and computational science. This website maintains a link to the Statistical ReferenceDatasets (StRD).
{2}
http://www.intl.nist.gov/div898/strd/The National Institute of Standards and Technology (NIST) website for Statistical Reference Datasets (StRD).
{3}
http://www.npl.co.uk/ssfm/The National Physical Laboratory (NPL) website for Software Support for Metrology (SSfM). Part of the mandate for the SSfM program is the testing of spreadsheets and other software programs. A series of reports are available, which document the methodology used and result of testing of the Microsoft Excel 95 spreadsheet program.