SMD : Help : Normalization Help

Help : Normalization Help


Contents


Background

"Normalization" refers to computational data transformations intended to remove certain systematic biases from microarray data, such as dye effects, intensity dependence, and spatial or print-tip effects. (In this context, it doesn't necessarily have anything to do with the normal or Gaussian distribution.) A wide variety of normalization approaches have been proposed and employed in the literature. Each technique relies on a set of assumptions about the ideal form of the data, and attempts to make the data consistent with that ideal form by computational manipulation. The database makes available several different methods for "location" normalization: total intensity normalization, M-A loess normalization, and two-dimensional loess normalization. These techniques all adjust the average of the data, either globally or stratified by intensity, print-tip, and/or position. "Scale" normalization is also available in conjunction with the loess methods; this adjusts the range of the data.

"Background correction" refers to adjustments to the data intended to remove nonbiological contributions ("background") to the measured signal. The most common approach for two-color data is to approximate the background at each feature or spot by measuring the local, off-spot signal intensity in either channel, and subtract that value from the foreground (on-spot) signal. Many other estimates of background may be applied, ranging from ignoring background contributions altogether, to complex model-based calculations. Background corrections are applied prior to normalization, and will affect the outcome of normalization. The database offers a variety of background correction options, detailed below.

This document briefly describes the normalization and background correction options available within the database, and how to use them.

Default Total Intensity Normalization

Total intensity normalization relies on the assumption that most genes do not respond to experimental conditions, and so the average log ratio on the array should be zero. Note that this may not be a safe assumption for your data! A single, global, multiplicative adjustment is performed so that the average log ratio is zero for well measured spots. All spots are normalized using the same constant, regardless of whether they were used in the calculation. In the database, this is performed by computing normalized values for all channel 2 intensities, by dividing the raw values by the normalization constant.

The normalization constant may be supplied by the user, or may be calculated by the database's software. If calculated by the database, the first step is to select good spots on which to base the normalization. Two methods are available. Both begin by discarding flagged spots.

Complex Normalization Options

Several more complex normalization options are provided using the Marray package for BioConductor (Gentleman et al., 2004), using the R statistical computing software.

Three location normalization options are provided:

Loess calculations depend critically on the "span," a value between 0 and 1 that specifies the amount of data to include in each local estimate, and thus the degree of smoothing. The value specified for the span (default 0.4 in the database) will influence the results of loess normalization, sometimes significantly. At the time of this writing there are no generally-accepted methods for choosing an optimal span parameter.

The normalization calculation may be "stratified" by print-tip (sector). This will cause the normalization to be performed separately on each sector. This is generally appropriate for pin-printed microarrays, in which print-tip effects are common; stratification by print-tip will eliminate much of the print-tip effect on the data. In the database, if you select two-dimensional normalization (above), spots will automatically be stratified by print-tip. Stratification is not available for the default normalization - use the marray median adjustment, instead.

"Scale" normalization adjusts the range of data, rather than the center of the distribution. This makes data more comparable across arrays, by eliminating differences in the range of response to conditions. Of course, this may not be appropriate; it is generally advised only when the absolute scale of response is not relevant (or not well measured). The database supports division of all values by the median absolute deviation (MAD) of the array (or sector if print-tip stratification is selected). This may be combined with location normalization (median adjustment, intensity-dependent loess, or 2-D loess functions only), in which case scale normalization will be performed following location normalization.

Options for Background Correction

The database offers a variety of methods for background correction, from the simple to the complex. The background estimates provided by your feature extraction software (usually local, off-spot intensity) will always be preserved and available in the database. If you choose to apply one of the options detailed here, a second version of your data will be computed and made available to you, using the selected background correction method. At this time you are limited to a single additional version of any given array; if you subsequently select a different background correction method, the first one will be overwritten. Both versions of the data will be normalized using the same method (whichever one you select).

In most of the methods described below, background is assumed to be additive in each channel. That is,

Ratio = (Redforeground - Redbackground)/(Greenforeground - Greenbackground)
The estimated background in each channel (Redbackground and Greenbackground) is subtracted from the measured foreground intensity in order to produce the net or corrected intensity. Multiplicative effects are generally left to subsequent normalization calculations to correct.

LIMMA options Several methods of background correction are offered using the LIMMA package for BioConductor (Gentleman et al., 2004), using the R statistical computing software. For more information, see the documentation on those websites, and the following manuscripts:

These options are:

Model-based options Additionally, the database offers a complex background estimate, based on a preliminary model for background effects fit to non-hybridizing control spots. If the print (array design) in question does not contain non-hybridizing controls in sufficient numbers, you may instead elect to use this method using the lowest-intensity spots in each sector as if they were non-hybridizing controls, instead. In this case the algorithm will pick, in each sector (block, or print-tip group), the spots in the 5th percentile of intensity and below, in each channel independently, that are above the 5th percentile of intensity in the other channel. This is intended to guarantee that actual features are selected, rather than mis-gridded locations that do not actually contain DNA.

The algorithm uses a general additive model (see Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models London: Chapman and Hall) to fit the measured foreground intensities, in each channel independently, to a combination of three terms: the print-tip which printed the spot (treated as a factor); a smooth interpolation of measured background intensities; and a smooth, two-dimensional surface intended to capture global effects. The model is expressed as

Foreground(non-hybridizing controls) ~ print-tip + smooth(local background) + smooth(X, Y)
where the smooth functions are locally linear. The model is fit to non-hybridizing control spots (or low-intensity spots), and then evaluated at each spot on the array in order to estimate the background intensity at each spot.

In each channel, 90% of the non-hybridizing control spots (or low-intensity spots) are used to train the model, and 10% are used to test it. The root mean-square error (RMSE) in each channel of the measured foreground of the test spots to their background estimates, and to the local off-spot intensity measurement (local background), are recorded in the protocol information for the array for later inspection.

How to renormalize experiments

Arrays may be renormalized one at a time by following the "Select normalization options" link while editing the experiment. To renormalize all arrays in an arraylist, follow the Batch Renormalize Data link in the list of all programs. Only GenePix, ScanAlyze, and SpotReader data may be renormalized within the database. Agilent and Affymetrix software provide other options for normalization prior to loading into the database.


Please send comments or questions to: array@genome.stanford.edu