| SMD : Help : Normalization Help |
|
|
Help : Normalization Help
Contents
Background
"Normalization" refers to computational data transformations intended
to remove certain systematic biases from microarray data, such as dye
effects, intensity dependence, and spatial or print-tip effects. (In this
context, it doesn't necessarily have anything to do with the normal or
Gaussian distribution.) A wide variety of normalization approaches
have been proposed and employed in the literature. Each technique
relies on a set of assumptions about the ideal form of the data, and
attempts to make the data consistent with that ideal form by
computational manipulation. The database makes
available several different methods for "location" normalization:
total intensity normalization, M-A loess normalization, and
two-dimensional loess normalization. These techniques all adjust the
average of the data, either globally or stratified by intensity,
print-tip, and/or position. "Scale" normalization is also
available in conjunction with the loess methods; this adjusts the
range of the data.
"Background correction" refers to adjustments to the data intended to
remove nonbiological contributions ("background") to the measured
signal. The most common approach for two-color data is to approximate
the background at each feature or spot by measuring the local,
off-spot signal intensity in either channel, and subtract that value
from the foreground (on-spot) signal. Many other estimates of
background may be applied, ranging from ignoring background
contributions altogether, to complex model-based calculations.
Background corrections are applied prior to normalization, and will
affect the outcome of normalization. The database offers a variety of
background correction options, detailed below.
This document briefly describes the normalization and background
correction options available within the database, and how to use them.
Default Total Intensity Normalization
Total intensity normalization relies on the assumption that most genes
do not respond to experimental conditions, and so the average log
ratio on the array should be zero. Note that this may not be a
safe assumption for your data! A single, global, multiplicative
adjustment is performed so that the average log ratio is zero for well
measured spots.
All spots are normalized using the same constant, regardless of
whether they were used in the calculation. In
the database, this is performed by computing normalized values for all channel
2 intensities, by dividing the raw values by the normalization
constant.
The normalization constant may be supplied by the user, or may be
calculated by the database's software. If calculated by the database,
the first step is to select good spots on which to base the
normalization. Two methods are available. Both begin by discarding flagged spots.
- The default "computed" normalization
procedure then selects non-flagged spots
for which the foreground intensity is well above background:
- If Scanalyze data : Both CH1GTB2 and CH2GTB2 (the fraction of the
pixels greater than the 1.5 times background of channels 1 and 2,
respectively) are greater than a threshold value.
- If GenePix or SpotReader data : Both % > B532+1SD and % > B635+1SD
(Percentage of spot pixels with intensities more than one standard
deviation above the background pixel intensity in channel 1 and 2)
are greater than a threshold value.
The threshold value is initially set to > 0.65. If fewer than 10% of
the spots in the
print pass these criteria, the program will use > 0.60. If fewer than
10% of the spots in the print pass the .60 threshold, the program will
use > 0.55. All spots that pass the 0.55 threshold are used in the
normalization calculation, regardless of how many there are. If more
than 10% of the spots pass any threshold, the program uses
those passing spots in the calculation and does not try a lower
threshold value.
- The "using regression correlation" method
selects non-flagged spots for which
the pixel-to-pixel regression correlation is > 0.6.
Complex Normalization Options
Several more complex normalization options are provided using the
Marray package for BioConductor
(Gentleman
et al., 2004), using the R
statistical computing software.
Three location normalization options are provided:
- Median adjustment. This is essentially the
same as the databse's default
total intensity normalization, but no spot filtering is performed.
Log ratios are adjusted globally such that the median log ratio is
zero; the database back-calculates normalized channel 2 intensities from the
normalized log-ratios.
- Intensity dependent normalization using local estimation. See
Yang
et al., 2002 and help documents on the BioConductor website for detailed
explanations of this approach. In essence, a smooth best-fit curve is
calculated for the dependence of log-ratio (M) on overall intensity
(A: log(base 2) of the geometric mean of the channel intensities).
Normalized log ratios are then given as the residuals from this curve
(and in the database, normalized channel 2 intensities are back-calculated from
the normalized log-ratios). Local estimation ("loess"), a regression
calculation weighted toward similar (in overall intensity value)
spots, is used to calculate the curve.
| Before Normalization |
After Global M-A Loess Normalization |
 |
 |
Intensity-dependent loess normalization
- Two-dimensional normalization using local estimation. The same
type of loess calculation is performed (see above references),
computing a smooth surface that gives the dependence of log-ratios on
spatial position across the microarray slide. Normalized log ratios
are given as residuals from this curve (essentially flattening the
surface) to eliminate spatial dependence in the data. In the database,
normalized channel 2 intensities are back-calculated from the
normalized log ratios. Spots are automatically stratified by
print-tip if you select this option (see below).
Loess calculations depend critically on the "span," a value between 0
and 1 that specifies the amount of data to include in each local
estimate, and thus the degree of smoothing. The value specified for
the span (default 0.4 in the database) will influence the results of loess
normalization, sometimes significantly. At the time of this writing
there are no generally-accepted methods for choosing an optimal span parameter.
The normalization calculation may be "stratified" by print-tip
(sector). This will cause the normalization to be performed
separately on each sector. This is generally appropriate for
pin-printed microarrays, in which print-tip effects are common;
stratification by print-tip will eliminate much of the print-tip
effect on the data. In the database, if you select two-dimensional
normalization (above), spots will automatically be stratified by
print-tip. Stratification is not available for the default
normalization - use the marray median adjustment, instead.
"Scale" normalization adjusts the range of data, rather than the
center of the distribution. This makes data more comparable across
arrays, by eliminating differences in the range of response to
conditions. Of course, this may not be appropriate; it is generally
advised only when the absolute scale of response is not relevant (or
not well measured). The database supports division of all values by
the median absolute deviation (MAD) of the array (or sector if
print-tip stratification is selected). This may be combined with
location normalization (median adjustment, intensity-dependent loess,
or 2-D loess functions only), in which case scale normalization will
be performed following location normalization.
Options for Background Correction
The database offers a variety of methods for background correction,
from the simple to the complex. The background estimates provided by
your feature extraction software (usually local, off-spot intensity)
will always be preserved and available in the database. If you choose
to apply one of the options detailed here, a second version of your
data will be computed and made available to you, using the selected
background correction method. At this time you are limited to a
single additional version of any given array; if you subsequently
select a different background correction method, the first one will be
overwritten. Both versions of the data will be normalized using the
same method (whichever one you select).
In most of the methods described below, background is assumed to be
additive in each channel. That is,
Ratio = (Redforeground -
Redbackground)/(Greenforeground
- Greenbackground)
The estimated background in each channel
(Redbackground and
Greenbackground) is subtracted from the
measured foreground intensity in order to produce the net or corrected
intensity. Multiplicative effects are generally left to subsequent
normalization calculations to correct.
LIMMA options Several methods of background
correction are offered using the LIMMA package for BioConductor (Gentleman
et al., 2004), using the R
statistical computing software. For more information, see the
documentation on those websites, and the following manuscripts:
- Smyth, G. K., and Speed, T. P. (2003). Normalization of cDNA microarray
data. Methods 31, 265-273
- Smyth, G. K. (2005). Limma: linear models for microarray data. In:
Bioinformatics and Computational Biology Solutions using R and
Bioconductor, R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber
(eds.), Springer, New York, pages 397-420
These options are:
- Zero: Do no background correction whatsoever, using the
unadjusted foreground intensity measurements instead.
- Moving minimum: For each spot, consider the local, off-spot
measurements of signal intensity for the spot and its eight neighbors
(forming a square around the spot). The minimum off-spot intensity is
used as the background estimate for the spot, and subtracted from its
foreground intensity.
- Half: Local, off-spot signal intensity is subtracted from
each spot's foreground intensity as usual, unless the net value is
non-positive. In the latter case, the net intensity is set to 0.5
(rounded to 1 in the database, since intensities are treated as
integers).
- Minimum: Local, off-spot signal intensity is subtracted from
each spot's foreground intensity as usual, unless the net value is
non-positive. In the latter case, the net intensity is set to
one-half the minimum positive net intensity.
- Edwards: Local, off-spot signal intensity is subtracted from
each spot's foreground intensity, with a log-linear interpolation of
lower-intensity spots designed to produce positive net values. See Edwards
(2003).
Model-based options Additionally, the
database offers a complex background estimate, based on a preliminary
model for background effects fit to non-hybridizing control spots. If
the print (array design) in question does not contain non-hybridizing
controls in sufficient numbers, you may instead elect to use this
method using the lowest-intensity spots in each sector as if
they were non-hybridizing controls, instead. In this case the
algorithm will pick, in each sector (block, or print-tip group), the
spots in the 5th percentile of intensity and below, in each channel
independently, that are above the 5th percentile of intensity in the
other channel. This is intended to guarantee that actual features are
selected, rather than mis-gridded locations that do not actually
contain DNA.
The algorithm uses a general additive model (see Hastie, T. and
Tibshirani, R. (1990) Generalized Additive Models London:
Chapman and Hall) to fit the measured foreground intensities, in each
channel independently, to a combination of three terms: the print-tip
which printed the spot (treated as a factor); a smooth interpolation
of measured background intensities; and a smooth, two-dimensional
surface intended to capture global effects. The model is expressed as
Foreground(non-hybridizing controls) ~ print-tip + smooth(local background) + smooth(X,
Y)
where the smooth functions are locally linear. The model is fit to
non-hybridizing control spots (or low-intensity spots), and then
evaluated at each spot on the array in order to estimate the
background intensity at each spot.
In each channel, 90% of the non-hybridizing control spots (or
low-intensity spots) are used to train the model, and 10% are used to
test it. The root mean-square error (RMSE) in each channel of the
measured foreground of the test spots to their background estimates,
and to the local off-spot intensity measurement (local background),
are recorded in the protocol information for the array for later inspection.
How to renormalize experiments
Arrays may be renormalized one at a time by following the "Select
normalization options" link while editing the experiment. To
renormalize all arrays in an arraylist,
follow the
Batch Renormalize Data link in the list of all programs. Only GenePix,
ScanAlyze, and SpotReader data may be renormalized within the database.
Agilent and Affymetrix software provide other options for
normalization prior to loading into the database.
Please send comments or questions to:
array@genome.stanford.edu