CNAmet R package

How do I preprocess my copy-number data?

Copy-number data should be

  1. Normalized,
  2. Segmented, and
  3. Called (binarized).

With Agilent CGH arrays, we use lowess normalization followed by circular binary segmentation (R package DNAcopy). Other segmentation algorithms such as GLAD work as well. Calling is more dependant on your data; we usually binarize the data by setting the calling threshold for copy-number alteration at two times the standard deviation of normalized control data or slightly less when no control data is available. You can also use R packages such as CGHcall or FastCall (TASSO R package), and then proceed to binarize the data.

With the Illumina 450k methylation array, we filter out probes matching to multiple loci and then normalize with e.g., R packages lumi or minfi (Bioconductor also contains easy-to-use annotation packages for the 450k array). Copy-number state of a locus is the sum of the unmethylated (U) and methylated (M) probes which we then segment with DNAcopy. In case the profiles are too noisy, we combine signals from windows of 10 adjacent probes using their unsegmented median values, and then segment the data.

For AffyMetrix SNP arrays, we have extracted copy-numbers using crlmm, and then segmented the data with DNAcopy.

In addition, make sure your copy-number data is in the same genome version as your expression data and annotations. You can use the UCSC LiftOver application to map loci from one genome build to another.

How do I preprocess my methylation data?

Methylation array data should normalized and binarized. Several R packages exist for methylation array normalization (e.g., lumi and minfi) We filter out probes with known SNPs and not mapping uniquely to the genome. There are multiple ways to binarize the values. If there are control samples, we for each probe subtract the median measurement of control set from each sample, and then define a cutoff at, say, two-fold. If there is no control population, we have obtained good results by calling the 1st decile hypomethylated and the tenth decile hypermethylated.

Instead of Beta values for quantifying methylation we recommend using M values where M=log2[Beta/(1-Beta)]. M values are more normally distributed and therefore allows us to use a wider range of statistical methods in the analysis.

Web site copyright Systems Biology Laboratory, University of Helsinki.
CNAmet copyright © 2010-2013 Riku Louhimo.
Page last updated 22 October 2012