1. You can add the PennCNV directory into the PATH environ- mental variable in your operating system, so that all PennCNV scripts can be executed directly by typing the name of the command.
Fig. 3 Plot of LRR and BAF values of two CNV calls. (a) LRR and BAF values of a deletion (CN = 1) are shown in upper and lower panels, respectively. (b) LRR and BAF values of a duplication (CN = 3) are shown in upper and lower panels, respectively. The red dots represent the markers inside the CNV calls
2. If you have problems installing PennCNV in your operating system, it is perhaps due to the incompatibilities of PennCNV’s khmm module with certain Perl installations in the operating system. To solve this issue, you can use perlbrew to install a dif- ferent version of Perl (such as 5.14.2); for example, you can use the command “perlbrew install perl-5.14.2 --as perl-5.14.2- PIC -Accflags=-fPIC” to install Perl 5.14.
If you are using Windows, we recommend that you first download and install 32-bit Perl 5.8.8 and then use PennCNV directly. In this case, there is no need for compilation because the .dll files for Perl 5.8.8 are already compiled and provided in the PennCNV package.
3. The Penn-Affy workflow can be adapted to other SNP array platforms. For example, Joseph T. et al. applied the Penn-Affy workflow on the Perlegen 600K platform [36]. The gener- ate_affy_geno_cluster.pl program in the Penn-Affy package requires three input files: a genotype call file, a confidence file that contains the confidence values of the genotype calls, a sig- nal intensity file that contains normalized signal intensities of A and B alleles and a location file that contains genomic locations of markers (e.g., a PFB file, described in Subheading 3.2.3).
For Affymetrix arrays, the first three files can be generated by Affymetrix Power Tools. Users of other platforms can generate the required data values using their platform-specific tools and then reformat the data into the file formats as described above.
The signal intensity values can be transformed into log-scale.
After generating the four input files, users can generate the canonical cluster file using generate_affy_geno_cluster.pl and then generate the LRR and BAF values using normalize_affy_
geno_cluster.pl (see Subheading 3.2.1, steps 4 and 5).
4. We can use the following commands to download and unzip the example data set:
mkdir raw_data cd raw_data
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/
GSE15nnn/GSE15826/suppl/GSE15826_RAW.tar tar xf GSE15826_RAW.tar
gunzip *.gz
5. For a typical modern computer, the command should take less than 1 day for 1000–2000 CEL files. It is very important to check that the APT programs finish completely, before pro- ceeding to next steps. Check the LOG files to see whether it reports a success.
6. We need to use at least 500 CEL files to generate a high-qual- ity clustering file. If only a few CEL files are available, users can skip this step and use the default canonical clustering file in the PennCNV-Affy package for the identical array (if available),
but in this case the CNV calls may be less reliable. Examples of such clustering files are: hapmap.genocluster for Genome- Wide SNP Array 6.0, agre.genocluster for Genome-Wide SNP Array 5.0, and affy500k.nsp.genocluster/affy500k.sty.geno- cluster for Mapping 500K Array Set.
7. If the sex information for some CEL file is not known, you do not need to include them in the cel_sex_file. The birdseed.
report.txt file that was generated in the previous contains a field named computed_gender. Therefore, we can use the fol- lowing command to generate the cel_sex_file:
cut -f 1-2 birdseed.report.txt | grep male > cel_
sex_file
8. For some reference genomes, the text-format gc5Base file is not officially provided by UCSC. In this case, we can prepare the gc5Base file by the following steps.
Step 1, download two tools provided by UCSC:
wget http://hgdownload.cse.ucsc.edu/admin/exe/
linux.x86_64/faToTwoBit chmod +x faToTwoBit
wget http://hgdownload.cse.ucsc.edu/admin/exe/
linux.x86_64/hgGcPercent chmod +x hgGcPercent
faToTwoBit and hgGcPercent are binary files precompiled by UCSC and are free for academic, nonprofit, and personal use.
A license may be required for commercial use.
Step 2, convert the reference FASTA file to .2bit file (assuming the reference file is hg38.fa):
./faToTwoBit hg38.fa hg38.2bit
Step 3, generate GC content file in Wiggle format:
./hgGcPercent -wigOut -doGaps -file=stdout -win=5120 hg38 hg38.2bit > hg38.gc.wig
Step 4, generate gc5Base.txt file using the script provided in PennCNV/gc_file directory:
PennCNV/gc_file/wig2gc5base hg38.gc.wig > hg38.
gc5Base.txt
9. By default, only autosome CNVs will be detected, the –chrx argument can be used to generate CNV calls on (and only on) chromosome X. The CNV calling for chrX is slightly different from that of autosomes. It is highly recommended to use the -sexfile argument to supply gender annotation for all geno- typed samples. The sexfile is a two-column file, with the first
column being signal file names, and the second column being either “male” or “female”. Table 10 shows an example of sexfile.
perl ../detect_cnv.pl -test -hmm example.hmm -pfb ex- ample.pfb -log example.rawcnv.log -out example.rawcnv -list inputlist -chrx -sexfile sexfile.txt
If sex for a sample is not provided in sexfile, or if -sexfile is not specified, PennCNV will try to predict the gender of the sam- ple, based on BAF heterozygosity rate of chrX markers, but such predictions may not be reliable for some arrays. Next, PennCNV will adjust the LRR values such that females have median LRR at 0 and males have median LRR at the same value as that for CN = 1 in the HMM file. After this step, the CNV calling is then applied in a similar way as autosomes.
10. As of June 2008, the -medianadjust argument is turned on by default in the program to reduce false positive duplication calls for problematic samples. The effect is that the BAF_median measure for all samples is automatically adjusted to be 0.5.
Users can turn off the argument by specifying -nomedian- adjust. This is important when calling CNVs on a signal inten- sity file that contains data only on a specific genomic region rather than a whole genome.
11. If we have multiple trio families, we can generate a listfile, which contains three file names per line (i.e., one family per line), to process multiple trios simultaneously. It is important that the signal intensity file names in the command line (or in the listfile) are identical to the file names listed in the fifth column of the CNV call file (e.g., example.rawcnv) so that PennCNV can rec- ognize the correct signal intensity file of each call.
12. If the family has two children, then the -quartet argument can be used for CNV calling. Accordingly, four file names should be supplied in the command line, or given in each line of the list file, representing father, mother, child 1, and child 2, respectively. PennCNV cannot generate calls on a pair of par- ents and three or more children; instead, the user need to split the family into trios and quartets for CNV calling, and then combine the CNV calls together into consensus calls.
Table 10
An example of sexfile
father.txt male
mother.txt female
offspring.txt male
1. Feuk L, Carson AR, Scherer SW (2006) Structural variation in the human genome. Nat Rev Genet 7(2):85–97. https://doi.
org/10.1038/nrg1767
2. Zarrei M, MacDonald JR, Merico D et al (2015) A copy number variation map of the human genome. Nat Rev Genet 16(3):172–
183. https://doi.org/10.1038/nrg3871 3. Sudmant PH, Rausch T, Gardner EJ et al (2015)
An integrated map of structural variation in 2,504 human genomes. Nature 526(7571):75–
81. https://doi.org/10.1038/nature15394 4. Mills RE, Walter K, Stewart C et al (2011)
Mapping copy number variation by population- scale genome sequencing. Nature 470(7332):59–65. https://doi.org/10.1038/
nature09708
5. Zhang F, Gu W, Hurles ME et al (2009) Copy number variation in human health, disease, and evolution. Annu Rev Genomics Hum Genet 10:451–481. https://doi.org/10.1146/
annurev.genom.9.081307.164217
6. Girirajan S, Campbell CD, Eichler EE (2011) Human copy number variation and complex genetic disease. Annu Rev Genet 45:203–226.
https://doi.org/10.1146/annurev-genet- 102209-163544
7. Weischenfeldt J, Symmons O, Spitz F et al (2013) Phenotypic impact of genomic struc- tural variation: insights from and for human disease. Nat Rev Genet 14(2):125–138.
https://doi.org/10.1038/nrg3373
8. Watson CT, Marques-Bonet T, Sharp AJ et al (2014) The genetics of microdeletion and microduplication syndromes: an update. Annu Rev Genomics Hum Genet 15:215–244.
https://doi.org/10.1146/annurev-genom- 091212-153408
9. Zack TI, Schumacher SE, Carter SL et al (2013) Pan-cancer patterns of somatic copy number alteration. Nat Genet 45(10):1134–
1140. https://doi.org/10.1038/ng.2760 10. Beroukhim R, Mermel CH, Porter D et al
(2010) The landscape of somatic copy- number alteration across human cancers. Nature 463(7283):899–905. https://doi.
org/10.1038/nature08822
11. Carter NP (2007) Methods and strategies for analyzing copy number variation using DNA microarrays. Nat Genet 39(7 Suppl):S16–S21.
https://doi.org/10.1038/ng2028
12. Pinto D, Darvishi K, Shi X et al (2011) Comprehensive assessment of array-based plat- forms and calling algorithms for detection of copy number variants. Nat Biotechnol 29(6):512–520.
https://doi.org/10.1038/nbt.1852
13. Venkatraman ES, Olshen AB (2007) A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23(6):657–663. https://doi.org/10.1093/
bioinformatics/btl646
14. Olshen AB, Venkatraman ES, Lucito R et al (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5(4):557–572. https://doi.
org/10.1093/biostatistics/kxh008
15. Price TS, Regan R, Mott R et al (2005) SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucleic Acids Res 33(11):3455–3464. https://doi.org/10.1093/
nar/gki643
16. Cooper GM, Zerr T, Kidd JM et al (2008) Systematic assessment of copy number variant detection via genome-wide SNP genotyping.
Nat Genet 40(10):1199–1203. https://doi.
org/10.1038/ng.236
17. Peiffer DA, Le JM, Steemers FJ et al (2006) High-resolution genomic profiling of chromo- somal aberrations using Infinium whole-genome genotyping. Genome Res 16(9):1136–1148.
https://doi.org/10.1101/gr.5402306
18. Wang K, Li M, Hadley D et al (2007) PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res 17(11):1665–1674. https://doi.
org/10.1101/gr.6861907
19. Colella S, Yau C, Taylor JM et al (2007) QuantiSNP: an Objective Bayes Hidden- Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res 35(6):2013–2025.
https://doi.org/10.1093/nar/gkm076 20. Zhang X, Du R, Li S et al (2014) Evaluation of
copy number variation detection for a SNP array platform. BMC Bioinformatics 15:50.
https://doi.org/10.1186/1471-2105-15-50 21. Marenne G, Rodriguez-Santiago B, Closas
MG et al (2011) Assessment of copy number 13. The joint CNV calling algorithm only supports trio families.
For complex nuclear families, it is better to use the -trio and -quartet operations described in Subheading 3.3.2.
References
variation using the Illumina Infinium 1M SNP- array: a comparison of methodological approaches in the Spanish Bladder Cancer/
EPICURO study. Hum Mutat 32(2):240–
248. https://doi.org/10.1002/humu.21398 22. Dellinger AE, Saw SM, Goh LK et al (2010)
Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays. Nucleic Acids Res 38(9):e105. https://doi.org/10.1093/
nar/gkq040
23. Sanders SJ, He X, Willsey AJ et al (2015) Insights into autism spectrum disorder genomic architecture and biology from 71 risk loci. Neuron 87(6):1215–1233. https://doi.
org/10.1016/j.neuron.2015.09.016
24. Huang AY, Yu D, Davis LK et al (2017) Rare copy number variants in NRXN1 and CNTN6 increase risk for tourette syndrome. Neuron 94(6):1101–1111 e1107. https://doi.
org/10.1016/j.neuron.2017.06.010
25. Marshall CR, Howrigan DP, Merico D et al (2017) Contribution of copy number variants to schizophrenia from a genome-wide study of 41,321 subjects. Nat Genet 49(1):27–35.
https://doi.org/10.1038/ng.3725
26. Elia J, Glessner JT, Wang K et al (2011) Genome-wide copy number variation study associates metabotropic glutamate receptor gene networks with attention deficit hyperac- tivity disorder. Nat Genet 44(1):78–84.
https://doi.org/10.1038/ng.1013
27. Green EK, Rees E, Walters JT et al (2016) Copy number variation in bipolar disorder.
Mol Psychiatry 21(1):89–93. https://doi.
org/10.1038/mp.2014.174
28. Rucker JJ, Tansey KE, Rivera M et al (2016) Phenotypic association analyses with copy number variation in recurrent depressive disor- der. Biol Psychiatry 79(4):329–336. https://
doi.org/10.1016/j.biopsych.2015.02.025
29. Glessner JT, Li J, Hakonarson H (2013) ParseCNV integrative copy number variation association software with quality tracking.
Nucleic Acids Res 41(5):e64. https://doi.
org/10.1093/nar/gks1346
30. McCarroll SA, Kuruvilla FG, Korn JM et al (2008) Integrated detection and population- genetic analysis of SNPs and copy number variation. Nat Genet 40(10):1166–1174.
https://doi.org/10.1038/ng.238
31. Korn JM, Kuruvilla FG, McCarroll SA et al (2008) Integrated genotype calling and asso- ciation analysis of SNPs, common copy num- ber polymorphisms and rare CNVs. Nat Genet 40(10):1253–1260. https://doi.org/
10.1038/ng.237
32. Staaf J, Vallon-Christersson J, Lindgren D et al (2008) Normalization of Illumina Infinium whole-genome SNP data improves copy num- ber estimates and allelic intensity ratios. BMC Bioinformatics 9:409. https://doi.org/
10.1186/1471-2105-9-409
33. Diskin SJ, Li M, Hou C et al (2008) Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms.
Nucleic Acids Res 36(19):e126. https://doi.
org/10.1093/nar/gkn556
34. Wang K, Chen Z, Tadesse MG et al (2008) Modeling genetic inheritance of copy number variations. Nucleic Acids Res 36(21):e138.
https://doi.org/10.1093/nar/gkn641 35. Mace A, Tuke MA, Beckmann JS et al (2016)
New quality measure for SNP array based CNV detection. Bioinformatics 32(21):3298–3305.
https://doi.org/10.1093/bioinformatics/
btw477
36. Glessner JT, Wang K, Sleiman PM et al (2010) Duplication of the SLIT3 locus on 5q35.1 pre- disposes to major depressive disorder. PLoS One 5(12):e15463. https://doi.org/10.1371/
journal.pone.0015463
29
Derek M. Bickhart (ed.), Copy Number Variants: Methods and Protocols, Methods in Molecular Biology, vol. 1833, https://doi.org/10.1007/978-1-4939-8666-8_2, © Springer Science+Business Media, LLC, part of Springer Nature 2018