PennCNV-ExomeSeq

PennCNV Copy Number Variant Call Detection in Exome Sequencing Software

Decompress command: tar -zxvf PennCNV-ExomeSeq_5.tgz

Linux bash environment is best, but on Windows Cygwin will also work.

Having trouble? Bugs? Email Joseph Glessner: $Description: C:\Documents and Settings\joe.CHOP\Local Settings\Temporary Internet Files\Content.Word\New Picture (40).bmp$

Read and Cite the paper: Glessner J.T. et al. Penncnv-Exomeseq: Genotype Improves Copy Number Variant Detection in Exome Sequencing. Austin journal of proteomics bioinformatics & genomics. 2015;2(1).

Software Integrated by PennCNV-ExomeSeq:

GATK UnifiedGenotyper OR GATK HaplotypeCaller

XHMM

PennCNV

ParseCNV

Overview

Format the GATK VCF (BAF) and the XHMM RD (LRR) for PennCNV.
The sequencing BAF is normalized with clustering and high standard deviation metric probe exclusion.
PennCNV-ExomeSeq delivers an average of 36 calls per sample whereas XHMM delivers 10 with 2 overlapping.
Even after normalization, the sequencing derived BAF and LRR are noisier (higher SD) than the SNP array distribution.
The BAF can also be used to annotate XHMM calls to filter out false-positives.

Raw depth from whole genome sequencing also supported through ClusterDepthToLRR_2.pl.

$Description: P:\Joe\ParseCNVDataModeling\BafLrrReference.png$

Protocol

Now the steps are all contained in a single Pipeline script!

perl PennCNV-ExomeSeq_Pipeline.pl ugVcfs.txt data/DATA.PCA_normalized.filtered.sample_zscores.RD_TEST.txt

Assuming one sample GATK unifiedgenotyper .vcfs (if multi sample do: grep -v ^## group1.DATA.Geno > group1.DATA.Geno.Clean; perl kcolumn.pl group1.DATA.Geno.Clean split 1 -head 9 --out test --name_by_header)

creates .het and .hom SNP CHR POS raw BAF files from GATK Unified Genotyper single sample VCFs

perl VcfToBaf.pl ugVcfs.txt

creates LRR scan_region definition files from XHMM ZPCARD output file

perl RDToLrr.pl DATA.PCA_normalized.filtered.sample_zscores.RD.txt

cat all het baf files and sort for compliePfb_catSortBafLrrs

cat 1-04316-01.ug.vcf.baf.het 1-04316-02.ug.vcf.baf.het 1-04316.ug.vcf.baf.het | sort -k 3,3 -n | sort -k 2,2 -n -s > AllBafsExpVal05_PosChrSorted.txt

cat all hom baf files and sort for compliePfb_catSortBafLrrs

cat 1-04316-01.ug.vcf.baf.hom 1-04316-02.ug.vcf.baf.hom 1-04316.ug.vcf.baf.hom | sort -k 3,3 -n | sort -k 2,2 -n -s > AllBafsExpVal1_PosChrSorted.txt

compliePfb and note SD and N for hets

perl compilePfb_catSortBafLrrs.pl AllBafsExpVal05_PosChrSorted.txt > AllBafsExpVal05_PosChrSorted_AvgSdN.txt

compliePfb and note SD and N for homs

perl compilePfb_catSortBafLrrs.pl AllBafsExpVal1_PosChrSorted.txt > AllBafsExpVal1_PosChrSorted_AvgSdN.txt

Normalize Baf To 0.5

perl NormalizeBafTo05.pl AllBafsExpVal05_PosChrSorted_AvgSdN.txt 1-04316-01.ug.vcf.baf.het > 1-04316-01.ug.vcf.baf.het.norm

perl NormalizeBafTo05.pl AllBafsExpVal05_PosChrSorted_AvgSdN.txt 1-04316-02.ug.vcf.baf.het > 1-04316-02.ug.vcf.baf.het.norm

perl NormalizeBafTo05.pl AllBafsExpVal05_PosChrSorted_AvgSdN.txt 1-04316.ug.vcf.baf.het > 1-04316.ug.vcf.baf.het.norm

Normalize Baf To 1

perl NormalizeBafTo1.pl AllBafsExpVal1_PosChrSorted_AvgSdN.txt 1-04316-01.ug.vcf.baf.hom > 1-04316-01.ug.vcf.baf.hom.norm

perl NormalizeBafTo1.pl AllBafsExpVal1_PosChrSorted_AvgSdN.txt 1-04316-02.ug.vcf.baf.hom > 1-04316-02.ug.vcf.baf.hom.norm

perl NormalizeBafTo1.pl AllBafsExpVal1_PosChrSorted_AvgSdN.txt 1-04316.ug.vcf.baf.hom > 1-04316.ug.vcf.baf.hom.norm

cat Bafs, remove GL

cat 1-04316-01.ug.vcf.baf.het.norm 1-04316-01.ug.vcf.baf.hom.norm | grep -v chrGL > 1-04316-01.ug.vcf.baf

cat 1-04316-02.ug.vcf.baf.het.norm 1-04316-02.ug.vcf.baf.hom.norm | grep -v chrGL > 1-04316-02.ug.vcf.baf

cat 1-04316.ug.vcf.baf.het.norm 1-04316.ug.vcf.baf.hom.norm | grep -v chrGL > 1-04316.ug.vcf.baf

cat het hom Bafs and sort for compliePfb_catSortBafLrrs

cat 1-04316-01.ug.vcf.baf 1-04316-02.ug.vcf.baf 1-04316.ug.vcf.baf | sort -k 3,3 -n | sort -k 2,2 -n -s > AllBafsExpVal051_PosChrSorted.txt

compliePfb to use with PennCNV using het hom Bafs

perl compilePfb_catSortBafLrrs.pl AllBafsExpVal051_PosChrSorted.txt | awk ‘{print $1”\t”$2”\t”$3”\t”$4}’ > AllBafsExpVal051_PosChrSorted.pfb

Filter het PFB for low SD and high N probes (adjust N based on sample size and distribution)

perl PfbAvgSdN_Filter.pl AllBafsExpVal05_PosChrSorted_AvgSdN.txt 0.1 2 > AllBafsExpVal05_PosChrSorted_AvgSdN_LowSdHighCount.txt

Filter PFB to use with PennCNV for low SD and high N probes

perl FilterBafLrrsForLowSD.pl AllBafsExpVal05_PosChrSorted_AvgSdN_LowSdHighCount.txt AllBafsExpVal051_PosChrSorted.pfb > AllBafsExpVal051_PosChrSorted_LowSdHighCount.pfb

Download GC file if needed

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gc5Base.txt.gz; gunzip gc5Base.txt.gz; sort -k 2,2 -k 3,3n < gc5Base.txt > gc5Base_hg19_sorted.txt

make .gcmodel

perl cal_gc_snp.pl gc5Base_hg19_sorted.txt AllBafsExpVal051_PosChrSorted_LowSdHighCount.pfb -output AllBafsExpVal051_PosChrSorted_LowSdHighCount.gcmodel

make .baflrr

perl scan_region.pl 1-04316.ug.vcf.baf 1-04316.lrr -knowngene -expandmax 5m > 1-04316.baflrr; sed -i 1i"Name\tChr\tPosition\tsample.B Allele Freq\tsample.Log R Ratio\tDistance" 1-04316.baflrr

perl scan_region.pl 1-04316-01.ug.vcf.baf 1-04316-01.lrr -knowngene -expandmax 5m > 1-04316-01.baflrr; sed -i 1i"Name\tChr\tPosition\tsample.B Allele Freq\tsample.Log R Ratio\tDistance" 1-04316-01.baflrr

perl scan_region.pl 1-04316-02.ug.vcf.baf 1-04316-02.lrr -knowngene -expandmax 5m > 1-04316-02.baflrr; sed -i 1i"Name\tChr\tPosition\tsample.B Allele Freq\tsample.Log R Ratio\tDistance" 1-04316-02.baflrr

run PennCNV

perl detect_cnv.pl -test -hmm hhall.hmm -pfb AllBafsExpVal051_PosChrSorted_LowSdHighCount.pfb -gcmodel AllBafsExpVal05And1_PosChrSorted_LowSdHighCount.gcmodel --listfile NonZeroBafLrrs.txt --conf -output PCGCExomeSeqNorm_LowSdHighCount.rawcnv -logfile PCGCExomeSeqNorm_LowSdHighCount.log

Evaluate Output

Optional for visualize_cnv.pl on clean set of probes

FilterBafLrrsForLowSD.pl

visualize_cnv

perl visualize_cnv.pl -format plot -signal 1-00140.baflrr LargeXHMMCalls.rawcnv

compare PennCNV-ExomeSeq with XHMM calls or other calls

ParseCNV TestConcordance.pl

make GT scan_region definition to annotate XHMM with number of homs and hets in CNV calls

VcfToGT_ForScanRegion.pl

make per sample rawcnv for scan_region query

SplitRawcnvBySample.pl

annotate GTs in XHMM rawcnv

perl scan_region.pl 1-01513-02_xhmm.rawcnv 1-01513-02.ug.vcf.gt -knowngene -expandmax 5m > 1-01513-02_xhmm.gt.rawcnv

optional filter for baf lrr exact overlap matches

perl CountGTsInRawcnvScanRegion.pl 1-01513-02_xhmm.gt.rawcnv >> batch1-6_DATA.vcf.SortSIDs.CountHomHets.rawcnv

grep -P "\t0$|Name" 1-04316.baflrr

XHMM DATA.PCA_normalized.filtered.sample_zscores.RD.txt

Matrix	chr1:69091-70008	chr1:861308-861407	chr1:865535-865716	chr1:866395-866494
1-00015	-2.16424	-0.50958	0.163431	-0.00344
1-00030-01	0.10668	-0.29709	-0.41101	-0.11729
1-00030-02	0.096985	0.119738	-0.42443	-0.04269
1-00030	1.504317	0.039196	0.414124	0.109203
1-00034-01	2.575126	-0.46299	0.501768	-1.79689
1-00034-02	-2.99311	0.767077	-0.18072	-0.93344
1-00034	0.539372	1.384102	-0.63583	1.72244
1-00041-01	-2.51376	-0.27845	-0.78048	1.245275
1-00041-02	0.26529	0.025955	0.049794	-0.89568

RDToLRR.pl -> 1-00015.lrr

-2.16424	chr1	+	69091	70008
-0.50958	chr1	+	861308	861407
0.163431	chr1	+	865535	865716
-0.00344	chr1	+	866395	866494

GATK UnifiedGenotyper VCF

#CHROM

POS

REF

ALT

QUAL

FILTER

INFO

FORMAT

HG00113

HG00114

HG00116

HG00117

16448899

89.95

AC=2;AF=0.100;AN=20;BaseQRankSum=-6.004;DP=2233;Dels=0.00;FS=0.000;HaplotypeScore=0.8069;InbreedingCoeff=-0.1091;MLEAC=2;MLEAF=0.100;MQ=5.67;MQ0=1899;MQRankSum=-2.342;QD=0.18;ReadPosRankSum=0.320

GT:AD:DP:GQ:PL

0/0:135,0:135:39:0,39,314

0/0:249,0:249:69:0,69,542

0/0:250,0:250:36:0,36,293

0/0:249,0:249:18:0,18,142

16449050

11839.64

AC=8;AF=0.400;AN=20;BaseQRankSum=-16.159;DP=2450;Dels=0.00;FS=2.561;HaplotypeScore=3.3000;InbreedingCoeff=-0.2500;MLEAC=8;MLEAF=0.400;MQ=17.86;MQ0=867;MQRankSum=0.889;QD=6.96;ReadPosRankSum=1.779

GT:AD:DP:GQ:PL

0/1:152,96:237:99:1569,0,1267

0/1:138,112:238:99:1932,0,1857

1/1:63,186:238:99:3705,429,0

0/0:249,1:249:99:0,387,3504

17072483

45337.58

AC=19;AF=0.950;AN=20;BaseQRankSum=-6.181;DP=1970;Dels=0.00;FS=17.920;HaplotypeScore=3.5190;InbreedingCoeff=-0.0526;MLEAC=19;MLEAF=0.950;MQ=38.38;MQ0=160;MQRankSum=-12.168;QD=23.01;ReadPosRankSum=2.653

GT:AD:DP:GQ:PL

1/1:0,249:249:99:6064,595,0

1/1:0,142:142:99:4243,364,0

0/1:129,117:234:99:2280,0,3330

1/1:0,183:183:99:4462,415,0

17072621

3408.58

AC=1;AF=0.050;AN=20;BaseQRankSum=-5.234;DP=1814;Dels=0.00;FS=5.716;HaplotypeScore=4.6771;InbreedingCoeff=-0.0526;MLEAC=1;MLEAF=0.050;MQ=53.06;MQ0=1;MQRankSum=5.820;QD=13.91;ReadPosRankSum=-1.783

GT:AD:DP:GQ:PL

0/0:250,0:250:99:0,731,8779

0/0:147,0:147:99:0,427,5744

0/0:206,0:206:99:0,590,7615

0/0:142,1:142:99:0,406,5264

17073066

11622.58

AC=19;AF=0.950;AN=20;BaseQRankSum=-2.450;DP=426;Dels=0.00;FS=4.926;HaplotypeScore=2.4288;InbreedingCoeff=-0.0526;MLEAC=19;MLEAF=0.950;MQ=54.80;MQ0=2;MQRankSum=-0.589;QD=27.28;ReadPosRankSum=0.149

GT:AD:DP:GQ:PL

1/1:1,81:81:99:2292,228,0

1/1:0,43:43:99:1393,114,0

0/1:19,23:41:99:557,0,446

1/1:0,28:28:72:785,72,0

17579761

629.58

AC=1;AF=0.050;AN=20;BaseQRankSum=2.742;DP=750;Dels=0.00;FS=2.102;HaplotypeScore=2.2944;InbreedingCoeff=-0.0526;MLEAC=1;MLEAF=0.050;MQ=59.13;MQ0=0;MQRankSum=-0.597;QD=12.85;ReadPosRankSum=0.246

GT:AD:DP:GQ:PL

0/0:147,0:147:99:0,421,5140

0/0:48,0:48:99:0,132,1802

0/0:63,0:63:99:0,169,2196

0/0:37,0:37:99:0,102,1367

17589209

2206.75

AC=6;AF=0.300;AN=20;BaseQRankSum=-6.530;DP=313;Dels=0.00;FS=2.652;HaplotypeScore=1.5979;InbreedingCoeff=-0.4286;MLEAC=6;MLEAF=0.300;MQ=58.12;MQ0=0;MQRankSum=0.147;QD=12.61;ReadPosRankSum=-0.302

GT:AD:DP:GQ:PL

0/1:11,11:22:99:293,0,318

0/0:30,0:30:87:0,87,1109

0/0:36,0:36:87:0,87,1098

0/1:22,16:37:99:415,0,676

VcfToBaf.pl -> HG00113.baf

chr22:16448899	22	16448899	0
chr22:16449050	22	16449050	0.405
chr22:17072483	22	17072483	1
chr22:17072621	22	17072621	0
chr22:17073066	22	17073066	1
chr22:17579761	22	17579761	0
chr22:17589209	22	17589209	0.5

PennCNV signal intensity baf/lrr file

Name	Chr	Position	sample.B Allele Freq	sample.Log R Ratio
MitoA10045G	M	10045	0	0.643956
MitoA10551G	M	10551	0	0.313524
MitoA11252G	M	11252	0	0.541707
MitoA11468G	M	11468	0	0.625628
MitoA11813G	M	11813	0	0.54349
MitoA12309G	M	12309	0	0.54341
MitoA13106G	M	13106	1	-0.06665
MitoA13264G	M	13264	0.020319	0.014483
MitoA13781G	M	13781	0	-0.58875

PennCNV rawcnv file

Chromosome Start Stop	NumSNP	Length	CN	SampleID	StartSNP	EndSNP	Confidence
chr18:61308108-61326761	numsnp=9	length=18654	state2,cn=1	1-00015	startsnp=chr18_61308108	endsnp=chr18_61326761	conf=99
chr7:142045523-142104578	numsnp=4	length=59056	state2,cn=1	1-00015	startsnp=chr7_142045523	endsnp=chr7_142104578	conf=91
chr7:57193723-57194425	numsnp=2	length=703	state2,cn=1	1-00015	startsnp=chr7_57193723	endsnp=chr7_57194425	conf=51
chrX:64956657-64959755	numsnp=5	length=3099	state5,cn=3	1-00015	startsnp=chrX_64956657	endsnp=chrX_64959755	conf=47
chr1:161487765-161519634	numsnp=5	length=31870	state5,cn=3	1-00030	startsnp=chr1_161487765	endsnp=chr1_161519634	conf=99
chr18:14784300-14797724	numsnp=7	length=13425	state5,cn=3	1-00030	startsnp=chr18_14784300	endsnp=chr18_14797724	conf=74
chr19:43674219-43698746	numsnp=5	length=24528	state2,cn=1	1-00030	startsnp=chr19_43674219	endsnp=chr19_43698746	conf=99
chr2:38956698-38972352	numsnp=5	length=15655	state5,cn=3	1-00030	startsnp=chr2_38956698	endsnp=chr2_38972352	conf=95
chrX:105875855-105937638	numsnp=12	length=61784	state5,cn=3	1-00030	startsnp=chrX_105875855	endsnp=chrX_105937638	conf=32

Version History

8/4/16 Fifth version posted: VcfToBaf.pl speed optimized and SnpCallRate option added in case vcf filter column too permissive.

2/26/16 Fourth version posted: BAF normalization to 0.5 bug fixed and optimized. Use bedtools if installed to speed up.

7/7/15 Third version posted: Pipeline script runs all the protocol scripts. Handle multi-sample and single-sample VCFs.

10/17/14 Second version posted: Dynamically assign GT and AD column numbers based on FORMAT column

9/17/14 First version posted