PennCNV Copy Number Variant Call Detection in Exome Sequencing Software

Download

Decompress command: tar -zxvf PennCNV-ExomeSeq_5.tgz

Linux bash environment is best, but on Windows Cygwin will also work.

Having trouble? Bugs? Email Joseph Glessner: Description: C:\Documents and Settings\joe.CHOP\Local Settings\Temporary Internet Files\Content.Word\New Picture (40).bmp

 

Read and Cite the paper: Glessner J.T. et al. Penncnv-Exomeseq: Genotype Improves Copy Number Variant Detection in Exome Sequencing. Austin journal of proteomics bioinformatics & genomics. 2015;2(1).

 

Software Integrated by PennCNV-ExomeSeq:

GATK UnifiedGenotyper OR GATK HaplotypeCaller

XHMM

PennCNV

ParseCNV

 

Overview

 

Format the GATK VCF (BAF) and the XHMM RD (LRR) for PennCNV.
The sequencing BAF is normalized with clustering and high standard deviation metric probe exclusion.
PennCNV-ExomeSeq delivers an average of 36 calls per sample whereas XHMM delivers 10 with 2 overlapping.
Even after normalization, the sequencing derived BAF and LRR are noisier (higher SD) than the SNP array distribution.
The BAF can also be used to annotate XHMM calls to filter out false-positives.

Raw depth from whole genome sequencing also supported through ClusterDepthToLRR_2.pl.

 

Description: P:\Joe\ParseCNVDataModeling\BafLrrReference.png

 

Protocol

Now the steps are all contained in a single Pipeline script!

perl PennCNV-ExomeSeq_Pipeline.pl ugVcfs.txt data/DATA.PCA_normalized.filtered.sample_zscores.RD_TEST.txt

 

Assuming one sample GATK unifiedgenotyper .vcfs (if multi sample do: grep -v ^## group1.DATA.Geno > group1.DATA.Geno.Clean; perl kcolumn.pl group1.DATA.Geno.Clean split 1 -head 9 --out test --name_by_header)

creates .het and .hom SNP CHR POS raw BAF files from GATK Unified Genotyper single sample VCFs

perl VcfToBaf.pl ugVcfs.txt

creates LRR scan_region definition files from XHMM ZPCARD output file

perl RDToLrr.pl DATA.PCA_normalized.filtered.sample_zscores.RD.txt

cat all het baf files and sort for compliePfb_catSortBafLrrs

cat 1-04316-01.ug.vcf.baf.het 1-04316-02.ug.vcf.baf.het 1-04316.ug.vcf.baf.het | sort -k 3,3 -n | sort -k 2,2 -n -s > AllBafsExpVal05_PosChrSorted.txt

cat all hom baf files and sort for compliePfb_catSortBafLrrs

cat 1-04316-01.ug.vcf.baf.hom 1-04316-02.ug.vcf.baf.hom 1-04316.ug.vcf.baf.hom | sort -k 3,3 -n | sort -k 2,2 -n -s > AllBafsExpVal1_PosChrSorted.txt

compliePfb and note SD and N for hets

perl compilePfb_catSortBafLrrs.pl AllBafsExpVal05_PosChrSorted.txt > AllBafsExpVal05_PosChrSorted_AvgSdN.txt

compliePfb and note SD and N for homs

perl compilePfb_catSortBafLrrs.pl AllBafsExpVal1_PosChrSorted.txt > AllBafsExpVal1_PosChrSorted_AvgSdN.txt

Normalize Baf To 0.5

perl NormalizeBafTo05.pl AllBafsExpVal05_PosChrSorted_AvgSdN.txt 1-04316-01.ug.vcf.baf.het > 1-04316-01.ug.vcf.baf.het.norm

perl NormalizeBafTo05.pl AllBafsExpVal05_PosChrSorted_AvgSdN.txt 1-04316-02.ug.vcf.baf.het > 1-04316-02.ug.vcf.baf.het.norm

perl NormalizeBafTo05.pl AllBafsExpVal05_PosChrSorted_AvgSdN.txt 1-04316.ug.vcf.baf.het > 1-04316.ug.vcf.baf.het.norm

Normalize Baf To 1

perl NormalizeBafTo1.pl AllBafsExpVal1_PosChrSorted_AvgSdN.txt 1-04316-01.ug.vcf.baf.hom > 1-04316-01.ug.vcf.baf.hom.norm

perl NormalizeBafTo1.pl AllBafsExpVal1_PosChrSorted_AvgSdN.txt 1-04316-02.ug.vcf.baf.hom > 1-04316-02.ug.vcf.baf.hom.norm

perl NormalizeBafTo1.pl AllBafsExpVal1_PosChrSorted_AvgSdN.txt 1-04316.ug.vcf.baf.hom > 1-04316.ug.vcf.baf.hom.norm

cat Bafs, remove GL

cat 1-04316-01.ug.vcf.baf.het.norm 1-04316-01.ug.vcf.baf.hom.norm | grep -v chrGL > 1-04316-01.ug.vcf.baf

cat 1-04316-02.ug.vcf.baf.het.norm 1-04316-02.ug.vcf.baf.hom.norm | grep -v chrGL > 1-04316-02.ug.vcf.baf

cat 1-04316.ug.vcf.baf.het.norm 1-04316.ug.vcf.baf.hom.norm | grep -v chrGL > 1-04316.ug.vcf.baf

cat het hom Bafs and sort for compliePfb_catSortBafLrrs

cat 1-04316-01.ug.vcf.baf 1-04316-02.ug.vcf.baf 1-04316.ug.vcf.baf | sort -k 3,3 -n | sort -k 2,2 -n -s > AllBafsExpVal051_PosChrSorted.txt

compliePfb to use with PennCNV using het hom Bafs

perl compilePfb_catSortBafLrrs.pl AllBafsExpVal051_PosChrSorted.txt | awk Ô{print $1Ó\tÓ$2Ó\tÓ$3Ó\tÓ$4}Õ > AllBafsExpVal051_PosChrSorted.pfb

Filter het PFB for low SD and high N probes (adjust N based on sample size and distribution)

perl PfbAvgSdN_Filter.pl AllBafsExpVal05_PosChrSorted_AvgSdN.txt 0.1 2 > AllBafsExpVal05_PosChrSorted_AvgSdN_LowSdHighCount.txt

Filter PFB to use with PennCNV for low SD and high N probes

perl FilterBafLrrsForLowSD.pl AllBafsExpVal05_PosChrSorted_AvgSdN_LowSdHighCount.txt AllBafsExpVal051_PosChrSorted.pfb > AllBafsExpVal051_PosChrSorted_LowSdHighCount.pfb

Download GC file if needed

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gc5Base.txt.gz; gunzip gc5Base.txt.gz; sort -k 2,2 -k 3,3n < gc5Base.txt > gc5Base_hg19_sorted.txt

make .gcmodel

perl cal_gc_snp.pl gc5Base_hg19_sorted.txt AllBafsExpVal051_PosChrSorted_LowSdHighCount.pfb -output AllBafsExpVal051_PosChrSorted_LowSdHighCount.gcmodel

make .baflrr

perl scan_region.pl 1-04316.ug.vcf.baf 1-04316.lrr -knowngene -expandmax 5m > 1-04316.baflrr; sed -i 1i"Name\tChr\tPosition\tsample.B Allele Freq\tsample.Log R Ratio\tDistance" 1-04316.baflrr

perl scan_region.pl 1-04316-01.ug.vcf.baf 1-04316-01.lrr -knowngene -expandmax 5m > 1-04316-01.baflrr; sed -i 1i"Name\tChr\tPosition\tsample.B Allele Freq\tsample.Log R Ratio\tDistance" 1-04316-01.baflrr

perl scan_region.pl 1-04316-02.ug.vcf.baf 1-04316-02.lrr -knowngene -expandmax 5m > 1-04316-02.baflrr; sed -i 1i"Name\tChr\tPosition\tsample.B Allele Freq\tsample.Log R Ratio\tDistance" 1-04316-02.baflrr

run PennCNV

perl detect_cnv.pl -test -hmm hhall.hmm -pfb AllBafsExpVal051_PosChrSorted_LowSdHighCount.pfb -gcmodel AllBafsExpVal05And1_PosChrSorted_LowSdHighCount.gcmodel --listfile NonZeroBafLrrs.txt --conf -output PCGCExomeSeqNorm_LowSdHighCount.rawcnv -logfile PCGCExomeSeqNorm_LowSdHighCount.log

 

Evaluate Output

 

Optional for visualize_cnv.pl on clean set of probes

FilterBafLrrsForLowSD.pl

visualize_cnv

perl visualize_cnv.pl -format plot -signal 1-00140.baflrr  LargeXHMMCalls.rawcnv

compare PennCNV-ExomeSeq with XHMM calls or other calls

ParseCNV TestConcordance.pl

make GT scan_region definition to annotate XHMM with number of homs and hets in CNV calls

VcfToGT_ForScanRegion.pl

make per sample rawcnv for scan_region query

SplitRawcnvBySample.pl

annotate GTs in XHMM rawcnv

perl scan_region.pl 1-01513-02_xhmm.rawcnv 1-01513-02.ug.vcf.gt -knowngene -expandmax 5m > 1-01513-02_xhmm.gt.rawcnv

optional filter for baf lrr exact overlap matches

perl CountGTsInRawcnvScanRegion.pl 1-01513-02_xhmm.gt.rawcnv >> batch1-6_DATA.vcf.SortSIDs.CountHomHets.rawcnv

grep -P "\t0$|Name" 1-04316.baflrr

 

 

XHMM DATA.PCA_normalized.filtered.sample_zscores.RD.txt

Matrix

chr1:69091-70008

chr1:861308-861407

chr1:865535-865716

chr1:866395-866494

1-00015

-2.16424

-0.50958

0.163431

-0.00344

1-00030-01

0.10668

-0.29709

-0.41101

-0.11729

1-00030-02

0.096985

0.119738

-0.42443

-0.04269

1-00030

1.504317

0.039196

0.414124

0.109203

1-00034-01

2.575126

-0.46299

0.501768

-1.79689

1-00034-02

-2.99311

0.767077

-0.18072

-0.93344

1-00034

0.539372

1.384102

-0.63583

1.72244

1-00041-01

-2.51376

-0.27845

-0.78048

1.245275

1-00041-02

0.26529

0.025955

0.049794

-0.89568

 

RDToLRR.pl  -> 1-00015.lrr

-2.16424

chr1

+

69091

70008

-0.50958

chr1

+

861308

861407

0.163431

chr1

+

865535

865716

-0.00344

chr1

+

866395

866494

 

GATK UnifiedGenotyper VCF

#CHROM

POS

ID

REF

ALT

QUAL

FILTER

INFO

FORMAT

HG00113

HG00114

HG00116

HG00117

22

16448899

.

G

A

89.95

.

AC=2;AF=0.100;AN=20;BaseQRankSum=-6.004;DP=2233;Dels=0.00;FS=0.000;HaplotypeScore=0.8069;InbreedingCoeff=-0.1091;MLEAC=2;MLEAF=0.100;MQ=5.67;MQ0=1899;MQRankSum=-2.342;QD=0.18;ReadPosRankSum=0.320

GT:AD:DP:GQ:PL

0/0:135,0:135:39:0,39,314

0/0:249,0:249:69:0,69,542

0/0:250,0:250:36:0,36,293

0/0:249,0:249:18:0,18,142

22

16449050

.

G

A

11839.64

.

AC=8;AF=0.400;AN=20;BaseQRankSum=-16.159;DP=2450;Dels=0.00;FS=2.561;HaplotypeScore=3.3000;InbreedingCoeff=-0.2500;MLEAC=8;MLEAF=0.400;MQ=17.86;MQ0=867;MQRankSum=0.889;QD=6.96;ReadPosRankSum=1.779

GT:AD:DP:GQ:PL

0/1:152,96:237:99:1569,0,1267

0/1:138,112:238:99:1932,0,1857

1/1:63,186:238:99:3705,429,0

0/0:249,1:249:99:0,387,3504

22

17072483

.

A

G

45337.58

.

AC=19;AF=0.950;AN=20;BaseQRankSum=-6.181;DP=1970;Dels=0.00;FS=17.920;HaplotypeScore=3.5190;InbreedingCoeff=-0.0526;MLEAC=19;MLEAF=0.950;MQ=38.38;MQ0=160;MQRankSum=-12.168;QD=23.01;ReadPosRankSum=2.653

GT:AD:DP:GQ:PL

1/1:0,249:249:99:6064,595,0

1/1:0,142:142:99:4243,364,0

0/1:129,117:234:99:2280,0,3330

1/1:0,183:183:99:4462,415,0

22

17072621

.

T

G

3408.58

.

AC=1;AF=0.050;AN=20;BaseQRankSum=-5.234;DP=1814;Dels=0.00;FS=5.716;HaplotypeScore=4.6771;InbreedingCoeff=-0.0526;MLEAC=1;MLEAF=0.050;MQ=53.06;MQ0=1;MQRankSum=5.820;QD=13.91;ReadPosRankSum=-1.783

GT:AD:DP:GQ:PL

0/0:250,0:250:99:0,731,8779

0/0:147,0:147:99:0,427,5744

0/0:206,0:206:99:0,590,7615

0/0:142,1:142:99:0,406,5264

22

17073066

.

A

G

11622.58

.

AC=19;AF=0.950;AN=20;BaseQRankSum=-2.450;DP=426;Dels=0.00;FS=4.926;HaplotypeScore=2.4288;InbreedingCoeff=-0.0526;MLEAC=19;MLEAF=0.950;MQ=54.80;MQ0=2;MQRankSum=-0.589;QD=27.28;ReadPosRankSum=0.149

GT:AD:DP:GQ:PL

1/1:1,81:81:99:2292,228,0

1/1:0,43:43:99:1393,114,0

0/1:19,23:41:99:557,0,446

1/1:0,28:28:72:785,72,0

22

17579761

.

G

A

629.58

.

AC=1;AF=0.050;AN=20;BaseQRankSum=2.742;DP=750;Dels=0.00;FS=2.102;HaplotypeScore=2.2944;InbreedingCoeff=-0.0526;MLEAC=1;MLEAF=0.050;MQ=59.13;MQ0=0;MQRankSum=-0.597;QD=12.85;ReadPosRankSum=0.246

GT:AD:DP:GQ:PL

0/0:147,0:147:99:0,421,5140

0/0:48,0:48:99:0,132,1802

0/0:63,0:63:99:0,169,2196

0/0:37,0:37:99:0,102,1367

22

17589209

.

C

T

2206.75

.

AC=6;AF=0.300;AN=20;BaseQRankSum=-6.530;DP=313;Dels=0.00;FS=2.652;HaplotypeScore=1.5979;InbreedingCoeff=-0.4286;MLEAC=6;MLEAF=0.300;MQ=58.12;MQ0=0;MQRankSum=0.147;QD=12.61;ReadPosRankSum=-0.302

GT:AD:DP:GQ:PL

0/1:11,11:22:99:293,0,318

0/0:30,0:30:87:0,87,1109

0/0:36,0:36:87:0,87,1098

0/1:22,16:37:99:415,0,676

 

VcfToBaf.pl -> HG00113.baf

chr22:16448899

22

16448899

0

chr22:16449050

22

16449050

0.405

chr22:17072483

22

17072483

1

chr22:17072621

22

17072621

0

chr22:17073066

22

17073066

1

chr22:17579761

22

17579761

0

chr22:17589209

22

17589209

0.5

 

PennCNV signal intensity baf/lrr file

Name

Chr

Position

sample.B Allele Freq

sample.Log R Ratio

MitoA10045G

M

10045

0

0.643956

MitoA10551G

M

10551

0

0.313524

MitoA11252G

M

11252

0

0.541707

MitoA11468G

M

11468

0

0.625628

MitoA11813G

M

11813

0

0.54349

MitoA12309G

M

12309

0

0.54341

MitoA13106G

M

13106

1

-0.06665

MitoA13264G

M

13264

0.020319

0.014483

MitoA13781G

M

13781

0

-0.58875

 

PennCNV rawcnv file

Chromosome Start Stop

NumSNP

Length

CN

SampleID

StartSNP

EndSNP

Confidence

chr18:61308108-61326761

numsnp=9

length=18654

state2,cn=1

1-00015

startsnp=chr18_61308108

endsnp=chr18_61326761

conf=99

chr7:142045523-142104578

numsnp=4

length=59056

state2,cn=1

1-00015

startsnp=chr7_142045523

endsnp=chr7_142104578

conf=91

chr7:57193723-57194425

numsnp=2

length=703

state2,cn=1

1-00015

startsnp=chr7_57193723

endsnp=chr7_57194425

conf=51

chrX:64956657-64959755

numsnp=5

length=3099

state5,cn=3

1-00015

startsnp=chrX_64956657

endsnp=chrX_64959755

conf=47

chr1:161487765-161519634

numsnp=5

length=31870

state5,cn=3

1-00030

startsnp=chr1_161487765

endsnp=chr1_161519634

conf=99

chr18:14784300-14797724

numsnp=7

length=13425

state5,cn=3

1-00030

startsnp=chr18_14784300

endsnp=chr18_14797724

conf=74

chr19:43674219-43698746

numsnp=5

length=24528

state2,cn=1

1-00030

startsnp=chr19_43674219

endsnp=chr19_43698746

conf=99

chr2:38956698-38972352

numsnp=5

length=15655

state5,cn=3

1-00030

startsnp=chr2_38956698

endsnp=chr2_38972352

conf=95

chrX:105875855-105937638

numsnp=12

length=61784

state5,cn=3

1-00030

startsnp=chrX_105875855

endsnp=chrX_105937638

conf=32

 

Version History

8/4/16 Fifth version posted: VcfToBaf.pl speed optimized and SnpCallRate option added in case vcf filter column too permissive.

2/26/16 Fourth version posted: BAF normalization to 0.5 bug fixed and optimized. Use bedtools if installed to speed up.

7/7/15 Third version posted: Pipeline script runs all the protocol scripts. Handle multi-sample and single-sample VCFs.

10/17/14 Second version posted: Dynamically assign GT and AD column numbers based on FORMAT column

9/17/14 First version posted