PennCNV Copy Number
Variant Call Detection in Exome Sequencing Software
Decompress command: tar -zxvf PennCNV-ExomeSeq_5.tgz
Linux bash environment
is best, but on Windows Cygwin will also work.
Having trouble? Bugs?
Email Joseph Glessner:
Read
and Cite the paper: Glessner J.T. et al. Penncnv-Exomeseq: Genotype
Improves Copy Number Variant Detection in Exome Sequencing. Austin journal of
proteomics bioinformatics & genomics. 2015;2(1).
Software
Integrated by PennCNV-ExomeSeq:
GATK
UnifiedGenotyper OR GATK
HaplotypeCaller
Overview
Format
the GATK VCF (BAF) and the XHMM RD (LRR) for PennCNV.
The sequencing BAF is normalized with clustering and high standard deviation
metric probe exclusion.
PennCNV-ExomeSeq delivers an average of 36 calls per sample whereas XHMM
delivers 10 with 2 overlapping.
Even after normalization, the sequencing derived BAF and LRR are noisier
(higher SD) than the SNP array distribution.
The BAF can also be used to annotate XHMM calls to filter out false-positives.
Raw
depth from whole genome sequencing also supported through ClusterDepthToLRR_2.pl.
Protocol
Now
the steps are all contained in a single Pipeline script!
perl PennCNV-ExomeSeq_Pipeline.pl
ugVcfs.txt data/DATA.PCA_normalized.filtered.sample_zscores.RD_TEST.txt
Assuming one sample GATK unifiedgenotyper .vcfs (if multi sample do: grep -v ^## group1.DATA.Geno > group1.DATA.Geno.Clean; perl kcolumn.pl group1.DATA.Geno.Clean split 1 -head 9 --out test --name_by_header)
creates .het and .hom SNP CHR
POS raw BAF files from GATK Unified Genotyper single sample VCFs
perl VcfToBaf.pl ugVcfs.txt
creates LRR scan_region
definition files from XHMM ZPCARD output file
perl RDToLrr.pl DATA.PCA_normalized.filtered.sample_zscores.RD.txt
cat all het baf files and sort for
compliePfb_catSortBafLrrs
cat 1-04316-01.ug.vcf.baf.het
1-04316-02.ug.vcf.baf.het 1-04316.ug.vcf.baf.het | sort -k 3,3 -n | sort -k 2,2
-n -s > AllBafsExpVal05_PosChrSorted.txt
cat
all hom baf files and sort for compliePfb_catSortBafLrrs
cat 1-04316-01.ug.vcf.baf.hom
1-04316-02.ug.vcf.baf.hom 1-04316.ug.vcf.baf.hom | sort -k 3,3 -n | sort -k 2,2
-n -s > AllBafsExpVal1_PosChrSorted.txt
compliePfb
and note SD and N for hets
perl compilePfb_catSortBafLrrs.pl AllBafsExpVal05_PosChrSorted.txt >
AllBafsExpVal05_PosChrSorted_AvgSdN.txt
compliePfb and note SD and N for homs
perl compilePfb_catSortBafLrrs.pl AllBafsExpVal1_PosChrSorted.txt > AllBafsExpVal1_PosChrSorted_AvgSdN.txt
Normalize Baf To 0.5
perl NormalizeBafTo05.pl AllBafsExpVal05_PosChrSorted_AvgSdN.txt 1-04316-01.ug.vcf.baf.het
> 1-04316-01.ug.vcf.baf.het.norm
perl NormalizeBafTo05.pl AllBafsExpVal05_PosChrSorted_AvgSdN.txt 1-04316-02.ug.vcf.baf.het > 1-04316-02.ug.vcf.baf.het.norm
perl NormalizeBafTo05.pl AllBafsExpVal05_PosChrSorted_AvgSdN.txt
1-04316.ug.vcf.baf.het > 1-04316.ug.vcf.baf.het.norm
Normalize Baf To 1
perl NormalizeBafTo1.pl AllBafsExpVal1_PosChrSorted_AvgSdN.txt 1-04316-01.ug.vcf.baf.hom
> 1-04316-01.ug.vcf.baf.hom.norm
perl NormalizeBafTo1.pl AllBafsExpVal1_PosChrSorted_AvgSdN.txt 1-04316-02.ug.vcf.baf.hom > 1-04316-02.ug.vcf.baf.hom.norm
perl NormalizeBafTo1.pl AllBafsExpVal1_PosChrSorted_AvgSdN.txt 1-04316.ug.vcf.baf.hom
> 1-04316.ug.vcf.baf.hom.norm
cat Bafs, remove GL
cat 1-04316-01.ug.vcf.baf.het.norm
1-04316-01.ug.vcf.baf.hom.norm | grep -v chrGL > 1-04316-01.ug.vcf.baf
cat 1-04316-02.ug.vcf.baf.het.norm
1-04316-02.ug.vcf.baf.hom.norm | grep -v chrGL > 1-04316-02.ug.vcf.baf
cat 1-04316.ug.vcf.baf.het.norm 1-04316.ug.vcf.baf.hom.norm
| grep -v chrGL > 1-04316.ug.vcf.baf
cat het hom Bafs and sort
for compliePfb_catSortBafLrrs
cat 1-04316-01.ug.vcf.baf 1-04316-02.ug.vcf.baf
1-04316.ug.vcf.baf | sort -k 3,3 -n | sort -k 2,2 -n -s > AllBafsExpVal051_PosChrSorted.txt
compliePfb
to use with PennCNV using het hom Bafs
perl compilePfb_catSortBafLrrs.pl AllBafsExpVal051_PosChrSorted.txt |
awk Ô{print $1Ó\tÓ$2Ó\tÓ$3Ó\tÓ$4}Õ > AllBafsExpVal051_PosChrSorted.pfb
Filter het PFB for low SD and high N
probes (adjust N based on sample size and distribution)
perl PfbAvgSdN_Filter.pl AllBafsExpVal05_PosChrSorted_AvgSdN.txt 0.1 2 > AllBafsExpVal05_PosChrSorted_AvgSdN_LowSdHighCount.txt
Filter PFB to use with PennCNV for low
SD and high N probes
perl FilterBafLrrsForLowSD.pl AllBafsExpVal05_PosChrSorted_AvgSdN_LowSdHighCount.txt
AllBafsExpVal051_PosChrSorted.pfb
> AllBafsExpVal051_PosChrSorted_LowSdHighCount.pfb
Download GC file if needed
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gc5Base.txt.gz; gunzip gc5Base.txt.gz; sort -k 2,2 -k 3,3n < gc5Base.txt > gc5Base_hg19_sorted.txt
make .gcmodel
perl cal_gc_snp.pl gc5Base_hg19_sorted.txt AllBafsExpVal051_PosChrSorted_LowSdHighCount.pfb -output AllBafsExpVal051_PosChrSorted_LowSdHighCount.gcmodel
make .baflrr
perl scan_region.pl 1-04316.ug.vcf.baf 1-04316.lrr
-knowngene
-expandmax 5m > 1-04316.baflrr; sed -i
1i"Name\tChr\tPosition\tsample.B Allele Freq\tsample.Log R
Ratio\tDistance" 1-04316.baflrr
perl scan_region.pl 1-04316-01.ug.vcf.baf 1-04316-01.lrr
-knowngene
-expandmax 5m > 1-04316-01.baflrr; sed -i
1i"Name\tChr\tPosition\tsample.B Allele Freq\tsample.Log R
Ratio\tDistance" 1-04316-01.baflrr
perl scan_region.pl 1-04316-02.ug.vcf.baf 1-04316-02.lrr
-knowngene
-expandmax 5m > 1-04316-02.baflrr; sed -i
1i"Name\tChr\tPosition\tsample.B Allele Freq\tsample.Log R
Ratio\tDistance" 1-04316-02.baflrr
run PennCNV
perl
detect_cnv.pl -test -hmm hhall.hmm -pfb AllBafsExpVal051_PosChrSorted_LowSdHighCount.pfb
-gcmodel AllBafsExpVal05And1_PosChrSorted_LowSdHighCount.gcmodel
--listfile NonZeroBafLrrs.txt --conf -output PCGCExomeSeqNorm_LowSdHighCount.rawcnv
-logfile PCGCExomeSeqNorm_LowSdHighCount.log
Evaluate Output
Optional for visualize_cnv.pl on
clean set of probes
FilterBafLrrsForLowSD.pl
visualize_cnv
perl visualize_cnv.pl -format plot -signal 1-00140.baflrr LargeXHMMCalls.rawcnv
compare PennCNV-ExomeSeq with
XHMM calls or other calls
ParseCNV
TestConcordance.pl
make GT scan_region definition
to annotate XHMM with number of homs and hets in CNV calls
VcfToGT_ForScanRegion.pl
make per sample rawcnv for
scan_region query
SplitRawcnvBySample.pl
annotate GTs in XHMM rawcnv
perl
scan_region.pl 1-01513-02_xhmm.rawcnv 1-01513-02.ug.vcf.gt -knowngene
-expandmax 5m > 1-01513-02_xhmm.gt.rawcnv
optional filter for baf lrr
exact overlap matches
perl
CountGTsInRawcnvScanRegion.pl 1-01513-02_xhmm.gt.rawcnv >>
batch1-6_DATA.vcf.SortSIDs.CountHomHets.rawcnv
grep -P "\t0$|Name" 1-04316.baflrr
XHMM DATA.PCA_normalized.filtered.sample_zscores.RD.txt
Matrix |
chr1:69091-70008 |
chr1:861308-861407 |
chr1:865535-865716 |
chr1:866395-866494 |
1-00015 |
-2.16424 |
-0.50958 |
0.163431 |
-0.00344 |
1-00030-01 |
0.10668 |
-0.29709 |
-0.41101 |
-0.11729 |
1-00030-02 |
0.096985 |
0.119738 |
-0.42443 |
-0.04269 |
1-00030 |
1.504317 |
0.039196 |
0.414124 |
0.109203 |
1-00034-01 |
2.575126 |
-0.46299 |
0.501768 |
-1.79689 |
1-00034-02 |
-2.99311 |
0.767077 |
-0.18072 |
-0.93344 |
1-00034 |
0.539372 |
1.384102 |
-0.63583 |
1.72244 |
1-00041-01 |
-2.51376 |
-0.27845 |
-0.78048 |
1.245275 |
1-00041-02 |
0.26529 |
0.025955 |
0.049794 |
-0.89568 |
RDToLRR.pl -> 1-00015.lrr
-2.16424 |
chr1 |
+ |
69091 |
70008 |
-0.50958 |
chr1 |
+ |
861308 |
861407 |
0.163431 |
chr1 |
+ |
865535 |
865716 |
-0.00344 |
chr1 |
+ |
866395 |
866494 |
GATK
UnifiedGenotyper VCF
#CHROM |
POS |
ID |
REF |
ALT |
QUAL |
FILTER |
INFO |
FORMAT |
HG00113 |
HG00114 |
HG00116 |
HG00117 |
22 |
16448899 |
. |
G |
A |
89.95 |
. |
AC=2;AF=0.100;AN=20;BaseQRankSum=-6.004;DP=2233;Dels=0.00;FS=0.000;HaplotypeScore=0.8069;InbreedingCoeff=-0.1091;MLEAC=2;MLEAF=0.100;MQ=5.67;MQ0=1899;MQRankSum=-2.342;QD=0.18;ReadPosRankSum=0.320 |
GT:AD:DP:GQ:PL |
0/0:135,0:135:39:0,39,314 |
0/0:249,0:249:69:0,69,542 |
0/0:250,0:250:36:0,36,293 |
0/0:249,0:249:18:0,18,142 |
22 |
16449050 |
. |
G |
A |
11839.64 |
. |
AC=8;AF=0.400;AN=20;BaseQRankSum=-16.159;DP=2450;Dels=0.00;FS=2.561;HaplotypeScore=3.3000;InbreedingCoeff=-0.2500;MLEAC=8;MLEAF=0.400;MQ=17.86;MQ0=867;MQRankSum=0.889;QD=6.96;ReadPosRankSum=1.779 |
GT:AD:DP:GQ:PL |
0/1:152,96:237:99:1569,0,1267 |
0/1:138,112:238:99:1932,0,1857 |
1/1:63,186:238:99:3705,429,0 |
0/0:249,1:249:99:0,387,3504 |
22 |
17072483 |
. |
A |
G |
45337.58 |
. |
AC=19;AF=0.950;AN=20;BaseQRankSum=-6.181;DP=1970;Dels=0.00;FS=17.920;HaplotypeScore=3.5190;InbreedingCoeff=-0.0526;MLEAC=19;MLEAF=0.950;MQ=38.38;MQ0=160;MQRankSum=-12.168;QD=23.01;ReadPosRankSum=2.653 |
GT:AD:DP:GQ:PL |
1/1:0,249:249:99:6064,595,0 |
1/1:0,142:142:99:4243,364,0 |
0/1:129,117:234:99:2280,0,3330 |
1/1:0,183:183:99:4462,415,0 |
22 |
17072621 |
. |
T |
G |
3408.58 |
. |
AC=1;AF=0.050;AN=20;BaseQRankSum=-5.234;DP=1814;Dels=0.00;FS=5.716;HaplotypeScore=4.6771;InbreedingCoeff=-0.0526;MLEAC=1;MLEAF=0.050;MQ=53.06;MQ0=1;MQRankSum=5.820;QD=13.91;ReadPosRankSum=-1.783 |
GT:AD:DP:GQ:PL |
0/0:250,0:250:99:0,731,8779 |
0/0:147,0:147:99:0,427,5744 |
0/0:206,0:206:99:0,590,7615 |
0/0:142,1:142:99:0,406,5264 |
22 |
17073066 |
. |
A |
G |
11622.58 |
. |
AC=19;AF=0.950;AN=20;BaseQRankSum=-2.450;DP=426;Dels=0.00;FS=4.926;HaplotypeScore=2.4288;InbreedingCoeff=-0.0526;MLEAC=19;MLEAF=0.950;MQ=54.80;MQ0=2;MQRankSum=-0.589;QD=27.28;ReadPosRankSum=0.149 |
GT:AD:DP:GQ:PL |
1/1:1,81:81:99:2292,228,0 |
1/1:0,43:43:99:1393,114,0 |
0/1:19,23:41:99:557,0,446 |
1/1:0,28:28:72:785,72,0 |
22 |
17579761 |
. |
G |
A |
629.58 |
. |
AC=1;AF=0.050;AN=20;BaseQRankSum=2.742;DP=750;Dels=0.00;FS=2.102;HaplotypeScore=2.2944;InbreedingCoeff=-0.0526;MLEAC=1;MLEAF=0.050;MQ=59.13;MQ0=0;MQRankSum=-0.597;QD=12.85;ReadPosRankSum=0.246 |
GT:AD:DP:GQ:PL |
0/0:147,0:147:99:0,421,5140 |
0/0:48,0:48:99:0,132,1802 |
0/0:63,0:63:99:0,169,2196 |
0/0:37,0:37:99:0,102,1367 |
22 |
17589209 |
. |
C |
T |
2206.75 |
. |
AC=6;AF=0.300;AN=20;BaseQRankSum=-6.530;DP=313;Dels=0.00;FS=2.652;HaplotypeScore=1.5979;InbreedingCoeff=-0.4286;MLEAC=6;MLEAF=0.300;MQ=58.12;MQ0=0;MQRankSum=0.147;QD=12.61;ReadPosRankSum=-0.302 |
GT:AD:DP:GQ:PL |
0/1:11,11:22:99:293,0,318 |
0/0:30,0:30:87:0,87,1109 |
0/0:36,0:36:87:0,87,1098 |
0/1:22,16:37:99:415,0,676 |
VcfToBaf.pl -> HG00113.baf
chr22:16448899 |
22 |
16448899 |
0 |
chr22:16449050 |
22 |
16449050 |
0.405 |
chr22:17072483 |
22 |
17072483 |
1 |
chr22:17072621 |
22 |
17072621 |
0 |
chr22:17073066 |
22 |
17073066 |
1 |
chr22:17579761 |
22 |
17579761 |
0 |
chr22:17589209 |
22 |
17589209 |
0.5 |
PennCNV signal
intensity baf/lrr file
Name |
Chr |
Position |
sample.B Allele Freq |
sample.Log R Ratio |
MitoA10045G |
M |
10045 |
0 |
0.643956 |
MitoA10551G |
M |
10551 |
0 |
0.313524 |
MitoA11252G |
M |
11252 |
0 |
0.541707 |
MitoA11468G |
M |
11468 |
0 |
0.625628 |
MitoA11813G |
M |
11813 |
0 |
0.54349 |
MitoA12309G |
M |
12309 |
0 |
0.54341 |
MitoA13106G |
M |
13106 |
1 |
-0.06665 |
MitoA13264G |
M |
13264 |
0.020319 |
0.014483 |
MitoA13781G |
M |
13781 |
0 |
-0.58875 |
PennCNV rawcnv
file
Chromosome Start Stop |
NumSNP |
Length |
CN |
SampleID |
StartSNP |
EndSNP |
Confidence |
chr18:61308108-61326761 |
numsnp=9 |
length=18654 |
state2,cn=1 |
1-00015 |
startsnp=chr18_61308108 |
endsnp=chr18_61326761 |
conf=99 |
chr7:142045523-142104578 |
numsnp=4 |
length=59056 |
state2,cn=1 |
1-00015 |
startsnp=chr7_142045523 |
endsnp=chr7_142104578 |
conf=91 |
chr7:57193723-57194425 |
numsnp=2 |
length=703 |
state2,cn=1 |
1-00015 |
startsnp=chr7_57193723 |
endsnp=chr7_57194425 |
conf=51 |
chrX:64956657-64959755 |
numsnp=5 |
length=3099 |
state5,cn=3 |
1-00015 |
startsnp=chrX_64956657 |
endsnp=chrX_64959755 |
conf=47 |
chr1:161487765-161519634 |
numsnp=5 |
length=31870 |
state5,cn=3 |
1-00030 |
startsnp=chr1_161487765 |
endsnp=chr1_161519634 |
conf=99 |
chr18:14784300-14797724 |
numsnp=7 |
length=13425 |
state5,cn=3 |
1-00030 |
startsnp=chr18_14784300 |
endsnp=chr18_14797724 |
conf=74 |
chr19:43674219-43698746 |
numsnp=5 |
length=24528 |
state2,cn=1 |
1-00030 |
startsnp=chr19_43674219 |
endsnp=chr19_43698746 |
conf=99 |
chr2:38956698-38972352 |
numsnp=5 |
length=15655 |
state5,cn=3 |
1-00030 |
startsnp=chr2_38956698 |
endsnp=chr2_38972352 |
conf=95 |
chrX:105875855-105937638 |
numsnp=12 |
length=61784 |
state5,cn=3 |
1-00030 |
startsnp=chrX_105875855 |
endsnp=chrX_105937638 |
conf=32 |
Version History
8/4/16 Fifth version posted: VcfToBaf.pl speed optimized and SnpCallRate option added in case vcf filter column too permissive.
2/26/16 Fourth version posted: BAF normalization to 0.5 bug fixed and optimized. Use bedtools if installed to speed up.
7/7/15 Third version posted: Pipeline script runs all the protocol scripts. Handle multi-sample and single-sample VCFs.
10/17/14 Second version posted: Dynamically assign GT and AD column numbers based on FORMAT column
9/17/14 First version posted