Table of Contents
Imputando con 1000 genome e impute2
La base de datos a usar ya ha pasado el QC asi que me voy a saltar esa parte.
Reorientado SNPs
Primeramente me bajo de aqui: http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewBatch.cgi?sbid=33767 el Assay correspondiente al chip utilizado.
El archivo es algo asi (mas un encabezado que le he quitado):
ss# loc_snp_id allele samplesize rs# ss2rs_orien chr chr_pos contig_acc contig_pos rs2genome_orien assembly weight ss66315736 SNP_A-1883517 C/T 270 rs41498448 0 1 779743 NT_004350.19 258375 1 GRCh37.p10 3 ss66317030 SNP_A-1886933 A/G 270 rs2980300 1 1 785989 NT_004350.19 264621 0 GRCh37.p10 1 ss66323101 SNP_A-1902458 A/C 270 rs10907175 0 1 1130727 NT_004350.19 609359 0 GRCh37.p10 1 ss66416875 SNP_A-2131660 C/T 270 rs2887286 1 1 1156131 NT_004350.19 634763 1 GRCh37.p10 1 ss66516979 SNP_A-4221087 G/T 270 rs307378 0 1 1268847 NT_004350.19 747479 0 GRCh37.p10 1 ss66316151 SNP_A-1884606 A/G 270 rs7540231 0 1 1506035 NT_004350.19 984667 0 GRCh37.p10 1 ss66458131 SNP_A-2235839 A/G 270 rs2272908 1 1 1721479 NT_004350.19 1200111 0 GRCh37.p10 1 ss66451308 SNP_A-2218153 A/G 270 rs12141314 0 1 1753686 NT_004350.19 1232318 0 GRCh37.p10 1
Primero selecciono los SNPs que estan en reverse (1 en la sexta columna)
$ awk '{if($6==1) print $5}' snpBatch_AFFY_33767 > list_snp_rw_oq1.txt
y los invierto con plink
$ plink --noweb --bfile admurcia --flip list_snp_rw_oq1.txt --make-bed --out admurciav36f @----------------------------------------------------------@ | PLINK! | v1.07 | 10/Aug/2009 | |----------------------------------------------------------| | (C) 2009 Shaun Purcell, GNU General Public License, v2 | |----------------------------------------------------------| | For documentation, citation & bug-report instructions: | | http://pngu.mgh.harvard.edu/purcell/plink/ | @----------------------------------------------------------@ Skipping web check... [ --noweb ] Writing this text to log file [ admurciav36f.log ] Analysis started: Thu Jan 2 13:41:21 2014 Options in effect: --noweb --bfile admurcia --flip list_snp_rw_oq1.txt --make-bed --out admurciav36f Reading map (extended format) from [ admurcia.bim ] 198429 markers to be included from [ admurcia.bim ] Reading pedigree information from [ admurcia.fam ] 1088 individuals read from [ admurcia.fam ] 1088 individuals with nonmissing phenotypes Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 319 cases, 769 controls and 0 missing 511 males, 577 females, and 0 of unspecified sex Reading genotype bitfile from [ admurcia.bed ] Detected that binary PED file is v1.00 SNP-major mode Reading SNPs to flip strand from [ list_snp_rw_oq1.txt ] Flipped strand of 299983 SNPs Before frequency and genotyping pruning, there are 198429 SNPs 1088 founders and 0 non-founders found 1130 heterozygous haploid genotypes; set to missing Writing list of heterozygous haploid genotypes to [ admurciav36f.hh ] Total genotyping rate in remaining individuals is 0.993951 0 SNPs failed missingness test ( GENO > 1 ) 0 SNPs failed frequency test ( MAF < 0 ) After frequency and genotyping pruning, there are 198429 SNPs After filtering, 319 cases, 769 controls and 0 missing After filtering, 511 males, 577 females, and 0 of unspecified sex Writing pedigree information to [ admurciav36f.fam ] Writing map (extended format) information to [ admurciav36f.bim ] Writing genotype bitfile to [ admurciav36f.bed ] Using (default) SNP-major mode Analysis finished: Thu Jan 2 13:41:31 2014
Ahora voy a convertirlo en .ped y .map
$ plink --noweb --bfile admurciav36f --recode --tab --out admurciav36f @----------------------------------------------------------@ | PLINK! | v1.07 | 10/Aug/2009 | |----------------------------------------------------------| | (C) 2009 Shaun Purcell, GNU General Public License, v2 | |----------------------------------------------------------| | For documentation, citation & bug-report instructions: | | http://pngu.mgh.harvard.edu/purcell/plink/ | @----------------------------------------------------------@ Skipping web check... [ --noweb ] Writing this text to log file [ admurciav36f.log ] Analysis started: Thu Jan 2 13:50:09 2014 Options in effect: --noweb --bfile admurciav36f --recode --tab --out admurciav36f Reading map (extended format) from [ admurciav36f.bim ] 198429 markers to be included from [ admurciav36f.bim ] Reading pedigree information from [ admurciav36f.fam ] 1088 individuals read from [ admurciav36f.fam ] 1088 individuals with nonmissing phenotypes Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 319 cases, 769 controls and 0 missing 511 males, 577 females, and 0 of unspecified sex Reading genotype bitfile from [ admurciav36f.bed ] Detected that binary PED file is v1.00 SNP-major mode Before frequency and genotyping pruning, there are 198429 SNPs 1088 founders and 0 non-founders found 1130 heterozygous haploid genotypes; set to missing Writing list of heterozygous haploid genotypes to [ admurciav36f.hh ] Total genotyping rate in remaining individuals is 0.993951 0 SNPs failed missingness test ( GENO > 1 ) 0 SNPs failed frequency test ( MAF < 0 ) After frequency and genotyping pruning, there are 198429 SNPs After filtering, 319 cases, 769 controls and 0 missing After filtering, 511 males, 577 females, and 0 of unspecified sex Writing recoded ped file to [ admurciav36f.ped ] Writing new map file to [ admurciav36f.map ] Analysis finished: Thu Jan 2 13:51:24 2014
Construcción 37/hg19
Así que sigo. Ahora hay que hacer el cambio a construccion 37/hg19
Ahora, a partir del .map, creo el siguiente archivo:
#Columna Información #1 # Cromosoma #2 Posición de inicio (Columna3-1) #3 Posición final #4 Nombre del snp (rs...) # Reemplazar chr23 por chrX # chr24 por chrY # chr25 por chrX
$ awk '{print "chr"$1"\t"($4-1)"\t"$4"\t"$2}' admurciav36f.map > v36.txt
y subo v36.txt a http://genome.ucsc.edu/cgi-bin/hgLiftOver
Obtengo algo así:
Successfully converted 194109 records: View Conversions Conversion failed on 4320 records. Display failure file Explain failure messages Failed input regions: #Deleted in new chr3 195567371 195567372 rs4974514 #Deleted in new chr4 32575207 32575208 rs11096892 #Deleted in new chr4 40025321 40025322 rs4512036 #Deleted in new chr4 103677906 103677907 rs230525 #Deleted in new chr4 103951974 103951975 rs223413 #Deleted in new chr4 176359371 176359372 rs1711984 #Deleted in new chr5 17739883 17739884 rs1849834 #Deleted in new chr5 97752918 97752919 rs2124263 .......
De aqui obtengo dos archivos, los convertidos: hglft_genome_1156_bb5460.bed y los errores: hglft_genome_1156_bb5460.err. Primeramente tengo que eliminar los convertidos.
$ grep -v "#Deleted in new" hglft_genome_1156_bb5460.err | awk {'print $4'} > delete_snps.txt $ plink --noweb --bfile admurciav36f --exclude delete_snps.txt --make-bed --out admurciav36f_nofailures $ plink --noweb --bfile admurciav36f_nofailures --recode --tab --out admurciav37
Luego convierto el archivo de snps convertidos en un .map. Lo que tengo es,
$ head hglft_genome_1156_bb5460.bed chr1 785988 785989 rs2980300 chr1 1130726 1130727 rs10907175 chr1 1268846 1268847 rs307378 chr1 1506034 1506035 rs7540231 chr1 1721478 1721479 rs2272908 chr1 1759025 1759026 rs9786963 chr1 1759053 1759054 rs10907187 chr1 1823921 1823922 rs6688000 chr1 2104980 2104981 rs262641 chr1 2105877 2105878 rs3128309
y haciendo
$ sed 's/^chr//' hglft_genome_1156_bb5460.bed | awk {'print $1"\t"$4"\t0\t"$3'} > v37.map
Obtengo,
$ head v37.map 1 rs2980300 0 785989 1 rs10907175 0 1130727 1 rs307378 0 1268847 1 rs7540231 0 1506035 1 rs2272908 0 1721479 1 rs9786963 0 1759026 1 rs10907187 0 1759054 1 rs6688000 0 1823922 1 rs262641 0 2104981 1 rs3128309 0 2105878
Por ultimo tomo este .map y obtengo mi archivo final.
$ cp admurciav37.ped v37.ped $ plink --noweb --file v37 --make-bed --out v37
Prephasing
Primero voy a separar por cromosomas la base de datos que he obtenido. Esto es muy rapido.
$ parallel plink --noweb --bfile v37 --chr {} --recode --make-bed --out v37chr{}\.unphased ::: {1..22..1}
Ahora empieza lo divertido. Con shapeit revisamos las inconsistencias entre los SNPs de nuestra base de datos y 1000Genome,
$ parallel shapeit -check -B v37chr{}.unphased --input-ref /nas/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/ALL_1000G_phase1integrated_v3_chr{}_impute.hap.gz /nas/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/ALL_1000G_phase1integrated_v3_chr{}_impute.legend.gz /nas/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/ALL_1000G_phase1integrated_v3.sample --output-log v37chr{}.alignments ::: {1..22..1}
Con esto obtengo para cada cromosoma (XX):
- v37chrXX.alignments.log
- v37chrXX.alignments.snp.strand (incluye los errores encontrados por incompatibilidad de alelos strand y SNPs que se encuentran en nuestro gwas que no se encuentran en el panel de referencia missing)
- v37chrXX.alignments.snap.strand.exclude (posiciones cromosómicas de todos los SNPs que dan error)
Primeramente voy a intentar hacer un flip con los archivos ue dan errores,
$ for x in {1..22..1}; do awk '{if($1=="strand") print $3}' v37chr$x.alignments.snp.strand > strand$x; done $ parallel plink --noweb --bfile v37chr{}.unphased --flip strand{} --make-bed --out v37chr{}f.unphased ::: {1..22..1}
Ahora miro los errores que no se han solucionado con el flip y los excluyo de la base de datos.
$ parallel shapeit -check -B v37chr{}f.unphased --input-ref /nas/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/ALL_1000G_phase1integrated_v3_chr{}_impute.hap.gz /nas/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/ALL_1000G_phase1integrated_v3_chr{}_impute.legend.gz /nas/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/ALL_1000G_phase1integrated_v3.sample --output-log v37chr{}f.alignments ::: {1..22..1} $ for x in {1..22..1}; do awk '{print $3}' v37chr${x}f.alignments.snp.strand > exclude${x}; done $ parallel plink --bfile v37chr{}f.unphased --exclude exclude{} --make-bed --out admurchr{}.unphased ::: {1..22..1}
Y felizmente ahora vamos a hacer el pre-phasing,
$ screen bash $ for x in {1..22..1}; do shapeit --input-bed admurchr${x}.unphased --input-map /nas/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/genetic_map_chr${x}_combined_b37.txt --thread 30 --effective-size 11418 -O admurchr${x}.haps admurchr${x}.sample --output-log admurchr${x}.log; done
Imputando
Tras esto ya estamos preparados para imputar. la orden general para imputar la ventana de 15000000 a 16999999 en el cromosoma 22 es algo así,
impute2 -m /nas/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/genetic_map_chr22_combined_b37.txt -known_haps_g admurchr22.haps -h /nas/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/ALL_1000G_phase1integrated_v3_chr22_impute.hap.gz -l /nas/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/ALL_1000G_phase1integrated_v3_chr22_impute.legend.gz -Ne 20000 -int 15000000 16999999 -o admur_chr22.15000000-16999999.impute2 –allow_large_regions –seed 367946
Lo que he hecho es un sencillo script en perl que me ejecute secuencialmente la imputacion en todas las ventanas de un cromosoma. Como se fabrican estas ventanas es algo oscuro para mi asi que las he copiado y pegado.
- imputechr.pl
#!/usr/bin/perl # Copyright 2013 O. Sotolongo <osotolongo@fundacioace.com> use strict; use warnings; my @regs = ( [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000 60100000 62100000 64100000 66100000 68100000 70100000 72100000 74100000 76100000 78100000 80100000 82100000 84100000 86100000 88100000 90100000 92100000 94100000 96100000 98100000 100100000 102100000 104100000 106100000 108100000 110100000 112100000 114100000 116100000 118100000 120100000 122100000 124100000 126100000 128100000 130100000 132100000 134100000 136100000 138100000 140100000 142100000 144100000 146100000 148100000 150100000 152100000 154100000 156100000 158100000 160100000 162100000 164100000 166100000 168100000 170100000 172100000 174100000 176100000 178100000 180100000 182100000 184100000 186100000 188100000 190100000 192100000 194100000 196100000 198100000 200100000 202100000 204100000 206100000 208100000 210100000 212100000 214100000 216100000 218100000 220100000 222100000 224100000 226100000 228100000 230100000 232100000 234100000 236100000 238100000 240100000 242100000 244100000 246100000 248100000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000 92010000 94010000 96010000 98010000 100010000 102010000 104010000 106010000 108010000 110010000 112010000 114010000 116010000 118010000 120010000 122010000 124010000 126010000 128010000 130010000 132010000 134010000 136010000 138010000 140010000 142010000 144010000 146010000 148010000 150010000 152010000 154010000 156010000 158010000 160010000 162010000 164010000 166010000 168010000 170010000 172010000 174010000 176010000 178010000 180010000 182010000 184010000 186010000 188010000 190010000 192010000 194010000 196010000 198010000 200010000 202010000 204010000 206010000 208010000 210010000 212010000 214010000 216010000 218010000 220010000 222010000 224010000 226010000 228010000 230010000 232010000 234010000 236010000 238010000 240010000 242010000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000 92010000 94010000 96010000 98010000 100010000 102010000 104010000 106010000 108010000 110010000 112010000 114010000 116010000 118010000 120010000 122010000 124010000 126010000 128010000 130010000 132010000 134010000 136010000 138010000 140010000 142010000 144010000 146010000 148010000 150010000 152010000 154010000 156010000 158010000 160010000 162010000 164010000 166010000 168010000 170010000 172010000 174010000 176010000 178010000 180010000 182010000 184010000 186010000 188010000 190010000 192010000 194010000 196010000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000 92010000 94010000 96010000 98010000 100010000 102010000 104010000 106010000 108010000 110010000 112010000 114010000 116010000 118010000 120010000 122010000 124010000 126010000 128010000 130010000 132010000 134010000 136010000 138010000 140010000 142010000 144010000 146010000 148010000 150010000 152010000 154010000 156010000 158010000 160010000 162010000 164010000 166010000 168010000 170010000 172010000 174010000 176010000 178010000 180010000 182010000 184010000 186010000 188010000 190010000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000 92010000 94010000 96010000 98010000 100010000 102010000 104010000 106010000 108010000 110010000 112010000 114010000 116010000 118010000 120010000 122010000 124010000 126010000 128010000 130010000 132010000 134010000 136010000 138010000 140010000 142010000 144010000 146010000 148010000 150010000 152010000 154010000 156010000 158010000 160010000 162010000 164010000 166010000 168010000 170010000 172010000 174010000 176010000 178010000 180010000)], [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000 60100000 62100000 64100000 66100000 68100000 70100000 72100000 74100000 76100000 78100000 80100000 82100000 84100000 86100000 88100000 90100000 92100000 94100000 96100000 98100000 100100000 102100000 104100000 106100000 108100000 110100000 112100000 114100000 116100000 118100000 120100000 122100000 124100000 126100000 128100000 130100000 132100000 134100000 136100000 138100000 140100000 142100000 144100000 146100000 148100000 150100000 152100000 154100000 156100000 158100000 160100000 162100000 164100000 166100000 168100000 170100000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000 92010000 94010000 96010000 98010000 100010000 102010000 104010000 106010000 108010000 110010000 112010000 114010000 116010000 118010000 120010000 122010000 124010000 126010000 128010000 130010000 132010000 134010000 136010000 138010000 140010000 142010000 144010000 146010000 148010000 150010000 152010000 154010000 156010000 158010000)], [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000 60100000 62100000 64100000 66100000 68100000 70100000 72100000 74100000 76100000 78100000 80100000 82100000 84100000 86100000 88100000 90100000 92100000 94100000 96100000 98100000 100100000 102100000 104100000 106100000 108100000 110100000 112100000 114100000 116100000 118100000 120100000 122100000 124100000 126100000 128100000 130100000 132100000 134100000 136100000 138100000 140100000 142100000 144100000 146100000)], [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000 60100000 62100000 64100000 66100000 68100000 70100000 72100000 74100000 76100000 78100000 80100000 82100000 84100000 86100000 88100000 90100000 92100000 94100000 96100000 98100000 100100000 102100000 104100000 106100000 108100000 110100000 112100000 114100000 116100000 118100000 120100000 122100000 124100000 126100000 128100000 130100000 132100000 134100000 136100000 138100000 140100000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000 92010000 94010000 96010000 98010000 100010000 102010000 104010000 106010000 108010000 110010000 112010000 114010000 116010000 118010000 120010000 122010000 124010000 126010000 128010000 130010000 132010000 134010000)], [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000 60100000 62100000 64100000 66100000 68100000 70100000 72100000 74100000 76100000 78100000 80100000 82100000 84100000 86100000 88100000 90100000 92100000 94100000 96100000 98100000 100100000 102100000 104100000 106100000 108100000 110100000 112100000 114100000 116100000 118100000 120100000 122100000 124100000 126100000 128100000 130100000 132100000 134100000)], [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000 60100000 62100000 64100000 66100000 68100000 70100000 72100000 74100000 76100000 78100000 80100000 82100000 84100000 86100000 88100000 90100000 92100000 94100000 96100000 98100000 100100000 102100000 104100000 106100000 108100000 110100000 112100000 114100000 116100000 118100000 120100000 122100000 124100000 126100000 128100000 130100000 132100000)], [qw (18000000 20000000 22000000 24000000 26000000 28000000 30000000 32000000 34000000 36000000 38000000 40000000 42000000 44000000 46000000 48000000 50000000 52000000 54000000 56000000 58000000 60000000 62000000 64000000 66000000 68000000 70000000 72000000 74000000 76000000 78000000 80000000 82000000 84000000 86000000 88000000 90000000 92000000 94000000 96000000 98000000 100000000 102000000 104000000 106000000 108000000 110000000 112000000 114000000)], [qw (18000000 20000000 22000000 24000000 26000000 28000000 30000000 32000000 34000000 36000000 38000000 40000000 42000000 44000000 46000000 48000000 50000000 52000000 54000000 56000000 58000000 60000000 62000000 64000000 66000000 68000000 70000000 72000000 74000000 76000000 78000000 80000000 82000000 84000000 86000000 88000000 90000000 92000000 94000000 96000000 98000000 100000000 102000000 104000000 106000000)], [qw (19000000 21000000 23000000 25000000 27000000 29000000 31000000 33000000 35000000 37000000 39000000 41000000 43000000 45000000 47000000 49000000 51000000 53000000 55000000 57000000 59000000 61000000 63000000 65000000 67000000 69000000 71000000 73000000 75000000 77000000 79000000 81000000 83000000 85000000 87000000 89000000 91000000 93000000 95000000 97000000 99000000 101000000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000)], [qw (1000 2001000 4001000 6001000 8001000 10001000 12001000 14001000 16001000 18001000 20001000 22001000 24001000 26001000 28001000 30001000 32001000 34001000 36001000 38001000 40001000 42001000 44001000 46001000 48001000 50001000 52001000 54001000 56001000 58001000 60001000 62001000 64001000 66001000 68001000 70001000 72001000 74001000 76001000 78001000 80001000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000)], [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000)], [qw (10000000 12000000 14000000 16000000 18000000 20000000 22000000 24000000 26000000 28000000 30000000 32000000 34000000 36000000 38000000 40000000 42000000 44000000 46000000 48000000)], [qw (15000000 17000000 19000000 21000000 23000000 25000000 27000000 29000000 31000000 33000000 35000000 37000000 39000000 41000000 43000000 45000000 47000000 49000000 51000000)], ); my $window = 1999999; my $libdir = '/nas/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/'; my $chr = shift; die "Must supply chromosome!\n" unless $chr; my $base = $regs[$chr-1]; for my $i (0 .. $#{$base}) { my $w1 = $base->[$i]; my $w2 = $w1 + $window; my $order = "impute2 -m ".$libdir."genetic_map_chr".$chr."_combined_b37.txt -known_haps_g admurchr".$chr.".haps -h ".$libdir."ALL_1000G_phase1integrated_v3_chr".$chr."_impute.hap.gz -l ".$libdir."ALL_1000G_phase1integrated_v3_chr".$chr."_impute.legend.gz -Ne 20000 -int ".$w1." ".$w2." -o admur_chr".$chr.".".$w1."-".$w2.".impute2 –allow_large_regions –seed 367946"; print "$order\n"; system($order); }
Ahora ejecutar esto ya es muy sencillo
parallel imputechr.pl {} ::: {1..22..1}
No esta muy optimizado ya que utilizo solo 22 procesadores, los cromosomas no estan balanceados, etc pero tampoco creo que valga la pena romperse la cabeza con esto.
Tras esto hay que pegarlos archivos .impute2 pero voy a probar con uno a ver que pasa,
$ wc -l admur_chr22.*.impute2 2163 admur_chr22.15000000-16999999.impute2 26326 admur_chr22.17000000-18999999.impute2 23261 admur_chr22.19000000-20999999.impute2 23821 admur_chr22.21000000-22999999.impute2 29967 admur_chr22.23000000-24999999.impute2 30049 admur_chr22.25000000-26999999.impute2 27506 admur_chr22.27000000-28999999.impute2 24557 admur_chr22.29000000-30999999.impute2 26377 admur_chr22.31000000-32999999.impute2 30715 admur_chr22.33000000-34999999.impute2 29381 admur_chr22.35000000-36999999.impute2 27769 admur_chr22.37000000-38999999.impute2 25536 admur_chr22.39000000-40999999.impute2 24608 admur_chr22.41000000-42999999.impute2 31071 admur_chr22.43000000-44999999.impute2 32923 admur_chr22.45000000-46999999.impute2 36053 admur_chr22.47000000-48999999.impute2 36468 admur_chr22.49000000-50999999.impute2 3052 admur_chr22.51000000-52999999.impute2 491603 total $ cat admur_chr22.*.impute2 > admur_chr22.gen $ wc -l admur_chr22.gen 491603 admur_chr22.gen
Parece que va bien, pero cuidado, si se hace un cat indiscriminado no va a respetar el orden de los archivos,
$ wc -l admur_chr1.*.impute2 20908 admur_chr1.100000-2099999.impute2 27413 admur_chr1.100100000-102099999.impute2 28473 admur_chr1.10100000-12099999.impute2 28073 admur_chr1.102100000-104099999.impute2 30590 admur_chr1.104100000-106099999.impute2 29565 admur_chr1.106100000-108099999.impute2 23361 admur_chr1.108100000-110099999.impute2 27206 admur_chr1.110100000-112099999.impute2 26169 admur_chr1.112100000-114099999.impute2 25630 admur_chr1.114100000-116099999.impute2 26798 admur_chr1.116100000-118099999.impute2 27182 admur_chr1.118100000-120099999.impute2 9486 admur_chr1.120100000-122099999.impute2 17754 admur_chr1.12100000-14099999.impute2 31730 admur_chr1.14100000-16099999.impute2 5863 admur_chr1.144100000-146099999.impute2 15432 admur_chr1.146100000-148099999.impute2 8211 admur_chr1.148100000-150099999.impute2 22784 admur_chr1.150100000-152099999.impute2 27905 admur_chr1.152100000-154099999.impute2 21489 admur_chr1.154100000-156099999.impute2 27749 admur_chr1.156100000-158099999.impute2 28988 admur_chr1.158100000-160099999.impute2 26522 admur_chr1.160100000-162099999.impute2 24375 admur_chr1.16100000-18099999.impute2 28623 admur_chr1.162100000-164099999.impute2 29061 admur_chr1.164100000-166099999.impute2 29108 admur_chr1.166100000-168099999.impute2 28942 admur_chr1.168100000-170099999.impute2 27559 admur_chr1.170100000-172099999.impute2 24188 admur_chr1.172100000-174099999.impute2 26726 admur_chr1.174100000-176099999.impute2 25819 admur_chr1.176100000-178099999.impute2 26084 admur_chr1.178100000-180099999.impute2 27678 admur_chr1.180100000-182099999.impute2 30910 admur_chr1.18100000-20099999.impute2 27082 admur_chr1.182100000-184099999.impute2 26659 admur_chr1.184100000-186099999.impute2 27929 admur_chr1.186100000-188099999.impute2 31630 admur_chr1.188100000-190099999.impute2 26320 admur_chr1.190100000-192099999.impute2 27083 admur_chr1.192100000-194099999.impute2 30094 admur_chr1.194100000-196099999.impute2 26517 admur_chr1.196100000-198099999.impute2 26585 admur_chr1.198100000-200099999.impute2 27000 admur_chr1.200100000-202099999.impute2 26731 admur_chr1.20100000-22099999.impute2 26190 admur_chr1.202100000-204099999.impute2 24815 admur_chr1.204100000-206099999.impute2 21534 admur_chr1.206100000-208099999.impute2 29269 admur_chr1.208100000-210099999.impute2 28422 admur_chr1.2100000-4099999.impute2 27718 admur_chr1.210100000-212099999.impute2 .......
así que tengo que hacerlo mas o menos o mano. Por suerte tengo el script perl anterior, asi que voy a reutilizarlo.
- catchr.pl
#!/usr/bin/perl # Copyright 2013 O. Sotolongo <osotolongo@fundacioace.com> use strict; use warnings; my @regs = ( [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000 60100000 62100000 64100000 66100000 68100000 70100000 72100000 74100000 76100000 78100000 80100000 82100000 84100000 86100000 88100000 90100000 92100000 94100000 96100000 98100000 100100000 102100000 104100000 106100000 108100000 110100000 112100000 114100000 116100000 118100000 120100000 122100000 124100000 126100000 128100000 130100000 132100000 134100000 136100000 138100000 140100000 142100000 144100000 146100000 148100000 150100000 152100000 154100000 156100000 158100000 160100000 162100000 164100000 166100000 168100000 170100000 172100000 174100000 176100000 178100000 180100000 182100000 184100000 186100000 188100000 190100000 192100000 194100000 196100000 198100000 200100000 202100000 204100000 206100000 208100000 210100000 212100000 214100000 216100000 218100000 220100000 222100000 224100000 226100000 228100000 230100000 232100000 234100000 236100000 238100000 240100000 242100000 244100000 246100000 248100000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000 92010000 94010000 96010000 98010000 100010000 102010000 104010000 106010000 108010000 110010000 112010000 114010000 116010000 118010000 120010000 122010000 124010000 126010000 128010000 130010000 132010000 134010000 136010000 138010000 140010000 142010000 144010000 146010000 148010000 150010000 152010000 154010000 156010000 158010000 160010000 162010000 164010000 166010000 168010000 170010000 172010000 174010000 176010000 178010000 180010000 182010000 184010000 186010000 188010000 190010000 192010000 194010000 196010000 198010000 200010000 202010000 204010000 206010000 208010000 210010000 212010000 214010000 216010000 218010000 220010000 222010000 224010000 226010000 228010000 230010000 232010000 234010000 236010000 238010000 240010000 242010000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000 92010000 94010000 96010000 98010000 100010000 102010000 104010000 106010000 108010000 110010000 112010000 114010000 116010000 118010000 120010000 122010000 124010000 126010000 128010000 130010000 132010000 134010000 136010000 138010000 140010000 142010000 144010000 146010000 148010000 150010000 152010000 154010000 156010000 158010000 160010000 162010000 164010000 166010000 168010000 170010000 172010000 174010000 176010000 178010000 180010000 182010000 184010000 186010000 188010000 190010000 192010000 194010000 196010000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000 92010000 94010000 96010000 98010000 100010000 102010000 104010000 106010000 108010000 110010000 112010000 114010000 116010000 118010000 120010000 122010000 124010000 126010000 128010000 130010000 132010000 134010000 136010000 138010000 140010000 142010000 144010000 146010000 148010000 150010000 152010000 154010000 156010000 158010000 160010000 162010000 164010000 166010000 168010000 170010000 172010000 174010000 176010000 178010000 180010000 182010000 184010000 186010000 188010000 190010000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000 92010000 94010000 96010000 98010000 100010000 102010000 104010000 106010000 108010000 110010000 112010000 114010000 116010000 118010000 120010000 122010000 124010000 126010000 128010000 130010000 132010000 134010000 136010000 138010000 140010000 142010000 144010000 146010000 148010000 150010000 152010000 154010000 156010000 158010000 160010000 162010000 164010000 166010000 168010000 170010000 172010000 174010000 176010000 178010000 180010000)], [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000 60100000 62100000 64100000 66100000 68100000 70100000 72100000 74100000 76100000 78100000 80100000 82100000 84100000 86100000 88100000 90100000 92100000 94100000 96100000 98100000 100100000 102100000 104100000 106100000 108100000 110100000 112100000 114100000 116100000 118100000 120100000 122100000 124100000 126100000 128100000 130100000 132100000 134100000 136100000 138100000 140100000 142100000 144100000 146100000 148100000 150100000 152100000 154100000 156100000 158100000 160100000 162100000 164100000 166100000 168100000 170100000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000 92010000 94010000 96010000 98010000 100010000 102010000 104010000 106010000 108010000 110010000 112010000 114010000 116010000 118010000 120010000 122010000 124010000 126010000 128010000 130010000 132010000 134010000 136010000 138010000 140010000 142010000 144010000 146010000 148010000 150010000 152010000 154010000 156010000 158010000)], [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000 60100000 62100000 64100000 66100000 68100000 70100000 72100000 74100000 76100000 78100000 80100000 82100000 84100000 86100000 88100000 90100000 92100000 94100000 96100000 98100000 100100000 102100000 104100000 106100000 108100000 110100000 112100000 114100000 116100000 118100000 120100000 122100000 124100000 126100000 128100000 130100000 132100000 134100000 136100000 138100000 140100000 142100000 144100000 146100000)], [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000 60100000 62100000 64100000 66100000 68100000 70100000 72100000 74100000 76100000 78100000 80100000 82100000 84100000 86100000 88100000 90100000 92100000 94100000 96100000 98100000 100100000 102100000 104100000 106100000 108100000 110100000 112100000 114100000 116100000 118100000 120100000 122100000 124100000 126100000 128100000 130100000 132100000 134100000 136100000 138100000 140100000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000 92010000 94010000 96010000 98010000 100010000 102010000 104010000 106010000 108010000 110010000 112010000 114010000 116010000 118010000 120010000 122010000 124010000 126010000 128010000 130010000 132010000 134010000)], [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000 60100000 62100000 64100000 66100000 68100000 70100000 72100000 74100000 76100000 78100000 80100000 82100000 84100000 86100000 88100000 90100000 92100000 94100000 96100000 98100000 100100000 102100000 104100000 106100000 108100000 110100000 112100000 114100000 116100000 118100000 120100000 122100000 124100000 126100000 128100000 130100000 132100000 134100000)], [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000 60100000 62100000 64100000 66100000 68100000 70100000 72100000 74100000 76100000 78100000 80100000 82100000 84100000 86100000 88100000 90100000 92100000 94100000 96100000 98100000 100100000 102100000 104100000 106100000 108100000 110100000 112100000 114100000 116100000 118100000 120100000 122100000 124100000 126100000 128100000 130100000 132100000)], [qw (18000000 20000000 22000000 24000000 26000000 28000000 30000000 32000000 34000000 36000000 38000000 40000000 42000000 44000000 46000000 48000000 50000000 52000000 54000000 56000000 58000000 60000000 62000000 64000000 66000000 68000000 70000000 72000000 74000000 76000000 78000000 80000000 82000000 84000000 86000000 88000000 90000000 92000000 94000000 96000000 98000000 100000000 102000000 104000000 106000000 108000000 110000000 112000000 114000000)], [qw (18000000 20000000 22000000 24000000 26000000 28000000 30000000 32000000 34000000 36000000 38000000 40000000 42000000 44000000 46000000 48000000 50000000 52000000 54000000 56000000 58000000 60000000 62000000 64000000 66000000 68000000 70000000 72000000 74000000 76000000 78000000 80000000 82000000 84000000 86000000 88000000 90000000 92000000 94000000 96000000 98000000 100000000 102000000 104000000 106000000)], [qw (19000000 21000000 23000000 25000000 27000000 29000000 31000000 33000000 35000000 37000000 39000000 41000000 43000000 45000000 47000000 49000000 51000000 53000000 55000000 57000000 59000000 61000000 63000000 65000000 67000000 69000000 71000000 73000000 75000000 77000000 79000000 81000000 83000000 85000000 87000000 89000000 91000000 93000000 95000000 97000000 99000000 101000000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000 80010000 82010000 84010000 86010000 88010000 90010000)], [qw (1000 2001000 4001000 6001000 8001000 10001000 12001000 14001000 16001000 18001000 20001000 22001000 24001000 26001000 28001000 30001000 32001000 34001000 36001000 38001000 40001000 42001000 44001000 46001000 48001000 50001000 52001000 54001000 56001000 58001000 60001000 62001000 64001000 66001000 68001000 70001000 72001000 74001000 76001000 78001000 80001000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000 64010000 66010000 68010000 70010000 72010000 74010000 76010000 78010000)], [qw (100000 2100000 4100000 6100000 8100000 10100000 12100000 14100000 16100000 18100000 20100000 22100000 24100000 26100000 28100000 30100000 32100000 34100000 36100000 38100000 40100000 42100000 44100000 46100000 48100000 50100000 52100000 54100000 56100000 58100000)], [qw (10000 2010000 4010000 6010000 8010000 10010000 12010000 14010000 16010000 18010000 20010000 22010000 24010000 26010000 28010000 30010000 32010000 34010000 36010000 38010000 40010000 42010000 44010000 46010000 48010000 50010000 52010000 54010000 56010000 58010000 60010000 62010000)], [qw (10000000 12000000 14000000 16000000 18000000 20000000 22000000 24000000 26000000 28000000 30000000 32000000 34000000 36000000 38000000 40000000 42000000 44000000 46000000 48000000)], [qw (15000000 17000000 19000000 21000000 23000000 25000000 27000000 29000000 31000000 33000000 35000000 37000000 39000000 41000000 43000000 45000000 47000000 49000000 51000000)], ); my $window = 1999999; my $chr = shift; die "Must supply chromosome!\n" unless $chr; my $base = $regs[$chr-1]; my $order = "cat "; for my $i (0 .. $#{$base}) { my $w1 = $base->[$i]; my $w2 = $w1 + $window; $order .= "admur_chr".$chr.".".$w1."-".$w2.".impute2 "; } $order .= "> admur_chr".$chr.".gen"; system($order);
y ahora si terminamos de una vez,
$ for x in {1..22..1}; do catchr.pl $x; done
Esto nos deja con los archivos .gen y .sample por cada cromosoma.