User Tools

Site Tools


genetica:bioinf_process:fastqc

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
genetica:bioinf_process:fastqc [2015/03/12 13:35]
vifehe
genetica:bioinf_process:fastqc [2020/08/04 10:58] (current)
Line 164: Line 164:
 Because examining each file is time consuming, I've created a couple of scripts with which we can extract the information of our interest: Because examining each file is time consuming, I've created a couple of scripts with which we can extract the information of our interest:
  
-<code summarizing_fastq_stats.sh> +[[genetica:bioinf_process:fastqc:script1]] - to summarize output from summary.txt
- +
-#!/bin/bash +
- +
-# summarizing_fastq_stats.sh is a bash program made by vifehe to summarize the statistics outputs from fastaq/summary.txt +
-+
-#The summary output is: +
-#[vifehe@detritus 1_paraparesia_fastaq]$ cat paraparesia_fastq_QC/SN7570192_15190_P4H11_L5150_1_sequence.fq_fastqc/summary.txt +
-#PASS   Basic Statistics        SN7570192_15190_P4H11_L5150_1_sequence.fq.gz 1 +
-#PASS   Per base sequence quality       SN7570192_15190_P4H11_L5150_1_sequence.fq.gz 2 +
-#PASS   Per sequence quality scores     SN7570192_15190_P4H11_L5150_1_sequence.fq.gz 3 +
-#PASS   Per base sequence content       SN7570192_15190_P4H11_L5150_1_sequence.fq.gz 4 +
-#PASS   Per base GC content     SN7570192_15190_P4H11_L5150_1_sequence.fq.gz 5 +
-#WARN   Per sequence GC content SN7570192_15190_P4H11_L5150_1_sequence.fq.gz 6 +
-#PASS   Per base N content      SN7570192_15190_P4H11_L5150_1_sequence.fq.gz 7 +
-#PASS   Sequence Length Distribution    SN7570192_15190_P4H11_L5150_1_sequence.fq.gz 8 +
-#WARN   Sequence Duplication Levels     SN7570192_15190_P4H11_L5150_1_sequence.fq.gz 9 +
-#PASS   Overrepresented sequences       SN7570192_15190_P4H11_L5150_1_sequence.fq.gz 10 +
-#PASS   Kmer Content    SN7570192_15190_P4H11_L5150_1_sequence.fq.gz 11 +
- +
-idir=fastqcRawdata_P1 +
-ofile=${idir}.sumstats +
- +
-touch $ofile +
- +
- +
-printf "#BS = Basic statistics\n#PBSQ = Per base sequence quality\n#PSQS = Per sequence quality scores\n#PBSQ = Per base sequence content\n#bCG = Per base GC content\n#sGC = Per sequence GC content\n#bN = Per base N content\n#SLD = Sequence Length Distribution\n#SDL = Sequence Duplication Levels\n#OS = Overrepresented sequences\n#KC = Kmer Content\nSample\tBS\tPBSQ\tPSQS\tPBSQ\tbGC\tsGC\tbN\tSLD\tSDL\tOS\tKC\n" >> $ofile +
- +
-for x in $idir/*.fq_fastqc/summary.txt +
-do +
- echo $x +
- sample=(`echo $x | awk -F "/" {'print $2'} | awk -F"_" {'print $4"-"$5'}`) #this should output L5150-1 +
- echo $sample +
- basic_stats=(`cat $x | sed -n '1p' | awk -F"\t" {'print $1'}`) # this should output the filter status of basic statistics +
- echo $basic_stats +
- per_base_seq_qual=(`cat $x | sed -n '2p' | awk -F"\t" {'print $1'}`) # this should output the filter status of basic statistics +
- echo $per_base_seq_qual +
- per_seq_qual_scores=(`cat $x | sed -n '3p' | awk -F"\t" {'print $1'}`) # this should output the filter status of basic statistics +
-        echo $per_seq_qual_scores +
- per_base_seq_content=(`cat $x | sed -n '4p' | awk -F"\t" {'print $1'}`) # this should output the filter status of basic statistics +
-        echo $per_base_seq_content +
- per_base_GC_content=(`cat $x | sed -n '5p' | awk -F"\t" {'print $1'}`) # this should output the filter status of basic statistics +
-        echo $per_base_GC_content +
- per_seq_GC_content=(`cat $x | sed -n '6p' | awk -F"\t" {'print $1'}`) # this should output the filter status of basic statistics +
-        echo $per_seq_GC_content +
- per_base_N_content=(`cat $x | sed -n '7p' | awk -F"\t" {'print $1'}`) # this should output the filter status of basic statistics +
-        echo $per_base_N_content +
- seq_length_distr=(`cat $x | sed -n '8p' | awk -F"\t" {'print $1'}`) # this should output the filter status of basic statistics +
-        echo $seq_length_distr +
- seq_dupl_level=(`cat $x | sed -n '9p' | awk -F"\t" {'print $1'}`) # this should output the filter status of basic statistics +
-        echo $seq_dupl_level +
- overrepresented=(`cat $x | sed -n '10p' | awk -F"\t" {'print $1'}`) # this should output the filter status of basic statistics +
-        echo $overrepresented +
- kmer_content=(`cat $x | sed -n '11p' | awk -F"\t" {'print $1'}`) # this should output the filter status of basic statistics +
-        echo $kmer_content +
-  +
- printf "$sample\t$basic_stats\t$per_base_seq_qual\t$per_seq_qual_scores\t$per_base_seq_content\t$per_base_GC_content\t$per_seq_GC_content\t$per_base_N_content\t$seq_length_distr\t$seq_dupl_level\t$overrepresented\t$kmer_content\n" >> $ofile +
-done +
- +
-</code> +
- +
  
 +[[genetica:bioinf_process:FASTQC:script2]] - to summarize output from fastqc_data.txt
genetica/bioinf_process/fastqc.1426167332.txt.gz · Last modified: 2020/08/04 10:48 (external edit)