User Tools

Site Tools


genetica:bioinf_process:fastqc

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
genetica:bioinf_process:fastqc [2015/03/12 11:51] vifehegenetica:bioinf_process:fastqc [2023/01/02 15:31] (current) osotolongo
Line 1: Line 1:
- 
 The evaluation of data quality of the __fatsq__ files is the first one of several Quality Control **(QC)** steps in the analysis of Next Generation Sequnecing (NGS) data. For such purpose we will use the software [[http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ | FastQC]]. The evaluation of data quality of the __fatsq__ files is the first one of several Quality Control **(QC)** steps in the analysis of Next Generation Sequnecing (NGS) data. For such purpose we will use the software [[http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ | FastQC]].
  
Line 21: Line 20:
  
  
-__Example of running FastQC in one of our samples:__+==Example of running FastQC in one of our samples:==
  
-Bonn's fastq files are stored at directory: Bonn_0_fastq, uncer different folders according to its plate of origin, hence:+Bonn's fastq files are stored at directory: Bonn_0_fastq, under different folders according to its plate of origin, hence:
  
-''[vifehe@detritus bonn_data]$ ls Bonn_0_fastq/+<code> 
 +[vifehe@detritus bonn_data]$ ls Bonn_0_fastq/
 P1_001-040     P1_041-080          P1_081-095 P1_001-040     P1_041-080          P1_081-095
 P2_001-040     P2_041-080          P2_081-095  P2_001-040     P2_041-080          P2_081-095 
 P3_001-040     P3_041-080          P3_081-095 P3_001-040     P3_041-080          P3_081-095
-P4_001-040     P4_081-095          P4_041-080  P5_001-017''+P4_001-040     P4_081-095          P4_041-080  P5_001-017 
 +</code>
  
 # we create the directory where we will save FastQC output: # we create the directory where we will save FastQC output:
Line 36: Line 37:
  
 # and we further create directories for each of the plates # and we further create directories for each of the plates
 +<code>
 +[vifehe@detritus Bonn_0_fastq]$ cd fastqcRawdata
 +[vifehe@detritus fastqcRawdata]$ touch fastqcRawdata_P1 fastqcRawdata_P2 fastqcRawdata_P3 fastqcRawdata_P4
  
-''[vifehe@detritus fastqcRawdata]$ touch fastqcRawdata_P1'' +</code> 
-''[vifehe@detritus fastqcRawdata]$ touch fastqcRawdata_P2'' +# to run fastqc on a single file, return to folder where we have our vcf files 
-''[vifehe@detritus fastqcRawdata]$ touch fastqcRawdata_P3'' +<code> 
-''[vifehe@detritus fastqcRawdata]$ touch fastqcRawdata_P4''+[vifehe@detritus fastqcRawdata]$ cd .. 
 +[vifehe@detritus Bonn_0_fastq]$ cd P1_001-040 
 +#to see just the first two files 
 +[vifehe@detritus P1_001-040]$ ls | head -n2     
 +SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +SN7640211_14074_P1A01_MND1014_2_sequence.fq.gz 
 +[vifehe@detritus P1_001-040]$ fastqc -o ../fastqcRawdata/fastqcRawdata_P1 Started analysis of SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Started analysis of SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz  # started at 13:57 
 +Approx 5% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 10% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 15% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 20% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 25% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 30% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 35% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 40% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 45% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 50% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 55% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 60% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 65% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 70% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 75% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 80% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 85% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 90% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 95% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Approx 100% complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz 
 +Analysis complete for SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz # finished at 14:09 
 +# the process per sample takes about 12 minutes 
 +#this can be run in a loop 
 +[vifehe@detritus P1_001-040]$  for x in P1_001-040/*.gz; do fastqc -o ../fastqcRawdata/fastqcRawdata_P1/ $x; done 
 +# to examine the output  
 +[vifehe@detritus P1_001-040]$ cd ../fastqcRawdata/fastqcRawdata_P1
  
 +#the program has created a folder named like the sequence and another compressed folder
 +[vifehe@detritus fastqcRawdata_P1]$ ls | head -n2 
 +SN7640211_14074_P1A01_MND1014_1_sequence.fq_fastqc
 +SN7640211_14074_P1A01_MND1014_1_sequence.fq_fastqc.zip
  
 +#list the contents of the folder created
 +[vifehe@detritus fastqcRawdata_P1]$ cd SN7640211_14074_P1A01_MND1014_1_sequence.fq_fastqc
 +[vifehe@detritus SN7640211_14074_P1A01_MND1014_1_sequence.fq_fastqc]$ ls
 +fastqc_data.txt  fastqc_report.html  Icons  Images  summary.txt
  
-[vifehe@detritus bonn_data]$ cat Bonn_0_fastq/fastqcRawdata/fastqcRawdata_P1/SN7640211_14074_P1A01_MND1014_1_sequence.fq_fastqc_folder/summary.txt +#examine Summary.txt output 
 +[vifehe@detritus SN7640211_14074_P1A01_MND1014_1_sequence.fq_fastqc]$ cat summary.txt
 PASS Basic Statistics SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz PASS Basic Statistics SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz
 PASS Per base sequence quality SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz PASS Per base sequence quality SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz
Line 57: Line 103:
 PASS Kmer Content SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz PASS Kmer Content SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz
  
 +# examine fastqc_data.txt output
 +[vifehe@detritus SN7640211_14074_P1A01_MND1014_1_sequence.fq_fastqc]$ more -n20  fastqc_data.txt
 +##FastQC 0.10.1
 +>>Basic Statistics pass
 +#Measure Value
 +Filename SN7640211_14074_P1A01_MND1014_1_sequence.fq.gz
 +File type Conventional base calls
 +Encoding Sanger / Illumina 1.9
 +Total Sequences 44012752
 +Filtered Sequences 0
 +Sequence length 101
 +%GC 49
 +>>END_MODULE
 +>>Per base sequence quality pass
 +#Base Mean Median Lower Quartile Upper Quartile 10th Percentile 90th Percentile
 +1 31.64506284451379 33.0 31.0 34.0 28.0 34.0
 +2 31.880190722906853 34.0 31.0 34.0 28.0 34.0
 +3 31.972653289210363 34.0 31.0 34.0 28.0 34.0
 +4 35.39369340049448 37.0 35.0 37.0 32.0 37.0
 +5 35.09201710449735 37.0 35.0 37.0 32.0 37.0
 +6 35.08697933726116 37.0 35.0 37.0 32.0 37.0
 +7 35.06162818448617 37.0 35.0 37.0 32.0 37.0
 +....
 +....
 +....
 +>>Sequence Duplication Levels warn
 +#Total Duplicate Percentage 33.90859348891959
 +#Duplication Level Relative count
 +1 100.0
 +2 29.769972680482393
 +3 10.634235430848415
 +4 4.525832792477321
 +5 1.99009856157457
 +6 1.1335335758380753
 +7 0.6905347682369101
 +8 0.447526904442063
 +9 0.296590343078804
 +10++ 1.4482363062804704
 +>>END_MODULE
 +>>Overrepresented sequences pass
 +>>END_MODULE
 +>>Kmer Content pass
 +>>END_MODULE
 +
 +
 +# Explore html file
 +[vifehe@detritus SN7640211_14074_P1A01_MND1014_1_sequence.fq_fastqc]$ firefox fastqc_report.html 
 +# this opens the file in firefox in which the following pictures can be seen
 +
 +</code>
 +
 +{{:genetica:bioinf_process:per_base_quality.png?200|per base quality}} {{:genetica:bioinf_process:per_sequence_quality.png?200 | per sequence quality}} {{:genetica:bioinf_process:per_base_sequence_content.png?200 | per base sequence content}} 
 +{{:genetica:bioinf_process:per_base_gc_content.png?200 | per base gc content}}
 +{{:genetica:bioinf_process:per_sequence_gc_content.png?200 |per sequence GC content}}
 +{{:genetica:bioinf_process:per_base_n_content.png?200 |per base N content}}{{:genetica:bioinf_process:sequence_length_distribution.png?200 |sequence length distribution}} {{:genetica:bioinf_process:duplication_levels.png?200 | sequence duplication levels}}
 +
 +
 +----
  
 +Because examining each file is time consuming, I've created a couple of scripts with which we can extract the information of our interest:
  
 +[[genetica:bioinf_process:fastqc:script1]] - to summarize output from summary.txt
  
 +[[genetica:bioinf_process:FASTQC:script2]] - to summarize output from fastqc_data.txt
genetica/bioinf_process/fastqc.1426161080.txt.gz · Last modified: 2020/08/04 10:48 (external edit)