The Old Flux Forum
The forum is locked, please visit the new forum at

http://fluxcapacitor.wikidot.com/forum:start

micha.

The Old Flux Forum

Go to the New FLUX CAPACITOR FORUM
 
Back to Flux Capacitor HomepageBack to Flux Capacitor Homepage  HomeHome  CalendarCalendar  FAQFAQ  SearchSearch  MemberlistMemberlist  UsergroupsUsergroups  RegisterRegister  Log inLog in  
Search
 
 

Display results as :
 
Rechercher Advanced Search
Latest topics
» A Dual Sequencing Error Model
Wed May 27, 2009 3:10 pm by micha

» FLUX CAPACITOR version 1.1 build 20090525
Mon May 25, 2009 3:03 pm by Admin

» more potential bugs
Wed May 06, 2009 2:37 am by micha

» Custom gene expression levels
Wed May 06, 2009 2:29 am by McFly

» LPSOLVE installation
Wed May 06, 2009 2:04 am by micha

» What can I do if I lack a reference genome of my species for ab-initio gene prediction?
Thu Mar 19, 2009 11:53 am by Admin

» What can I do if I have short sequence reads, but they are not aligned to the genome?
Thu Mar 19, 2009 11:52 am by Admin

» What do I do if I have reference transcript sequences, but they are not aligned to the genome?
Thu Mar 19, 2009 11:50 am by Admin

» What do I do if I don't have a reference transcriptome for my species?
Thu Mar 19, 2009 11:49 am by Admin

Social bookmarking
Social bookmarking Digg  Social bookmarking Delicious  Social bookmarking Reddit  Social bookmarking Stumbleupon  Social bookmarking Slashdot  Social bookmarking Furl  Social bookmarking Yahoo  Social bookmarking Google  Social bookmarking Blinklist  Social bookmarking Blogmarks  Social bookmarking Technorati  

Bookmark and share the address of Flux on your social bookmarking website

Bookmark and share the address of The Old Flux Forum on your social bookmarking website

Share | 
 

 A Dual Sequencing Error Model

View previous topic View next topic Go down 
AuthorMessage
micha



Posts: 10
Join date: 2009-05-06
Age: 36
Location: Barcelona

PostSubject: A Dual Sequencing Error Model   Sun May 10, 2009 2:22 pm

Hi,

I am currently integrating sequencing error models into the FLUX SIMULATOR and here are my current thoughts on a widely applicable error model that distinguishes mainly two types of errors.

Error types: NGS machines commonly use light of certain wavelengths (i.e., colors) in order to distinguish the bases that are incorporated as sequencing-by-synthesis progresses. In each step (i.e., cycle), usually all 4 colors are measured and the base-caller takes decision which base to call based on characteristics of the 4 measurements. Additionally, synchronizing problems incur that require for comparing measurments of subsequent cycles with each other. Thus, we can roughly distinguish two error types of different nature:


  1. substitutions: something went wrong with the measurement resulting in a wrong base call. The hetorogenous nature of "accidents" across different sequencing platforms makes determenistic modelling difficult. However, we can estimate empirically characteristics of such "accidents" from a given run - e.g., from the control lane in Solexa, or an arbitrary comparable run - once the reads have been aligned to a reference. These attributes are:
    - number of "accidents": mutation errors are clustered by the same cycles they affect, assuming that they have been provoked by the same source.
    - probability of an "accident": relative proportion of reads that support a certain accident compared to all reads
    - severeness of an "accident": distribution of quality scores assigned to nucleotides affected by the "accident"
    Latter "severeness" can be turned into error probabilities for wrong base calls as usually quality scores are logarithmic scores of probabilities. In a simplified model, substitutions afterwards can be distributed according to a general crosstalk matrix of the dyes that have been adopted in the experiment. Under this model, substitution errors are independent of their surrounding sequence, but common regions with problems are observed in different subsets of the reads. To get a visual impression of Solexa "accidents", see here.

  2. duplications/deletions: these include phase synchronizing problems in general, comprising "phasing", "pre-phasing", "empty cycle", "homo-polymers", etc. ... The common attribute is that synchronization has been lost, so we observe the replacement of a substring with a substring of different length. Probabilities clearly depend on both, the original and the replacement string. However, as phasing errors are expected to not extend across many cycles - given that in such cases the read alignment problem would grow very hard and the chance of higher similarities to random other locations would grow - naturally limits the parameter space. In contrast to the mutations, duplications/deleletions are modelled to occur independent of their position in a read, but dependent on their sequence neighborhood.


Finally, also the distribution of qualities for base-calls without "accident" is to be modelled. Here, I plan to derive empiric estimates for each base, assuming independence of the preceeding/subsequent bases in the sequencing process.

micha.
Back to top Go down
View user profile http://www.sammeth.net
David Horner



Posts: 1
Join date: 2009-05-11

PostSubject: re. A Dual Sequencing Error Model   Mon May 11, 2009 8:54 am

It looks good. The only thing I'd say is that I don't think this approach will translate directly to SOLiD sequencing simulation .... but for now that doesn't seem to be the objective, and for my work SOLiD simulation is not part of the objective.

Cheers

David
Back to top Go down
View user profile
micha



Posts: 10
Join date: 2009-05-06
Age: 36
Location: Barcelona

PostSubject: Re: A Dual Sequencing Error Model   Mon May 18, 2009 11:06 am

Hi,

there is a preliminary version of the error model including the positional error models as discussed so far. It is included in the 1.2a release bundle of the FLUX SIMULATOR. As soon as possible I will try to write some documentation on the new (and old) features.

Please tell me what you think

micha.
Back to top Go down
View user profile http://www.sammeth.net
micha



Posts: 10
Join date: 2009-05-06
Age: 36
Location: Barcelona

PostSubject: Position error model, quick instructions   Mon May 18, 2009 12:07 pm



A: in order to enable fastaQ output (C), you need an error model. The model can be estimated either from an alignment or already be established in form of an .err file (format to described elsewhere).

B: If you parse the error profile from an alignment, you may provide a quality threshold on bases that are considered as unproblematic. Every basecall below this threshold is considered as part of an "accident" and correspondingly included in one of the error models.

C: once you have an error model you can activate fastaq output additonal to the bed files

D: The error model splits "accidents" according to their start and extension in the time scale of the experiment. E.g., an accident that is observed from cycle 5 to 7 has length 3, a different "accident" of length 3 could be observed from cycle 30 to 32. These "accidents" may overlap, so the accident of (start,extension) = (5,7) is separated from an accident (5,8).

E: The error models measures the substitution rate for each nucleotide with each other, respectively with "N" (black) split accross the different quality levels. If no substitution rate could be estimated for a certain quality, it gets interpolated from its clostest datapoints.

F: distribution of accident locations along the readlength. Red are cumulative starts, blue are cumulative ends and black shows all accidents overlapping a certain position.

G: distribution of qualities along the read. Red are the minimum, yellow the 1st quartile, green the median, blue the 3rd quartile and black the maximum value that has been recorded for each read position.
Back to top Go down
View user profile http://www.sammeth.net
lisa



Posts: 1
Join date: 2009-05-27

PostSubject: .err file format description   Wed May 27, 2009 12:56 pm

Hello,

is there any kind of file format description for the .err files other than the code?

Regards,
Lisa
Back to top Go down
View user profile
micha



Posts: 10
Join date: 2009-05-06
Age: 36
Location: Barcelona

PostSubject: Re: A Dual Sequencing Error Model   Wed May 27, 2009 3:10 pm

Hi Lisa,

you are right, a documentation of the error file - and what how the program uses them - is more than over-due. The basic idea is to store different probability distributions of the single error models as cumulative distribution functions (CDF) over a discrete value space (e.g., quality values, substitution symbols). As by their nature the number series in a CDF have to be monotonously increasing with (at least) the last value of a series being 1.0. From an informatical point of view, the format is not very exciting - organized in blocks the data is presented in tokens separated by whitespaces. Here, I will describe the format on the basis of the example myLittleRun.err which is included in the demo subfolder of the FLUX SIMULATOR bundle from the download page.



  1. Block describing the general model

    Code:
    #MODEL minQual maxQual tholdQual nrInstances
    p(minQual) p(minQual+1) ... p(maxQual-1) p(maxQual)


    expression
    (example)
    verbose explanation
    #MODELtag introducing the model description block
    minQual
    (-40)
    minimum quality: the minimum value for qualities in the described error models. Currently exclusively integer quality models (as Illumina and phred qualities) are addressed. Therefore, subsequent CDFs over quality spectra have all the length (maxQual - minQual + 1).
    maxQual
    (40)
    maximum quality: highest value of the quality spectrum, an integer - see above.
    tholdQual
    (.)
    the threshold quality: level below which below which all base-calls have been considered "problematic" or "accident", regardless whether the corresponding base had been called correctly or not. If none such threshold has been applied, tholdQual should be set to "."
    nrInstances
    (916311)
    number of instances: on how many observations (i.e., reads) the error model has been estimated on
    p(minQual),
    ...,
    p(maxQual)
    CDF over qualities of "unproblematic" base calls. A base call is considered as unproblematic iff it is (i) correct and (ii) equal or above the level specified by tholdQual.


  2. Blocks defining the quality crosstalk table

    Code:
    #CROSSTALK letter                                                         
    minQual p(A) p(C) p(G) p(N) p(T)                                           
    (minQual+1) p(A) p(C) p(G) p(N) p(T)                                     
    ...                                                                 
    (maxQual-1) p(A) p(C) p(G) p(N) p(T)                                       
    maxQual p(A) p(C) p(G) p(N) p(T)


    #CROSSTALKtag that introduces a crosstalk description block
    letter
    (A)
    Symbol, for which the crosstalk is specified as the observed substitution rates broken down by quality levels.
    minQual
    ...
    maxQual

    (-40,...,40)
    quality level for the following observed substitution rates p(X) apply.
    p(A),p(C),
    p(G),p(N),
    p(T)
    CDF of the symbol specified by letter to be substituted by A, C, G, N, or T as observed for the quality level in this line.


  3. Blocks defining position-based error models

    Code:
    # PositionErrorProfile start length baseProb
    start p(minQual) p(minQual+1) ... p(maxQual-1) p(maxQual)
    (start+1) p(minQual) p(minQual+1) ... p(maxQual-1) p(maxQual)
    ...
    (start+length-1) p(minQual) p(minQual+1) ... p(maxQual-1) p(maxQual)

    #PositionErrorProfiletag that introduces position error profile block
    start
    (26)
    first position in the read affected by this error model (1-based)
    length
    (11)
    extension of the "problem" captured in this error profile. Consequently, the last position affected is (start+length-1).
    baseProb
    (6.875394925958544E-5)
    probability as fraction of reads that shared this problem in the observed dataset. Multiplying this probability with the value nrInstances in the #MODEL block recasts the number of instances in which this error has been observed.


I will write the next days a more detailed descriptio of how the FLUX SIMULATOR program applies the model in order to randomly determine substitutions and qualities for the reads. Everything then will be included in the next download package, which I plan to put online after removal of the issues up to FLUX-5, see bugtracker.

See you,

micha.
Back to top Go down
View user profile http://www.sammeth.net
 

A Dual Sequencing Error Model

View previous topic View next topic Back to top 
Page 1 of 1

 Similar topics

-
» Media Disconnected Networking Error
» BackTrack 5 r1 rtl8187 SIOCSIFFLAGS Unknown error 132
» All A class models Error code 29
» More model test papers for ICSE 2009
» XAMPP Error: Warning: file_get_contents(lang.tmp)

Permissions in this forum:You cannot reply to topics in this forum
The Old Flux Forum :: Flux Simulator :: Discussion-