| | | A Dual Sequencing Error Model | |
| | Author | Message |
|---|
micha

Posts: 10 Join date: 2009-05-06 Age: 36 Location: Barcelona
 | Subject: A Dual Sequencing Error Model Sun May 10, 2009 2:22 pm | |
| Hi, I am currently integrating sequencing error models into the FLUX SIMULATOR and here are my current thoughts on a widely applicable error model that distinguishes mainly two types of errors. Error types: NGS machines commonly use light of certain wavelengths (i.e., colors) in order to distinguish the bases that are incorporated as sequencing-by-synthesis progresses. In each step (i.e., cycle), usually all 4 colors are measured and the base-caller takes decision which base to call based on characteristics of the 4 measurements. Additionally, synchronizing problems incur that require for comparing measurments of subsequent cycles with each other. Thus, we can roughly distinguish two error types of different nature:
- substitutions: something went wrong with the measurement resulting in a wrong base call. The hetorogenous nature of "accidents" across different sequencing platforms makes determenistic modelling difficult. However, we can estimate empirically characteristics of such "accidents" from a given run - e.g., from the control lane in Solexa, or an arbitrary comparable run - once the reads have been aligned to a reference. These attributes are:
- number of "accidents": mutation errors are clustered by the same cycles they affect, assuming that they have been provoked by the same source. - probability of an "accident": relative proportion of reads that support a certain accident compared to all reads - severeness of an "accident": distribution of quality scores assigned to nucleotides affected by the "accident" Latter "severeness" can be turned into error probabilities for wrong base calls as usually quality scores are logarithmic scores of probabilities. In a simplified model, substitutions afterwards can be distributed according to a general crosstalk matrix of the dyes that have been adopted in the experiment. Under this model, substitution errors are independent of their surrounding sequence, but common regions with problems are observed in different subsets of the reads. To get a visual impression of Solexa "accidents", see here.
- duplications/deletions: these include phase synchronizing problems in general, comprising "phasing", "pre-phasing", "empty cycle", "homo-polymers", etc. ... The common attribute is that synchronization has been lost, so we observe the replacement of a substring with a substring of different length. Probabilities clearly depend on both, the original and the replacement string. However, as phasing errors are expected to not extend across many cycles - given that in such cases the read alignment problem would grow very hard and the chance of higher similarities to random other locations would grow - naturally limits the parameter space. In contrast to the mutations, duplications/deleletions are modelled to occur independent of their position in a read, but dependent on their sequence neighborhood.
Finally, also the distribution of qualities for base-calls without "accident" is to be modelled. Here, I plan to derive empiric estimates for each base, assuming independence of the preceeding/subsequent bases in the sequencing process. micha. |
|  | | David Horner
Posts: 1 Join date: 2009-05-11
 | Subject: re. A Dual Sequencing Error Model Mon May 11, 2009 8:54 am | |
| It looks good. The only thing I'd say is that I don't think this approach will translate directly to SOLiD sequencing simulation .... but for now that doesn't seem to be the objective, and for my work SOLiD simulation is not part of the objective.
Cheers
David |
|  | | micha

Posts: 10 Join date: 2009-05-06 Age: 36 Location: Barcelona
 | Subject: Re: A Dual Sequencing Error Model Mon May 18, 2009 11:06 am | |
| Hi,
there is a preliminary version of the error model including the positional error models as discussed so far. It is included in the 1.2a release bundle of the FLUX SIMULATOR. As soon as possible I will try to write some documentation on the new (and old) features.
Please tell me what you think
micha. |
|  | | micha

Posts: 10 Join date: 2009-05-06 Age: 36 Location: Barcelona
 | Subject: Position error model, quick instructions Mon May 18, 2009 12:07 pm | |
| A: in order to enable fastaQ output (C), you need an error model. The model can be estimated either from an alignment or already be established in form of an .err file (format to described elsewhere). B: If you parse the error profile from an alignment, you may provide a quality threshold on bases that are considered as unproblematic. Every basecall below this threshold is considered as part of an "accident" and correspondingly included in one of the error models. C: once you have an error model you can activate fastaq output additonal to the bed files D: The error model splits "accidents" according to their start and extension in the time scale of the experiment. E.g., an accident that is observed from cycle 5 to 7 has length 3, a different "accident" of length 3 could be observed from cycle 30 to 32. These "accidents" may overlap, so the accident of (start,extension) = (5,7) is separated from an accident (5,8). E: The error models measures the substitution rate for each nucleotide with each other, respectively with "N" (black) split accross the different quality levels. If no substitution rate could be estimated for a certain quality, it gets interpolated from its clostest datapoints. F: distribution of accident locations along the readlength. Red are cumulative starts, blue are cumulative ends and black shows all accidents overlapping a certain position. G: distribution of qualities along the read. Red are the minimum, yellow the 1st quartile, green the median, blue the 3rd quartile and black the maximum value that has been recorded for each read position. |
|  | | lisa
Posts: 1 Join date: 2009-05-27
 | Subject: .err file format description Wed May 27, 2009 12:56 pm | |
| Hello,
is there any kind of file format description for the .err files other than the code?
Regards, Lisa |
|  | | micha

Posts: 10 Join date: 2009-05-06 Age: 36 Location: Barcelona
 | Subject: Re: A Dual Sequencing Error Model Wed May 27, 2009 3:10 pm | |
| Hi Lisa, you are right, a documentation of the error file - and what how the program uses them - is more than over-due. The basic idea is to store different probability distributions of the single error models as cumulative distribution functions (CDF) over a discrete value space (e.g., quality values, substitution symbols). As by their nature the number series in a CDF have to be monotonously increasing with (at least) the last value of a series being 1.0. From an informatical point of view, the format is not very exciting - organized in blocks the data is presented in tokens separated by whitespaces. Here, I will describe the format on the basis of the example myLittleRun.err which is included in the demo subfolder of the FLUX SIMULATOR bundle from the download page.
Block describing the general model
| Code: | #MODEL minQual maxQual tholdQual nrInstances p(minQual) p(minQual+1) ... p(maxQual-1) p(maxQual) |
expression (example) | verbose explanation | | #MODEL | tag introducing the model description block | minQual (-40) | minimum quality: the minimum value for qualities in the described error models. Currently exclusively integer quality models (as Illumina and phred qualities) are addressed. Therefore, subsequent CDFs over quality spectra have all the length (maxQual - minQual + 1). | maxQual (40) | maximum quality: highest value of the quality spectrum, an integer - see above. | tholdQual (.) | the threshold quality: level below which below which all base-calls have been considered "problematic" or "accident", regardless whether the corresponding base had been called correctly or not. If none such threshold has been applied, tholdQual should be set to "." | nrInstances (916311) | number of instances: on how many observations (i.e., reads) the error model has been estimated on | p(minQual), ..., p(maxQual) | CDF over qualities of "unproblematic" base calls. A base call is considered as unproblematic iff it is (i) correct and (ii) equal or above the level specified by tholdQual. |
Blocks defining the quality crosstalk table
| Code: | #CROSSTALK letter minQual p(A) p(C) p(G) p(N) p(T) (minQual+1) p(A) p(C) p(G) p(N) p(T) ... (maxQual-1) p(A) p(C) p(G) p(N) p(T) maxQual p(A) p(C) p(G) p(N) p(T) |
| #CROSSTALK | tag that introduces a crosstalk description block | letter (A) | Symbol, for which the crosstalk is specified as the observed substitution rates broken down by quality levels. | minQual ... maxQual (-40,...,40) | quality level for the following observed substitution rates p(X) apply. | p(A),p(C), p(G),p(N), p(T) | CDF of the symbol specified by letter to be substituted by A, C, G, N, or T as observed for the quality level in this line. | |
Blocks defining position-based error models
| Code: | # PositionErrorProfile start length baseProb start p(minQual) p(minQual+1) ... p(maxQual-1) p(maxQual) (start+1) p(minQual) p(minQual+1) ... p(maxQual-1) p(maxQual) ... (start+length-1) p(minQual) p(minQual+1) ... p(maxQual-1) p(maxQual) |
| #PositionErrorProfile | tag that introduces position error profile block | start (26) | first position in the read affected by this error model (1-based) | length (11) | extension of the "problem" captured in this error profile. Consequently, the last position affected is (start+length-1). | baseProb (6.875394925958544E-5) | probability as fraction of reads that shared this problem in the observed dataset. Multiplying this probability with the value nrInstances in the #MODEL block recasts the number of instances in which this error has been observed. |
I will write the next days a more detailed descriptio of how the FLUX SIMULATOR program applies the model in order to randomly determine substitutions and qualities for the reads. Everything then will be included in the next download package, which I plan to put online after removal of the issues up to FLUX-5, see bugtracker. See you, micha. |
|  | | | | A Dual Sequencing Error Model | |
|
Similar topics |  |
|
| | Permissions in this forum: | You cannot reply to topics in this forum
| |
| |
| |