Hidden Markov Models (HMMs) have many applications in bioinformatics. They are, for example, used to search for patterns in a sequence. Here pattern refers to particular chain of characters arranged in a particular sequence e.g. TATA box or CpG islands.
Patterns can be deterministic or non-deterministic at initial inspection. For example, traffic lights follow a predictable pattern. Yellow follows green and red follows yellow. Weather, however, does not follow a predictable pattern in most parts of the planet. A sunny day can be followed by a rainy day, cloudy day or even another sunny day.
It is necessary to learn markov chains before one can understand hidden markov models.
Suppose we are looking for CpG islands in a sequence. If we are using a probabilistic model, we would want a model where the probability of a symbol depends on the previous symbol. Markov chains is such a model.
A markov chain is a set of states connect by arrows called transitions. Each transition has a probability parameter associated with it. The probability parameter on an arrow from C to G represents the probability of a G following a C.

A finite markov chain is an integer stochastic process, consisting of:
For DNA,

This matrix represents transition probabilities, M = ast. Note that the sum of each vector (row of matrix) is 1. Transition probability is represented as follows:

Note that this is simply conditional probability P(t|s), the probability of s occurring given that t has occurred. A key property of markov chains is that the probability of each symbol xi depends only on the preceding symbol xi-1 rather than the entire sequence. [1]

This equation shows that we need to specify the P(x1), the probability of starting in a particular state in addition to specifying the transition probabilities. We now add two begin state and end state to our model in ensure that the beginning and end are modeled. β is the begin state and ε = end state.
P(x1 = s) = aβs
P(ε|xL = t) = atε
We do not need to associate any probability to the begin and end states. They can just serve as points where transitions begin and end. The end state is useful in modeling distribution of lengths of sequences. The distribution over lengths decays exponentially.
In human genomes the pair CG often transforms to (methyl-C) G which often transforms to TG. Hence the pair CG appears less than expected from what is expected from the independent frequencies of C and G alone. Due to biological reasons, this process is sometimes suppressed in short stretches of genomes such as in the start regions of many genes. These areas are called CpG islands.
To be able to discriminate between CpG island and non-CpG islands, we need to model strings with and without CpG islands as Markov Chains over the same states {A,C,G,T} but different transition probabilities:
We produce a matrix for the + model and another for the - model. To use these models for discrimination, we calculate the log-odds ratio:

This would produce another matrix which should ideally discriminate between CpG islands and others by positive and negative scores. If the ratio is greater than 1, CpG island is more likely.
Using markov chains, we had to build two models, + and -. Now we would like to use one model to do both. To do this, we would need to add both the + and - probabilities into one model. Thus we would end up with two states corresponding to each nucleotide symbol. To void this confusion, we would rename our states from A, C, G, T to A+, C+, G+, T+ and A-, C-, G-, T-.
The transition probabilities in this model are so that within each group they are close to the transition probabilities of the original component model, but there is also a small but finite chance of switching into the other component. Overall there is more chance of switching from ‘+’ to ‘-’ than vice versa, so if left to run free, the model will spend more of its time in the ‘-’ non-island states than in the island states. [1]
A hidden Markov model (HMM) is a statistical model where the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis.
In a regular Markov model, the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states. The essential difference between a Markov chain and a hidden Markov chain is that while there is a 1-1 correspondence between states and symbols in Markov chains, the same isn't true for hidden Markov chains.
Definition: A HMM is a triplet M = (S, Q, T) where:
We now need to distinguish the sequence of states from the sequence of symbols. The path π is a sequence of states. The path itself follows a simple Markov chain, so the probability of a state depends only on the previous state. As in the markov chain model, we can define the state transition probabilities in terms of π:

The probability of going to l given that we are at k. Since we have decoupled the symbols b from the states k, there is no longer a 1-1 correspondence between states and symbols. Thus we must introduce a new set of parameters for the model:

ek(b) is the probability that the symbol b is seen in state k. These are known as emission probabilities.

Where for our convenience we denote π0 = begin and πL+1 = end.
First a state π1 is chosen according to the probabilities a0i. In that state an emission is emitted according to the distribution eπ1 for that state. Then a new state π2 is chosen according to the transition probabilities aπ1i and so forth.
P(X,π) is the joint probability of an observed sequence X and a state sequence π.
In practice, this is not very useful since very often we do not know the path. So we have to estimate the path either by finding the most likely path or using an a posteriori distribution over states.
There are 3 canonical problems associated with HMMs:
Although it is no longer possible to tell what state the system is in by looking at the corresponding symbol, it is often the sequence of underlying states that we are interested in. Decoding is the act of finding out the meaning of a sequence by considering the underlying states. A commonly used method is a dynamic programming algorithm, Viterbi.
In general, many different states can give rise to any particular sequence of symbols. For example the following 3 states give result in CGCG sequence of symbols.
1. C+, G+, C+, G+
2. C-, G-, C-, G-
3. C+, G-, C+. G-
The probability of the first is larger than the second which is larger than the third. So if we are to choose only one path, it is likely that the first one will be chosen.
We can calculate the most probable path in a hidden Markov model using a dynamic programming algorithm.

The most probable path &pi* can be found recursively. Suppose the probability Vk(i) of the most probable path ending in state k with observation i is known for all states k. Then these probabilities can be calculated for observation xi+1 as:

All sequences have to start in state 0 (the begin state), so the initial condition is that V0(0) = 1. By keeping pointers backwards, the actual state sequence can be found by backtracking.

We use logs to convert products to sums to avoid underflow problems.
Viterbi algorithm makes three assumptions:
Since many different state paths can give rise to the same sequence x, we must add the probabilities for all possible paths to obtain the full probability of x,

The number of possible paths π increases exponentially with the length of the sequence, so evaluation by enumerating all paths in not practical. Approximation or enumeration is unnecessary as the full probability can itself be calculated by a similar dynamic programming procedure to the Viterbi algorithm, replacing the maximization steps with sums. This is called the forward algorithm.
The quantity corresponding to the Viterbi variable Vk(i) in the forward algorithm is:
fk(i) = (x1,...xi,π = k)
Note: This post is a summary of chapter 3 of Durbin.