annotate hmm/hmm.py @ 3:26d9c0308fcf

updated/added from ecclerig version
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Mon, 09 Mar 2020 17:35:28 +0000
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1 # Natural Language Toolkit: Hidden Markov Model
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
2 #
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
3 # Copyright (C) 2001-2015 NLTK Project
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
4 # Author: Trevor Cohn <tacohn@csse.unimelb.edu.au>
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
5 # Philip Blunsom <pcbl@csse.unimelb.edu.au>
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
6 # Tiago Tresoldi <tiago@tresoldi.pro.br> (fixes)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
7 # Steven Bird <stevenbird1@gmail.com> (fixes)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
8 # Joseph Frazee <jfrazee@mail.utexas.edu> (fixes)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
9 # Steven Xu <xxu@student.unimelb.edu.au> (fixes)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
10 # URL: <http://nltk.org/>
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
11 # For license information, see LICENSE.TXT
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
12
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
13 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
14 Hidden Markov Models (HMMs) largely used to assign the correct label sequence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
15 to sequential data or assess the probability of a given label and data
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
16 sequence. These models are finite state machines characterised by a number of
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
17 states, transitions between these states, and output symbols emitted while in
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
18 each state. The HMM is an extension to the Markov chain, where each state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
19 corresponds deterministically to a given event. In the HMM the observation is
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
20 a probabilistic function of the state. HMMs share the Markov chain's
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
21 assumption, being that the probability of transition from one state to another
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
22 only depends on the current state - i.e. the series of states that led to the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
23 current state are not used. They are also time invariant.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
24
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
25 The HMM is a directed graph, with probability weighted edges (representing the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
26 probability of a transition between the source and sink states) where each
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
27 vertex emits an output symbol when entered. The symbol (or observation) is
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
28 non-deterministically generated. For this reason, knowing that a sequence of
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
29 output observations was generated by a given HMM does not mean that the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
30 corresponding sequence of states (and what the current state is) is known.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
31 This is the 'hidden' in the hidden markov model.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
32
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
33 Formally, a HMM can be characterised by:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
34
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
35 - the output observation alphabet. This is the set of symbols which may be
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
36 observed as output of the system.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
37 - the set of states.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
38 - the transition probabilities *a_{ij} = P(s_t = j | s_{t-1} = i)*. These
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
39 represent the probability of transition to each state from a given state.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
40 - the output probability matrix *b_i(k) = P(X_t = o_k | s_t = i)*. These
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
41 represent the probability of observing each symbol in a given state.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
42 - the initial state distribution. This gives the probability of starting
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
43 in each state.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
44
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
45 To ground this discussion, take a common NLP application, part-of-speech (POS)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
46 tagging. An HMM is desirable for this task as the highest probability tag
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
47 sequence can be calculated for a given sequence of word forms. This differs
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
48 from other tagging techniques which often tag each word individually, seeking
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
49 to optimise each individual tagging greedily without regard to the optimal
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
50 combination of tags for a larger unit, such as a sentence. The HMM does this
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
51 with the Viterbi algorithm, which efficiently computes the optimal path
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
52 through the graph given the sequence of words forms.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
53
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
54 In POS tagging the states usually have a 1:1 correspondence with the tag
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
55 alphabet - i.e. each state represents a single tag. The output observation
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
56 alphabet is the set of word forms (the lexicon), and the remaining three
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
57 parameters are derived by a training regime. With this information the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
58 probability of a given sentence can be easily derived, by simply summing the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
59 probability of each distinct path through the model. Similarly, the highest
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
60 probability tagging sequence can be derived with the Viterbi algorithm,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
61 yielding a state sequence which can be mapped into a tag sequence.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
62
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
63 This discussion assumes that the HMM has been trained. This is probably the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
64 most difficult task with the model, and requires either MLE estimates of the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
65 parameters or unsupervised learning using the Baum-Welch algorithm, a variant
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
66 of EM.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
67
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
68 For more information, please consult the source code for this module,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
69 which includes extensive demonstration code.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
70 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
71 from __future__ import print_function, unicode_literals, division
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
72
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
73 import re
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
74 import itertools
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
75
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
76 try:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
77 import numpy as np
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
78 except ImportError:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
79 pass
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
80
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
81 from nltk.probability import (FreqDist, ConditionalFreqDist,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
82 ConditionalProbDist, DictionaryProbDist,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
83 DictionaryConditionalProbDist,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
84 LidstoneProbDist, MutableProbDist,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
85 MLEProbDist, RandomProbDist)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
86 from nltk.metrics import accuracy
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
87 from nltk.util import LazyMap, unique_list
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
88 from nltk.compat import python_2_unicode_compatible, izip, imap
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
89 from nltk.tag.api import TaggerI
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
90
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
91
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
92 _TEXT = 0 # index of text in a tuple
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
93 _TAG = 1 # index of tag in a tuple
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
94
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
95 def _identity(labeled_symbols):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
96 return labeled_symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
97
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
98 @python_2_unicode_compatible
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
99 class HiddenMarkovModelTagger(TaggerI):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
100 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
101 Hidden Markov model class, a generative model for labelling sequence data.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
102 These models define the joint probability of a sequence of symbols and
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
103 their labels (state transitions) as the product of the starting state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
104 probability, the probability of each state transition, and the probability
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
105 of each observation being generated from each state. This is described in
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
106 more detail in the module documentation.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
107
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
108 This implementation is based on the HMM description in Chapter 8, Huang,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
109 Acero and Hon, Spoken Language Processing and includes an extension for
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
110 training shallow HMM parsers or specialized HMMs as in Molina et.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
111 al, 2002. A specialized HMM modifies training data by applying a
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
112 specialization function to create a new training set that is more
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
113 appropriate for sequential tagging with an HMM. A typical use case is
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
114 chunking.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
115
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
116 :param symbols: the set of output symbols (alphabet)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
117 :type symbols: seq of any
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
118 :param states: a set of states representing state space
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
119 :type states: seq of any
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
120 :param transitions: transition probabilities; Pr(s_i | s_j) is the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
121 probability of transition from state i given the model is in
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
122 state_j
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
123 :type transitions: ConditionalProbDistI
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
124 :param outputs: output probabilities; Pr(o_k | s_i) is the probability
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
125 of emitting symbol k when entering state i
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
126 :type outputs: ConditionalProbDistI
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
127 :param priors: initial state distribution; Pr(s_i) is the probability
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
128 of starting in state i
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
129 :type priors: ProbDistI
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
130 :param transform: an optional function for transforming training
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
131 instances, defaults to the identity function.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
132 :type transform: callable
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
133 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
134 def __init__(self, symbols, states, transitions, outputs, priors,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
135 transform=_identity):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
136 self._symbols = unique_list(symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
137 self._states = unique_list(states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
138 self._transitions = transitions
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
139 self._outputs = outputs
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
140 self._priors = priors
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
141 self._cache = None
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
142 self._transform = transform
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
143
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
144 @classmethod
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
145 def _train(cls, labeled_sequence, test_sequence=None,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
146 unlabeled_sequence=None, transform=_identity,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
147 estimator=None, **kwargs):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
148
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
149 if estimator is None:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
150 def estimator(fd, bins):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
151 return LidstoneProbDist(fd, 0.1, bins)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
152
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
153 labeled_sequence = LazyMap(transform, labeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
154 symbols = unique_list(word for sent in labeled_sequence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
155 for word, tag in sent)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
156 tag_set = unique_list(tag for sent in labeled_sequence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
157 for word, tag in sent)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
158
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
159 trainer = HiddenMarkovModelTrainer(tag_set, symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
160 hmm = trainer.train_supervised(labeled_sequence, estimator=estimator)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
161 hmm = cls(hmm._symbols, hmm._states, hmm._transitions, hmm._outputs,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
162 hmm._priors, transform=transform)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
163
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
164 if test_sequence:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
165 hmm.test(test_sequence, verbose=kwargs.get('verbose', False))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
166
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
167 if unlabeled_sequence:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
168 max_iterations = kwargs.get('max_iterations', 5)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
169 hmm = trainer.train_unsupervised(unlabeled_sequence, model=hmm,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
170 max_iterations=max_iterations)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
171 if test_sequence:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
172 hmm.test(test_sequence, verbose=kwargs.get('verbose', False))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
173
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
174 return hmm
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
175
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
176 @classmethod
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
177 def train(cls, labeled_sequence, test_sequence=None,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
178 unlabeled_sequence=None, **kwargs):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
179 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
180 Train a new HiddenMarkovModelTagger using the given labeled and
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
181 unlabeled training instances. Testing will be performed if test
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
182 instances are provided.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
183
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
184 :return: a hidden markov model tagger
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
185 :rtype: HiddenMarkovModelTagger
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
186 :param labeled_sequence: a sequence of labeled training instances,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
187 i.e. a list of sentences represented as tuples
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
188 :type labeled_sequence: list(list)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
189 :param test_sequence: a sequence of labeled test instances
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
190 :type test_sequence: list(list)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
191 :param unlabeled_sequence: a sequence of unlabeled training instances,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
192 i.e. a list of sentences represented as words
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
193 :type unlabeled_sequence: list(list)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
194 :param transform: an optional function for transforming training
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
195 instances, defaults to the identity function, see ``transform()``
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
196 :type transform: function
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
197 :param estimator: an optional function or class that maps a
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
198 condition's frequency distribution to its probability
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
199 distribution, defaults to a Lidstone distribution with gamma = 0.1
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
200 :type estimator: class or function
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
201 :param verbose: boolean flag indicating whether training should be
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
202 verbose or include printed output
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
203 :type verbose: bool
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
204 :param max_iterations: number of Baum-Welch interations to perform
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
205 :type max_iterations: int
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
206 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
207 return cls._train(labeled_sequence, test_sequence,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
208 unlabeled_sequence, **kwargs)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
209
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
210 def probability(self, sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
211 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
212 Returns the probability of the given symbol sequence. If the sequence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
213 is labelled, then returns the joint probability of the symbol, state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
214 sequence. Otherwise, uses the forward algorithm to find the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
215 probability over all label sequences.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
216
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
217 :return: the probability of the sequence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
218 :rtype: float
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
219 :param sequence: the sequence of symbols which must contain the TEXT
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
220 property, and optionally the TAG property
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
221 :type sequence: Token
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
222 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
223 return 2**(self.log_probability(self._transform(sequence)))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
224
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
225 def log_probability(self, sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
226 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
227 Returns the log-probability of the given symbol sequence. If the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
228 sequence is labelled, then returns the joint log-probability of the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
229 symbol, state sequence. Otherwise, uses the forward algorithm to find
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
230 the log-probability over all label sequences.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
231
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
232 :return: the log-probability of the sequence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
233 :rtype: float
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
234 :param sequence: the sequence of symbols which must contain the TEXT
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
235 property, and optionally the TAG property
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
236 :type sequence: Token
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
237 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
238 sequence = self._transform(sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
239
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
240 T = len(sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
241
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
242 if T > 0 and sequence[0][_TAG]:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
243 last_state = sequence[0][_TAG]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
244 p = self._priors.logprob(last_state) + \
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
245 self._output_logprob(last_state, sequence[0][_TEXT])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
246 for t in range(1, T):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
247 state = sequence[t][_TAG]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
248 p += self._transitions[last_state].logprob(state) + \
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
249 self._output_logprob(state, sequence[t][_TEXT])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
250 last_state = state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
251 return p
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
252 else:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
253 alpha = self._forward_probability(sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
254 p = logsumexp2(alpha[T-1])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
255 return p
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
256
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
257 def tag(self, unlabeled_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
258 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
259 Tags the sequence with the highest probability state sequence. This
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
260 uses the best_path method to find the Viterbi path.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
261
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
262 :return: a labelled sequence of symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
263 :rtype: list
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
264 :param unlabeled_sequence: the sequence of unlabeled symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
265 :type unlabeled_sequence: list
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
266 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
267 unlabeled_sequence = self._transform(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
268 return self._tag(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
269
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
270 def _tag(self, unlabeled_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
271 path = self._best_path(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
272 return list(izip(unlabeled_sequence, path))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
273
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
274 def _output_logprob(self, state, symbol):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
275 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
276 :return: the log probability of the symbol being observed in the given
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
277 state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
278 :rtype: float
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
279 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
280 return self._outputs[state].logprob(symbol)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
281
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
282 def _create_cache(self):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
283 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
284 The cache is a tuple (P, O, X, S) where:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
285
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
286 - S maps symbols to integers. I.e., it is the inverse
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
287 mapping from self._symbols; for each symbol s in
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
288 self._symbols, the following is true::
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
289
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
290 self._symbols[S[s]] == s
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
291
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
292 - O is the log output probabilities::
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
293
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
294 O[i,k] = log( P(token[t]=sym[k]|tag[t]=state[i]) )
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
295
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
296 - X is the log transition probabilities::
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
297
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
298 X[i,j] = log( P(tag[t]=state[j]|tag[t-1]=state[i]) )
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
299
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
300 - P is the log prior probabilities::
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
301
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
302 P[i] = log( P(tag[0]=state[i]) )
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
303 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
304 if not self._cache:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
305 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
306 M = len(self._symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
307 P = np.zeros(N, np.float32)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
308 X = np.zeros((N, N), np.float32)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
309 O = np.zeros((N, M), np.float32)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
310 for i in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
311 si = self._states[i]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
312 P[i] = self._priors.logprob(si)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
313 for j in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
314 X[i, j] = self._transitions[si].logprob(self._states[j])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
315 for k in range(M):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
316 O[i, k] = self._output_logprob(si, self._symbols[k])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
317 S = {}
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
318 for k in range(M):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
319 S[self._symbols[k]] = k
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
320 self._cache = (P, O, X, S)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
321
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
322 def _update_cache(self, symbols):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
323 # add new symbols to the symbol table and repopulate the output
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
324 # probabilities and symbol table mapping
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
325 if symbols:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
326 self._create_cache()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
327 P, O, X, S = self._cache
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
328 for symbol in symbols:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
329 if symbol not in self._symbols:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
330 self._cache = None
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
331 self._symbols.append(symbol)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
332 # don't bother with the work if there aren't any new symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
333 if not self._cache:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
334 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
335 M = len(self._symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
336 Q = O.shape[1]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
337 # add new columns to the output probability table without
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
338 # destroying the old probabilities
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
339 O = np.hstack([O, np.zeros((N, M - Q), np.float32)])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
340 for i in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
341 si = self._states[i]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
342 # only calculate probabilities for new symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
343 for k in range(Q, M):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
344 O[i, k] = self._output_logprob(si, self._symbols[k])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
345 # only create symbol mappings for new symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
346 for k in range(Q, M):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
347 S[self._symbols[k]] = k
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
348 self._cache = (P, O, X, S)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
349
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
350 def reset_cache(self):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
351 self._cache = None
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
352
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
353 def best_path(self, unlabeled_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
354 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
355 Returns the state sequence of the optimal (most probable) path through
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
356 the HMM. Uses the Viterbi algorithm to calculate this part by dynamic
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
357 programming.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
358
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
359 :return: the state sequence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
360 :rtype: sequence of any
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
361 :param unlabeled_sequence: the sequence of unlabeled symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
362 :type unlabeled_sequence: list
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
363 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
364 unlabeled_sequence = self._transform(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
365 return self._best_path(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
366
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
367 def _best_path(self, unlabeled_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
368 T = len(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
369 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
370 self._create_cache()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
371 self._update_cache(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
372 P, O, X, S = self._cache
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
373
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
374 V = np.zeros((T, N), np.float32)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
375 B = -np.ones((T, N), np.int)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
376
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
377 V[0] = P + O[:, S[unlabeled_sequence[0]]]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
378 for t in range(1, T):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
379 for j in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
380 vs = V[t-1, :] + X[:, j]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
381 best = np.argmax(vs)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
382 V[t, j] = vs[best] + O[j, S[unlabeled_sequence[t]]]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
383 B[t, j] = best
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
384
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
385 current = np.argmax(V[T-1,:])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
386 sequence = [current]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
387 for t in range(T-1, 0, -1):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
388 last = B[t, current]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
389 sequence.append(last)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
390 current = last
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
391
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
392 sequence.reverse()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
393 return list(map(self._states.__getitem__, sequence))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
394
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
395 def best_path_simple(self, unlabeled_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
396 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
397 Returns the state sequence of the optimal (most probable) path through
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
398 the HMM. Uses the Viterbi algorithm to calculate this part by dynamic
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
399 programming. This uses a simple, direct method, and is included for
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
400 teaching purposes.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
401
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
402 :return: the state sequence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
403 :rtype: sequence of any
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
404 :param unlabeled_sequence: the sequence of unlabeled symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
405 :type unlabeled_sequence: list
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
406 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
407 unlabeled_sequence = self._transform(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
408 return self._best_path_simple(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
409
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
410 def _best_path_simple(self, unlabeled_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
411 T = len(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
412 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
413 V = np.zeros((T, N), np.float64)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
414 B = {}
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
415
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
416 # find the starting log probabilities for each state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
417 symbol = unlabeled_sequence[0]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
418 for i, state in enumerate(self._states):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
419 V[0, i] = self._priors.logprob(state) + \
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
420 self._output_logprob(state, symbol)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
421 B[0, state] = None
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
422
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
423 # find the maximum log probabilities for reaching each state at time t
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
424 for t in range(1, T):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
425 symbol = unlabeled_sequence[t]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
426 for j in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
427 sj = self._states[j]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
428 best = None
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
429 for i in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
430 si = self._states[i]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
431 va = V[t-1, i] + self._transitions[si].logprob(sj)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
432 if not best or va > best[0]:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
433 best = (va, si)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
434 V[t, j] = best[0] + self._output_logprob(sj, symbol)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
435 B[t, sj] = best[1]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
436
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
437 # find the highest probability final state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
438 best = None
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
439 for i in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
440 val = V[T-1, i]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
441 if not best or val > best[0]:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
442 best = (val, self._states[i])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
443
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
444 # traverse the back-pointers B to find the state sequence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
445 current = best[1]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
446 sequence = [current]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
447 for t in range(T-1, 0, -1):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
448 last = B[t, current]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
449 sequence.append(last)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
450 current = last
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
451
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
452 sequence.reverse()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
453 return sequence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
454
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
455 def random_sample(self, rng, length):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
456 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
457 Randomly sample the HMM to generate a sentence of a given length. This
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
458 samples the prior distribution then the observation distribution and
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
459 transition distribution for each subsequent observation and state.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
460 This will mostly generate unintelligible garbage, but can provide some
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
461 amusement.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
462
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
463 :return: the randomly created state/observation sequence,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
464 generated according to the HMM's probability
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
465 distributions. The SUBTOKENS have TEXT and TAG
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
466 properties containing the observation and state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
467 respectively.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
468 :rtype: list
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
469 :param rng: random number generator
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
470 :type rng: Random (or any object with a random() method)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
471 :param length: desired output length
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
472 :type length: int
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
473 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
474
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
475 # sample the starting state and symbol prob dists
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
476 tokens = []
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
477 state = self._sample_probdist(self._priors, rng.random(), self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
478 symbol = self._sample_probdist(self._outputs[state],
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
479 rng.random(), self._symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
480 tokens.append((symbol, state))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
481
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
482 for i in range(1, length):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
483 # sample the state transition and symbol prob dists
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
484 state = self._sample_probdist(self._transitions[state],
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
485 rng.random(), self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
486 symbol = self._sample_probdist(self._outputs[state],
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
487 rng.random(), self._symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
488 tokens.append((symbol, state))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
489
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
490 return tokens
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
491
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
492 def _sample_probdist(self, probdist, p, samples):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
493 cum_p = 0
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
494 for sample in samples:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
495 add_p = probdist.prob(sample)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
496 if cum_p <= p <= cum_p + add_p:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
497 return sample
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
498 cum_p += add_p
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
499 raise Exception('Invalid probability distribution - '
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
500 'does not sum to one')
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
501
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
502 def entropy(self, unlabeled_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
503 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
504 Returns the entropy over labellings of the given sequence. This is
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
505 given by::
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
506
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
507 H(O) = - sum_S Pr(S | O) log Pr(S | O)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
508
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
509 where the summation ranges over all state sequences, S. Let
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
510 *Z = Pr(O) = sum_S Pr(S, O)}* where the summation ranges over all state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
511 sequences and O is the observation sequence. As such the entropy can
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
512 be re-expressed as::
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
513
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
514 H = - sum_S Pr(S | O) log [ Pr(S, O) / Z ]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
515 = log Z - sum_S Pr(S | O) log Pr(S, 0)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
516 = log Z - sum_S Pr(S | O) [ log Pr(S_0) + sum_t Pr(S_t | S_{t-1}) + sum_t Pr(O_t | S_t) ]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
517
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
518 The order of summation for the log terms can be flipped, allowing
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
519 dynamic programming to be used to calculate the entropy. Specifically,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
520 we use the forward and backward probabilities (alpha, beta) giving::
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
521
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
522 H = log Z - sum_s0 alpha_0(s0) beta_0(s0) / Z * log Pr(s0)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
523 + sum_t,si,sj alpha_t(si) Pr(sj | si) Pr(O_t+1 | sj) beta_t(sj) / Z * log Pr(sj | si)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
524 + sum_t,st alpha_t(st) beta_t(st) / Z * log Pr(O_t | st)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
525
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
526 This simply uses alpha and beta to find the probabilities of partial
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
527 sequences, constrained to include the given state(s) at some point in
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
528 time.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
529 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
530 unlabeled_sequence = self._transform(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
531
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
532 T = len(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
533 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
534
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
535 alpha = self._forward_probability(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
536 beta = self._backward_probability(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
537 normalisation = logsumexp2(alpha[T-1])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
538
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
539 entropy = normalisation
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
540
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
541 # starting state, t = 0
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
542 for i, state in enumerate(self._states):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
543 p = 2**(alpha[0, i] + beta[0, i] - normalisation)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
544 entropy -= p * self._priors.logprob(state)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
545 #print 'p(s_0 = %s) =' % state, p
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
546
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
547 # state transitions
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
548 for t0 in range(T - 1):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
549 t1 = t0 + 1
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
550 for i0, s0 in enumerate(self._states):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
551 for i1, s1 in enumerate(self._states):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
552 p = 2**(alpha[t0, i0] + self._transitions[s0].logprob(s1) +
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
553 self._outputs[s1].logprob(
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
554 unlabeled_sequence[t1][_TEXT]) +
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
555 beta[t1, i1] - normalisation)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
556 entropy -= p * self._transitions[s0].logprob(s1)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
557 #print 'p(s_%d = %s, s_%d = %s) =' % (t0, s0, t1, s1), p
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
558
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
559 # symbol emissions
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
560 for t in range(T):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
561 for i, state in enumerate(self._states):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
562 p = 2**(alpha[t, i] + beta[t, i] - normalisation)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
563 entropy -= p * self._outputs[state].logprob(
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
564 unlabeled_sequence[t][_TEXT])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
565 #print 'p(s_%d = %s) =' % (t, state), p
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
566
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
567 return entropy
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
568
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
569 def point_entropy(self, unlabeled_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
570 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
571 Returns the pointwise entropy over the possible states at each
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
572 position in the chain, given the observation sequence.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
573 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
574 unlabeled_sequence = self._transform(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
575
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
576 T = len(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
577 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
578
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
579 alpha = self._forward_probability(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
580 beta = self._backward_probability(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
581 normalisation = logsumexp2(alpha[T-1])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
582
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
583 entropies = np.zeros(T, np.float64)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
584 probs = np.zeros(N, np.float64)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
585 for t in range(T):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
586 for s in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
587 probs[s] = alpha[t, s] + beta[t, s] - normalisation
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
588
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
589 for s in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
590 entropies[t] -= 2**(probs[s]) * probs[s]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
591
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
592 return entropies
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
593
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
594 def _exhaustive_entropy(self, unlabeled_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
595 unlabeled_sequence = self._transform(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
596
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
597 T = len(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
598 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
599
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
600 labellings = [[state] for state in self._states]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
601 for t in range(T - 1):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
602 current = labellings
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
603 labellings = []
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
604 for labelling in current:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
605 for state in self._states:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
606 labellings.append(labelling + [state])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
607
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
608 log_probs = []
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
609 for labelling in labellings:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
610 labeled_sequence = unlabeled_sequence[:]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
611 for t, label in enumerate(labelling):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
612 labeled_sequence[t] = (labeled_sequence[t][_TEXT], label)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
613 lp = self.log_probability(labeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
614 log_probs.append(lp)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
615 normalisation = _log_add(*log_probs)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
616
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
617 #ps = zeros((T, N), float64)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
618 #for labelling, lp in zip(labellings, log_probs):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
619 #for t in range(T):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
620 #ps[t, self._states.index(labelling[t])] += \
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
621 # 2**(lp - normalisation)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
622
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
623 #for t in range(T):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
624 #print 'prob[%d] =' % t, ps[t]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
625
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
626 entropy = 0
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
627 for lp in log_probs:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
628 lp -= normalisation
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
629 entropy -= 2**(lp) * lp
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
630
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
631 return entropy
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
632
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
633 def _exhaustive_point_entropy(self, unlabeled_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
634 unlabeled_sequence = self._transform(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
635
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
636 T = len(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
637 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
638
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
639 labellings = [[state] for state in self._states]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
640 for t in range(T - 1):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
641 current = labellings
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
642 labellings = []
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
643 for labelling in current:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
644 for state in self._states:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
645 labellings.append(labelling + [state])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
646
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
647 log_probs = []
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
648 for labelling in labellings:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
649 labelled_sequence = unlabeled_sequence[:]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
650 for t, label in enumerate(labelling):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
651 labelled_sequence[t] = (labelled_sequence[t][_TEXT], label)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
652 lp = self.log_probability(labelled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
653 log_probs.append(lp)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
654
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
655 normalisation = _log_add(*log_probs)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
656
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
657 probabilities = _ninf_array((T,N))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
658
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
659 for labelling, lp in zip(labellings, log_probs):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
660 lp -= normalisation
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
661 for t, label in enumerate(labelling):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
662 index = self._states.index(label)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
663 probabilities[t, index] = _log_add(probabilities[t, index], lp)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
664
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
665 entropies = np.zeros(T, np.float64)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
666 for t in range(T):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
667 for s in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
668 entropies[t] -= 2**(probabilities[t, s]) * probabilities[t, s]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
669
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
670 return entropies
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
671
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
672 def _transitions_matrix(self):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
673 """ Return a matrix of transition log probabilities. """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
674 trans_iter = (self._transitions[sj].logprob(si)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
675 for sj in self._states
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
676 for si in self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
677
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
678 transitions_logprob = np.fromiter(trans_iter, dtype=np.float64)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
679 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
680 return transitions_logprob.reshape((N, N)).T
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
681
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
682 def _outputs_vector(self, symbol):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
683 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
684 Return a vector with log probabilities of emitting a symbol
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
685 when entering states.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
686 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
687 out_iter = (self._output_logprob(sj, symbol) for sj in self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
688 return np.fromiter(out_iter, dtype=np.float64)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
689
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
690 def _forward_probability(self, unlabeled_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
691 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
692 Return the forward probability matrix, a T by N array of
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
693 log-probabilities, where T is the length of the sequence and N is the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
694 number of states. Each entry (t, s) gives the probability of being in
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
695 state s at time t after observing the partial symbol sequence up to
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
696 and including t.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
697
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
698 :param unlabeled_sequence: the sequence of unlabeled symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
699 :type unlabeled_sequence: list
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
700 :return: the forward log probability matrix
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
701 :rtype: array
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
702 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
703 T = len(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
704 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
705 alpha = _ninf_array((T, N))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
706
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
707 transitions_logprob = self._transitions_matrix()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
708
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
709 # Initialization
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
710 symbol = unlabeled_sequence[0][_TEXT]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
711 for i, state in enumerate(self._states):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
712 alpha[0, i] = self._priors.logprob(state) + \
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
713 self._output_logprob(state, symbol)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
714
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
715 # Induction
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
716 for t in range(1, T):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
717 symbol = unlabeled_sequence[t][_TEXT]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
718 output_logprob = self._outputs_vector(symbol)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
719
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
720 for i in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
721 summand = alpha[t-1] + transitions_logprob[i]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
722 alpha[t, i] = logsumexp2(summand) + output_logprob[i]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
723
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
724 return alpha
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
725
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
726 def _backward_probability(self, unlabeled_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
727 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
728 Return the backward probability matrix, a T by N array of
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
729 log-probabilities, where T is the length of the sequence and N is the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
730 number of states. Each entry (t, s) gives the probability of being in
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
731 state s at time t after observing the partial symbol sequence from t
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
732 .. T.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
733
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
734 :return: the backward log probability matrix
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
735 :rtype: array
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
736 :param unlabeled_sequence: the sequence of unlabeled symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
737 :type unlabeled_sequence: list
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
738 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
739 T = len(unlabeled_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
740 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
741 beta = _ninf_array((T, N))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
742
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
743 transitions_logprob = self._transitions_matrix().T
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
744
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
745 # initialise the backward values;
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
746 # "1" is an arbitrarily chosen value from Rabiner tutorial
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
747 beta[T-1, :] = np.log2(1)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
748
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
749 # inductively calculate remaining backward values
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
750 for t in range(T-2, -1, -1):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
751 symbol = unlabeled_sequence[t+1][_TEXT]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
752 outputs = self._outputs_vector(symbol)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
753
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
754 for i in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
755 summand = transitions_logprob[i] + beta[t+1] + outputs
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
756 beta[t, i] = logsumexp2(summand)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
757
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
758 return beta
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
759
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
760 def test(self, test_sequence, verbose=False, **kwargs):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
761 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
762 Tests the HiddenMarkovModelTagger instance.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
763
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
764 :param test_sequence: a sequence of labeled test instances
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
765 :type test_sequence: list(list)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
766 :param verbose: boolean flag indicating whether training should be
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
767 verbose or include printed output
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
768 :type verbose: bool
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
769 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
770
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
771 def words(sent):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
772 return [word for (word, tag) in sent]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
773
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
774 def tags(sent):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
775 return [tag for (word, tag) in sent]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
776
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
777 def flatten(seq):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
778 return list(itertools.chain(*seq))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
779
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
780 test_sequence = self._transform(test_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
781 predicted_sequence = list(imap(self._tag, imap(words, test_sequence)))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
782
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
783 if verbose:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
784 for test_sent, predicted_sent in izip(test_sequence, predicted_sequence):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
785 print('Test:',
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
786 ' '.join('%s/%s' % (token, tag)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
787 for (token, tag) in test_sent))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
788 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
789 print('Untagged:',
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
790 ' '.join("%s" % token for (token, tag) in test_sent))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
791 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
792 print('HMM-tagged:',
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
793 ' '.join('%s/%s' % (token, tag)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
794 for (token, tag) in predicted_sent))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
795 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
796 print('Entropy:',
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
797 self.entropy([(token, None) for
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
798 (token, tag) in predicted_sent]))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
799 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
800 print('-' * 60)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
801
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
802 test_tags = flatten(imap(tags, test_sequence))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
803 predicted_tags = flatten(imap(tags, predicted_sequence))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
804
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
805 acc = accuracy(test_tags, predicted_tags)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
806 count = sum(len(sent) for sent in test_sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
807 print('accuracy over %d tokens: %.2f' % (count, acc * 100))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
808
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
809 def __repr__(self):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
810 return ('<HiddenMarkovModelTagger %d states and %d output symbols>'
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
811 % (len(self._states), len(self._symbols)))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
812
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
813
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
814 class HiddenMarkovModelTrainer(object):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
815 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
816 Algorithms for learning HMM parameters from training data. These include
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
817 both supervised learning (MLE) and unsupervised learning (Baum-Welch).
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
818
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
819 Creates an HMM trainer to induce an HMM with the given states and
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
820 output symbol alphabet. A supervised and unsupervised training
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
821 method may be used. If either of the states or symbols are not given,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
822 these may be derived from supervised training.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
823
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
824 :param states: the set of state labels
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
825 :type states: sequence of any
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
826 :param symbols: the set of observation symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
827 :type symbols: sequence of any
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
828 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
829 def __init__(self, states=None, symbols=None):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
830 self._states = (states if states else [])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
831 self._symbols = (symbols if symbols else [])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
832
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
833 def train(self, labeled_sequences=None, unlabeled_sequences=None,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
834 **kwargs):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
835 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
836 Trains the HMM using both (or either of) supervised and unsupervised
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
837 techniques.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
838
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
839 :return: the trained model
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
840 :rtype: HiddenMarkovModelTagger
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
841 :param labelled_sequences: the supervised training data, a set of
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
842 labelled sequences of observations
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
843 :type labelled_sequences: list
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
844 :param unlabeled_sequences: the unsupervised training data, a set of
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
845 sequences of observations
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
846 :type unlabeled_sequences: list
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
847 :param kwargs: additional arguments to pass to the training methods
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
848 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
849 assert labeled_sequences or unlabeled_sequences
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
850 model = None
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
851 if labeled_sequences:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
852 model = self.train_supervised(labeled_sequences, **kwargs)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
853 if unlabeled_sequences:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
854 if model: kwargs['model'] = model
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
855 model = self.train_unsupervised(unlabeled_sequences, **kwargs)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
856 return model
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
857
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
858
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
859 def _baum_welch_step(self, sequence, model, symbol_to_number):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
860 N = len(model._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
861 M = len(model._symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
862 T = len(sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
863
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
864 # compute forward and backward probabilities
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
865 alpha = model._forward_probability(sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
866 beta = model._backward_probability(sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
867
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
868 # find the log probability of the sequence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
869 lpk = logsumexp2(alpha[T-1])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
870
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
871 A_numer = _ninf_array((N, N))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
872 B_numer = _ninf_array((N, M))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
873 A_denom = _ninf_array(N)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
874 B_denom = _ninf_array(N)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
875
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
876 transitions_logprob = model._transitions_matrix().T
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
877
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
878 for t in range(T):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
879 symbol = sequence[t][_TEXT] # not found? FIXME
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
880 next_symbol = None
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
881 if t < T - 1:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
882 next_symbol = sequence[t+1][_TEXT] # not found? FIXME
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
883 xi = symbol_to_number[symbol]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
884
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
885 next_outputs_logprob = model._outputs_vector(next_symbol)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
886 alpha_plus_beta = alpha[t] + beta[t]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
887
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
888 if t < T - 1:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
889 numer_add = transitions_logprob + next_outputs_logprob + \
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
890 beta[t+1] + alpha[t].reshape(N, 1)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
891 A_numer = np.logaddexp2(A_numer, numer_add)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
892 A_denom = np.logaddexp2(A_denom, alpha_plus_beta)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
893 else:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
894 B_denom = np.logaddexp2(A_denom, alpha_plus_beta)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
895
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
896 B_numer[:,xi] = np.logaddexp2(B_numer[:,xi], alpha_plus_beta)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
897
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
898 return lpk, A_numer, A_denom, B_numer, B_denom
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
899
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
900 def train_unsupervised(self, unlabeled_sequences, update_outputs=True,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
901 **kwargs):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
902 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
903 Trains the HMM using the Baum-Welch algorithm to maximise the
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
904 probability of the data sequence. This is a variant of the EM
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
905 algorithm, and is unsupervised in that it doesn't need the state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
906 sequences for the symbols. The code is based on 'A Tutorial on Hidden
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
907 Markov Models and Selected Applications in Speech Recognition',
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
908 Lawrence Rabiner, IEEE, 1989.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
909
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
910 :return: the trained model
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
911 :rtype: HiddenMarkovModelTagger
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
912 :param unlabeled_sequences: the training data, a set of
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
913 sequences of observations
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
914 :type unlabeled_sequences: list
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
915
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
916 kwargs may include following parameters:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
917
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
918 :param model: a HiddenMarkovModelTagger instance used to begin
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
919 the Baum-Welch algorithm
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
920 :param max_iterations: the maximum number of EM iterations
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
921 :param convergence_logprob: the maximum change in log probability to
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
922 allow convergence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
923 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
924
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
925 # create a uniform HMM, which will be iteratively refined, unless
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
926 # given an existing model
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
927 model = kwargs.get('model')
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
928 if not model:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
929 priors = RandomProbDist(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
930 transitions = DictionaryConditionalProbDist(
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
931 dict((state, RandomProbDist(self._states))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
932 for state in self._states))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
933 outputs = DictionaryConditionalProbDist(
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
934 dict((state, RandomProbDist(self._symbols))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
935 for state in self._states))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
936 model = HiddenMarkovModelTagger(self._symbols, self._states,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
937 transitions, outputs, priors)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
938
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
939 self._states = model._states
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
940 self._symbols = model._symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
941
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
942 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
943 M = len(self._symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
944 symbol_numbers = dict((sym, i) for i, sym in enumerate(self._symbols))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
945
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
946 # update model prob dists so that they can be modified
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
947 # model._priors = MutableProbDist(model._priors, self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
948
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
949 model._transitions = DictionaryConditionalProbDist(
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
950 dict((s, MutableProbDist(model._transitions[s], self._states))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
951 for s in self._states))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
952
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
953 if update_outputs:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
954 model._outputs = DictionaryConditionalProbDist(
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
955 dict((s, MutableProbDist(model._outputs[s], self._symbols))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
956 for s in self._states))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
957
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
958 model.reset_cache()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
959
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
960 # iterate until convergence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
961 converged = False
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
962 last_logprob = None
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
963 iteration = 0
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
964 max_iterations = kwargs.get('max_iterations', 1000)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
965 epsilon = kwargs.get('convergence_logprob', 1e-6)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
966
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
967 while not converged and iteration < max_iterations:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
968 A_numer = _ninf_array((N, N))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
969 B_numer = _ninf_array((N, M))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
970 A_denom = _ninf_array(N)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
971 B_denom = _ninf_array(N)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
972
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
973 logprob = 0
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
974 for sequence in unlabeled_sequences:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
975 sequence = list(sequence)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
976 if not sequence:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
977 continue
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
978
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
979 (lpk, seq_A_numer, seq_A_denom,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
980 seq_B_numer, seq_B_denom) = self._baum_welch_step(sequence, model, symbol_numbers)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
981
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
982 # add these sums to the global A and B values
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
983 for i in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
984 A_numer[i] = np.logaddexp2(A_numer[i], seq_A_numer[i]-lpk)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
985 B_numer[i] = np.logaddexp2(B_numer[i], seq_B_numer[i]-lpk)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
986
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
987 A_denom = np.logaddexp2(A_denom, seq_A_denom-lpk)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
988 B_denom = np.logaddexp2(B_denom, seq_B_denom-lpk)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
989
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
990 logprob += lpk
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
991
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
992 # use the calculated values to update the transition and output
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
993 # probability values
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
994 for i in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
995 logprob_Ai = A_numer[i] - A_denom[i]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
996 logprob_Bi = B_numer[i] - B_denom[i]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
997
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
998 # We should normalize all probabilities (see p.391 Huang et al)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
999 # Let sum(P) be K.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1000 # We can divide each Pi by K to make sum(P) == 1.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1001 # Pi' = Pi/K
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1002 # log2(Pi') = log2(Pi) - log2(K)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1003 logprob_Ai -= logsumexp2(logprob_Ai)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1004 logprob_Bi -= logsumexp2(logprob_Bi)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1005
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1006 # update output and transition probabilities
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1007 si = self._states[i]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1008
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1009 for j in range(N):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1010 sj = self._states[j]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1011 model._transitions[si].update(sj, logprob_Ai[j])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1012
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1013 if update_outputs:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1014 for k in range(M):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1015 ok = self._symbols[k]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1016 model._outputs[si].update(ok, logprob_Bi[k])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1017
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1018 # Rabiner says the priors don't need to be updated. I don't
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1019 # believe him. FIXME
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1020
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1021 # test for convergence
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1022 if iteration > 0 and abs(logprob - last_logprob) < epsilon:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1023 converged = True
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1024
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1025 print('iteration', iteration, 'logprob', logprob)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1026 iteration += 1
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1027 last_logprob = logprob
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1028
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1029 return model
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1030
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1031 def train_supervised(self, labelled_sequences, estimator=None):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1032 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1033 Supervised training maximising the joint probability of the symbol and
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1034 state sequences. This is done via collecting frequencies of
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1035 transitions between states, symbol observations while within each
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1036 state and which states start a sentence. These frequency distributions
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1037 are then normalised into probability estimates, which can be
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1038 smoothed if desired.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1039
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1040 :return: the trained model
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1041 :rtype: HiddenMarkovModelTagger
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1042 :param labelled_sequences: the training data, a set of
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1043 labelled sequences of observations
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1044 :type labelled_sequences: list
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1045 :param estimator: a function taking
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1046 a FreqDist and a number of bins and returning a CProbDistI;
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1047 otherwise a MLE estimate is used
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1048 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1049
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1050 # default to the MLE estimate
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1051 if estimator is None:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1052 estimator = lambda fdist, bins: MLEProbDist(fdist)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1053
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1054 # count occurrences of starting states, transitions out of each state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1055 # and output symbols observed in each state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1056 known_symbols = set(self._symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1057 known_states = set(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1058
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1059 starting = FreqDist()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1060 transitions = ConditionalFreqDist()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1061 outputs = ConditionalFreqDist()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1062 for sequence in labelled_sequences:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1063 lasts = None
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1064 for token in sequence:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1065 state = token[_TAG]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1066 symbol = token[_TEXT]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1067 if lasts is None:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1068 starting[state] += 1
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1069 else:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1070 transitions[lasts][state] += 1
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1071 outputs[state][symbol] += 1
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1072 lasts = state
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1073
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1074 # update the state and symbol lists
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1075 if state not in known_states:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1076 self._states.append(state)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1077 known_states.add(state)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1078
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1079 if symbol not in known_symbols:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1080 self._symbols.append(symbol)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1081 known_symbols.add(symbol)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1082
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1083 # create probability distributions (with smoothing)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1084 N = len(self._states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1085 pi = estimator(starting, N)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1086 A = ConditionalProbDist(transitions, estimator, N)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1087 B = ConditionalProbDist(outputs, estimator, len(self._symbols))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1088
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1089 return HiddenMarkovModelTagger(self._symbols, self._states, A, B, pi)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1090
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1091
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1092 def _ninf_array(shape):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1093 res = np.empty(shape, np.float64)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1094 res.fill(-np.inf)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1095 return res
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1096
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1097
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1098 def logsumexp2(arr):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1099 max_ = arr.max()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1100 return np.log2(np.sum(2**(arr - max_))) + max_
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1101
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1102
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1103 def _log_add(*values):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1104 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1105 Adds the logged values, returning the logarithm of the addition.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1106 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1107 x = max(values)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1108 if x > -np.inf:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1109 sum_diffs = 0
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1110 for value in values:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1111 sum_diffs += 2**(value - x)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1112 return x + np.log2(sum_diffs)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1113 else:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1114 return x
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1115
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1116
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1117 def _create_hmm_tagger(states, symbols, A, B, pi):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1118 def pd(values, samples):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1119 d = dict(zip(samples, values))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1120 return DictionaryProbDist(d)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1121
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1122 def cpd(array, conditions, samples):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1123 d = {}
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1124 for values, condition in zip(array, conditions):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1125 d[condition] = pd(values, samples)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1126 return DictionaryConditionalProbDist(d)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1127
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1128 A = cpd(A, states, states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1129 B = cpd(B, states, symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1130 pi = pd(pi, states)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1131 return HiddenMarkovModelTagger(symbols=symbols, states=states,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1132 transitions=A, outputs=B, priors=pi)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1133
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1134
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1135 def _market_hmm_example():
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1136 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1137 Return an example HMM (described at page 381, Huang et al)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1138 """
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1139 states = ['bull', 'bear', 'static']
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1140 symbols = ['up', 'down', 'unchanged']
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1141 A = np.array([[0.6, 0.2, 0.2], [0.5, 0.3, 0.2], [0.4, 0.1, 0.5]], np.float64)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1142 B = np.array([[0.7, 0.1, 0.2], [0.1, 0.6, 0.3], [0.3, 0.3, 0.4]], np.float64)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1143 pi = np.array([0.5, 0.2, 0.3], np.float64)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1144
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1145 model = _create_hmm_tagger(states, symbols, A, B, pi)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1146 return model, states, symbols
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1147
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1148
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1149 def demo():
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1150 # demonstrates HMM probability calculation
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1151
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1152 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1153 print("HMM probability calculation demo")
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1154 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1155
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1156 model, states, symbols = _market_hmm_example()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1157
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1158 print('Testing', model)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1159
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1160 for test in [['up', 'up'], ['up', 'down', 'up'],
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1161 ['down'] * 5, ['unchanged'] * 5 + ['up']]:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1162
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1163 sequence = [(t, None) for t in test]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1164
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1165 print('Testing with state sequence', test)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1166 print('probability =', model.probability(sequence))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1167 print('tagging = ', model.tag([word for (word,tag) in sequence]))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1168 print('p(tagged) = ', model.probability(sequence))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1169 print('H = ', model.entropy(sequence))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1170 print('H_exh = ', model._exhaustive_entropy(sequence))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1171 print('H(point) = ', model.point_entropy(sequence))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1172 print('H_exh(point)=', model._exhaustive_point_entropy(sequence))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1173 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1174
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1175 def load_pos(num_sents):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1176 from nltk.corpus import brown
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1177
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1178 sentences = brown.tagged_sents(categories='news')[:num_sents]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1179
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1180 tag_re = re.compile(r'[*]|--|[^+*-]+')
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1181 tag_set = set()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1182 symbols = set()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1183
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1184 cleaned_sentences = []
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1185 for sentence in sentences:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1186 for i in range(len(sentence)):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1187 word, tag = sentence[i]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1188 word = word.lower() # normalize
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1189 symbols.add(word) # log this word
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1190 # Clean up the tag.
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1191 tag = tag_re.match(tag).group()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1192 tag_set.add(tag)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1193 sentence[i] = (word, tag) # store cleaned-up tagged token
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1194 cleaned_sentences += [sentence]
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1195
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1196 return cleaned_sentences, list(tag_set), list(symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1197
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1198 def demo_pos():
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1199 # demonstrates POS tagging using supervised training
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1200
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1201 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1202 print("HMM POS tagging demo")
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1203 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1204
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1205 print('Training HMM...')
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1206 labelled_sequences, tag_set, symbols = load_pos(20000)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1207 trainer = HiddenMarkovModelTrainer(tag_set, symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1208 hmm = trainer.train_supervised(labelled_sequences[10:],
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1209 estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1210
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1211 print('Testing...')
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1212 hmm.test(labelled_sequences[:10], verbose=True)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1213
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1214 def _untag(sentences):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1215 unlabeled = []
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1216 for sentence in sentences:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1217 unlabeled.append([(token[_TEXT], None) for token in sentence])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1218 return unlabeled
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1219
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1220 def demo_pos_bw(test=10, supervised=20, unsupervised=10, verbose=True,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1221 max_iterations=5):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1222 # demonstrates the Baum-Welch algorithm in POS tagging
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1223
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1224 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1225 print("Baum-Welch demo for POS tagging")
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1226 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1227
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1228 print('Training HMM (supervised, %d sentences)...' % supervised)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1229
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1230 sentences, tag_set, symbols = load_pos(test + supervised + unsupervised)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1231
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1232 symbols = set()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1233 for sentence in sentences:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1234 for token in sentence:
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1235 symbols.add(token[_TEXT])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1236
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1237 trainer = HiddenMarkovModelTrainer(tag_set, list(symbols))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1238 hmm = trainer.train_supervised(sentences[test:test+supervised],
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1239 estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins))
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1240
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1241 hmm.test(sentences[:test], verbose=verbose)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1242
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1243 print('Training (unsupervised, %d sentences)...' % unsupervised)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1244 # it's rather slow - so only use 10 samples by default
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1245 unlabeled = _untag(sentences[test+supervised:])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1246 hmm = trainer.train_unsupervised(unlabeled, model=hmm,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1247 max_iterations=max_iterations)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1248 hmm.test(sentences[:test], verbose=verbose)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1249
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1250 def demo_bw():
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1251 # demo Baum Welch by generating some sequences and then performing
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1252 # unsupervised training on them
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1253
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1254 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1255 print("Baum-Welch demo for market example")
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1256 print()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1257
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1258 model, states, symbols = _market_hmm_example()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1259
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1260 # generate some random sequences
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1261 training = []
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1262 import random
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1263 rng = random.Random()
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1264 rng.seed(0)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1265 for i in range(10):
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1266 item = model.random_sample(rng, 5)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1267 training.append([(i[0], None) for i in item])
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1268
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1269 # train on those examples, starting with the model that generated them
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1270 trainer = HiddenMarkovModelTrainer(states, symbols)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1271 hmm = trainer.train_unsupervised(training, model=model,
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1272 max_iterations=1000)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1273
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1274
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1275 if __name__ == "__main__":
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1276 import doctest
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1277 doctest.testmod(optionflags=doctest.NORMALIZE_WHITESPACE)
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1278
26d9c0308fcf updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1279