Mercurial > hg > python
annotate hmm/hmm.py @ 3:26d9c0308fcf
updated/added from ecclerig version
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Mon, 09 Mar 2020 17:35:28 +0000 |
parents | |
children |
rev | line source |
---|---|
3
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1 # Natural Language Toolkit: Hidden Markov Model |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
2 # |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
3 # Copyright (C) 2001-2015 NLTK Project |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
4 # Author: Trevor Cohn <tacohn@csse.unimelb.edu.au> |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
5 # Philip Blunsom <pcbl@csse.unimelb.edu.au> |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
6 # Tiago Tresoldi <tiago@tresoldi.pro.br> (fixes) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
7 # Steven Bird <stevenbird1@gmail.com> (fixes) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
8 # Joseph Frazee <jfrazee@mail.utexas.edu> (fixes) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
9 # Steven Xu <xxu@student.unimelb.edu.au> (fixes) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
10 # URL: <http://nltk.org/> |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
11 # For license information, see LICENSE.TXT |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
12 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
13 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
14 Hidden Markov Models (HMMs) largely used to assign the correct label sequence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
15 to sequential data or assess the probability of a given label and data |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
16 sequence. These models are finite state machines characterised by a number of |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
17 states, transitions between these states, and output symbols emitted while in |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
18 each state. The HMM is an extension to the Markov chain, where each state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
19 corresponds deterministically to a given event. In the HMM the observation is |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
20 a probabilistic function of the state. HMMs share the Markov chain's |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
21 assumption, being that the probability of transition from one state to another |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
22 only depends on the current state - i.e. the series of states that led to the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
23 current state are not used. They are also time invariant. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
24 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
25 The HMM is a directed graph, with probability weighted edges (representing the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
26 probability of a transition between the source and sink states) where each |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
27 vertex emits an output symbol when entered. The symbol (or observation) is |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
28 non-deterministically generated. For this reason, knowing that a sequence of |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
29 output observations was generated by a given HMM does not mean that the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
30 corresponding sequence of states (and what the current state is) is known. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
31 This is the 'hidden' in the hidden markov model. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
32 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
33 Formally, a HMM can be characterised by: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
34 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
35 - the output observation alphabet. This is the set of symbols which may be |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
36 observed as output of the system. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
37 - the set of states. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
38 - the transition probabilities *a_{ij} = P(s_t = j | s_{t-1} = i)*. These |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
39 represent the probability of transition to each state from a given state. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
40 - the output probability matrix *b_i(k) = P(X_t = o_k | s_t = i)*. These |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
41 represent the probability of observing each symbol in a given state. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
42 - the initial state distribution. This gives the probability of starting |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
43 in each state. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
44 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
45 To ground this discussion, take a common NLP application, part-of-speech (POS) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
46 tagging. An HMM is desirable for this task as the highest probability tag |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
47 sequence can be calculated for a given sequence of word forms. This differs |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
48 from other tagging techniques which often tag each word individually, seeking |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
49 to optimise each individual tagging greedily without regard to the optimal |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
50 combination of tags for a larger unit, such as a sentence. The HMM does this |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
51 with the Viterbi algorithm, which efficiently computes the optimal path |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
52 through the graph given the sequence of words forms. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
53 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
54 In POS tagging the states usually have a 1:1 correspondence with the tag |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
55 alphabet - i.e. each state represents a single tag. The output observation |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
56 alphabet is the set of word forms (the lexicon), and the remaining three |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
57 parameters are derived by a training regime. With this information the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
58 probability of a given sentence can be easily derived, by simply summing the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
59 probability of each distinct path through the model. Similarly, the highest |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
60 probability tagging sequence can be derived with the Viterbi algorithm, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
61 yielding a state sequence which can be mapped into a tag sequence. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
62 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
63 This discussion assumes that the HMM has been trained. This is probably the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
64 most difficult task with the model, and requires either MLE estimates of the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
65 parameters or unsupervised learning using the Baum-Welch algorithm, a variant |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
66 of EM. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
67 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
68 For more information, please consult the source code for this module, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
69 which includes extensive demonstration code. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
70 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
71 from __future__ import print_function, unicode_literals, division |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
72 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
73 import re |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
74 import itertools |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
75 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
76 try: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
77 import numpy as np |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
78 except ImportError: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
79 pass |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
80 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
81 from nltk.probability import (FreqDist, ConditionalFreqDist, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
82 ConditionalProbDist, DictionaryProbDist, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
83 DictionaryConditionalProbDist, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
84 LidstoneProbDist, MutableProbDist, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
85 MLEProbDist, RandomProbDist) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
86 from nltk.metrics import accuracy |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
87 from nltk.util import LazyMap, unique_list |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
88 from nltk.compat import python_2_unicode_compatible, izip, imap |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
89 from nltk.tag.api import TaggerI |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
90 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
91 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
92 _TEXT = 0 # index of text in a tuple |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
93 _TAG = 1 # index of tag in a tuple |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
94 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
95 def _identity(labeled_symbols): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
96 return labeled_symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
97 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
98 @python_2_unicode_compatible |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
99 class HiddenMarkovModelTagger(TaggerI): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
100 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
101 Hidden Markov model class, a generative model for labelling sequence data. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
102 These models define the joint probability of a sequence of symbols and |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
103 their labels (state transitions) as the product of the starting state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
104 probability, the probability of each state transition, and the probability |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
105 of each observation being generated from each state. This is described in |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
106 more detail in the module documentation. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
107 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
108 This implementation is based on the HMM description in Chapter 8, Huang, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
109 Acero and Hon, Spoken Language Processing and includes an extension for |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
110 training shallow HMM parsers or specialized HMMs as in Molina et. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
111 al, 2002. A specialized HMM modifies training data by applying a |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
112 specialization function to create a new training set that is more |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
113 appropriate for sequential tagging with an HMM. A typical use case is |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
114 chunking. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
115 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
116 :param symbols: the set of output symbols (alphabet) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
117 :type symbols: seq of any |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
118 :param states: a set of states representing state space |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
119 :type states: seq of any |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
120 :param transitions: transition probabilities; Pr(s_i | s_j) is the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
121 probability of transition from state i given the model is in |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
122 state_j |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
123 :type transitions: ConditionalProbDistI |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
124 :param outputs: output probabilities; Pr(o_k | s_i) is the probability |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
125 of emitting symbol k when entering state i |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
126 :type outputs: ConditionalProbDistI |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
127 :param priors: initial state distribution; Pr(s_i) is the probability |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
128 of starting in state i |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
129 :type priors: ProbDistI |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
130 :param transform: an optional function for transforming training |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
131 instances, defaults to the identity function. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
132 :type transform: callable |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
133 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
134 def __init__(self, symbols, states, transitions, outputs, priors, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
135 transform=_identity): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
136 self._symbols = unique_list(symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
137 self._states = unique_list(states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
138 self._transitions = transitions |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
139 self._outputs = outputs |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
140 self._priors = priors |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
141 self._cache = None |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
142 self._transform = transform |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
143 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
144 @classmethod |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
145 def _train(cls, labeled_sequence, test_sequence=None, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
146 unlabeled_sequence=None, transform=_identity, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
147 estimator=None, **kwargs): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
148 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
149 if estimator is None: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
150 def estimator(fd, bins): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
151 return LidstoneProbDist(fd, 0.1, bins) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
152 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
153 labeled_sequence = LazyMap(transform, labeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
154 symbols = unique_list(word for sent in labeled_sequence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
155 for word, tag in sent) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
156 tag_set = unique_list(tag for sent in labeled_sequence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
157 for word, tag in sent) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
158 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
159 trainer = HiddenMarkovModelTrainer(tag_set, symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
160 hmm = trainer.train_supervised(labeled_sequence, estimator=estimator) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
161 hmm = cls(hmm._symbols, hmm._states, hmm._transitions, hmm._outputs, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
162 hmm._priors, transform=transform) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
163 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
164 if test_sequence: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
165 hmm.test(test_sequence, verbose=kwargs.get('verbose', False)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
166 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
167 if unlabeled_sequence: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
168 max_iterations = kwargs.get('max_iterations', 5) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
169 hmm = trainer.train_unsupervised(unlabeled_sequence, model=hmm, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
170 max_iterations=max_iterations) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
171 if test_sequence: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
172 hmm.test(test_sequence, verbose=kwargs.get('verbose', False)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
173 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
174 return hmm |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
175 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
176 @classmethod |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
177 def train(cls, labeled_sequence, test_sequence=None, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
178 unlabeled_sequence=None, **kwargs): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
179 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
180 Train a new HiddenMarkovModelTagger using the given labeled and |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
181 unlabeled training instances. Testing will be performed if test |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
182 instances are provided. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
183 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
184 :return: a hidden markov model tagger |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
185 :rtype: HiddenMarkovModelTagger |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
186 :param labeled_sequence: a sequence of labeled training instances, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
187 i.e. a list of sentences represented as tuples |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
188 :type labeled_sequence: list(list) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
189 :param test_sequence: a sequence of labeled test instances |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
190 :type test_sequence: list(list) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
191 :param unlabeled_sequence: a sequence of unlabeled training instances, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
192 i.e. a list of sentences represented as words |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
193 :type unlabeled_sequence: list(list) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
194 :param transform: an optional function for transforming training |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
195 instances, defaults to the identity function, see ``transform()`` |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
196 :type transform: function |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
197 :param estimator: an optional function or class that maps a |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
198 condition's frequency distribution to its probability |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
199 distribution, defaults to a Lidstone distribution with gamma = 0.1 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
200 :type estimator: class or function |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
201 :param verbose: boolean flag indicating whether training should be |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
202 verbose or include printed output |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
203 :type verbose: bool |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
204 :param max_iterations: number of Baum-Welch interations to perform |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
205 :type max_iterations: int |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
206 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
207 return cls._train(labeled_sequence, test_sequence, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
208 unlabeled_sequence, **kwargs) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
209 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
210 def probability(self, sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
211 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
212 Returns the probability of the given symbol sequence. If the sequence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
213 is labelled, then returns the joint probability of the symbol, state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
214 sequence. Otherwise, uses the forward algorithm to find the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
215 probability over all label sequences. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
216 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
217 :return: the probability of the sequence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
218 :rtype: float |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
219 :param sequence: the sequence of symbols which must contain the TEXT |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
220 property, and optionally the TAG property |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
221 :type sequence: Token |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
222 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
223 return 2**(self.log_probability(self._transform(sequence))) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
224 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
225 def log_probability(self, sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
226 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
227 Returns the log-probability of the given symbol sequence. If the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
228 sequence is labelled, then returns the joint log-probability of the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
229 symbol, state sequence. Otherwise, uses the forward algorithm to find |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
230 the log-probability over all label sequences. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
231 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
232 :return: the log-probability of the sequence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
233 :rtype: float |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
234 :param sequence: the sequence of symbols which must contain the TEXT |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
235 property, and optionally the TAG property |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
236 :type sequence: Token |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
237 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
238 sequence = self._transform(sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
239 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
240 T = len(sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
241 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
242 if T > 0 and sequence[0][_TAG]: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
243 last_state = sequence[0][_TAG] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
244 p = self._priors.logprob(last_state) + \ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
245 self._output_logprob(last_state, sequence[0][_TEXT]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
246 for t in range(1, T): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
247 state = sequence[t][_TAG] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
248 p += self._transitions[last_state].logprob(state) + \ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
249 self._output_logprob(state, sequence[t][_TEXT]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
250 last_state = state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
251 return p |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
252 else: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
253 alpha = self._forward_probability(sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
254 p = logsumexp2(alpha[T-1]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
255 return p |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
256 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
257 def tag(self, unlabeled_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
258 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
259 Tags the sequence with the highest probability state sequence. This |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
260 uses the best_path method to find the Viterbi path. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
261 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
262 :return: a labelled sequence of symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
263 :rtype: list |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
264 :param unlabeled_sequence: the sequence of unlabeled symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
265 :type unlabeled_sequence: list |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
266 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
267 unlabeled_sequence = self._transform(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
268 return self._tag(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
269 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
270 def _tag(self, unlabeled_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
271 path = self._best_path(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
272 return list(izip(unlabeled_sequence, path)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
273 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
274 def _output_logprob(self, state, symbol): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
275 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
276 :return: the log probability of the symbol being observed in the given |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
277 state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
278 :rtype: float |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
279 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
280 return self._outputs[state].logprob(symbol) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
281 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
282 def _create_cache(self): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
283 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
284 The cache is a tuple (P, O, X, S) where: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
285 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
286 - S maps symbols to integers. I.e., it is the inverse |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
287 mapping from self._symbols; for each symbol s in |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
288 self._symbols, the following is true:: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
289 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
290 self._symbols[S[s]] == s |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
291 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
292 - O is the log output probabilities:: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
293 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
294 O[i,k] = log( P(token[t]=sym[k]|tag[t]=state[i]) ) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
295 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
296 - X is the log transition probabilities:: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
297 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
298 X[i,j] = log( P(tag[t]=state[j]|tag[t-1]=state[i]) ) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
299 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
300 - P is the log prior probabilities:: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
301 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
302 P[i] = log( P(tag[0]=state[i]) ) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
303 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
304 if not self._cache: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
305 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
306 M = len(self._symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
307 P = np.zeros(N, np.float32) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
308 X = np.zeros((N, N), np.float32) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
309 O = np.zeros((N, M), np.float32) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
310 for i in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
311 si = self._states[i] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
312 P[i] = self._priors.logprob(si) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
313 for j in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
314 X[i, j] = self._transitions[si].logprob(self._states[j]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
315 for k in range(M): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
316 O[i, k] = self._output_logprob(si, self._symbols[k]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
317 S = {} |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
318 for k in range(M): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
319 S[self._symbols[k]] = k |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
320 self._cache = (P, O, X, S) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
321 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
322 def _update_cache(self, symbols): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
323 # add new symbols to the symbol table and repopulate the output |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
324 # probabilities and symbol table mapping |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
325 if symbols: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
326 self._create_cache() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
327 P, O, X, S = self._cache |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
328 for symbol in symbols: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
329 if symbol not in self._symbols: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
330 self._cache = None |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
331 self._symbols.append(symbol) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
332 # don't bother with the work if there aren't any new symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
333 if not self._cache: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
334 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
335 M = len(self._symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
336 Q = O.shape[1] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
337 # add new columns to the output probability table without |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
338 # destroying the old probabilities |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
339 O = np.hstack([O, np.zeros((N, M - Q), np.float32)]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
340 for i in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
341 si = self._states[i] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
342 # only calculate probabilities for new symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
343 for k in range(Q, M): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
344 O[i, k] = self._output_logprob(si, self._symbols[k]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
345 # only create symbol mappings for new symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
346 for k in range(Q, M): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
347 S[self._symbols[k]] = k |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
348 self._cache = (P, O, X, S) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
349 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
350 def reset_cache(self): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
351 self._cache = None |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
352 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
353 def best_path(self, unlabeled_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
354 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
355 Returns the state sequence of the optimal (most probable) path through |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
356 the HMM. Uses the Viterbi algorithm to calculate this part by dynamic |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
357 programming. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
358 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
359 :return: the state sequence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
360 :rtype: sequence of any |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
361 :param unlabeled_sequence: the sequence of unlabeled symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
362 :type unlabeled_sequence: list |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
363 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
364 unlabeled_sequence = self._transform(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
365 return self._best_path(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
366 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
367 def _best_path(self, unlabeled_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
368 T = len(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
369 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
370 self._create_cache() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
371 self._update_cache(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
372 P, O, X, S = self._cache |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
373 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
374 V = np.zeros((T, N), np.float32) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
375 B = -np.ones((T, N), np.int) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
376 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
377 V[0] = P + O[:, S[unlabeled_sequence[0]]] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
378 for t in range(1, T): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
379 for j in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
380 vs = V[t-1, :] + X[:, j] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
381 best = np.argmax(vs) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
382 V[t, j] = vs[best] + O[j, S[unlabeled_sequence[t]]] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
383 B[t, j] = best |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
384 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
385 current = np.argmax(V[T-1,:]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
386 sequence = [current] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
387 for t in range(T-1, 0, -1): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
388 last = B[t, current] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
389 sequence.append(last) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
390 current = last |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
391 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
392 sequence.reverse() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
393 return list(map(self._states.__getitem__, sequence)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
394 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
395 def best_path_simple(self, unlabeled_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
396 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
397 Returns the state sequence of the optimal (most probable) path through |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
398 the HMM. Uses the Viterbi algorithm to calculate this part by dynamic |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
399 programming. This uses a simple, direct method, and is included for |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
400 teaching purposes. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
401 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
402 :return: the state sequence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
403 :rtype: sequence of any |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
404 :param unlabeled_sequence: the sequence of unlabeled symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
405 :type unlabeled_sequence: list |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
406 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
407 unlabeled_sequence = self._transform(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
408 return self._best_path_simple(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
409 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
410 def _best_path_simple(self, unlabeled_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
411 T = len(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
412 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
413 V = np.zeros((T, N), np.float64) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
414 B = {} |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
415 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
416 # find the starting log probabilities for each state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
417 symbol = unlabeled_sequence[0] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
418 for i, state in enumerate(self._states): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
419 V[0, i] = self._priors.logprob(state) + \ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
420 self._output_logprob(state, symbol) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
421 B[0, state] = None |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
422 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
423 # find the maximum log probabilities for reaching each state at time t |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
424 for t in range(1, T): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
425 symbol = unlabeled_sequence[t] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
426 for j in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
427 sj = self._states[j] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
428 best = None |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
429 for i in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
430 si = self._states[i] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
431 va = V[t-1, i] + self._transitions[si].logprob(sj) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
432 if not best or va > best[0]: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
433 best = (va, si) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
434 V[t, j] = best[0] + self._output_logprob(sj, symbol) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
435 B[t, sj] = best[1] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
436 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
437 # find the highest probability final state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
438 best = None |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
439 for i in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
440 val = V[T-1, i] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
441 if not best or val > best[0]: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
442 best = (val, self._states[i]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
443 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
444 # traverse the back-pointers B to find the state sequence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
445 current = best[1] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
446 sequence = [current] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
447 for t in range(T-1, 0, -1): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
448 last = B[t, current] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
449 sequence.append(last) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
450 current = last |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
451 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
452 sequence.reverse() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
453 return sequence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
454 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
455 def random_sample(self, rng, length): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
456 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
457 Randomly sample the HMM to generate a sentence of a given length. This |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
458 samples the prior distribution then the observation distribution and |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
459 transition distribution for each subsequent observation and state. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
460 This will mostly generate unintelligible garbage, but can provide some |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
461 amusement. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
462 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
463 :return: the randomly created state/observation sequence, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
464 generated according to the HMM's probability |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
465 distributions. The SUBTOKENS have TEXT and TAG |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
466 properties containing the observation and state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
467 respectively. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
468 :rtype: list |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
469 :param rng: random number generator |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
470 :type rng: Random (or any object with a random() method) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
471 :param length: desired output length |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
472 :type length: int |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
473 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
474 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
475 # sample the starting state and symbol prob dists |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
476 tokens = [] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
477 state = self._sample_probdist(self._priors, rng.random(), self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
478 symbol = self._sample_probdist(self._outputs[state], |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
479 rng.random(), self._symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
480 tokens.append((symbol, state)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
481 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
482 for i in range(1, length): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
483 # sample the state transition and symbol prob dists |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
484 state = self._sample_probdist(self._transitions[state], |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
485 rng.random(), self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
486 symbol = self._sample_probdist(self._outputs[state], |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
487 rng.random(), self._symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
488 tokens.append((symbol, state)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
489 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
490 return tokens |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
491 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
492 def _sample_probdist(self, probdist, p, samples): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
493 cum_p = 0 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
494 for sample in samples: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
495 add_p = probdist.prob(sample) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
496 if cum_p <= p <= cum_p + add_p: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
497 return sample |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
498 cum_p += add_p |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
499 raise Exception('Invalid probability distribution - ' |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
500 'does not sum to one') |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
501 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
502 def entropy(self, unlabeled_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
503 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
504 Returns the entropy over labellings of the given sequence. This is |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
505 given by:: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
506 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
507 H(O) = - sum_S Pr(S | O) log Pr(S | O) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
508 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
509 where the summation ranges over all state sequences, S. Let |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
510 *Z = Pr(O) = sum_S Pr(S, O)}* where the summation ranges over all state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
511 sequences and O is the observation sequence. As such the entropy can |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
512 be re-expressed as:: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
513 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
514 H = - sum_S Pr(S | O) log [ Pr(S, O) / Z ] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
515 = log Z - sum_S Pr(S | O) log Pr(S, 0) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
516 = log Z - sum_S Pr(S | O) [ log Pr(S_0) + sum_t Pr(S_t | S_{t-1}) + sum_t Pr(O_t | S_t) ] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
517 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
518 The order of summation for the log terms can be flipped, allowing |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
519 dynamic programming to be used to calculate the entropy. Specifically, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
520 we use the forward and backward probabilities (alpha, beta) giving:: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
521 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
522 H = log Z - sum_s0 alpha_0(s0) beta_0(s0) / Z * log Pr(s0) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
523 + sum_t,si,sj alpha_t(si) Pr(sj | si) Pr(O_t+1 | sj) beta_t(sj) / Z * log Pr(sj | si) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
524 + sum_t,st alpha_t(st) beta_t(st) / Z * log Pr(O_t | st) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
525 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
526 This simply uses alpha and beta to find the probabilities of partial |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
527 sequences, constrained to include the given state(s) at some point in |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
528 time. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
529 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
530 unlabeled_sequence = self._transform(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
531 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
532 T = len(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
533 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
534 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
535 alpha = self._forward_probability(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
536 beta = self._backward_probability(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
537 normalisation = logsumexp2(alpha[T-1]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
538 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
539 entropy = normalisation |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
540 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
541 # starting state, t = 0 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
542 for i, state in enumerate(self._states): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
543 p = 2**(alpha[0, i] + beta[0, i] - normalisation) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
544 entropy -= p * self._priors.logprob(state) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
545 #print 'p(s_0 = %s) =' % state, p |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
546 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
547 # state transitions |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
548 for t0 in range(T - 1): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
549 t1 = t0 + 1 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
550 for i0, s0 in enumerate(self._states): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
551 for i1, s1 in enumerate(self._states): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
552 p = 2**(alpha[t0, i0] + self._transitions[s0].logprob(s1) + |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
553 self._outputs[s1].logprob( |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
554 unlabeled_sequence[t1][_TEXT]) + |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
555 beta[t1, i1] - normalisation) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
556 entropy -= p * self._transitions[s0].logprob(s1) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
557 #print 'p(s_%d = %s, s_%d = %s) =' % (t0, s0, t1, s1), p |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
558 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
559 # symbol emissions |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
560 for t in range(T): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
561 for i, state in enumerate(self._states): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
562 p = 2**(alpha[t, i] + beta[t, i] - normalisation) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
563 entropy -= p * self._outputs[state].logprob( |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
564 unlabeled_sequence[t][_TEXT]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
565 #print 'p(s_%d = %s) =' % (t, state), p |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
566 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
567 return entropy |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
568 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
569 def point_entropy(self, unlabeled_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
570 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
571 Returns the pointwise entropy over the possible states at each |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
572 position in the chain, given the observation sequence. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
573 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
574 unlabeled_sequence = self._transform(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
575 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
576 T = len(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
577 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
578 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
579 alpha = self._forward_probability(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
580 beta = self._backward_probability(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
581 normalisation = logsumexp2(alpha[T-1]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
582 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
583 entropies = np.zeros(T, np.float64) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
584 probs = np.zeros(N, np.float64) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
585 for t in range(T): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
586 for s in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
587 probs[s] = alpha[t, s] + beta[t, s] - normalisation |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
588 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
589 for s in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
590 entropies[t] -= 2**(probs[s]) * probs[s] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
591 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
592 return entropies |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
593 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
594 def _exhaustive_entropy(self, unlabeled_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
595 unlabeled_sequence = self._transform(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
596 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
597 T = len(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
598 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
599 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
600 labellings = [[state] for state in self._states] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
601 for t in range(T - 1): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
602 current = labellings |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
603 labellings = [] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
604 for labelling in current: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
605 for state in self._states: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
606 labellings.append(labelling + [state]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
607 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
608 log_probs = [] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
609 for labelling in labellings: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
610 labeled_sequence = unlabeled_sequence[:] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
611 for t, label in enumerate(labelling): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
612 labeled_sequence[t] = (labeled_sequence[t][_TEXT], label) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
613 lp = self.log_probability(labeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
614 log_probs.append(lp) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
615 normalisation = _log_add(*log_probs) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
616 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
617 #ps = zeros((T, N), float64) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
618 #for labelling, lp in zip(labellings, log_probs): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
619 #for t in range(T): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
620 #ps[t, self._states.index(labelling[t])] += \ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
621 # 2**(lp - normalisation) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
622 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
623 #for t in range(T): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
624 #print 'prob[%d] =' % t, ps[t] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
625 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
626 entropy = 0 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
627 for lp in log_probs: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
628 lp -= normalisation |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
629 entropy -= 2**(lp) * lp |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
630 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
631 return entropy |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
632 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
633 def _exhaustive_point_entropy(self, unlabeled_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
634 unlabeled_sequence = self._transform(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
635 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
636 T = len(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
637 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
638 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
639 labellings = [[state] for state in self._states] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
640 for t in range(T - 1): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
641 current = labellings |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
642 labellings = [] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
643 for labelling in current: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
644 for state in self._states: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
645 labellings.append(labelling + [state]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
646 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
647 log_probs = [] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
648 for labelling in labellings: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
649 labelled_sequence = unlabeled_sequence[:] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
650 for t, label in enumerate(labelling): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
651 labelled_sequence[t] = (labelled_sequence[t][_TEXT], label) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
652 lp = self.log_probability(labelled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
653 log_probs.append(lp) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
654 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
655 normalisation = _log_add(*log_probs) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
656 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
657 probabilities = _ninf_array((T,N)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
658 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
659 for labelling, lp in zip(labellings, log_probs): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
660 lp -= normalisation |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
661 for t, label in enumerate(labelling): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
662 index = self._states.index(label) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
663 probabilities[t, index] = _log_add(probabilities[t, index], lp) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
664 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
665 entropies = np.zeros(T, np.float64) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
666 for t in range(T): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
667 for s in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
668 entropies[t] -= 2**(probabilities[t, s]) * probabilities[t, s] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
669 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
670 return entropies |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
671 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
672 def _transitions_matrix(self): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
673 """ Return a matrix of transition log probabilities. """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
674 trans_iter = (self._transitions[sj].logprob(si) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
675 for sj in self._states |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
676 for si in self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
677 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
678 transitions_logprob = np.fromiter(trans_iter, dtype=np.float64) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
679 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
680 return transitions_logprob.reshape((N, N)).T |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
681 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
682 def _outputs_vector(self, symbol): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
683 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
684 Return a vector with log probabilities of emitting a symbol |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
685 when entering states. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
686 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
687 out_iter = (self._output_logprob(sj, symbol) for sj in self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
688 return np.fromiter(out_iter, dtype=np.float64) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
689 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
690 def _forward_probability(self, unlabeled_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
691 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
692 Return the forward probability matrix, a T by N array of |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
693 log-probabilities, where T is the length of the sequence and N is the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
694 number of states. Each entry (t, s) gives the probability of being in |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
695 state s at time t after observing the partial symbol sequence up to |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
696 and including t. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
697 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
698 :param unlabeled_sequence: the sequence of unlabeled symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
699 :type unlabeled_sequence: list |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
700 :return: the forward log probability matrix |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
701 :rtype: array |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
702 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
703 T = len(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
704 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
705 alpha = _ninf_array((T, N)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
706 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
707 transitions_logprob = self._transitions_matrix() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
708 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
709 # Initialization |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
710 symbol = unlabeled_sequence[0][_TEXT] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
711 for i, state in enumerate(self._states): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
712 alpha[0, i] = self._priors.logprob(state) + \ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
713 self._output_logprob(state, symbol) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
714 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
715 # Induction |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
716 for t in range(1, T): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
717 symbol = unlabeled_sequence[t][_TEXT] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
718 output_logprob = self._outputs_vector(symbol) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
719 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
720 for i in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
721 summand = alpha[t-1] + transitions_logprob[i] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
722 alpha[t, i] = logsumexp2(summand) + output_logprob[i] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
723 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
724 return alpha |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
725 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
726 def _backward_probability(self, unlabeled_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
727 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
728 Return the backward probability matrix, a T by N array of |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
729 log-probabilities, where T is the length of the sequence and N is the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
730 number of states. Each entry (t, s) gives the probability of being in |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
731 state s at time t after observing the partial symbol sequence from t |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
732 .. T. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
733 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
734 :return: the backward log probability matrix |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
735 :rtype: array |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
736 :param unlabeled_sequence: the sequence of unlabeled symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
737 :type unlabeled_sequence: list |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
738 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
739 T = len(unlabeled_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
740 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
741 beta = _ninf_array((T, N)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
742 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
743 transitions_logprob = self._transitions_matrix().T |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
744 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
745 # initialise the backward values; |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
746 # "1" is an arbitrarily chosen value from Rabiner tutorial |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
747 beta[T-1, :] = np.log2(1) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
748 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
749 # inductively calculate remaining backward values |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
750 for t in range(T-2, -1, -1): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
751 symbol = unlabeled_sequence[t+1][_TEXT] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
752 outputs = self._outputs_vector(symbol) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
753 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
754 for i in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
755 summand = transitions_logprob[i] + beta[t+1] + outputs |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
756 beta[t, i] = logsumexp2(summand) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
757 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
758 return beta |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
759 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
760 def test(self, test_sequence, verbose=False, **kwargs): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
761 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
762 Tests the HiddenMarkovModelTagger instance. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
763 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
764 :param test_sequence: a sequence of labeled test instances |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
765 :type test_sequence: list(list) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
766 :param verbose: boolean flag indicating whether training should be |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
767 verbose or include printed output |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
768 :type verbose: bool |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
769 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
770 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
771 def words(sent): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
772 return [word for (word, tag) in sent] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
773 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
774 def tags(sent): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
775 return [tag for (word, tag) in sent] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
776 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
777 def flatten(seq): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
778 return list(itertools.chain(*seq)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
779 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
780 test_sequence = self._transform(test_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
781 predicted_sequence = list(imap(self._tag, imap(words, test_sequence))) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
782 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
783 if verbose: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
784 for test_sent, predicted_sent in izip(test_sequence, predicted_sequence): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
785 print('Test:', |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
786 ' '.join('%s/%s' % (token, tag) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
787 for (token, tag) in test_sent)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
788 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
789 print('Untagged:', |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
790 ' '.join("%s" % token for (token, tag) in test_sent)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
791 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
792 print('HMM-tagged:', |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
793 ' '.join('%s/%s' % (token, tag) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
794 for (token, tag) in predicted_sent)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
795 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
796 print('Entropy:', |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
797 self.entropy([(token, None) for |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
798 (token, tag) in predicted_sent])) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
799 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
800 print('-' * 60) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
801 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
802 test_tags = flatten(imap(tags, test_sequence)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
803 predicted_tags = flatten(imap(tags, predicted_sequence)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
804 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
805 acc = accuracy(test_tags, predicted_tags) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
806 count = sum(len(sent) for sent in test_sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
807 print('accuracy over %d tokens: %.2f' % (count, acc * 100)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
808 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
809 def __repr__(self): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
810 return ('<HiddenMarkovModelTagger %d states and %d output symbols>' |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
811 % (len(self._states), len(self._symbols))) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
812 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
813 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
814 class HiddenMarkovModelTrainer(object): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
815 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
816 Algorithms for learning HMM parameters from training data. These include |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
817 both supervised learning (MLE) and unsupervised learning (Baum-Welch). |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
818 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
819 Creates an HMM trainer to induce an HMM with the given states and |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
820 output symbol alphabet. A supervised and unsupervised training |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
821 method may be used. If either of the states or symbols are not given, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
822 these may be derived from supervised training. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
823 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
824 :param states: the set of state labels |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
825 :type states: sequence of any |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
826 :param symbols: the set of observation symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
827 :type symbols: sequence of any |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
828 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
829 def __init__(self, states=None, symbols=None): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
830 self._states = (states if states else []) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
831 self._symbols = (symbols if symbols else []) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
832 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
833 def train(self, labeled_sequences=None, unlabeled_sequences=None, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
834 **kwargs): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
835 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
836 Trains the HMM using both (or either of) supervised and unsupervised |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
837 techniques. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
838 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
839 :return: the trained model |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
840 :rtype: HiddenMarkovModelTagger |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
841 :param labelled_sequences: the supervised training data, a set of |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
842 labelled sequences of observations |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
843 :type labelled_sequences: list |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
844 :param unlabeled_sequences: the unsupervised training data, a set of |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
845 sequences of observations |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
846 :type unlabeled_sequences: list |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
847 :param kwargs: additional arguments to pass to the training methods |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
848 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
849 assert labeled_sequences or unlabeled_sequences |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
850 model = None |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
851 if labeled_sequences: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
852 model = self.train_supervised(labeled_sequences, **kwargs) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
853 if unlabeled_sequences: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
854 if model: kwargs['model'] = model |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
855 model = self.train_unsupervised(unlabeled_sequences, **kwargs) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
856 return model |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
857 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
858 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
859 def _baum_welch_step(self, sequence, model, symbol_to_number): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
860 N = len(model._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
861 M = len(model._symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
862 T = len(sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
863 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
864 # compute forward and backward probabilities |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
865 alpha = model._forward_probability(sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
866 beta = model._backward_probability(sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
867 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
868 # find the log probability of the sequence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
869 lpk = logsumexp2(alpha[T-1]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
870 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
871 A_numer = _ninf_array((N, N)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
872 B_numer = _ninf_array((N, M)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
873 A_denom = _ninf_array(N) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
874 B_denom = _ninf_array(N) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
875 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
876 transitions_logprob = model._transitions_matrix().T |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
877 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
878 for t in range(T): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
879 symbol = sequence[t][_TEXT] # not found? FIXME |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
880 next_symbol = None |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
881 if t < T - 1: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
882 next_symbol = sequence[t+1][_TEXT] # not found? FIXME |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
883 xi = symbol_to_number[symbol] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
884 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
885 next_outputs_logprob = model._outputs_vector(next_symbol) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
886 alpha_plus_beta = alpha[t] + beta[t] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
887 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
888 if t < T - 1: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
889 numer_add = transitions_logprob + next_outputs_logprob + \ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
890 beta[t+1] + alpha[t].reshape(N, 1) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
891 A_numer = np.logaddexp2(A_numer, numer_add) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
892 A_denom = np.logaddexp2(A_denom, alpha_plus_beta) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
893 else: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
894 B_denom = np.logaddexp2(A_denom, alpha_plus_beta) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
895 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
896 B_numer[:,xi] = np.logaddexp2(B_numer[:,xi], alpha_plus_beta) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
897 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
898 return lpk, A_numer, A_denom, B_numer, B_denom |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
899 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
900 def train_unsupervised(self, unlabeled_sequences, update_outputs=True, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
901 **kwargs): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
902 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
903 Trains the HMM using the Baum-Welch algorithm to maximise the |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
904 probability of the data sequence. This is a variant of the EM |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
905 algorithm, and is unsupervised in that it doesn't need the state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
906 sequences for the symbols. The code is based on 'A Tutorial on Hidden |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
907 Markov Models and Selected Applications in Speech Recognition', |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
908 Lawrence Rabiner, IEEE, 1989. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
909 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
910 :return: the trained model |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
911 :rtype: HiddenMarkovModelTagger |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
912 :param unlabeled_sequences: the training data, a set of |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
913 sequences of observations |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
914 :type unlabeled_sequences: list |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
915 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
916 kwargs may include following parameters: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
917 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
918 :param model: a HiddenMarkovModelTagger instance used to begin |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
919 the Baum-Welch algorithm |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
920 :param max_iterations: the maximum number of EM iterations |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
921 :param convergence_logprob: the maximum change in log probability to |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
922 allow convergence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
923 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
924 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
925 # create a uniform HMM, which will be iteratively refined, unless |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
926 # given an existing model |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
927 model = kwargs.get('model') |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
928 if not model: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
929 priors = RandomProbDist(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
930 transitions = DictionaryConditionalProbDist( |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
931 dict((state, RandomProbDist(self._states)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
932 for state in self._states)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
933 outputs = DictionaryConditionalProbDist( |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
934 dict((state, RandomProbDist(self._symbols)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
935 for state in self._states)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
936 model = HiddenMarkovModelTagger(self._symbols, self._states, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
937 transitions, outputs, priors) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
938 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
939 self._states = model._states |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
940 self._symbols = model._symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
941 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
942 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
943 M = len(self._symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
944 symbol_numbers = dict((sym, i) for i, sym in enumerate(self._symbols)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
945 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
946 # update model prob dists so that they can be modified |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
947 # model._priors = MutableProbDist(model._priors, self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
948 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
949 model._transitions = DictionaryConditionalProbDist( |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
950 dict((s, MutableProbDist(model._transitions[s], self._states)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
951 for s in self._states)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
952 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
953 if update_outputs: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
954 model._outputs = DictionaryConditionalProbDist( |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
955 dict((s, MutableProbDist(model._outputs[s], self._symbols)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
956 for s in self._states)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
957 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
958 model.reset_cache() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
959 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
960 # iterate until convergence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
961 converged = False |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
962 last_logprob = None |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
963 iteration = 0 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
964 max_iterations = kwargs.get('max_iterations', 1000) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
965 epsilon = kwargs.get('convergence_logprob', 1e-6) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
966 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
967 while not converged and iteration < max_iterations: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
968 A_numer = _ninf_array((N, N)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
969 B_numer = _ninf_array((N, M)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
970 A_denom = _ninf_array(N) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
971 B_denom = _ninf_array(N) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
972 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
973 logprob = 0 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
974 for sequence in unlabeled_sequences: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
975 sequence = list(sequence) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
976 if not sequence: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
977 continue |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
978 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
979 (lpk, seq_A_numer, seq_A_denom, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
980 seq_B_numer, seq_B_denom) = self._baum_welch_step(sequence, model, symbol_numbers) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
981 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
982 # add these sums to the global A and B values |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
983 for i in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
984 A_numer[i] = np.logaddexp2(A_numer[i], seq_A_numer[i]-lpk) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
985 B_numer[i] = np.logaddexp2(B_numer[i], seq_B_numer[i]-lpk) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
986 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
987 A_denom = np.logaddexp2(A_denom, seq_A_denom-lpk) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
988 B_denom = np.logaddexp2(B_denom, seq_B_denom-lpk) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
989 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
990 logprob += lpk |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
991 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
992 # use the calculated values to update the transition and output |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
993 # probability values |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
994 for i in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
995 logprob_Ai = A_numer[i] - A_denom[i] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
996 logprob_Bi = B_numer[i] - B_denom[i] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
997 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
998 # We should normalize all probabilities (see p.391 Huang et al) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
999 # Let sum(P) be K. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1000 # We can divide each Pi by K to make sum(P) == 1. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1001 # Pi' = Pi/K |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1002 # log2(Pi') = log2(Pi) - log2(K) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1003 logprob_Ai -= logsumexp2(logprob_Ai) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1004 logprob_Bi -= logsumexp2(logprob_Bi) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1005 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1006 # update output and transition probabilities |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1007 si = self._states[i] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1008 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1009 for j in range(N): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1010 sj = self._states[j] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1011 model._transitions[si].update(sj, logprob_Ai[j]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1012 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1013 if update_outputs: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1014 for k in range(M): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1015 ok = self._symbols[k] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1016 model._outputs[si].update(ok, logprob_Bi[k]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1017 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1018 # Rabiner says the priors don't need to be updated. I don't |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1019 # believe him. FIXME |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1020 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1021 # test for convergence |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1022 if iteration > 0 and abs(logprob - last_logprob) < epsilon: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1023 converged = True |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1024 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1025 print('iteration', iteration, 'logprob', logprob) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1026 iteration += 1 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1027 last_logprob = logprob |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1028 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1029 return model |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1030 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1031 def train_supervised(self, labelled_sequences, estimator=None): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1032 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1033 Supervised training maximising the joint probability of the symbol and |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1034 state sequences. This is done via collecting frequencies of |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1035 transitions between states, symbol observations while within each |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1036 state and which states start a sentence. These frequency distributions |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1037 are then normalised into probability estimates, which can be |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1038 smoothed if desired. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1039 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1040 :return: the trained model |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1041 :rtype: HiddenMarkovModelTagger |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1042 :param labelled_sequences: the training data, a set of |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1043 labelled sequences of observations |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1044 :type labelled_sequences: list |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1045 :param estimator: a function taking |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1046 a FreqDist and a number of bins and returning a CProbDistI; |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1047 otherwise a MLE estimate is used |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1048 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1049 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1050 # default to the MLE estimate |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1051 if estimator is None: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1052 estimator = lambda fdist, bins: MLEProbDist(fdist) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1053 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1054 # count occurrences of starting states, transitions out of each state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1055 # and output symbols observed in each state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1056 known_symbols = set(self._symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1057 known_states = set(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1058 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1059 starting = FreqDist() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1060 transitions = ConditionalFreqDist() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1061 outputs = ConditionalFreqDist() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1062 for sequence in labelled_sequences: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1063 lasts = None |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1064 for token in sequence: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1065 state = token[_TAG] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1066 symbol = token[_TEXT] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1067 if lasts is None: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1068 starting[state] += 1 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1069 else: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1070 transitions[lasts][state] += 1 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1071 outputs[state][symbol] += 1 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1072 lasts = state |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1073 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1074 # update the state and symbol lists |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1075 if state not in known_states: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1076 self._states.append(state) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1077 known_states.add(state) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1078 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1079 if symbol not in known_symbols: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1080 self._symbols.append(symbol) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1081 known_symbols.add(symbol) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1082 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1083 # create probability distributions (with smoothing) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1084 N = len(self._states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1085 pi = estimator(starting, N) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1086 A = ConditionalProbDist(transitions, estimator, N) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1087 B = ConditionalProbDist(outputs, estimator, len(self._symbols)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1088 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1089 return HiddenMarkovModelTagger(self._symbols, self._states, A, B, pi) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1090 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1091 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1092 def _ninf_array(shape): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1093 res = np.empty(shape, np.float64) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1094 res.fill(-np.inf) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1095 return res |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1096 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1097 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1098 def logsumexp2(arr): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1099 max_ = arr.max() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1100 return np.log2(np.sum(2**(arr - max_))) + max_ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1101 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1102 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1103 def _log_add(*values): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1104 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1105 Adds the logged values, returning the logarithm of the addition. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1106 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1107 x = max(values) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1108 if x > -np.inf: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1109 sum_diffs = 0 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1110 for value in values: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1111 sum_diffs += 2**(value - x) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1112 return x + np.log2(sum_diffs) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1113 else: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1114 return x |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1115 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1116 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1117 def _create_hmm_tagger(states, symbols, A, B, pi): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1118 def pd(values, samples): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1119 d = dict(zip(samples, values)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1120 return DictionaryProbDist(d) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1121 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1122 def cpd(array, conditions, samples): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1123 d = {} |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1124 for values, condition in zip(array, conditions): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1125 d[condition] = pd(values, samples) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1126 return DictionaryConditionalProbDist(d) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1127 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1128 A = cpd(A, states, states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1129 B = cpd(B, states, symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1130 pi = pd(pi, states) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1131 return HiddenMarkovModelTagger(symbols=symbols, states=states, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1132 transitions=A, outputs=B, priors=pi) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1133 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1134 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1135 def _market_hmm_example(): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1136 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1137 Return an example HMM (described at page 381, Huang et al) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1138 """ |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1139 states = ['bull', 'bear', 'static'] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1140 symbols = ['up', 'down', 'unchanged'] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1141 A = np.array([[0.6, 0.2, 0.2], [0.5, 0.3, 0.2], [0.4, 0.1, 0.5]], np.float64) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1142 B = np.array([[0.7, 0.1, 0.2], [0.1, 0.6, 0.3], [0.3, 0.3, 0.4]], np.float64) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1143 pi = np.array([0.5, 0.2, 0.3], np.float64) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1144 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1145 model = _create_hmm_tagger(states, symbols, A, B, pi) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1146 return model, states, symbols |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1147 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1148 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1149 def demo(): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1150 # demonstrates HMM probability calculation |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1151 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1152 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1153 print("HMM probability calculation demo") |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1154 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1155 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1156 model, states, symbols = _market_hmm_example() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1157 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1158 print('Testing', model) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1159 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1160 for test in [['up', 'up'], ['up', 'down', 'up'], |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1161 ['down'] * 5, ['unchanged'] * 5 + ['up']]: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1162 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1163 sequence = [(t, None) for t in test] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1164 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1165 print('Testing with state sequence', test) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1166 print('probability =', model.probability(sequence)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1167 print('tagging = ', model.tag([word for (word,tag) in sequence])) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1168 print('p(tagged) = ', model.probability(sequence)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1169 print('H = ', model.entropy(sequence)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1170 print('H_exh = ', model._exhaustive_entropy(sequence)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1171 print('H(point) = ', model.point_entropy(sequence)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1172 print('H_exh(point)=', model._exhaustive_point_entropy(sequence)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1173 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1174 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1175 def load_pos(num_sents): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1176 from nltk.corpus import brown |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1177 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1178 sentences = brown.tagged_sents(categories='news')[:num_sents] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1179 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1180 tag_re = re.compile(r'[*]|--|[^+*-]+') |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1181 tag_set = set() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1182 symbols = set() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1183 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1184 cleaned_sentences = [] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1185 for sentence in sentences: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1186 for i in range(len(sentence)): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1187 word, tag = sentence[i] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1188 word = word.lower() # normalize |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1189 symbols.add(word) # log this word |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1190 # Clean up the tag. |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1191 tag = tag_re.match(tag).group() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1192 tag_set.add(tag) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1193 sentence[i] = (word, tag) # store cleaned-up tagged token |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1194 cleaned_sentences += [sentence] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1195 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1196 return cleaned_sentences, list(tag_set), list(symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1197 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1198 def demo_pos(): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1199 # demonstrates POS tagging using supervised training |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1200 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1201 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1202 print("HMM POS tagging demo") |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1203 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1204 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1205 print('Training HMM...') |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1206 labelled_sequences, tag_set, symbols = load_pos(20000) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1207 trainer = HiddenMarkovModelTrainer(tag_set, symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1208 hmm = trainer.train_supervised(labelled_sequences[10:], |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1209 estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1210 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1211 print('Testing...') |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1212 hmm.test(labelled_sequences[:10], verbose=True) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1213 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1214 def _untag(sentences): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1215 unlabeled = [] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1216 for sentence in sentences: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1217 unlabeled.append([(token[_TEXT], None) for token in sentence]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1218 return unlabeled |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1219 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1220 def demo_pos_bw(test=10, supervised=20, unsupervised=10, verbose=True, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1221 max_iterations=5): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1222 # demonstrates the Baum-Welch algorithm in POS tagging |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1223 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1224 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1225 print("Baum-Welch demo for POS tagging") |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1226 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1227 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1228 print('Training HMM (supervised, %d sentences)...' % supervised) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1229 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1230 sentences, tag_set, symbols = load_pos(test + supervised + unsupervised) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1231 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1232 symbols = set() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1233 for sentence in sentences: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1234 for token in sentence: |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1235 symbols.add(token[_TEXT]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1236 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1237 trainer = HiddenMarkovModelTrainer(tag_set, list(symbols)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1238 hmm = trainer.train_supervised(sentences[test:test+supervised], |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1239 estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins)) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1240 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1241 hmm.test(sentences[:test], verbose=verbose) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1242 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1243 print('Training (unsupervised, %d sentences)...' % unsupervised) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1244 # it's rather slow - so only use 10 samples by default |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1245 unlabeled = _untag(sentences[test+supervised:]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1246 hmm = trainer.train_unsupervised(unlabeled, model=hmm, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1247 max_iterations=max_iterations) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1248 hmm.test(sentences[:test], verbose=verbose) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1249 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1250 def demo_bw(): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1251 # demo Baum Welch by generating some sequences and then performing |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1252 # unsupervised training on them |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1253 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1254 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1255 print("Baum-Welch demo for market example") |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1256 print() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1257 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1258 model, states, symbols = _market_hmm_example() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1259 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1260 # generate some random sequences |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1261 training = [] |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1262 import random |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1263 rng = random.Random() |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1264 rng.seed(0) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1265 for i in range(10): |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1266 item = model.random_sample(rng, 5) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1267 training.append([(i[0], None) for i in item]) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1268 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1269 # train on those examples, starting with the model that generated them |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1270 trainer = HiddenMarkovModelTrainer(states, symbols) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1271 hmm = trainer.train_unsupervised(training, model=model, |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1272 max_iterations=1000) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1273 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1274 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1275 if __name__ == "__main__": |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1276 import doctest |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1277 doctest.testmod(optionflags=doctest.NORMALIZE_WHITESPACE) |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1278 |
26d9c0308fcf
updated/added from ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1279 |