Mercurial > hg > ooxml
annotate parse.py @ 71:54bb53434887
begin work on decoder that allows identifiers as keys and values
author | Henry S. Thompson <ht@markup.co.uk> |
---|---|
date | Wed, 28 Jun 2017 19:08:30 +0100 |
parents | 793496d2d165 |
children |
rev | line source |
---|---|
51
793496d2d165
allow command line input
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
1 #!/usr/bin/python2.7 |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
2 ''' Tokenise Excel formulae |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
3 Starting from regexps and concat rules in |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
4 https://sites.google.com/site/e90e50/random-topics/tool-for-parsing-formulas-in-excel |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
5 and |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
6 parser_formule_with_textbox_v01_2003.xla |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
7 linked to therein''' |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
8 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
9 import sys,re |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
10 |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
11 cw = "=\-+*/();:,.$<>^!" |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
12 #cWW = "[=\-+*/();:,.$<>^]" |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
13 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
14 sListSeparator = "," |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
15 sRowSeparator = ";" |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
16 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
17 # If comma is decimal symbol, then \ is used in place of comma in Array. |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
18 # If semi-colon is decimal symbol, then \ is used in place of semi-colon |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
19 # in Array. (Semi-colon is row separator) |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
20 # See https://www.ablebits.com/office-addins-blog/2015/02/25/array-formulas-functions-excel/ |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
21 |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
22 pats=["\"[^\"]*\"", |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
23 "\{[^}]+}", |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
24 sListSeparator, |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
25 "[^" + cw + "]+(?:\.[^" + cw + "]+)*\\(", |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
26 "\\)", |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
27 "^=|\\(", |
35
e500d7c18aad
Fixed confusion wrt gen vs. num, nature of @ format (id=49)
Henry S. Thompson <ht@markup.co.uk>
parents:
2
diff
changeset
|
28 "'[^']+'", #((?:'[^']+')|(?:\[[0-9]+\][^!]*)) |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
29 "\$?[A-Z]+\$?[0-9]+", |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
30 "[a-zA-Z_\\\\][a-zA-Z0-9._]*", |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
31 "."] |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
32 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
33 ## They perform the following tasks, in order: |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
34 ## 1. Represents a text (delimited by double quotes) |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
35 ## 2. Represents a constant matrix |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
36 ## 3. Represents a list (function parameter) separator |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
37 ## 4. Represents a function name followed by an opening parenthesis |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
38 ## 5. Represents a closing parenthesis |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
39 ## 6. Represents the beginning of the formula or an opening |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
40 ## parenthesis (not part of a function) |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
41 ## 7. A sheet name (delimited by single quotes) and a ! |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
42 ## 8. A cell reference |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
43 ## 9. A name |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
44 ## 10. Each characters not matched by the previous patterns |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
45 |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
46 # As it stands this combines operators with following refs or vars |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
47 # Should we try ref="\$?[A-Z]+\$?[0-9]+" and var=[a-zA-Z_\\][a-zA-Z0-9._]* ? |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
48 # (suitably int'lised) |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
49 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
50 tokPat=re.compile("("+(")|(".join(pats))+")",re.IGNORECASE) |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
51 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
52 f="""=IF(I$1>$C$2,VLOOKUP($C52,GP_input!$C$187:$CM$196,VLOOKUP($C$9,lists!$A$34:$B$39,2,FALSE)+BC$4,FALSE),SUMIF('BEX2011'!$C$32:$C$5000,$B$9&"sl"&$B52&$C52&$B53,'BEX2001'!Q$32))""" |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
53 |
51
793496d2d165
allow command line input
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
54 if len(sys.argv)>1: |
793496d2d165
allow command line input
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
55 f=sys.argv[1] |
793496d2d165
allow command line input
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
56 |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
57 l=tokPat.findall(f) |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
58 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
59 ## Visual basic code: |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
60 ## set M = RE.Execute(s) |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
61 ## s = "" |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
62 ## For Each SM In M |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
63 ## Set SB = SM.SubMatches |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
64 ## If Len(SB(0) & SB(6)) Then |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
65 ## t = SB(0) & SB(6) [HST doesn't understand why atoms |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
66 ## and strings are concatenated] |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
67 ## ElseIf Len(SB(1)) Then |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
68 ## t = Array_Const_Wrap(SB(1), sRowSeparator) & vbCr |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
69 ## ElseIf Len(SB(2) & SB(5)) Then |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
70 ## t = SB(2) & SB(5) & vbCr |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
71 ## ElseIf Len(SB(3)) Then |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
72 ## t = vbCr & SB(3) & vbCr |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
73 ## ElseIf Len(SB(4)) Then |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
74 ## t = vbCr & SB(4) |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
75 ## End If |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
76 ## s = s & t |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
77 ## Next |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
78 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
79 def mergeMatches(l): |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
80 res="" |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
81 rtype=None |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
82 cur="" |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
83 for txt,cm,sep,ofun,close,opn,sheet,ref,var,misc in l: |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
84 if txt is not '': |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
85 if res is not '' and rtype is not 1: |
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
86 yield res |
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
87 rtype=1 |
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
88 res='' |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
89 res+=txt |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
90 rtype=1 |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
91 continue |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
92 if misc is not '': |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
93 if res is not '' and rtype is not 2: |
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
94 yield res |
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
95 rtype=2 |
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
96 res='' |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
97 res+=misc |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
98 rtype=2 |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
99 continue |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
100 if cm is not '': |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
101 cur=cm |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
102 elif sep is not '': |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
103 cur=sep |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
104 elif ofun is not '': |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
105 cur=ofun |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
106 elif close is not '': |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
107 cur=close |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
108 elif opn is not '': |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
109 cur=opn |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
110 elif sheet is not '': |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
111 cur=sheet |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
112 elif ref is not '': |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
113 cur=ref |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
114 elif var is not '': |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
115 cur=var |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
116 if res is not '': |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
117 yield res |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
118 rtype=None |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
119 res='' |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
120 yield cur |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
121 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
122 toks=list(mergeMatches(l)) |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
123 print (''.join(toks))==f |