Mercurial > hg > ooxml
annotate parse.py @ 40:ac6d1ca099f7
simplified rect output, refs (not refs2) working again
author | Henry S. Thompson <ht@markup.co.uk> |
---|---|
date | Tue, 02 May 2017 09:20:13 +0100 |
parents | e500d7c18aad |
children | 793496d2d165 |
rev | line source |
---|---|
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
1 ''' Tokenise Excel formulae |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
2 Starting from regexps and concat rules in |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
3 https://sites.google.com/site/e90e50/random-topics/tool-for-parsing-formulas-in-excel |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
4 and |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
5 parser_formule_with_textbox_v01_2003.xla |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
6 linked to therein''' |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
7 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
8 import sys,re |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
9 |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
10 cw = "=\-+*/();:,.$<>^!" |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
11 #cWW = "[=\-+*/();:,.$<>^]" |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
12 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
13 sListSeparator = "," |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
14 sRowSeparator = ";" |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
15 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
16 # If comma is decimal symbol, then \ is used in place of comma in Array. |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
17 # If semi-colon is decimal symbol, then \ is used in place of semi-colon |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
18 # in Array. (Semi-colon is row separator) |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
19 # See https://www.ablebits.com/office-addins-blog/2015/02/25/array-formulas-functions-excel/ |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
20 |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
21 pats=["\"[^\"]*\"", |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
22 "\{[^}]+}", |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
23 sListSeparator, |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
24 "[^" + cw + "]+(?:\.[^" + cw + "]+)*\\(", |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
25 "\\)", |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
26 "^=|\\(", |
35
e500d7c18aad
Fixed confusion wrt gen vs. num, nature of @ format (id=49)
Henry S. Thompson <ht@markup.co.uk>
parents:
2
diff
changeset
|
27 "'[^']+'", #((?:'[^']+')|(?:\[[0-9]+\][^!]*)) |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
28 "\$?[A-Z]+\$?[0-9]+", |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
29 "[a-zA-Z_\\\\][a-zA-Z0-9._]*", |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
30 "."] |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
31 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
32 ## They perform the following tasks, in order: |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
33 ## 1. Represents a text (delimited by double quotes) |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
34 ## 2. Represents a constant matrix |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
35 ## 3. Represents a list (function parameter) separator |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
36 ## 4. Represents a function name followed by an opening parenthesis |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
37 ## 5. Represents a closing parenthesis |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
38 ## 6. Represents the beginning of the formula or an opening |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
39 ## parenthesis (not part of a function) |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
40 ## 7. A sheet name (delimited by single quotes) and a ! |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
41 ## 8. A cell reference |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
42 ## 9. A name |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
43 ## 10. Each characters not matched by the previous patterns |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
44 |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
45 # As it stands this combines operators with following refs or vars |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
46 # Should we try ref="\$?[A-Z]+\$?[0-9]+" and var=[a-zA-Z_\\][a-zA-Z0-9._]* ? |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
47 # (suitably int'lised) |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
48 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
49 tokPat=re.compile("("+(")|(".join(pats))+")",re.IGNORECASE) |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
50 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
51 f="""=IF(I$1>$C$2,VLOOKUP($C52,GP_input!$C$187:$CM$196,VLOOKUP($C$9,lists!$A$34:$B$39,2,FALSE)+BC$4,FALSE),SUMIF('BEX2011'!$C$32:$C$5000,$B$9&"sl"&$B52&$C52&$B53,'BEX2001'!Q$32))""" |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
52 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
53 l=tokPat.findall(f) |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
54 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
55 ## Visual basic code: |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
56 ## set M = RE.Execute(s) |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
57 ## s = "" |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
58 ## For Each SM In M |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
59 ## Set SB = SM.SubMatches |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
60 ## If Len(SB(0) & SB(6)) Then |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
61 ## t = SB(0) & SB(6) [HST doesn't understand why atoms |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
62 ## and strings are concatenated] |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
63 ## ElseIf Len(SB(1)) Then |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
64 ## t = Array_Const_Wrap(SB(1), sRowSeparator) & vbCr |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
65 ## ElseIf Len(SB(2) & SB(5)) Then |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
66 ## t = SB(2) & SB(5) & vbCr |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
67 ## ElseIf Len(SB(3)) Then |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
68 ## t = vbCr & SB(3) & vbCr |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
69 ## ElseIf Len(SB(4)) Then |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
70 ## t = vbCr & SB(4) |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
71 ## End If |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
72 ## s = s & t |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
73 ## Next |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
74 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
75 def mergeMatches(l): |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
76 res="" |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
77 rtype=None |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
78 cur="" |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
79 for txt,cm,sep,ofun,close,opn,sheet,ref,var,misc in l: |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
80 if txt is not '': |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
81 if res is not '' and rtype is not 1: |
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
82 yield res |
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
83 rtype=1 |
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
84 res='' |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
85 res+=txt |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
86 rtype=1 |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
87 continue |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
88 if misc is not '': |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
89 if res is not '' and rtype is not 2: |
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
90 yield res |
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
91 rtype=2 |
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
92 res='' |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
93 res+=misc |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
94 rtype=2 |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
95 continue |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
96 if cm is not '': |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
97 cur=cm |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
98 elif sep is not '': |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
99 cur=sep |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
100 elif ofun is not '': |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
101 cur=ofun |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
102 elif close is not '': |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
103 cur=close |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
104 elif opn is not '': |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
105 cur=opn |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
106 elif sheet is not '': |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
107 cur=sheet |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
108 elif ref is not '': |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
109 cur=ref |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
110 elif var is not '': |
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
111 cur=var |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
112 if res is not '': |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
113 yield res |
1
20424d7e99e4
handle sheet names more carefully
Henry S. Thompson <ht@markup.co.uk>
parents:
0
diff
changeset
|
114 rtype=None |
0
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
115 res='' |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
116 yield cur |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
117 |
ddd5f7539abc
reconstruct the original more or less
Henry Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
118 toks=list(mergeMatches(l)) |
2
263a1935d37d
more aggressive - (non-intl) splitting of refs and names,
Henry S. Thompson <ht@markup.co.uk>
parents:
1
diff
changeset
|
119 print (''.join(toks))==f |