Mercurial > hg > ooxml
annotate notes.txt @ 64:823ac978f4ab
scrappy first pass w/o auto features
author | Henry S. Thompson <ht@markup.co.uk> |
---|---|
date | Mon, 12 Jun 2017 16:46:45 +0200 |
parents | adeb9575b273 |
children |
rev | line source |
---|---|
55
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
1 |
37
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
2 Tokenisation patterns, derived from parse.py, derived from |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
3 https://sites.google.com/site/e90e50/random-topics/tool-for-parsing-formulas-in-excel |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
4 and |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
5 parser_formule_with_textbox_v01_2003.xla |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
6 linked to therein |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
7 |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
8 1 ("[^"]*") q |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
9 A text (delimited by double quotes) |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
10 2 (\{[^}]+}) m |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
11 A constant matrix |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
12 3 (,) c |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
13 A list (function parameter) separator |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
14 4 ([^=\-+*/();:,.$<>^!]+(?:\.[^=\-+*/();:,.$<>^!]+)*\() f |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
15 A function name followed by an opening parenthesis |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
16 5 ([)]) p |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
17 A closing parenthesis |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
18 6 (^=|\() l |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
19 The beginning of the formula or an opening |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
20 parenthesis (not part of a function) |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
21 7 ((?:(?:'[^']+')|(?:\[[0-9]+\][^!]*)|(?:[a-zA-Z_][a-zA-Z0-9._]*)!)) n |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
22 A sheet name (either delimited by single quotes, or |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
23 bracketed number plus optional string, |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
24 or simple name (syntax is a _guess_)) |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
25 8 (\$?[A-Z]+\$?[0-9]+) s or r |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
26 A cell reference |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
27 9 ([a-zA-Z_\\][a-zA-Z0-9._]*) v |
39
4c6a341e75da
big rework works on sample2, w/o refs processing
Henry S. Thompson <ht@markup.co.uk>
parents:
37
diff
changeset
|
28 A name (boolean constant or a variable -- anything else?) |
37
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
29 10 (.) x |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
30 Single characters not matched by the previous patterns |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
31 ---------- |
3
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
32 You can't depend on |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
33 <f si="..." t="shared"/> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
34 That is, it's _true_, but you can have a table with shared formulae |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
35 that doesn't use it. Compare M17:T28 (see below, uses shared) and |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
36 C17:J28 (mostly no shared) in sample4 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
37 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
38 Looks like the result of a sweep-and-copy-{right,down} results in the |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
39 _new_ cells covered showing as 'shared': |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
40 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
41 [ ][1][1][1][1]... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
42 [2][2][2][2][2]... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
43 [2][2][2][2][2]... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
44 ... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
45 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
46 Presumably that one was right-then-down, down-then-right would give a |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
47 slightly different pattern |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
48 -------- |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
49 Thinking about a pipeline... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
50 1) convert all variable references into (verbose!) elts: |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
51 <!ELEMENT R EMPTY> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
52 <!ATTLIST R |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
53 ac CDATA IMPLIED |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
54 rc CDATA IMPLIED |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
55 ar CDATA IMPLIED |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
56 rr CDATA IMPLIED> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
57 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
58 where e.g. ac is 'absolute column' |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
59 'D6' --> <R rc='D' rr='6'/> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
60 and |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
61 '$E5' --> <R ac='E' rr='5'/> |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
62 No, in fact -- absolute vs. 'variable' isn't relevant for our purposes. |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
63 What we probably _do_ want is to add to every reference a _relative_ |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
64 version, i.e. +/-columnDelta, +/-rowDelta |
3
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
65 -------- |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
66 Identifying dates is . . . tedious. They will be ints or floats (?), |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
67 with s="<int>", where the int is a 0-origin index into the list of |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
68 <xf...numFmtId="<bin>".../> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
69 children of <cellXfs> in styles.xml, and bin is a built-in date format |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
70 code, see 18.8.30 numFmt (Number Format) in ISO/IEC 29500-1:2016(E) == |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
71 C071691e.pdf DONE |
10
01e80c7a9575
simple ascii type matrix output working
Henry S. Thompson <ht@markup.co.uk>
parents:
3
diff
changeset
|
72 --------- |
01e80c7a9575
simple ascii type matrix output working
Henry S. Thompson <ht@markup.co.uk>
parents:
3
diff
changeset
|
73 Decided to distinguish between type (num, date, str, err, ...) and |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
74 class (cur(rency), others to come?). If non-standard code, just record |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
75 that. |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
76 The current pipe has two main steps, followed by an optional |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
77 prettifying step: |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
78 format.xsl (extracts type={bool,date,num,str,err} |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
79 class={cur,[nothing else yet]} |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
80 code={raw format code if not recognised} |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
81 rect.xsl (fills in gaps, cuts down size, using only bdnse for |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
82 <t>[ype] with attrs c[lass]={c,...} and [co]d[e]=... |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
83 For now, just using first letters of type, class DONE |
16
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
84 ---------- |
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
85 Hmm, looking at real data (kenneth_lay__19506), I see _lots_ of cells |
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
86 with (numerical) formats, but no content. Where do I throw those |
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
87 away? Can throw away empty _rows_ in rect.xsl, but for _cells_ have |
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
88 to wait for ascii.xsl or html.xsl. But only copy type in in rect if |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
89 there was content before. DONE |
22
ca98c74a7cb1
towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents:
16
diff
changeset
|
90 ----------- |
23 | 91 Using attributes to hold space-separated lists is risky, as in |
24
87e0d620deea
switch to elements from attributes and default namespace
Henry S. Thompson <ht@markup.co.uk>
parents:
23
diff
changeset
|
92 refs.xsl output, is risky! Fixed, see below. |
23 | 93 ----------- |
30
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
94 Not handling variables as references FIXED. Not catching external |
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
95 references to variables FIXED (as externals). Not catching naked [n]! as external |
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
96 references FIXED |
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
97 Solo local vars are recursively dereferenced |
22
ca98c74a7cb1
towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents:
16
diff
changeset
|
98 The definition table is in workbook.xml definedNames/definedName[@name=$name]/. |
37
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
99 Sheet name to filename mapping for locals is in workbook.xml |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
100 sheets/sheet[@name=$sname]/@sheetId |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
101 These appear in definedName, single-quoted if (iff?) the sheet name has spaces |
ac3cd8de7a10
towards big rework of tokenisation
Henry S. Thompson <ht@markup.co.uk>
parents:
36
diff
changeset
|
102 (or other specials?) |
36
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
103 ??? Variables on l or r of ranges are just looked up: if they are complex |
30
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
104 no recursion is done: the _semantics_ of this case are not clear to |
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
105 me, need a real-life example... |
47
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
106 Variables whose value is itself a range are not being handled FIXED |
23 | 107 ----------- |
24
87e0d620deea
switch to elements from attributes and default namespace
Henry S. Thompson <ht@markup.co.uk>
parents:
23
diff
changeset
|
108 Switch to default namespace in order to reduce size and improve |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
109 readability, and to elements instead of attributes DONE |
23 | 110 ----------- |
111 Should put another step after refs.xsl to compute a map from | |
112 distinct-values of all targets to all the cells which use them | |
36
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
113 DONE. |
58
adeb9575b273
add inverted rel pointers back to referencing from referenced
Henry S. Thompson <ht@markup.co.uk>
parents:
55
diff
changeset
|
114 Now provides inverted rel. ref info so that static input columns can |
adeb9575b273
add inverted rel pointers back to referencing from referenced
Henry S. Thompson <ht@markup.co.uk>
parents:
55
diff
changeset
|
115 be identified from their users. |
adeb9575b273
add inverted rel pointers back to referencing from referenced
Henry S. Thompson <ht@markup.co.uk>
parents:
55
diff
changeset
|
116 |
adeb9575b273
add inverted rel pointers back to referencing from referenced
Henry S. Thompson <ht@markup.co.uk>
parents:
55
diff
changeset
|
117 Likewise ranges -- range endpoints in place, @@ what about the things themselves? |
36
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
118 |
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
119 That really does mean we should move to elts for |
23 | 120 each ref or range, since at this point we want to compute vector |
30
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
121 representation as well DONE, so we can identify projections |
23 | 122 |
123 Slightly irritating that we'll have to serialise this as XML and then | |
124 re-build it later... | |
125 ----------- | |
126 Overgenerating in kenneth_lay__19506: e.g. <e:ref c="E9" er="[1]!'.SPX' '.SPX'!"/> | |
127 from <f>[1]!'.SPX'</f> | |
128 Hmm. This cell displays in Excel as REUTERS|IDN!.SPX | |
129 The indirections work as follows: | |
130 in workbook.xml: | |
131 <externalReferences> | |
132 <externalReference r:id="rId3"/> | |
133 <externalReference r:id="rId4"/> | |
134 </externalReferences> | |
135 in _rels/workbook.xml.rels | |
136 <Relationship Id="rId3" Target="externalLinks/externalLink1.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/externalLink"/> | |
137 in externalLinks/externalLink1.xml | |
138 <ddeLink ddeService="REUTER" ddeTopic="IDN"... | |
139 <ddeItems> | |
140 ... | |
141 <ddeItem advise="1" name=".SPX"> | |
142 <values> | |
143 <value> | |
144 <val>1264.96</val> | |
145 </value> | |
146 </values> | |
147 </ddeItem> | |
148 Whew! | |
30
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
149 FIXED |
23 | 150 ---------- |
28
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
151 http://upcommons.upc.edu/bitstream/handle/2117/100584/KDIR_2016_47_CR.pdf |
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
152 [downloaded] |
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
153 uses appearance a lot. That needs to be harvested from styles.xml |
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
154 The kenneth_lay enron sample has _403_ numbered formats... |
30
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
155 ---------- |
23 | 156 Tried the largest sheet from the largest .xlsx I could find: |
157 fuse1k/'benjamin_rogers__1002__NYISO Price Information version 2'.xlsx | |
158 -rw-r--r-- 1 ht None 6273325 Apr 3 16:22 '../benjamin_rogers__1002__NYISO Price Information version 2.xlsx' | |
159 -rw-r--r-- 1 ht None 23221149 Jan 1 1980 xl/worksheets/sheet3.xml | |
160 | |
161 > lxcount xl/worksheets/sheet3.xml | sort -k2nr | |
162 *Total* 1230217 | |
163 c 596032 | |
164 v 595876 | |
165 f 19201 | |
166 row 18985 | |
167 col 106 | |
168 | |
169 <dimension ref="A1:DY18985"/> | |
170 | |
171 Blew java out of the water :-( | |
172 java.lang.OutOfMemoryError: Java heap space | |
173 | |
174 Need to try again with more memory, if I remember how... | |
175 | |
176 The raw result is going to have 18985 x 102 == 2 million cells == | |
177 (assuming average cell size of 30 bytes and row overhead of 20 (* | |
178 18985 (+ 20 (* 102 30))) 58,473,800 bytes, which is big but tolerable... | |
179 ---------------- | |
35
e500d7c18aad
Fixed confusion wrt gen vs. num, nature of @ format (id=49)
Henry S. Thompson <ht@markup.co.uk>
parents:
30
diff
changeset
|
180 sample4 html reveals several problems: mistaken content based on class |
e500d7c18aad
Fixed confusion wrt gen vs. num, nature of @ format (id=49)
Henry S. Thompson <ht@markup.co.uk>
parents:
30
diff
changeset
|
181 bug, e.g. B4 is 'a' FIXED |
e500d7c18aad
Fixed confusion wrt gen vs. num, nature of @ format (id=49)
Henry S. Thompson <ht@markup.co.uk>
parents:
30
diff
changeset
|
182 highlighted cells are being |
e500d7c18aad
Fixed confusion wrt gen vs. num, nature of @ format (id=49)
Henry S. Thompson <ht@markup.co.uk>
parents:
30
diff
changeset
|
183 labelled as cur, e.g. B61 in |
e500d7c18aad
Fixed confusion wrt gen vs. num, nature of @ format (id=49)
Henry S. Thompson <ht@markup.co.uk>
parents:
30
diff
changeset
|
184 output of format.xsl FIXED |
36
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
185 ----------- |
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
186 Need to rethink variable handling... |
45
6ed900e8cc61
towards comparable formulae
Henry S. Thompson <ht@markup.co.uk>
parents:
39
diff
changeset
|
187 Is all we really need a normalised formula computation?: |
36
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
188 1) recursively replace variables; |
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
189 2) convert all simple refs to new CR string normal form: |
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
190 crnf ::= col row |
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
191 col ::= abs | rel |
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
192 row ::= abs | rel |
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
193 abs ::= '\xAA' xs:positiveInteger |
ae605b77d1e4
compute (but not use) master formula cells info,
Henry S. Thompson <ht@markup.co.uk>
parents:
35
diff
changeset
|
194 rel ::= '\xAE' ( ( '-' xs:positiveInteger ) | xs:nonNegativeInteger ) |
47
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
195 ----------- |
48
5d9806f90896
basic integration of shared, but copying <f> is wrong, should reconstruct by denormalising <nf> for new home
Henry S. Thompson <ht@markup.co.uk>
parents:
47
diff
changeset
|
196 Would <c c=COL [fi= [si=]]> be sufficient? |
50
01a7c2ebd3d1
top 20 shared formulae coloured in
Henry S. Thompson <ht@markup.co.uk>
parents:
49
diff
changeset
|
197 fi an index into _all_ functions, si original index into explicitly |
01a7c2ebd3d1
top 20 shared formulae coloured in
Henry S. Thompson <ht@markup.co.uk>
parents:
49
diff
changeset
|
198 shared functions -- note that the same fi may appear with multiple |
01a7c2ebd3d1
top 20 shared formulae coloured in
Henry S. Thompson <ht@markup.co.uk>
parents:
49
diff
changeset
|
199 si, see discussion back at the top of this doc. |
48
5d9806f90896
basic integration of shared, but copying <f> is wrong, should reconstruct by denormalising <nf> for new home
Henry S. Thompson <ht@markup.co.uk>
parents:
47
diff
changeset
|
200 Brute force for now -- rect sees shared table, computes CRNF |
49
d3569a8cbf7a
shared refs rebuilt correctly
Henry S. Thompson <ht@markup.co.uk>
parents:
48
diff
changeset
|
201 Not good enough -- <f> in shared table can't be used as is, need to |
d3569a8cbf7a
shared refs rebuilt correctly
Henry S. Thompson <ht@markup.co.uk>
parents:
48
diff
changeset
|
202 rebuild ref names relative to each new home. FIXED |
48
5d9806f90896
basic integration of shared, but copying <f> is wrong, should reconstruct by denormalising <nf> for new home
Henry S. Thompson <ht@markup.co.uk>
parents:
47
diff
changeset
|
203 ----------- |
47
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
204 Picking colours to label regions, e.g. with similar formulae: |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
205 http://stackoverflow.com/questions/470690/how-to-automatically-generate-n-distinct-colors |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
206 Start with just top-n, limited to 22 from Kelly |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
207 #FFB300, # Vivid Yellow |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
208 #803E75, # Strong Purple |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
209 #FF6800, # Vivid Orange |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
210 #A6BDD7, # Very Light Blue |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
211 #C10020, # Vivid Red |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
212 #CEA262, # Grayish Yellow |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
213 #817066, # Medium Gray |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
214 # The following don't work well for people with defective color vision |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
215 #007D34, # Vivid Green |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
216 #F6768E, # Strong Purplish Pink |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
217 #00538A, # Strong Blue |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
218 #FF7A5C, # Strong Yellowish Pink |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
219 #53377A, # Strong Violet |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
220 #FF8E00, # Vivid Orange Yellow |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
221 #B32851, # Strong Purplish Red |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
222 #F4C800, # Vivid Greenish Yellow |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
223 #7F180D, # Strong Reddish Brown |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
224 #93AA00, # Vivid Yellowish Green |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
225 #593315, # Deep Yellowish Brown |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
226 #F13A13, # Vivid Reddish Orange |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
227 #232C16, # Dark Olive Green |
3e9a3e51627e
explicit form match working, but shared still needs work
Henry S. Thompson <ht@markup.co.uk>
parents:
45
diff
changeset
|
228 ------------ |
58
adeb9575b273
add inverted rel pointers back to referencing from referenced
Henry S. Thompson <ht@markup.co.uk>
parents:
55
diff
changeset
|
229 @@ string identity, to say nothing of actual value, is lost -- fix? |
adeb9575b273
add inverted rel pointers back to referencing from referenced
Henry S. Thompson <ht@markup.co.uk>
parents:
55
diff
changeset
|
230 @@ row/column/both spans [what?] |
64
823ac978f4ab
scrappy first pass w/o auto features
Henry S. Thompson <ht@markup.co.uk>
parents:
58
diff
changeset
|
231 |
823ac978f4ab
scrappy first pass w/o auto features
Henry S. Thompson <ht@markup.co.uk>
parents:
58
diff
changeset
|
232 Now using up to 4 border colours to reflect incoming refs |
823ac978f4ab
scrappy first pass w/o auto features
Henry S. Thompson <ht@markup.co.uk>
parents:
58
diff
changeset
|
233 @@ sort these before clipping to 4 to reflect frequency |
823ac978f4ab
scrappy first pass w/o auto features
Henry S. Thompson <ht@markup.co.uk>
parents:
58
diff
changeset
|
234 @@ use vertical layering in the cell to get the borders |
823ac978f4ab
scrappy first pass w/o auto features
Henry S. Thompson <ht@markup.co.uk>
parents:
58
diff
changeset
|
235 more evident when a background colour is present? Already |
823ac978f4ab
scrappy first pass w/o auto features
Henry S. Thompson <ht@markup.co.uk>
parents:
58
diff
changeset
|
236 happening, just a bit hard to see, need a 1px space? |
52
9bb415e0adc9
try to fix error processin odd REUTER|IDN\!'.SPX' external ref
Henry S. Thompson <ht@markup.co.uk>
parents:
50
diff
changeset
|
237 ------ |
54
191c95187e87
working now on enron1k, uli1/sheet1
Henry S. Thompson <ht@markup.co.uk>
parents:
52
diff
changeset
|
238 enron1k/kenneth_lay__19506 contains this formula: |
52
9bb415e0adc9
try to fix error processin odd REUTER|IDN\!'.SPX' external ref
Henry S. Thompson <ht@markup.co.uk>
parents:
50
diff
changeset
|
239 |
9bb415e0adc9
try to fix error processin odd REUTER|IDN\!'.SPX' external ref
Henry S. Thompson <ht@markup.co.uk>
parents:
50
diff
changeset
|
240 <f>[1]!'.SPX'</f> |
9bb415e0adc9
try to fix error processin odd REUTER|IDN\!'.SPX' external ref
Henry S. Thompson <ht@markup.co.uk>
parents:
50
diff
changeset
|
241 |
54
191c95187e87
working now on enron1k, uli1/sheet1
Henry S. Thompson <ht@markup.co.uk>
parents:
52
diff
changeset
|
242 which crashes tokenise/rnf FIXED works now, and with |
191c95187e87
working now on enron1k, uli1/sheet1
Henry S. Thompson <ht@markup.co.uk>
parents:
52
diff
changeset
|
243 <f>[1]!'AES,DIVIDEND' (where _are_ these coming from???) |
191c95187e87
working now on enron1k, uli1/sheet1
Henry S. Thompson <ht@markup.co.uk>
parents:
52
diff
changeset
|
244 Als, with enlarged memory, now runs on uli1/sheet1 |
52
9bb415e0adc9
try to fix error processin odd REUTER|IDN\!'.SPX' external ref
Henry S. Thompson <ht@markup.co.uk>
parents:
50
diff
changeset
|
245 |
9bb415e0adc9
try to fix error processin odd REUTER|IDN\!'.SPX' external ref
Henry S. Thompson <ht@markup.co.uk>
parents:
50
diff
changeset
|
246 Changes intended to fix this fixed a bug (?) which wasn't properly |
9bb415e0adc9
try to fix error processin odd REUTER|IDN\!'.SPX' external ref
Henry S. Thompson <ht@markup.co.uk>
parents:
50
diff
changeset
|
247 merging e.g. +3 -- no examples of larger numbers available to check |
54
191c95187e87
working now on enron1k, uli1/sheet1
Henry S. Thompson <ht@markup.co.uk>
parents:
52
diff
changeset
|
248 with... We are now getting e.g. <x>2.509+0.482+0.238</x> |
191c95187e87
working now on enron1k, uli1/sheet1
Henry S. Thompson <ht@markup.co.uk>
parents:
52
diff
changeset
|
249 in enron1k/kenneth_lay__19506 |
52
9bb415e0adc9
try to fix error processin odd REUTER|IDN\!'.SPX' external ref
Henry S. Thompson <ht@markup.co.uk>
parents:
50
diff
changeset
|
250 |
54
191c95187e87
working now on enron1k, uli1/sheet1
Henry S. Thompson <ht@markup.co.uk>
parents:
52
diff
changeset
|
251 We could _either_ add a class of operators, or a class of numbers? |
191c95187e87
working now on enron1k, uli1/sheet1
Henry S. Thompson <ht@markup.co.uk>
parents:
52
diff
changeset
|
252 |
55
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
253 ============== |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
254 Python : sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
255 lxml.etree : (3, 6, 4, 0) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
256 libxml used : (2, 9, 2) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
257 libxml compiled : (2, 9, 2) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
258 libxslt used : (1, 1, 29) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
259 libxslt compiled : (1, 1, 29) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
260 |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
261 testa works |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
262 ---------- |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
263 Python : sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
264 lxml.etree : (3, 7, 3, 0) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
265 libxml used : (2, 9, 2) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
266 libxml compiled : (2, 9, 2) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
267 libxslt used : (1, 1, 29) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
268 libxslt compiled : (1, 1, 29) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
269 |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
270 testa works |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
271 --------- |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
272 Python : sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
273 lxml.etree : (3, 7, 3, 0) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
274 libxml used : (2, 9, 4) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
275 libxml compiled : (2, 9, 4) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
276 libxslt used : (1, 1, 29) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
277 libxslt compiled : (1, 1, 29) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
278 |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
279 testa fails |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
280 ----------- |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
281 Python : sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
282 lxml.etree : (3, 7, 3, 0) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
283 libxml used : (2, 9, 3) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
284 libxml compiled : (2, 9, 3) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
285 libxslt used : (1, 1, 29) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
286 libxslt compiled : (1, 1, 29) |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
287 |
05cf88c20cc5
sample I/O pair, with range annotations
Henry S. Thompson <ht@markup.co.uk>
parents:
54
diff
changeset
|
288 testa works |