Mercurial > hg > ooxml
annotate notes.txt @ 34:93fd0d532754
fix bug in refs wrt e.g. [1]!.SGX,
adapt html and ascii to new-format refs,
move a2n and n2a into separate files for re-use
author | Henry S. Thompson <ht@markup.co.uk> |
---|---|
date | Wed, 12 Apr 2017 21:35:04 +0100 |
parents | 16eff0d30d4d |
children | e500d7c18aad |
rev | line source |
---|---|
3
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
1 You can't depend on |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
2 <f si="..." t="shared"/> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
3 That is, it's _true_, but you can have a table with shared formulae |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
4 that doesn't use it. Compare M17:T28 (see below, uses shared) and |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
5 C17:J28 (mostly no shared) in sample4 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
6 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
7 Looks like the result of a sweep-and-copy-{right,down} results in the |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
8 _new_ cells covered showing as 'shared': |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
9 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
10 [ ][1][1][1][1]... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
11 [2][2][2][2][2]... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
12 [2][2][2][2][2]... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
13 ... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
14 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
15 Presumably that one was right-then-down, down-then-right would give a |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
16 slightly different pattern |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
17 -------- |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
18 Thinking about a pipeline... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
19 1) convert all variable references into (verbose!) elts: |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
20 <!ELEMENT R EMPTY> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
21 <!ATTLIST R |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
22 ac CDATA IMPLIED |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
23 rc CDATA IMPLIED |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
24 ar CDATA IMPLIED |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
25 rr CDATA IMPLIED> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
26 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
27 where e.g. ac is 'absolute column' |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
28 'D6' --> <R rc='D' rr='6'/> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
29 and |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
30 '$E5' --> <R ac='E' rr='5'/> |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
31 No, in fact -- absolute vs. 'variable' isn't relevant for our purposes. |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
32 What we probably _do_ want is to add to every reference a _relative_ |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
33 version, i.e. +/-columnDelta, +/-rowDelta |
3
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
34 -------- |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
35 Identifying dates is . . . tedious. They will be ints or floats (?), |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
36 with s="<int>", where the int is a 0-origin index into the list of |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
37 <xf...numFmtId="<bin>".../> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
38 children of <cellXfs> in styles.xml, and bin is a built-in date format |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
39 code, see 18.8.30 numFmt (Number Format) in ISO/IEC 29500-1:2016(E) == |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
40 C071691e.pdf DONE |
10
01e80c7a9575
simple ascii type matrix output working
Henry S. Thompson <ht@markup.co.uk>
parents:
3
diff
changeset
|
41 --------- |
01e80c7a9575
simple ascii type matrix output working
Henry S. Thompson <ht@markup.co.uk>
parents:
3
diff
changeset
|
42 Decided to distinguish between type (num, date, str, err, ...) and |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
43 class (cur(rency), others to come?). If non-standard code, just record |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
44 that. |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
45 The current pipe has two main steps, followed by an optional |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
46 prettifying step: |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
47 format.xsl (extracts type={bool,date,num,str,err} |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
48 class={cur,[nothing else yet]} |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
49 code={raw format code if not recognised} |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
50 rect.xsl (fills in gaps, cuts down size, using only bdnse for |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
51 <t>[ype] with attrs c[lass]={c,...} and [co]d[e]=... |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
52 For now, just using first letters of type, class DONE |
16
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
53 ---------- |
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
54 Hmm, looking at real data (kenneth_lay__19506), I see _lots_ of cells |
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
55 with (numerical) formats, but no content. Where do I throw those |
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
56 away? Can throw away empty _rows_ in rect.xsl, but for _cells_ have |
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
57 to wait for ascii.xsl or html.xsl. But only copy type in in rect if |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
58 there was content before. DONE |
22
ca98c74a7cb1
towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents:
16
diff
changeset
|
59 ----------- |
23 | 60 Using attributes to hold space-separated lists is risky, as in |
24
87e0d620deea
switch to elements from attributes and default namespace
Henry S. Thompson <ht@markup.co.uk>
parents:
23
diff
changeset
|
61 refs.xsl output, is risky! Fixed, see below. |
23 | 62 ----------- |
30
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
63 Not handling variables as references FIXED. Not catching external |
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
64 references to variables FIXED (as externals). Not catching naked [n]! as external |
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
65 references FIXED |
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
66 Solo local vars are recursively dereferenced |
22
ca98c74a7cb1
towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents:
16
diff
changeset
|
67 The definition table is in workbook.xml definedNames/definedName[@name=$name]/. |
ca98c74a7cb1
towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents:
16
diff
changeset
|
68 Sheet name to filename mapping for locals is in workbook.xml sheets/sheet[@name=$sname]/@sheetId |
30
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
69 Variables on l or r of ranges are just looked up: if they are complex |
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
70 no recursion is done: the _semantics_ of this case are not clear to |
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
71 me, need a real-life example... |
23 | 72 ----------- |
24
87e0d620deea
switch to elements from attributes and default namespace
Henry S. Thompson <ht@markup.co.uk>
parents:
23
diff
changeset
|
73 Switch to default namespace in order to reduce size and improve |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
74 readability, and to elements instead of attributes DONE |
23 | 75 ----------- |
76 Should put another step after refs.xsl to compute a map from | |
77 distinct-values of all targets to all the cells which use them | |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
78 (likewise ranges) DONE. That really does mean we should move to elts for |
23 | 79 each ref or range, since at this point we want to compute vector |
30
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
80 representation as well DONE, so we can identify projections |
23 | 81 |
82 Slightly irritating that we'll have to serialise this as XML and then | |
83 re-build it later... | |
84 ----------- | |
85 Overgenerating in kenneth_lay__19506: e.g. <e:ref c="E9" er="[1]!'.SPX' '.SPX'!"/> | |
86 from <f>[1]!'.SPX'</f> | |
87 Hmm. This cell displays in Excel as REUTERS|IDN!.SPX | |
88 The indirections work as follows: | |
89 in workbook.xml: | |
90 <externalReferences> | |
91 <externalReference r:id="rId3"/> | |
92 <externalReference r:id="rId4"/> | |
93 </externalReferences> | |
94 in _rels/workbook.xml.rels | |
95 <Relationship Id="rId3" Target="externalLinks/externalLink1.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/externalLink"/> | |
96 in externalLinks/externalLink1.xml | |
97 <ddeLink ddeService="REUTER" ddeTopic="IDN"... | |
98 <ddeItems> | |
99 ... | |
100 <ddeItem advise="1" name=".SPX"> | |
101 <values> | |
102 <value> | |
103 <val>1264.96</val> | |
104 </value> | |
105 </values> | |
106 </ddeItem> | |
107 Whew! | |
30
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
108 FIXED |
23 | 109 ---------- |
28
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
110 http://upcommons.upc.edu/bitstream/handle/2117/100584/KDIR_2016_47_CR.pdf |
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
111 [downloaded] |
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
112 uses appearance a lot. That needs to be harvested from styles.xml |
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
113 The kenneth_lay enron sample has _403_ numbered formats... |
30
16eff0d30d4d
tidied dereferencing, added simple (no recursion) coverage for variables in ranges
Henry S. Thompson <ht@markup.co.uk>
parents:
28
diff
changeset
|
114 ---------- |
23 | 115 Tried the largest sheet from the largest .xlsx I could find: |
116 fuse1k/'benjamin_rogers__1002__NYISO Price Information version 2'.xlsx | |
117 -rw-r--r-- 1 ht None 6273325 Apr 3 16:22 '../benjamin_rogers__1002__NYISO Price Information version 2.xlsx' | |
118 -rw-r--r-- 1 ht None 23221149 Jan 1 1980 xl/worksheets/sheet3.xml | |
119 | |
120 > lxcount xl/worksheets/sheet3.xml | sort -k2nr | |
121 *Total* 1230217 | |
122 c 596032 | |
123 v 595876 | |
124 f 19201 | |
125 row 18985 | |
126 col 106 | |
127 | |
128 <dimension ref="A1:DY18985"/> | |
129 | |
130 Blew java out of the water :-( | |
131 java.lang.OutOfMemoryError: Java heap space | |
132 | |
133 Need to try again with more memory, if I remember how... | |
134 | |
135 The raw result is going to have 18985 x 102 == 2 million cells == | |
136 (assuming average cell size of 30 bytes and row overhead of 20 (* | |
137 18985 (+ 20 (* 102 30))) 58,473,800 bytes, which is big but tolerable... | |
138 ---------------- | |
139 Back to ranges - | |
140 |