Mercurial > hg > ooxml
annotate notes.txt @ 28:c56a2e6990bd
convert tokenisation to a function, so can make recursive
author | Henry S. Thompson <ht@markup.co.uk> |
---|---|
date | Tue, 11 Apr 2017 12:28:44 +0100 |
parents | 8309dcfce613 |
children | 16eff0d30d4d |
rev | line source |
---|---|
3
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
1 You can't depend on |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
2 <f si="..." t="shared"/> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
3 That is, it's _true_, but you can have a table with shared formulae |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
4 that doesn't use it. Compare M17:T28 (see below, uses shared) and |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
5 C17:J28 (mostly no shared) in sample4 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
6 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
7 Looks like the result of a sweep-and-copy-{right,down} results in the |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
8 _new_ cells covered showing as 'shared': |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
9 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
10 [ ][1][1][1][1]... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
11 [2][2][2][2][2]... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
12 [2][2][2][2][2]... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
13 ... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
14 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
15 Presumably that one was right-then-down, down-then-right would give a |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
16 slightly different pattern |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
17 -------- |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
18 Thinking about a pipeline... |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
19 1) convert all variable references into (verbose!) elts: |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
20 <!ELEMENT R EMPTY> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
21 <!ATTLIST R |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
22 ac CDATA IMPLIED |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
23 rc CDATA IMPLIED |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
24 ar CDATA IMPLIED |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
25 rr CDATA IMPLIED> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
26 |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
27 where e.g. ac is 'absolute column' |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
28 'D6' --> <R rc='D' rr='6'/> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
29 and |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
30 '$E5' --> <R ac='E' rr='5'/> |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
31 No, in fact -- absolute vs. 'variable' isn't relevant for our purposes. |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
32 What we probably _do_ want is to add to every reference a _relative_ |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
33 version, i.e. +/-columnDelta, +/-rowDelta |
3
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
34 -------- |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
35 Identifying dates is . . . tedious. They will be ints or floats (?), |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
36 with s="<int>", where the int is a 0-origin index into the list of |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
37 <xf...numFmtId="<bin>".../> |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
38 children of <cellXfs> in styles.xml, and bin is a built-in date format |
2c115aefde6b
beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
39 code, see 18.8.30 numFmt (Number Format) in ISO/IEC 29500-1:2016(E) == |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
40 C071691e.pdf DONE |
10
01e80c7a9575
simple ascii type matrix output working
Henry S. Thompson <ht@markup.co.uk>
parents:
3
diff
changeset
|
41 --------- |
01e80c7a9575
simple ascii type matrix output working
Henry S. Thompson <ht@markup.co.uk>
parents:
3
diff
changeset
|
42 Decided to distinguish between type (num, date, str, err, ...) and |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
43 class (cur(rency), others to come?). If non-standard code, just record |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
44 that. |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
45 The current pipe has two main steps, followed by an optional |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
46 prettifying step: |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
47 format.xsl (extracts type={bool,date,num,str,err} |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
48 class={cur,[nothing else yet]} |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
49 code={raw format code if not recognised} |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
50 rect.xsl (fills in gaps, cuts down size, using only bdnse for |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
51 <t>[ype] with attrs c[lass]={c,...} and [co]d[e]=... |
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
52 For now, just using first letters of type, class DONE |
16
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
53 ---------- |
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
54 Hmm, looking at real data (kenneth_lay__19506), I see _lots_ of cells |
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
55 with (numerical) formats, but no content. Where do I throw those |
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
56 away? Can throw away empty _rows_ in rect.xsl, but for _cells_ have |
2bbd067529b6
improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents:
10
diff
changeset
|
57 to wait for ascii.xsl or html.xsl. But only copy type in in rect if |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
58 there was content before. DONE |
22
ca98c74a7cb1
towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents:
16
diff
changeset
|
59 ----------- |
23 | 60 Using attributes to hold space-separated lists is risky, as in |
24
87e0d620deea
switch to elements from attributes and default namespace
Henry S. Thompson <ht@markup.co.uk>
parents:
23
diff
changeset
|
61 refs.xsl output, is risky! Fixed, see below. |
23 | 62 ----------- |
22
ca98c74a7cb1
towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents:
16
diff
changeset
|
63 Not handling variables as references. Not catching external |
ca98c74a7cb1
towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents:
16
diff
changeset
|
64 references to variables. Not catching naked [n]! as external |
ca98c74a7cb1
towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents:
16
diff
changeset
|
65 references. |
ca98c74a7cb1
towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents:
16
diff
changeset
|
66 Fixed, but not dereferenced vars |
ca98c74a7cb1
towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents:
16
diff
changeset
|
67 The definition table is in workbook.xml definedNames/definedName[@name=$name]/. |
ca98c74a7cb1
towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents:
16
diff
changeset
|
68 Sheet name to filename mapping for locals is in workbook.xml sheets/sheet[@name=$sname]/@sheetId |
23 | 69 ----------- |
24
87e0d620deea
switch to elements from attributes and default namespace
Henry S. Thompson <ht@markup.co.uk>
parents:
23
diff
changeset
|
70 Switch to default namespace in order to reduce size and improve |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
71 readability, and to elements instead of attributes DONE |
23 | 72 ----------- |
73 Should put another step after refs.xsl to compute a map from | |
74 distinct-values of all targets to all the cells which use them | |
27
8309dcfce613
preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents:
24
diff
changeset
|
75 (likewise ranges) DONE. That really does mean we should move to elts for |
23 | 76 each ref or range, since at this point we want to compute vector |
77 representation as well, so we can identify projections | |
78 | |
79 Slightly irritating that we'll have to serialise this as XML and then | |
80 re-build it later... | |
81 ----------- | |
82 Overgenerating in kenneth_lay__19506: e.g. <e:ref c="E9" er="[1]!'.SPX' '.SPX'!"/> | |
83 from <f>[1]!'.SPX'</f> | |
84 Hmm. This cell displays in Excel as REUTERS|IDN!.SPX | |
85 The indirections work as follows: | |
86 in workbook.xml: | |
87 <externalReferences> | |
88 <externalReference r:id="rId3"/> | |
89 <externalReference r:id="rId4"/> | |
90 </externalReferences> | |
91 in _rels/workbook.xml.rels | |
92 <Relationship Id="rId3" Target="externalLinks/externalLink1.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/externalLink"/> | |
93 in externalLinks/externalLink1.xml | |
94 <ddeLink ddeService="REUTER" ddeTopic="IDN"... | |
95 <ddeItems> | |
96 ... | |
97 <ddeItem advise="1" name=".SPX"> | |
98 <values> | |
99 <value> | |
100 <val>1264.96</val> | |
101 </value> | |
102 </values> | |
103 </ddeItem> | |
104 Whew! | |
105 ---------- | |
28
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
106 http://upcommons.upc.edu/bitstream/handle/2117/100584/KDIR_2016_47_CR.pdf |
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
107 [downloaded] |
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
108 uses appearance a lot. That needs to be harvested from styles.xml |
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
109 The kenneth_lay enron sample has _403_ numbered formats... |
c56a2e6990bd
convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents:
27
diff
changeset
|
110 |
23 | 111 Tried the largest sheet from the largest .xlsx I could find: |
112 fuse1k/'benjamin_rogers__1002__NYISO Price Information version 2'.xlsx | |
113 -rw-r--r-- 1 ht None 6273325 Apr 3 16:22 '../benjamin_rogers__1002__NYISO Price Information version 2.xlsx' | |
114 -rw-r--r-- 1 ht None 23221149 Jan 1 1980 xl/worksheets/sheet3.xml | |
115 | |
116 > lxcount xl/worksheets/sheet3.xml | sort -k2nr | |
117 *Total* 1230217 | |
118 c 596032 | |
119 v 595876 | |
120 f 19201 | |
121 row 18985 | |
122 col 106 | |
123 | |
124 <dimension ref="A1:DY18985"/> | |
125 | |
126 Blew java out of the water :-( | |
127 java.lang.OutOfMemoryError: Java heap space | |
128 | |
129 Need to try again with more memory, if I remember how... | |
130 | |
131 The raw result is going to have 18985 x 102 == 2 million cells == | |
132 (assuming average cell size of 30 bytes and row overhead of 20 (* | |
133 18985 (+ 20 (* 102 30))) 58,473,800 bytes, which is big but tolerable... | |
134 ---------------- | |
135 Back to ranges - | |
136 |