annotate notes.txt @ 28:c56a2e6990bd

convert tokenisation to a function, so can make recursive
author Henry S. Thompson <ht@markup.co.uk>
date Tue, 11 Apr 2017 12:28:44 +0100
parents 8309dcfce613
children 16eff0d30d4d
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
1 You can't depend on
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
2 <f si="..." t="shared"/>
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
3 That is, it's _true_, but you can have a table with shared formulae
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
4 that doesn't use it. Compare M17:T28 (see below, uses shared) and
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
5 C17:J28 (mostly no shared) in sample4
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
6
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
7 Looks like the result of a sweep-and-copy-{right,down} results in the
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
8 _new_ cells covered showing as 'shared':
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
9
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
10 [ ][1][1][1][1]...
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
11 [2][2][2][2][2]...
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
12 [2][2][2][2][2]...
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
13 ...
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
14
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
15 Presumably that one was right-then-down, down-then-right would give a
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
16 slightly different pattern
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
17 --------
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
18 Thinking about a pipeline...
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
19 1) convert all variable references into (verbose!) elts:
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
20 <!ELEMENT R EMPTY>
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
21 <!ATTLIST R
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
22 ac CDATA IMPLIED
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
23 rc CDATA IMPLIED
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
24 ar CDATA IMPLIED
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
25 rr CDATA IMPLIED>
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
26
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
27 where e.g. ac is 'absolute column'
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
28 'D6' --> <R rc='D' rr='6'/>
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
29 and
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
30 '$E5' --> <R ac='E' rr='5'/>
27
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
31 No, in fact -- absolute vs. 'variable' isn't relevant for our purposes.
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
32 What we probably _do_ want is to add to every reference a _relative_
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
33 version, i.e. +/-columnDelta, +/-rowDelta
3
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
34 --------
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
35 Identifying dates is . . . tedious. They will be ints or floats (?),
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
36 with s="<int>", where the int is a 0-origin index into the list of
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
37 <xf...numFmtId="<bin>".../>
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
38 children of <cellXfs> in styles.xml, and bin is a built-in date format
2c115aefde6b beginning work on elaboration of worksheets
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
39 code, see 18.8.30 numFmt (Number Format) in ISO/IEC 29500-1:2016(E) ==
27
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
40 C071691e.pdf DONE
10
01e80c7a9575 simple ascii type matrix output working
Henry S. Thompson <ht@markup.co.uk>
parents: 3
diff changeset
41 ---------
01e80c7a9575 simple ascii type matrix output working
Henry S. Thompson <ht@markup.co.uk>
parents: 3
diff changeset
42 Decided to distinguish between type (num, date, str, err, ...) and
27
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
43 class (cur(rency), others to come?). If non-standard code, just record
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
44 that.
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
45 The current pipe has two main steps, followed by an optional
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
46 prettifying step:
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
47 format.xsl (extracts type={bool,date,num,str,err}
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
48 class={cur,[nothing else yet]}
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
49 code={raw format code if not recognised}
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
50 rect.xsl (fills in gaps, cuts down size, using only bdnse for
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
51 <t>[ype] with attrs c[lass]={c,...} and [co]d[e]=...
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
52 For now, just using first letters of type, class DONE
16
2bbd067529b6 improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents: 10
diff changeset
53 ----------
2bbd067529b6 improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents: 10
diff changeset
54 Hmm, looking at real data (kenneth_lay__19506), I see _lots_ of cells
2bbd067529b6 improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents: 10
diff changeset
55 with (numerical) formats, but no content. Where do I throw those
2bbd067529b6 improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents: 10
diff changeset
56 away? Can throw away empty _rows_ in rect.xsl, but for _cells_ have
2bbd067529b6 improve efficiency, detect blank rows, don't type empty cells
Henry S. Thompson <ht@markup.co.uk>
parents: 10
diff changeset
57 to wait for ascii.xsl or html.xsl. But only copy type in in rect if
27
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
58 there was content before. DONE
22
ca98c74a7cb1 towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents: 16
diff changeset
59 -----------
23
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
60 Using attributes to hold space-separated lists is risky, as in
24
87e0d620deea switch to elements from attributes and default namespace
Henry S. Thompson <ht@markup.co.uk>
parents: 23
diff changeset
61 refs.xsl output, is risky! Fixed, see below.
23
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
62 -----------
22
ca98c74a7cb1 towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents: 16
diff changeset
63 Not handling variables as references. Not catching external
ca98c74a7cb1 towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents: 16
diff changeset
64 references to variables. Not catching naked [n]! as external
ca98c74a7cb1 towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents: 16
diff changeset
65 references.
ca98c74a7cb1 towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents: 16
diff changeset
66 Fixed, but not dereferenced vars
ca98c74a7cb1 towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents: 16
diff changeset
67 The definition table is in workbook.xml definedNames/definedName[@name=$name]/.
ca98c74a7cb1 towards var handling, no lookup yet
Henry S. Thompson <ht@markup.co.uk>
parents: 16
diff changeset
68 Sheet name to filename mapping for locals is in workbook.xml sheets/sheet[@name=$sname]/@sheetId
23
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
69 -----------
24
87e0d620deea switch to elements from attributes and default namespace
Henry S. Thompson <ht@markup.co.uk>
parents: 23
diff changeset
70 Switch to default namespace in order to reduce size and improve
27
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
71 readability, and to elements instead of attributes DONE
23
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
72 -----------
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
73 Should put another step after refs.xsl to compute a map from
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
74 distinct-values of all targets to all the cells which use them
27
8309dcfce613 preparing for variable deref
Henry S. Thompson <ht@markup.co.uk>
parents: 24
diff changeset
75 (likewise ranges) DONE. That really does mean we should move to elts for
23
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
76 each ref or range, since at this point we want to compute vector
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
77 representation as well, so we can identify projections
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
78
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
79 Slightly irritating that we'll have to serialise this as XML and then
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
80 re-build it later...
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
81 -----------
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
82 Overgenerating in kenneth_lay__19506: e.g. <e:ref c="E9" er="[1]!'.SPX' '.SPX'!"/>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
83 from <f>[1]!'.SPX'</f>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
84 Hmm. This cell displays in Excel as REUTERS|IDN!.SPX
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
85 The indirections work as follows:
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
86 in workbook.xml:
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
87 <externalReferences>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
88 <externalReference r:id="rId3"/>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
89 <externalReference r:id="rId4"/>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
90 </externalReferences>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
91 in _rels/workbook.xml.rels
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
92 <Relationship Id="rId3" Target="externalLinks/externalLink1.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/externalLink"/>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
93 in externalLinks/externalLink1.xml
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
94 <ddeLink ddeService="REUTER" ddeTopic="IDN"...
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
95 <ddeItems>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
96 ...
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
97 <ddeItem advise="1" name=".SPX">
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
98 <values>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
99 <value>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
100 <val>1264.96</val>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
101 </value>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
102 </values>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
103 </ddeItem>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
104 Whew!
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
105 ----------
28
c56a2e6990bd convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents: 27
diff changeset
106 http://upcommons.upc.edu/bitstream/handle/2117/100584/KDIR_2016_47_CR.pdf
c56a2e6990bd convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents: 27
diff changeset
107 [downloaded]
c56a2e6990bd convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents: 27
diff changeset
108 uses appearance a lot. That needs to be harvested from styles.xml
c56a2e6990bd convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents: 27
diff changeset
109 The kenneth_lay enron sample has _403_ numbered formats...
c56a2e6990bd convert tokenisation to a function, so can make recursive
Henry S. Thompson <ht@markup.co.uk>
parents: 27
diff changeset
110
23
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
111 Tried the largest sheet from the largest .xlsx I could find:
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
112 fuse1k/'benjamin_rogers__1002__NYISO Price Information version 2'.xlsx
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
113 -rw-r--r-- 1 ht None 6273325 Apr 3 16:22 '../benjamin_rogers__1002__NYISO Price Information version 2.xlsx'
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
114 -rw-r--r-- 1 ht None 23221149 Jan 1 1980 xl/worksheets/sheet3.xml
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
115
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
116 > lxcount xl/worksheets/sheet3.xml | sort -k2nr
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
117 *Total* 1230217
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
118 c 596032
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
119 v 595876
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
120 f 19201
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
121 row 18985
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
122 col 106
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
123
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
124 <dimension ref="A1:DY18985"/>
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
125
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
126 Blew java out of the water :-(
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
127 java.lang.OutOfMemoryError: Java heap space
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
128
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
129 Need to try again with more memory, if I remember how...
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
130
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
131 The raw result is going to have 18985 x 102 == 2 million cells ==
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
132 (assuming average cell size of 30 bytes and row overhead of 20 (*
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
133 18985 (+ 20 (* 102 30))) 58,473,800 bytes, which is big but tolerable...
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
134 ----------------
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
135 Back to ranges -
bfa38afaea63 change to default ns
Henry S. Thompson <ht@markup.co.uk>
parents: 22
diff changeset
136