view notes.txt @ 26:d2ca3ea1f5ae

plumb in and use inverted ref table (via key)
author Henry S. Thompson <ht@markup.co.uk>
date Fri, 07 Apr 2017 18:42:47 +0100
parents 87e0d620deea
children 8309dcfce613
line wrap: on
line source

You can't depend on 
  <f si="..." t="shared"/>
 That is, it's _true_, but you can have a table with shared formulae
  that doesn't use it.  Compare M17:T28 (see below, uses shared) and
  C17:J28 (mostly no shared) in sample4

Looks like the result of a sweep-and-copy-{right,down} results in the
_new_ cells covered showing as 'shared':

         [ ][1][1][1][1]...
         [2][2][2][2][2]...
         [2][2][2][2][2]...
         ...

Presumably that one was right-then-down, down-then-right would give a
slightly different pattern
--------
Thinking about a pipeline...
 1) convert all variable references into (verbose!) elts:
     <!ELEMENT R EMPTY>
     <!ATTLIST R
        ac CDATA IMPLIED
        rc CDATA IMPLIED
        ar CDATA IMPLIED
        rr CDATA IMPLIED>

     where e.g. ac is 'absolute column'
    'D6' --> <R rc='D' rr='6'/>
     and
    '$E5' --> <R ac='E' rr='5'/>
--------
Identifying dates is . . . tedious.  They will be ints or floats (?),
with s="<int>", where the int is a 0-origin index into the list of
<xf...numFmtId="<bin>".../>
children of <cellXfs> in styles.xml, and bin is a built-in date format
code, see 18.8.30 numFmt (Number Format) in ISO/IEC 29500-1:2016(E) ==
C071691e.pdf
---------
Decided to distinguish between type (num, date, str, err, ...) and
class (cur, others to come?).  If non-standard code, just record that.
----------
Hmm, looking at real data (kenneth_lay__19506), I see _lots_ of cells
with (numerical) formats, but no content.  Where do I throw those
away?  Can throw away empty _rows_ in rect.xsl, but for _cells_ have
to wait for ascii.xsl or html.xsl.  But only copy type in in rect if
there was content before.
-----------
Using attributes to hold space-separated lists is risky, as in
refs.xsl output, is risky!  Fixed, see below.
-----------
Not handling variables as references.  Not catching external
references to variables.  Not catching naked [n]! as external
references.
 Fixed, but not dereferenced vars
 The definition table is in workbook.xml definedNames/definedName[@name=$name]/.
  Sheet name to filename mapping for locals is in workbook.xml sheets/sheet[@name=$sname]/@sheetId
-----------
Switch to default namespace in order to reduce size and improve
readability, and to elements instead of attributes
-----------
Should put another step after refs.xsl to compute a map from
distinct-values of all targets to all the cells which use them
(likewise ranges).  That really does mean we should move to elts for
each ref or range, since at this point we want to compute vector
representation as well, so we can identify projections

Slightly irritating that we'll have to serialise this as XML and then
re-build it later...
-----------
 Overgenerating in kenneth_lay__19506: e.g. <e:ref c="E9" er="[1]!'.SPX' '.SPX'!"/>
      from <f>[1]!'.SPX'</f>
 Hmm.  This cell displays in Excel as REUTERS|IDN!.SPX
       The indirections work as follows:
       in workbook.xml:
        <externalReferences>
         <externalReference r:id="rId3"/>
         <externalReference r:id="rId4"/>
        </externalReferences>
       in _rels/workbook.xml.rels
        <Relationship Id="rId3" Target="externalLinks/externalLink1.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/externalLink"/>
       in externalLinks/externalLink1.xml
        <ddeLink ddeService="REUTER" ddeTopic="IDN"...
         <ddeItems>
	  ...
	  <ddeItem advise="1" name=".SPX">
	  <values>
	  <value>
	  <val>1264.96</val>
	  </value>
	  </values>
	  </ddeItem>
       Whew!
----------
Tried the largest sheet from the largest .xlsx I could find:
  fuse1k/'benjamin_rogers__1002__NYISO Price Information version 2'.xlsx
   -rw-r--r-- 1 ht None  6273325 Apr  3 16:22 '../benjamin_rogers__1002__NYISO Price Information version 2.xlsx'
   -rw-r--r-- 1 ht None 23221149 Jan  1  1980  xl/worksheets/sheet3.xml

  > lxcount xl/worksheets/sheet3.xml | sort -k2nr
  *Total* 1230217
  c       596032
  v       595876
  f       19201
  row     18985
  col     106

    <dimension ref="A1:DY18985"/>

Blew java out of the water :-(
  java.lang.OutOfMemoryError: Java heap space

Need to try again with more memory, if I remember how...

The raw result is going to have 18985 x 102 == 2 million cells ==
(assuming average cell size of 30 bytes and row overhead of 20 (*
18985 (+ 20 (* 102 30))) 58,473,800 bytes, which is big but tolerable...
----------------
Back to ranges -