view notes.txt @ 66:d0edaceb04b6

first time to Sonra, Kostas
author Henry S. Thompson <ht@markup.co.uk>
date Mon, 12 Jun 2017 17:08:07 +0200
parents 823ac978f4ab
children
line wrap: on
line source


Tokenisation patterns, derived from parse.py, derived from 
 https://sites.google.com/site/e90e50/random-topics/tool-for-parsing-formulas-in-excel
   and
 parser_formule_with_textbox_v01_2003.xla
   linked to therein

1 ("[^"]*") q
  A text (delimited by double quotes) 
2 (\{[^}]+}) m
  A constant matrix
3 (,) c
  A list (function parameter) separator
4 ([^=\-+*/();:,.$&lt;>^!]+(?:\.[^=\-+*/();:,.$&lt;>^!]+)*\() f
  A function name followed by an opening parenthesis
5 ([)]) p
  A closing parenthesis
6 (^=|\() l
  The beginning of the formula or an opening
         parenthesis (not part of a function)
7 ((?:(?:'[^']+')|(?:\[[0-9]+\][^!]*)|(?:[a-zA-Z_][a-zA-Z0-9._]*)!)) n
  A sheet name (either delimited by single quotes, or
                bracketed number plus optional string,
                or simple name (syntax is a _guess_)) 
8 (\$?[A-Z]+\$?[0-9]+) s or r
  A cell reference
9 ([a-zA-Z_\\][a-zA-Z0-9._]*) v
  A name (boolean constant or a variable -- anything else?)
10 (.) x
  Single characters not matched by the previous patterns
----------
You can't depend on 
  <f si="..." t="shared"/>
 That is, it's _true_, but you can have a table with shared formulae
  that doesn't use it.  Compare M17:T28 (see below, uses shared) and
  C17:J28 (mostly no shared) in sample4

Looks like the result of a sweep-and-copy-{right,down} results in the
_new_ cells covered showing as 'shared':

         [ ][1][1][1][1]...
         [2][2][2][2][2]...
         [2][2][2][2][2]...
         ...

Presumably that one was right-then-down, down-then-right would give a
slightly different pattern
--------
Thinking about a pipeline...
 1) convert all variable references into (verbose!) elts:
     <!ELEMENT R EMPTY>
     <!ATTLIST R
        ac CDATA IMPLIED
        rc CDATA IMPLIED
        ar CDATA IMPLIED
        rr CDATA IMPLIED>

     where e.g. ac is 'absolute column'
    'D6' --> <R rc='D' rr='6'/>
     and
    '$E5' --> <R ac='E' rr='5'/>
No, in fact -- absolute vs. 'variable' isn't relevant for our purposes.
What we probably _do_ want is to add to every reference a _relative_
version, i.e. +/-columnDelta, +/-rowDelta
--------
Identifying dates is . . . tedious.  They will be ints or floats (?),
with s="<int>", where the int is a 0-origin index into the list of
<xf...numFmtId="<bin>".../>
children of <cellXfs> in styles.xml, and bin is a built-in date format
code, see 18.8.30 numFmt (Number Format) in ISO/IEC 29500-1:2016(E) ==
C071691e.pdf  DONE
---------
Decided to distinguish between type (num, date, str, err, ...) and
class (cur(rency), others to come?).  If non-standard code, just record
that.
  The current pipe has two main steps, followed by an optional
  prettifying step:
    format.xsl (extracts type={bool,date,num,str,err}
                         class={cur,[nothing else yet]}
                         code={raw format code if not recognised}
    rect.xsl   (fills in gaps, cuts down size, using only bdnse for
                <t>[ype] with attrs c[lass]={c,...} and [co]d[e]=...
 For now, just using first letters of type, class DONE
----------
Hmm, looking at real data (kenneth_lay__19506), I see _lots_ of cells
with (numerical) formats, but no content.  Where do I throw those
away?  Can throw away empty _rows_ in rect.xsl, but for _cells_ have
to wait for ascii.xsl or html.xsl.  But only copy type in in rect if
there was content before. DONE
-----------
Using attributes to hold space-separated lists is risky, as in
refs.xsl output, is risky!  Fixed, see below.
-----------
Not handling variables as references FIXED.  Not catching external
references to variables FIXED (as externals).  Not catching naked [n]! as external
references FIXED
 Solo local vars are recursively dereferenced
 The definition table is in workbook.xml definedNames/definedName[@name=$name]/.
  Sheet name to filename mapping for locals is in workbook.xml
    sheets/sheet[@name=$sname]/@sheetId
    These appear in definedName, single-quoted if (iff?) the sheet name has spaces
      (or other specials?)
 ??? Variables on l or r of ranges are just looked up: if they are complex
  no recursion is done: the _semantics_ of this case are not clear to
  me, need a real-life example... 
 Variables whose value is itself a range are not being handled FIXED
-----------
Switch to default namespace in order to reduce size and improve
readability, and to elements instead of attributes DONE
-----------
Should put another step after refs.xsl to compute a map from
distinct-values of all targets to all the cells which use them
DONE.
Now provides inverted rel. ref info so that static input columns can
be identified from their users.

Likewise ranges --  range endpoints in place, @@ what about the things themselves?

That really does mean we should move to elts for
each ref or range, since at this point we want to compute vector
representation as well DONE, so we can identify projections

Slightly irritating that we'll have to serialise this as XML and then
re-build it later...
-----------
 Overgenerating in kenneth_lay__19506: e.g. <e:ref c="E9" er="[1]!'.SPX' '.SPX'!"/>
      from <f>[1]!'.SPX'</f>
 Hmm.  This cell displays in Excel as REUTERS|IDN!.SPX
       The indirections work as follows:
       in workbook.xml:
        <externalReferences>
         <externalReference r:id="rId3"/>
         <externalReference r:id="rId4"/>
        </externalReferences>
       in _rels/workbook.xml.rels
        <Relationship Id="rId3" Target="externalLinks/externalLink1.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/externalLink"/>
       in externalLinks/externalLink1.xml
        <ddeLink ddeService="REUTER" ddeTopic="IDN"...
         <ddeItems>
	  ...
	  <ddeItem advise="1" name=".SPX">
	  <values>
	  <value>
	  <val>1264.96</val>
	  </value>
	  </values>
	  </ddeItem>
       Whew!
FIXED
----------
http://upcommons.upc.edu/bitstream/handle/2117/100584/KDIR_2016_47_CR.pdf
[downloaded]
uses appearance a lot.  That needs to be harvested from styles.xml
The kenneth_lay enron sample has _403_ numbered formats...
----------
Tried the largest sheet from the largest .xlsx I could find:
  fuse1k/'benjamin_rogers__1002__NYISO Price Information version 2'.xlsx
   -rw-r--r-- 1 ht None  6273325 Apr  3 16:22 '../benjamin_rogers__1002__NYISO Price Information version 2.xlsx'
   -rw-r--r-- 1 ht None 23221149 Jan  1  1980  xl/worksheets/sheet3.xml

  > lxcount xl/worksheets/sheet3.xml | sort -k2nr
  *Total* 1230217
  c       596032
  v       595876
  f       19201
  row     18985
  col     106

    <dimension ref="A1:DY18985"/>

Blew java out of the water :-(
  java.lang.OutOfMemoryError: Java heap space

Need to try again with more memory, if I remember how...

The raw result is going to have 18985 x 102 == 2 million cells ==
(assuming average cell size of 30 bytes and row overhead of 20 (*
18985 (+ 20 (* 102 30))) 58,473,800 bytes, which is big but tolerable...
----------------
sample4 html reveals several problems: mistaken content based on class
                                        bug, e.g. B4 is 'a' FIXED
                                       highlighted cells are being
                                        labelled as cur, e.g. B61 in
                                        output of format.xsl FIXED
-----------
Need to rethink variable handling...
Is all we really need a normalised formula computation?:
 1) recursively replace variables;
 2) convert all simple refs to new CR string normal form: 
    crnf ::= col row
    col ::= abs | rel
    row ::= abs | rel
    abs ::= '\xAA' xs:positiveInteger
    rel ::= '\xAE' ( ( '-' xs:positiveInteger ) | xs:nonNegativeInteger )
-----------
Would <c c=COL [fi= [si=]]> be sufficient?
 fi an index into _all_ functions, si original index into explicitly
    shared functions -- note that the same fi may appear with multiple
    si, see discussion back at the top of this doc.
 Brute force for now -- rect sees shared table, computes CRNF
 Not good enough -- <f> in shared table can't be used as is, need to
  rebuild ref names relative to each new home. FIXED
-----------
Picking colours to label regions, e.g. with similar formulae:
  http://stackoverflow.com/questions/470690/how-to-automatically-generate-n-distinct-colors
Start with just top-n, limited to 22 from Kelly
    #FFB300, # Vivid Yellow
    #803E75, # Strong Purple
    #FF6800, # Vivid Orange
    #A6BDD7, # Very Light Blue
    #C10020, # Vivid Red
    #CEA262, # Grayish Yellow
    #817066, # Medium Gray
    # The following don't work well for people with defective color vision
    #007D34, # Vivid Green
    #F6768E, # Strong Purplish Pink
    #00538A, # Strong Blue
    #FF7A5C, # Strong Yellowish Pink
    #53377A, # Strong Violet
    #FF8E00, # Vivid Orange Yellow
    #B32851, # Strong Purplish Red
    #F4C800, # Vivid Greenish Yellow
    #7F180D, # Strong Reddish Brown
    #93AA00, # Vivid Yellowish Green
    #593315, # Deep Yellowish Brown
    #F13A13, # Vivid Reddish Orange
    #232C16, # Dark Olive Green
------------
@@ string identity, to say nothing of actual value, is lost -- fix?
@@ row/column/both spans [what?]

Now using up to 4 border colours to reflect incoming refs
 @@ sort these before clipping to 4 to reflect frequency
 @@ use vertical layering in the cell to get the borders
    more evident when a background colour is present?  Already
    happening, just a bit hard to see, need a 1px space?
------
enron1k/kenneth_lay__19506 contains this formula:

  <f>[1]!'.SPX'</f>

which crashes tokenise/rnf FIXED works now, and with
<f>[1]!'AES,DIVIDEND' (where _are_ these coming from???)
Als, with enlarged memory, now runs on uli1/sheet1

Changes intended to fix this fixed a bug (?) which wasn't properly
merging e.g. +3 -- no examples of larger numbers available to check
with...  We are now getting e.g. <x>2.509+0.482+0.238</x>
in enron1k/kenneth_lay__19506

We could _either_ add a class of operators, or a class of numbers?

==============
Python              : sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)
lxml.etree          : (3, 6, 4, 0)
libxml used         : (2, 9, 2)
libxml compiled     : (2, 9, 2)
libxslt used        : (1, 1, 29)
libxslt compiled    : (1, 1, 29)

testa works
----------
Python              : sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)
lxml.etree          : (3, 7, 3, 0)
libxml used         : (2, 9, 2)
libxml compiled     : (2, 9, 2)
libxslt used        : (1, 1, 29)
libxslt compiled    : (1, 1, 29)

testa works
---------
Python              : sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)
lxml.etree          : (3, 7, 3, 0)
libxml used         : (2, 9, 4)
libxml compiled     : (2, 9, 4)
libxslt used        : (1, 1, 29)
libxslt compiled    : (1, 1, 29)

testa fails
-----------
Python              : sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)
lxml.etree          : (3, 7, 3, 0)
libxml used         : (2, 9, 3)
libxml compiled     : (2, 9, 3)
libxslt used        : (1, 1, 29)
libxslt compiled    : (1, 1, 29)

testa works