Mercurial > hg > ooxml
view notes.txt @ 44:5b2333e59d91
updated html, ascii output to new raw doctype
author | Henry S. Thompson <ht@markup.co.uk> |
---|---|
date | Sun, 07 May 2017 11:56:02 +0100 |
parents | 4c6a341e75da |
children | 6ed900e8cc61 |
line wrap: on
line source
Tokenisation patterns, derived from parse.py, derived from https://sites.google.com/site/e90e50/random-topics/tool-for-parsing-formulas-in-excel and parser_formule_with_textbox_v01_2003.xla linked to therein 1 ("[^"]*") q A text (delimited by double quotes) 2 (\{[^}]+}) m A constant matrix 3 (,) c A list (function parameter) separator 4 ([^=\-+*/();:,.$<>^!]+(?:\.[^=\-+*/();:,.$<>^!]+)*\() f A function name followed by an opening parenthesis 5 ([)]) p A closing parenthesis 6 (^=|\() l The beginning of the formula or an opening parenthesis (not part of a function) 7 ((?:(?:'[^']+')|(?:\[[0-9]+\][^!]*)|(?:[a-zA-Z_][a-zA-Z0-9._]*)!)) n A sheet name (either delimited by single quotes, or bracketed number plus optional string, or simple name (syntax is a _guess_)) 8 (\$?[A-Z]+\$?[0-9]+) s or r A cell reference 9 ([a-zA-Z_\\][a-zA-Z0-9._]*) v A name (boolean constant or a variable -- anything else?) 10 (.) x Single characters not matched by the previous patterns ---------- You can't depend on <f si="..." t="shared"/> That is, it's _true_, but you can have a table with shared formulae that doesn't use it. Compare M17:T28 (see below, uses shared) and C17:J28 (mostly no shared) in sample4 Looks like the result of a sweep-and-copy-{right,down} results in the _new_ cells covered showing as 'shared': [ ][1][1][1][1]... [2][2][2][2][2]... [2][2][2][2][2]... ... Presumably that one was right-then-down, down-then-right would give a slightly different pattern -------- Thinking about a pipeline... 1) convert all variable references into (verbose!) elts: <!ELEMENT R EMPTY> <!ATTLIST R ac CDATA IMPLIED rc CDATA IMPLIED ar CDATA IMPLIED rr CDATA IMPLIED> where e.g. ac is 'absolute column' 'D6' --> <R rc='D' rr='6'/> and '$E5' --> <R ac='E' rr='5'/> No, in fact -- absolute vs. 'variable' isn't relevant for our purposes. What we probably _do_ want is to add to every reference a _relative_ version, i.e. +/-columnDelta, +/-rowDelta -------- Identifying dates is . . . tedious. They will be ints or floats (?), with s="<int>", where the int is a 0-origin index into the list of <xf...numFmtId="<bin>".../> children of <cellXfs> in styles.xml, and bin is a built-in date format code, see 18.8.30 numFmt (Number Format) in ISO/IEC 29500-1:2016(E) == C071691e.pdf DONE --------- Decided to distinguish between type (num, date, str, err, ...) and class (cur(rency), others to come?). If non-standard code, just record that. The current pipe has two main steps, followed by an optional prettifying step: format.xsl (extracts type={bool,date,num,str,err} class={cur,[nothing else yet]} code={raw format code if not recognised} rect.xsl (fills in gaps, cuts down size, using only bdnse for <t>[ype] with attrs c[lass]={c,...} and [co]d[e]=... For now, just using first letters of type, class DONE ---------- Hmm, looking at real data (kenneth_lay__19506), I see _lots_ of cells with (numerical) formats, but no content. Where do I throw those away? Can throw away empty _rows_ in rect.xsl, but for _cells_ have to wait for ascii.xsl or html.xsl. But only copy type in in rect if there was content before. DONE ----------- Using attributes to hold space-separated lists is risky, as in refs.xsl output, is risky! Fixed, see below. ----------- Not handling variables as references FIXED. Not catching external references to variables FIXED (as externals). Not catching naked [n]! as external references FIXED Solo local vars are recursively dereferenced The definition table is in workbook.xml definedNames/definedName[@name=$name]/. Sheet name to filename mapping for locals is in workbook.xml sheets/sheet[@name=$sname]/@sheetId These appear in definedName, single-quoted if (iff?) the sheet name has spaces (or other specials?) ??? Variables on l or r of ranges are just looked up: if they are complex no recursion is done: the _semantics_ of this case are not clear to me, need a real-life example... @@ Variables whose value is itself a range are not being handled ----------- Switch to default namespace in order to reduce size and improve readability, and to elements instead of attributes DONE ----------- Should put another step after refs.xsl to compute a map from distinct-values of all targets to all the cells which use them DONE. Likewise ranges @@ That really does mean we should move to elts for each ref or range, since at this point we want to compute vector representation as well DONE, so we can identify projections Slightly irritating that we'll have to serialise this as XML and then re-build it later... ----------- Overgenerating in kenneth_lay__19506: e.g. <e:ref c="E9" er="[1]!'.SPX' '.SPX'!"/> from <f>[1]!'.SPX'</f> Hmm. This cell displays in Excel as REUTERS|IDN!.SPX The indirections work as follows: in workbook.xml: <externalReferences> <externalReference r:id="rId3"/> <externalReference r:id="rId4"/> </externalReferences> in _rels/workbook.xml.rels <Relationship Id="rId3" Target="externalLinks/externalLink1.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/externalLink"/> in externalLinks/externalLink1.xml <ddeLink ddeService="REUTER" ddeTopic="IDN"... <ddeItems> ... <ddeItem advise="1" name=".SPX"> <values> <value> <val>1264.96</val> </value> </values> </ddeItem> Whew! FIXED ---------- http://upcommons.upc.edu/bitstream/handle/2117/100584/KDIR_2016_47_CR.pdf [downloaded] uses appearance a lot. That needs to be harvested from styles.xml The kenneth_lay enron sample has _403_ numbered formats... ---------- Tried the largest sheet from the largest .xlsx I could find: fuse1k/'benjamin_rogers__1002__NYISO Price Information version 2'.xlsx -rw-r--r-- 1 ht None 6273325 Apr 3 16:22 '../benjamin_rogers__1002__NYISO Price Information version 2.xlsx' -rw-r--r-- 1 ht None 23221149 Jan 1 1980 xl/worksheets/sheet3.xml > lxcount xl/worksheets/sheet3.xml | sort -k2nr *Total* 1230217 c 596032 v 595876 f 19201 row 18985 col 106 <dimension ref="A1:DY18985"/> Blew java out of the water :-( java.lang.OutOfMemoryError: Java heap space Need to try again with more memory, if I remember how... The raw result is going to have 18985 x 102 == 2 million cells == (assuming average cell size of 30 bytes and row overhead of 20 (* 18985 (+ 20 (* 102 30))) 58,473,800 bytes, which is big but tolerable... ---------------- sample4 html reveals several problems: mistaken content based on class bug, e.g. B4 is 'a' FIXED highlighted cells are being labelled as cur, e.g. B61 in output of format.xsl FIXED ----------- Need to rethink variable handling... Is all we really need a normalised formula computation: 1) recursively replace variables; 2) convert all simple refs to new CR string normal form: crnf ::= col row col ::= abs | rel row ::= abs | rel abs ::= '\xAA' xs:positiveInteger rel ::= '\xAE' ( ( '-' xs:positiveInteger ) | xs:nonNegativeInteger )