Mercurial > hg > cc > azure
annotate master/src/wecu/sac_schemes.py @ 67:13182e98a1ab
use sorted insertion into tuple list for properties
author | Henry S. Thompson <ht@markup.co.uk> |
---|---|
date | Thu, 04 Jun 2020 17:58:10 +0000 |
parents | b04870ab3035 |
children |
rev | line source |
---|---|
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
1 #!/usr/bin/python3 |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
2 '''Assumes export PYTHONIOENCODING=utf-8 has been done if necessary |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
3 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
4 Usage: uz ...wat.gz | sac_schemes.py [-d] [altStorageScheme] |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
5 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
6 where altStorageScheme if present selects an alternative approach to storing triple counts: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
7 [absent]: three nested dictionaries |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
8 1: one dictionary indexed by 4-tuple |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
9 2: one dictionary indexed by ".".join(keys)''' |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
10 |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
11 import sys, json, regex |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
12 |
67
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
13 print(sys.argv,file=sys.stderr) |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
14 |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
15 if len(sys.argv)>1 and sys.argv[1]=='-d': |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
16 sys.argv.pop(1) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
17 dictRes=True |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
18 else: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
19 dictRes=False |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
20 |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
21 META_PATH=['Envelope', 'Payload-Metadata', 'HTTP-Response-Metadata'] |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
22 |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
23 PATHS={'hdr':['Headers'], |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
24 'head':['HTML-Metadata','Head'], |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
25 'body':['HTML-Metadata','Links']} |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
26 |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
27 SCHEME=regex.compile('(<?[a-zA-Z][a-zA-Z0-9+.-]*):') |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
28 URN=regex.compile('(<?urn:[a-z][a-z0-9+.-]*):',regex.I) |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
29 |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
30 EMPTY='' |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
31 |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
32 D={} |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
33 |
67
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
34 def insert(e,tbt): |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
35 '''insert something into a trivial pair-impl of a list, |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
36 not balanced!''' |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
37 if isinstance(tbt,tuple): |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
38 assert not isinstance(tbt[0],tuple) |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
39 if e<=tbt[0]: |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
40 return (e,tbt) |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
41 else: |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
42 return (tbt[0],insert(e,tbt[1])) |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
43 elif e<=tbt: |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
44 return (e,tbt) |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
45 else: |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
46 return (tbt,e) |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
47 |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
48 def walk(o,f,r,path=None): |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
49 '''Apply f to every key+leaf of a json object reached via p in region r''' |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
50 if isinstance(o,dict): |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
51 for k,v in o.items(): |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
52 if isinstance(v,dict): |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
53 walk(v,f,r,(path,k)) |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
54 elif isinstance(v,(list,tuple)): |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
55 walked=False |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
56 for i in v: |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
57 if isinstance(i,dict): |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
58 if (not walked) and (i is not v[0]): |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
59 print('oops',key,path,k,i,file=sys.stderr) |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
60 walked=True |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
61 walk(i,f,r,(path,k)) |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
62 elif walked: |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
63 print('oops2',key,path,k,i,file=sys.stderr) |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
64 if not walked: |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
65 f(v,k,path,r) |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
66 else: |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
67 kk=f(v,k,path,r,o) |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
68 if kk is not None: |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
69 #print(v,D,kk,file=sys.stderr) |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
70 if v in D: |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
71 (rr,pp,jj,ss)=D[v] |
67
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
72 D[v]=(rr,pp,insert(k,jj),ss) |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
73 else: |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
74 D[v]=kk |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
75 if D: |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
76 for kk in D.values(): |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
77 res[kk]=res.get(kk,0)+1 |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
78 D.clear() |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
79 elif isinstance(o,(list,tuple)): |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
80 for i in o: |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
81 walk(i,f,r,path) |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
82 |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
83 def pp(v,k,p,r,parent=None): |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
84 '''Handle a leaf value v, with key k in parent, under path p from r |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
85 Uses nested dictionaries''' |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
86 if isinstance(v,str): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
87 m=SCHEME.match(v) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
88 if m is not None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
89 n=URN.match(v) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
90 if n is not None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
91 m=n |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
92 s=m.group(1) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
93 # The following assumes paths are always either length 1 or length 2!!! |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
94 # by open-coding rather than using qq(p) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
95 if p is not None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
96 assert p[0] is None |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
97 p=p[1] |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
98 d=res[r].setdefault(p,dict()) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
99 d=d.setdefault(k,dict()) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
100 d[s]=d.get(s,0)+1 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
101 |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
102 def pp_tuple(v,k,p,r,parent=None): |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
103 '''Handle a leaf value v, with key k in parent, under path p from r |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
104 Uses one dict and 4-tuple''' |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
105 if isinstance(v,str): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
106 m=SCHEME.match(v) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
107 if m is not None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
108 n=URN.match(v) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
109 if n is not None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
110 m=n |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
111 s=m.group(1) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
112 # The following assumes paths are always either length 1 or length 2!!! |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
113 # by open-coding rather than using qq(p) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
114 if p is not None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
115 assert p[0] is None |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
116 p=p[1] |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
117 if parent is None: |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
118 res[kk]=res.get(kk,0)+1 |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
119 else: |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
120 return (r,p,k,s) |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
121 |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
122 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
123 SEP='\x00' |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
124 DOT='.' |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
125 |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
126 def pp_concat(v,k,p,r,parent=None): |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
127 '''Handle a leaf value v, with key k in parent, under path p from r |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
128 Uses one dict and one string''' |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
129 if isinstance(v,str): |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
130 m=SCHEME.match(v) |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
131 if m is not None: |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
132 n=URN.match(v) |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
133 if n is not None: |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
134 m=n |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
135 s=m.group(1) |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
136 # The following assumes paths are always either length 1 or length 2!!! |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
137 # by open-coding rather than using qq(p) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
138 if p is None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
139 p=EMPTY |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
140 else: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
141 assert p[0] is None |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
142 p=p[1] |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
143 k=SEP.join((r,p,k,s)) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
144 res[k]=res.get(k,0)+1 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
145 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
146 def dump(res): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
147 for r in res.keys(): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
148 rv=res[r] |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
149 for p in rv.keys(): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
150 pv=rv[p] |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
151 for k,v in pv.items(): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
152 for s,c in v.items(): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
153 print(r,end=EMPTY) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
154 if p is None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
155 print(EMPTY,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
156 else: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
157 print('.',p,sep=EMPTY,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
158 print(k,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
159 print(s,c,sep='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
160 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
161 def dump_tuple(res): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
162 for (r,p,k,s),c in res.items(): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
163 print(r,end=EMPTY) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
164 # The following assumes paths are always either length 1 or length 2!!! |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
165 # by open-coding rather than using qq(p) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
166 if p is None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
167 print(EMPTY,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
168 else: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
169 print(DOT,p,sep=EMPTY,end='\t') |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
170 while isinstance(k,tuple): |
67
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
171 print(k[0],end='&') |
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
172 k=k[1] |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
173 print(k,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
174 print(s,c,sep='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
175 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
176 def dump_concat(res): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
177 for ks,c in res.items(): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
178 (r,p,k,s)=ks.split(SEP) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
179 print(r,end=EMPTY) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
180 # The following assumes paths are always either length 1 or length 2!!! |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
181 # by open-coding rather than using qq(p) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
182 if p==EMPTY: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
183 print(EMPTY,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
184 else: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
185 print('.',p,sep=EMPTY,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
186 print(k,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
187 print(s,c,sep='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
188 |
67
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
189 if len(sys.argv)>=2: |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
190 res=dict() |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
191 if sys.argv[1]=='1': |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
192 print('using tuple',file=sys.stderr) |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
193 pp=pp_tuple |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
194 dump=dump_tuple |
67
13182e98a1ab
use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents:
66
diff
changeset
|
195 elif sys.argv[1]=='2': |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
196 print('using concat',file=sys.stderr) |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
197 pp=pp_concat |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
198 dump=dump_concat |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
199 else: |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
200 print('using nested',file=sys.stderr) |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
201 res=dict((r,dict()) for r in PATHS.keys()) |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
202 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
203 def main(): |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
204 global n # for debugging |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
205 n=0 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
206 for l in sys.stdin: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
207 if l[0]=='{' and '"WARC-Type":"response"' in l: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
208 j=json.loads(l) |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
209 n+=1 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
210 for s in META_PATH: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
211 j=j[s] |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
212 for k,v in PATHS.items(): |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
213 p=j |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
214 try: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
215 for s in v: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
216 p=p[s] |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
217 except KeyError as e: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
218 continue |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
219 walk(p,pp,k) |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
220 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
221 print(n,file=sys.stderr) |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
222 |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
223 if dictRes: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
224 print('res=',end=EMPTY) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
225 from pprint import pprint |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
226 pprint(res) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
227 else: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
228 dump(res) |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
229 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
230 def qq(p): |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
231 if p is None: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
232 sys.stdout.write('\t') |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
233 else: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
234 qq1(p[0]) |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
235 print(p[1],end='\t') |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
236 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
237 def qq1(p): |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
238 if p is None: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
239 return |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
240 else: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
241 qq1(p[0]) |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
242 print(p[1],end='.') |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
243 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
244 if __name__=="__main__": |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
245 main() |