annotate master/src/wecu/sac_schemes.py @ 66:b04870ab3035

don't over-count duplicate URIs in multiple properties, produce composite keys instead
author Henry S. Thompson <ht@markup.co.uk>
date Thu, 04 Jun 2020 16:10:55 +0000
parents d46c8b12fc04
children 13182e98a1ab
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
1 #!/usr/bin/python3
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
2 '''Assumes export PYTHONIOENCODING=utf-8 has been done if necessary
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
3
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
4 Usage: uz ...wat.gz | sac_schemes.py [-d] [altStorageScheme]
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
5
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
6 where altStorageScheme if present selects an alternative approach to storing triple counts:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
7 [absent]: three nested dictionaries
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
8 1: one dictionary indexed by 4-tuple
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
9 2: one dictionary indexed by ".".join(keys)'''
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
10
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
11 import sys, json, regex
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
12
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
13 if len(sys.argv)>1 and sys.argv[1]=='-d':
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
14 sys.argv.pop(1)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
15 dictRes=True
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
16 else:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
17 dictRes=False
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
18
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
19 META_PATH=['Envelope', 'Payload-Metadata', 'HTTP-Response-Metadata']
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
20
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
21 PATHS={'hdr':['Headers'],
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
22 'head':['HTML-Metadata','Head'],
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
23 'body':['HTML-Metadata','Links']}
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
24
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
25 SCHEME=regex.compile('(<?[a-zA-Z][a-zA-Z0-9+.-]*):')
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
26 URN=regex.compile('(<?urn:[a-z][a-z0-9+.-]*):',regex.I)
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
27
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
28 EMPTY=''
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
29
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
30 D={}
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
31
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
32 def walk(o,f,r,path=None):
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
33 '''Apply f to every key+leaf of a json object reached via p in region r'''
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
34 if isinstance(o,dict):
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
35 for k,v in o.items():
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
36 if isinstance(v,dict):
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
37 walk(v,f,r,(path,k))
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
38 elif isinstance(v,(list,tuple)):
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
39 walked=False
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
40 for i in v:
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
41 if isinstance(i,dict):
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
42 if (not walked) and (i is not v[0]):
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
43 print('oops',key,path,k,i,file=sys.stderr)
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
44 walked=True
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
45 walk(i,f,r,(path,k))
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
46 elif walked:
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
47 print('oops2',key,path,k,i,file=sys.stderr)
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
48 if not walked:
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
49 f(v,k,path,r)
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
50 else:
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
51 kk=f(v,k,path,r,o)
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
52 if kk is not None:
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
53 #print(v,D,kk,file=sys.stderr)
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
54 if v in D:
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
55 (rr,pp,jj,ss)=D[v]
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
56 D[v]=(rr,pp,(jj,k),ss)
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
57 else:
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
58 D[v]=kk
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
59 if D:
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
60 for kk in D.values():
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
61 res[kk]=res.get(kk,0)+1
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
62 D.clear()
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
63 elif isinstance(o,(list,tuple)):
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
64 for i in o:
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
65 walk(i,f,r,path)
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
66
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
67 def pp(v,k,p,r,parent=None):
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
68 '''Handle a leaf value v, with key k in parent, under path p from r
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
69 Uses nested dictionaries'''
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
70 if isinstance(v,str):
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
71 m=SCHEME.match(v)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
72 if m is not None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
73 n=URN.match(v)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
74 if n is not None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
75 m=n
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
76 s=m.group(1)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
77 # The following assumes paths are always either length 1 or length 2!!!
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
78 # by open-coding rather than using qq(p)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
79 if p is not None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
80 assert p[0] is None
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
81 p=p[1]
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
82 d=res[r].setdefault(p,dict())
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
83 d=d.setdefault(k,dict())
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
84 d[s]=d.get(s,0)+1
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
85
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
86 def pp_tuple(v,k,p,r,parent=None):
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
87 '''Handle a leaf value v, with key k in parent, under path p from r
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
88 Uses one dict and 4-tuple'''
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
89 if isinstance(v,str):
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
90 m=SCHEME.match(v)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
91 if m is not None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
92 n=URN.match(v)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
93 if n is not None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
94 m=n
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
95 s=m.group(1)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
96 # The following assumes paths are always either length 1 or length 2!!!
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
97 # by open-coding rather than using qq(p)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
98 if p is not None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
99 assert p[0] is None
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
100 p=p[1]
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
101 if parent is None:
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
102 res[kk]=res.get(kk,0)+1
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
103 else:
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
104 return (r,p,k,s)
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
105
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
106
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
107 SEP='\x00'
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
108 DOT='.'
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
109
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
110 def pp_concat(v,k,p,r,parent=None):
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
111 '''Handle a leaf value v, with key k in parent, under path p from r
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
112 Uses one dict and one string'''
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
113 if isinstance(v,str):
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
114 m=SCHEME.match(v)
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
115 if m is not None:
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
116 n=URN.match(v)
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
117 if n is not None:
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
118 m=n
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
119 s=m.group(1)
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
120 # The following assumes paths are always either length 1 or length 2!!!
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
121 # by open-coding rather than using qq(p)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
122 if p is None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
123 p=EMPTY
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
124 else:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
125 assert p[0] is None
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
126 p=p[1]
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
127 k=SEP.join((r,p,k,s))
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
128 res[k]=res.get(k,0)+1
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
129
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
130 def dump(res):
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
131 for r in res.keys():
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
132 rv=res[r]
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
133 for p in rv.keys():
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
134 pv=rv[p]
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
135 for k,v in pv.items():
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
136 for s,c in v.items():
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
137 print(r,end=EMPTY)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
138 if p is None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
139 print(EMPTY,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
140 else:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
141 print('.',p,sep=EMPTY,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
142 print(k,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
143 print(s,c,sep='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
144
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
145 def dump_tuple(res):
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
146 for (r,p,k,s),c in res.items():
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
147 print(r,end=EMPTY)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
148 # The following assumes paths are always either length 1 or length 2!!!
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
149 # by open-coding rather than using qq(p)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
150 if p is None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
151 print(EMPTY,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
152 else:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
153 print(DOT,p,sep=EMPTY,end='\t')
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
154 while isinstance(k,tuple):
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
155 print(k[1],end='&')
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
156 k=k[0]
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
157 print(k,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
158 print(s,c,sep='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
159
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
160 def dump_concat(res):
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
161 for ks,c in res.items():
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
162 (r,p,k,s)=ks.split(SEP)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
163 print(r,end=EMPTY)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
164 # The following assumes paths are always either length 1 or length 2!!!
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
165 # by open-coding rather than using qq(p)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
166 if p==EMPTY:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
167 print(EMPTY,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
168 else:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
169 print('.',p,sep=EMPTY,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
170 print(k,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
171 print(s,c,sep='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
172
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
173 if len(sys.argv)==2:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
174 res=dict()
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
175 if sys.argv[1]=='1':
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
176 print('using tuple',file=sys.stderr)
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
177 pp=pp_tuple
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
178 dump=dump_tuple
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
179 else:
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
180 print('using concat',file=sys.stderr)
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
181 pp=pp_concat
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
182 dump=dump_concat
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
183 else:
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
184 print('using nested',file=sys.stderr)
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
185 res=dict((r,dict()) for r in PATHS.keys())
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
186
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
187 def main():
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
188 global n # for debugging
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
189 n=0
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
190 for l in sys.stdin:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
191 if l[0]=='{' and '"WARC-Type":"response"' in l:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
192 j=json.loads(l)
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
193 n+=1
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
194 for s in META_PATH:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
195 j=j[s]
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
196 for k,v in PATHS.items():
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
197 p=j
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
198 try:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
199 for s in v:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
200 p=p[s]
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
201 except KeyError as e:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
202 continue
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
203 walk(p,pp,k)
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
204
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
205 print(n,file=sys.stderr)
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
206
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
207 if dictRes:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
208 print('res=',end=EMPTY)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
209 from pprint import pprint
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
210 pprint(res)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
211 else:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
212 dump(res)
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
213
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
214 def qq(p):
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
215 if p is None:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
216 sys.stdout.write('\t')
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
217 else:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
218 qq1(p[0])
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
219 print(p[1],end='\t')
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
220
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
221 def qq1(p):
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
222 if p is None:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
223 return
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
224 else:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
225 qq1(p[0])
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
226 print(p[1],end='.')
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
227
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
228 if __name__=="__main__":
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
229 main()