annotate master/src/wecu/sac_schemes.py @ 67:13182e98a1ab

use sorted insertion into tuple list for properties
author Henry S. Thompson <ht@markup.co.uk>
date Thu, 04 Jun 2020 17:58:10 +0000
parents b04870ab3035
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
1 #!/usr/bin/python3
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
2 '''Assumes export PYTHONIOENCODING=utf-8 has been done if necessary
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
3
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
4 Usage: uz ...wat.gz | sac_schemes.py [-d] [altStorageScheme]
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
5
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
6 where altStorageScheme if present selects an alternative approach to storing triple counts:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
7 [absent]: three nested dictionaries
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
8 1: one dictionary indexed by 4-tuple
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
9 2: one dictionary indexed by ".".join(keys)'''
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
10
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
11 import sys, json, regex
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
12
67
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
13 print(sys.argv,file=sys.stderr)
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
14
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
15 if len(sys.argv)>1 and sys.argv[1]=='-d':
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
16 sys.argv.pop(1)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
17 dictRes=True
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
18 else:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
19 dictRes=False
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
20
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
21 META_PATH=['Envelope', 'Payload-Metadata', 'HTTP-Response-Metadata']
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
22
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
23 PATHS={'hdr':['Headers'],
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
24 'head':['HTML-Metadata','Head'],
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
25 'body':['HTML-Metadata','Links']}
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
26
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
27 SCHEME=regex.compile('(<?[a-zA-Z][a-zA-Z0-9+.-]*):')
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
28 URN=regex.compile('(<?urn:[a-z][a-z0-9+.-]*):',regex.I)
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
29
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
30 EMPTY=''
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
31
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
32 D={}
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
33
67
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
34 def insert(e,tbt):
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
35 '''insert something into a trivial pair-impl of a list,
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
36 not balanced!'''
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
37 if isinstance(tbt,tuple):
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
38 assert not isinstance(tbt[0],tuple)
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
39 if e<=tbt[0]:
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
40 return (e,tbt)
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
41 else:
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
42 return (tbt[0],insert(e,tbt[1]))
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
43 elif e<=tbt:
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
44 return (e,tbt)
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
45 else:
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
46 return (tbt,e)
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
47
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
48 def walk(o,f,r,path=None):
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
49 '''Apply f to every key+leaf of a json object reached via p in region r'''
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
50 if isinstance(o,dict):
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
51 for k,v in o.items():
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
52 if isinstance(v,dict):
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
53 walk(v,f,r,(path,k))
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
54 elif isinstance(v,(list,tuple)):
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
55 walked=False
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
56 for i in v:
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
57 if isinstance(i,dict):
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
58 if (not walked) and (i is not v[0]):
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
59 print('oops',key,path,k,i,file=sys.stderr)
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
60 walked=True
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
61 walk(i,f,r,(path,k))
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
62 elif walked:
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
63 print('oops2',key,path,k,i,file=sys.stderr)
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
64 if not walked:
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
65 f(v,k,path,r)
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
66 else:
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
67 kk=f(v,k,path,r,o)
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
68 if kk is not None:
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
69 #print(v,D,kk,file=sys.stderr)
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
70 if v in D:
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
71 (rr,pp,jj,ss)=D[v]
67
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
72 D[v]=(rr,pp,insert(k,jj),ss)
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
73 else:
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
74 D[v]=kk
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
75 if D:
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
76 for kk in D.values():
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
77 res[kk]=res.get(kk,0)+1
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
78 D.clear()
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
79 elif isinstance(o,(list,tuple)):
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
80 for i in o:
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
81 walk(i,f,r,path)
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
82
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
83 def pp(v,k,p,r,parent=None):
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
84 '''Handle a leaf value v, with key k in parent, under path p from r
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
85 Uses nested dictionaries'''
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
86 if isinstance(v,str):
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
87 m=SCHEME.match(v)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
88 if m is not None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
89 n=URN.match(v)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
90 if n is not None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
91 m=n
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
92 s=m.group(1)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
93 # The following assumes paths are always either length 1 or length 2!!!
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
94 # by open-coding rather than using qq(p)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
95 if p is not None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
96 assert p[0] is None
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
97 p=p[1]
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
98 d=res[r].setdefault(p,dict())
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
99 d=d.setdefault(k,dict())
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
100 d[s]=d.get(s,0)+1
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
101
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
102 def pp_tuple(v,k,p,r,parent=None):
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
103 '''Handle a leaf value v, with key k in parent, under path p from r
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
104 Uses one dict and 4-tuple'''
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
105 if isinstance(v,str):
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
106 m=SCHEME.match(v)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
107 if m is not None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
108 n=URN.match(v)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
109 if n is not None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
110 m=n
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
111 s=m.group(1)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
112 # The following assumes paths are always either length 1 or length 2!!!
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
113 # by open-coding rather than using qq(p)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
114 if p is not None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
115 assert p[0] is None
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
116 p=p[1]
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
117 if parent is None:
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
118 res[kk]=res.get(kk,0)+1
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
119 else:
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
120 return (r,p,k,s)
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
121
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
122
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
123 SEP='\x00'
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
124 DOT='.'
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
125
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
126 def pp_concat(v,k,p,r,parent=None):
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
127 '''Handle a leaf value v, with key k in parent, under path p from r
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
128 Uses one dict and one string'''
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
129 if isinstance(v,str):
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
130 m=SCHEME.match(v)
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
131 if m is not None:
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
132 n=URN.match(v)
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
133 if n is not None:
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
134 m=n
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
135 s=m.group(1)
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
136 # The following assumes paths are always either length 1 or length 2!!!
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
137 # by open-coding rather than using qq(p)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
138 if p is None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
139 p=EMPTY
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
140 else:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
141 assert p[0] is None
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
142 p=p[1]
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
143 k=SEP.join((r,p,k,s))
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
144 res[k]=res.get(k,0)+1
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
145
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
146 def dump(res):
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
147 for r in res.keys():
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
148 rv=res[r]
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
149 for p in rv.keys():
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
150 pv=rv[p]
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
151 for k,v in pv.items():
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
152 for s,c in v.items():
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
153 print(r,end=EMPTY)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
154 if p is None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
155 print(EMPTY,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
156 else:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
157 print('.',p,sep=EMPTY,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
158 print(k,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
159 print(s,c,sep='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
160
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
161 def dump_tuple(res):
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
162 for (r,p,k,s),c in res.items():
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
163 print(r,end=EMPTY)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
164 # The following assumes paths are always either length 1 or length 2!!!
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
165 # by open-coding rather than using qq(p)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
166 if p is None:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
167 print(EMPTY,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
168 else:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
169 print(DOT,p,sep=EMPTY,end='\t')
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
170 while isinstance(k,tuple):
67
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
171 print(k[0],end='&')
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
172 k=k[1]
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
173 print(k,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
174 print(s,c,sep='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
175
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
176 def dump_concat(res):
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
177 for ks,c in res.items():
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
178 (r,p,k,s)=ks.split(SEP)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
179 print(r,end=EMPTY)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
180 # The following assumes paths are always either length 1 or length 2!!!
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
181 # by open-coding rather than using qq(p)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
182 if p==EMPTY:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
183 print(EMPTY,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
184 else:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
185 print('.',p,sep=EMPTY,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
186 print(k,end='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
187 print(s,c,sep='\t')
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
188
67
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
189 if len(sys.argv)>=2:
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
190 res=dict()
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
191 if sys.argv[1]=='1':
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
192 print('using tuple',file=sys.stderr)
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
193 pp=pp_tuple
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
194 dump=dump_tuple
67
13182e98a1ab use sorted insertion into tuple list for properties
Henry S. Thompson <ht@markup.co.uk>
parents: 66
diff changeset
195 elif sys.argv[1]=='2':
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
196 print('using concat',file=sys.stderr)
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
197 pp=pp_concat
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
198 dump=dump_concat
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
199 else:
66
b04870ab3035 don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents: 63
diff changeset
200 print('using nested',file=sys.stderr)
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
201 res=dict((r,dict()) for r in PATHS.keys())
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
202
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
203 def main():
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
204 global n # for debugging
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
205 n=0
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
206 for l in sys.stdin:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
207 if l[0]=='{' and '"WARC-Type":"response"' in l:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
208 j=json.loads(l)
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
209 n+=1
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
210 for s in META_PATH:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
211 j=j[s]
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
212 for k,v in PATHS.items():
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
213 p=j
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
214 try:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
215 for s in v:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
216 p=p[s]
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
217 except KeyError as e:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
218 continue
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
219 walk(p,pp,k)
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
220
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
221 print(n,file=sys.stderr)
61
cfaf5223b071 trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff changeset
222
63
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
223 if dictRes:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
224 print('res=',end=EMPTY)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
225 from pprint import pprint
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
226 pprint(res)
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
227 else:
d46c8b12fc04 support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents: 62
diff changeset
228 dump(res)
62
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
229
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
230 def qq(p):
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
231 if p is None:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
232 sys.stdout.write('\t')
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
233 else:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
234 qq1(p[0])
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
235 print(p[1],end='\t')
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
236
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
237 def qq1(p):
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
238 if p is None:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
239 return
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
240 else:
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
241 qq1(p[0])
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
242 print(p[1],end='.')
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
243
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
244 if __name__=="__main__":
892e1c0240e1 added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents: 61
diff changeset
245 main()