Mercurial > hg > cc > azure
annotate master/src/wecu/sac_schemes.py @ 66:b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
author | Henry S. Thompson <ht@markup.co.uk> |
---|---|
date | Thu, 04 Jun 2020 16:10:55 +0000 |
parents | d46c8b12fc04 |
children | 13182e98a1ab |
rev | line source |
---|---|
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
1 #!/usr/bin/python3 |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
2 '''Assumes export PYTHONIOENCODING=utf-8 has been done if necessary |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
3 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
4 Usage: uz ...wat.gz | sac_schemes.py [-d] [altStorageScheme] |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
5 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
6 where altStorageScheme if present selects an alternative approach to storing triple counts: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
7 [absent]: three nested dictionaries |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
8 1: one dictionary indexed by 4-tuple |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
9 2: one dictionary indexed by ".".join(keys)''' |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
10 |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
11 import sys, json, regex |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
12 |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
13 if len(sys.argv)>1 and sys.argv[1]=='-d': |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
14 sys.argv.pop(1) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
15 dictRes=True |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
16 else: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
17 dictRes=False |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
18 |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
19 META_PATH=['Envelope', 'Payload-Metadata', 'HTTP-Response-Metadata'] |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
20 |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
21 PATHS={'hdr':['Headers'], |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
22 'head':['HTML-Metadata','Head'], |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
23 'body':['HTML-Metadata','Links']} |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
24 |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
25 SCHEME=regex.compile('(<?[a-zA-Z][a-zA-Z0-9+.-]*):') |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
26 URN=regex.compile('(<?urn:[a-z][a-z0-9+.-]*):',regex.I) |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
27 |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
28 EMPTY='' |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
29 |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
30 D={} |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
31 |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
32 def walk(o,f,r,path=None): |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
33 '''Apply f to every key+leaf of a json object reached via p in region r''' |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
34 if isinstance(o,dict): |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
35 for k,v in o.items(): |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
36 if isinstance(v,dict): |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
37 walk(v,f,r,(path,k)) |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
38 elif isinstance(v,(list,tuple)): |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
39 walked=False |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
40 for i in v: |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
41 if isinstance(i,dict): |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
42 if (not walked) and (i is not v[0]): |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
43 print('oops',key,path,k,i,file=sys.stderr) |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
44 walked=True |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
45 walk(i,f,r,(path,k)) |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
46 elif walked: |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
47 print('oops2',key,path,k,i,file=sys.stderr) |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
48 if not walked: |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
49 f(v,k,path,r) |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
50 else: |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
51 kk=f(v,k,path,r,o) |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
52 if kk is not None: |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
53 #print(v,D,kk,file=sys.stderr) |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
54 if v in D: |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
55 (rr,pp,jj,ss)=D[v] |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
56 D[v]=(rr,pp,(jj,k),ss) |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
57 else: |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
58 D[v]=kk |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
59 if D: |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
60 for kk in D.values(): |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
61 res[kk]=res.get(kk,0)+1 |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
62 D.clear() |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
63 elif isinstance(o,(list,tuple)): |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
64 for i in o: |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
65 walk(i,f,r,path) |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
66 |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
67 def pp(v,k,p,r,parent=None): |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
68 '''Handle a leaf value v, with key k in parent, under path p from r |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
69 Uses nested dictionaries''' |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
70 if isinstance(v,str): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
71 m=SCHEME.match(v) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
72 if m is not None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
73 n=URN.match(v) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
74 if n is not None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
75 m=n |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
76 s=m.group(1) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
77 # The following assumes paths are always either length 1 or length 2!!! |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
78 # by open-coding rather than using qq(p) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
79 if p is not None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
80 assert p[0] is None |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
81 p=p[1] |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
82 d=res[r].setdefault(p,dict()) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
83 d=d.setdefault(k,dict()) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
84 d[s]=d.get(s,0)+1 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
85 |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
86 def pp_tuple(v,k,p,r,parent=None): |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
87 '''Handle a leaf value v, with key k in parent, under path p from r |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
88 Uses one dict and 4-tuple''' |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
89 if isinstance(v,str): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
90 m=SCHEME.match(v) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
91 if m is not None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
92 n=URN.match(v) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
93 if n is not None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
94 m=n |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
95 s=m.group(1) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
96 # The following assumes paths are always either length 1 or length 2!!! |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
97 # by open-coding rather than using qq(p) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
98 if p is not None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
99 assert p[0] is None |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
100 p=p[1] |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
101 if parent is None: |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
102 res[kk]=res.get(kk,0)+1 |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
103 else: |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
104 return (r,p,k,s) |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
105 |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
106 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
107 SEP='\x00' |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
108 DOT='.' |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
109 |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
110 def pp_concat(v,k,p,r,parent=None): |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
111 '''Handle a leaf value v, with key k in parent, under path p from r |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
112 Uses one dict and one string''' |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
113 if isinstance(v,str): |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
114 m=SCHEME.match(v) |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
115 if m is not None: |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
116 n=URN.match(v) |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
117 if n is not None: |
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
118 m=n |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
119 s=m.group(1) |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
120 # The following assumes paths are always either length 1 or length 2!!! |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
121 # by open-coding rather than using qq(p) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
122 if p is None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
123 p=EMPTY |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
124 else: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
125 assert p[0] is None |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
126 p=p[1] |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
127 k=SEP.join((r,p,k,s)) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
128 res[k]=res.get(k,0)+1 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
129 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
130 def dump(res): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
131 for r in res.keys(): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
132 rv=res[r] |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
133 for p in rv.keys(): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
134 pv=rv[p] |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
135 for k,v in pv.items(): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
136 for s,c in v.items(): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
137 print(r,end=EMPTY) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
138 if p is None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
139 print(EMPTY,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
140 else: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
141 print('.',p,sep=EMPTY,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
142 print(k,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
143 print(s,c,sep='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
144 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
145 def dump_tuple(res): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
146 for (r,p,k,s),c in res.items(): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
147 print(r,end=EMPTY) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
148 # The following assumes paths are always either length 1 or length 2!!! |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
149 # by open-coding rather than using qq(p) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
150 if p is None: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
151 print(EMPTY,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
152 else: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
153 print(DOT,p,sep=EMPTY,end='\t') |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
154 while isinstance(k,tuple): |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
155 print(k[1],end='&') |
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
156 k=k[0] |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
157 print(k,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
158 print(s,c,sep='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
159 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
160 def dump_concat(res): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
161 for ks,c in res.items(): |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
162 (r,p,k,s)=ks.split(SEP) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
163 print(r,end=EMPTY) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
164 # The following assumes paths are always either length 1 or length 2!!! |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
165 # by open-coding rather than using qq(p) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
166 if p==EMPTY: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
167 print(EMPTY,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
168 else: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
169 print('.',p,sep=EMPTY,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
170 print(k,end='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
171 print(s,c,sep='\t') |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
172 |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
173 if len(sys.argv)==2: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
174 res=dict() |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
175 if sys.argv[1]=='1': |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
176 print('using tuple',file=sys.stderr) |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
177 pp=pp_tuple |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
178 dump=dump_tuple |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
179 else: |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
180 print('using concat',file=sys.stderr) |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
181 pp=pp_concat |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
182 dump=dump_concat |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
183 else: |
66
b04870ab3035
don't over-count duplicate URIs in multiple properties, produce composite keys instead
Henry S. Thompson <ht@markup.co.uk>
parents:
63
diff
changeset
|
184 print('using nested',file=sys.stderr) |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
185 res=dict((r,dict()) for r in PATHS.keys()) |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
186 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
187 def main(): |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
188 global n # for debugging |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
189 n=0 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
190 for l in sys.stdin: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
191 if l[0]=='{' and '"WARC-Type":"response"' in l: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
192 j=json.loads(l) |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
193 n+=1 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
194 for s in META_PATH: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
195 j=j[s] |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
196 for k,v in PATHS.items(): |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
197 p=j |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
198 try: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
199 for s in v: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
200 p=p[s] |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
201 except KeyError as e: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
202 continue |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
203 walk(p,pp,k) |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
204 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
205 print(n,file=sys.stderr) |
61
cfaf5223b071
trying to get my own mapper working
Henry S. Thompson <ht@markup.co.uk>
parents:
diff
changeset
|
206 |
63
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
207 if dictRes: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
208 print('res=',end=EMPTY) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
209 from pprint import pprint |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
210 pprint(res) |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
211 else: |
d46c8b12fc04
support multiple approaches to key combination, use local files to collect results
Henry S. Thompson <ht@markup.co.uk>
parents:
62
diff
changeset
|
212 dump(res) |
62
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
213 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
214 def qq(p): |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
215 if p is None: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
216 sys.stdout.write('\t') |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
217 else: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
218 qq1(p[0]) |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
219 print(p[1],end='\t') |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
220 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
221 def qq1(p): |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
222 if p is None: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
223 return |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
224 else: |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
225 qq1(p[0]) |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
226 print(p[1],end='.') |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
227 |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
228 if __name__=="__main__": |
892e1c0240e1
added more robust (I hope) error handling,
Henry S. Thompson <ht@markup.co.uk>
parents:
61
diff
changeset
|
229 main() |