annotate bin/cdx2tsv.py @ 143:ddff993994be

too clever by half, keys won't work in parallel for e.g. media types
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 20 Oct 2021 15:47:55 +0000
parents a76cc0df2754
children 66d17f7410f2
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
120
d0b544e53dda for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1 #!/usr/bin/env python3
121
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
2 '''Extract named fields, with optional post-processing, in order,
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
3 from a Common Crawl index row
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
4
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
5 Field specs on command line are either an atom or
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
6 a tuple of atom and expression with
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
7 free variable f
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
8 For example
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
9 > cdx2tsv.py mime '(url,f.split(":",maxsplit=1)[0])' < xyzzy.cdx
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
10 will output media type and URI scheme'''
120
d0b544e53dda for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
11 import json,sys
d0b544e53dda for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
12
135
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
13 if len(sys.argv)==1 or sys.argv[1][1]=='-':
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
14 print("""Reads index lines from stdin and extracts values from json dict part
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
15
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
16 Usage: cdx2tsv.py fieldspecs...
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
17
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
18 fieldspec is either a name or a quoted python tuple of a name and
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
19 an expression with free variable f which will be evaluated with f
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
20 having the field value.
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
21
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
22 For example
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
23 cdx2tsv.py mime '(url,f.split(":",maxsplit=1)[0])' < xyzzy.cdx
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
24 will output media type and URI scheme""",
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
25 file=sys.stderr)
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
26 exit(1)
a76cc0df2754 add usage/help info
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 121
diff changeset
27
120
d0b544e53dda for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
28 fields=sys.argv[1:]
d0b544e53dda for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
29
121
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
30 fields=[((lambda x,y:(x,eval("lambda f:%s"%y)))(*(f[1:-1].split(',',maxsplit=1))) if f[0]=='(' else f) for f in fields]
120
d0b544e53dda for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
31 for l in sys.stdin:
d0b544e53dda for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
32 (key,stamp,jj)=l.rstrip().split(' ',maxsplit=2)
d0b544e53dda for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
33 ja=json.loads(jj)
121
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
34 print('\t'.join((ja.get(f,'NA') if isinstance(f,str) else
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
35 ((f[1](ja[f[0]]) if f[0] in ja else 'NA'))) for f in (ja.keys() if fields==["*"] else fields)))
863ea87be6bb support field edit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 120
diff changeset
36