Mercurial > hg > cc > cirrus_home
annotate bin/cdx2tsv.py @ 120:d0b544e53dda
for use in processing CC index files
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Mon, 28 Jun 2021 14:01:41 +0000 |
parents | |
children | 863ea87be6bb |
rev | line source |
---|---|
120
d0b544e53dda
for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1 #!/usr/bin/env python3 |
d0b544e53dda
for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
2 '''Extract named fields, in order, from a Common Crawl index row''' |
d0b544e53dda
for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
3 import json,sys |
d0b544e53dda
for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
4 |
d0b544e53dda
for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
5 fields=sys.argv[1:] |
d0b544e53dda
for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
6 |
d0b544e53dda
for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
7 for l in sys.stdin: |
d0b544e53dda
for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
8 (key,stamp,jj)=l.rstrip().split(' ',maxsplit=2) |
d0b544e53dda
for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
9 ja=json.loads(jj) |
d0b544e53dda
for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
10 print(ja.keys()) |
d0b544e53dda
for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
11 print('\t'.join(ja.get(f,'NA') for f in (ja.keys() if fields==["*"] else fields)) |
d0b544e53dda
for use in processing CC index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
12 ) |