view bin/cdx2tsv.py @ 143:ddff993994be

too clever by half, keys won't work in parallel for e.g. media types
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 20 Oct 2021 15:47:55 +0000
parents a76cc0df2754
children 66d17f7410f2
line wrap: on
line source

#!/usr/bin/env python3
'''Extract named fields, with optional post-processing, in order,
                         from a Common Crawl index row

   Field specs on command line are either an atom or
                                             a tuple of atom and expression with
                                             free variable f
   For example
    > cdx2tsv.py mime '(url,f.split(":",maxsplit=1)[0])' < xyzzy.cdx
   will output media type and URI scheme'''
import json,sys

if len(sys.argv)==1 or sys.argv[1][1]=='-':
  print("""Reads index lines from stdin and extracts values from json dict part

  Usage: cdx2tsv.py fieldspecs...

  fieldspec is either a name or a quoted python tuple of a name and
    an expression with free variable f which will be evaluated with f
    having the field value.

  For example 
    cdx2tsv.py mime '(url,f.split(":",maxsplit=1)[0])' < xyzzy.cdx
  will output media type and URI scheme""",
        file=sys.stderr)
  exit(1)

fields=sys.argv[1:]

fields=[((lambda x,y:(x,eval("lambda f:%s"%y)))(*(f[1:-1].split(',',maxsplit=1))) if f[0]=='(' else f) for f in fields]
for l in sys.stdin:
  (key,stamp,jj)=l.rstrip().split(' ',maxsplit=2)
  ja=json.loads(jj)
  print('\t'.join((ja.get(f,'NA') if isinstance(f,str) else
                   ((f[1](ja[f[0]]) if f[0] in ja else 'NA'))) for f in (ja.keys() if fields==["*"] else fields)))