annotate bin/lmh_warc.py @ 61:f182d09ad1cd

whole working
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Fri, 14 Jul 2023 12:08:09 +0100
parents 11a886a84a49
children b14187ccfb46
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
55
11a886a84a49 finds multiples
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
1 import re,swarc,sys
42
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
2 TUPAT=re.compile(b'^WARC-Target-URI: (.*?)\r',re.MULTILINE)
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
3 LMPAT=re.compile(b'^Last-Modified: (.*?)\r',re.MULTILINE)
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
4
55
11a886a84a49 finds multiples
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
5 OUT=open(sys.stdout.fileno(),'wb')
11a886a84a49 finds multiples
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
6
42
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
7 def showmeLMH(wtype,buf,part):
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
8 global URI
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
9 if part==1:
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
10 if (m:=TUPAT.search(buf)):
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
11 URI=m[1]
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
12 else:
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
13 raise ValueError(b"No target URI in %s ??"%buf)
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
14 else:
55
11a886a84a49 finds multiples
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
15 mm=LMPAT.findall(buf)
42
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
16 OUT.write(URI)
55
11a886a84a49 finds multiples
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
17 if mm:
11a886a84a49 finds multiples
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
18 for m in mm:
11a886a84a49 finds multiples
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
19 OUT.write(b'\t')
11a886a84a49 finds multiples
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
20 OUT.write(m)
42
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
21 OUT.write(b'\n')
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
22
55
11a886a84a49 finds multiples
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
23 swarc.warc(sys.argv[1],showmeLMH,[b'response'],parts=3)
42
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
24