annotate bin/lmh_warc.py @ 64:b14187ccfb46

revert to just showing first LM
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 19 Jul 2023 13:19:42 +0100
parents 11a886a84a49
children 120d90b47d74
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
64
b14187ccfb46 revert to just showing first LM
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 55
diff changeset
1 #!/usr/bin/env python3
b14187ccfb46 revert to just showing first LM
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 55
diff changeset
2
b14187ccfb46 revert to just showing first LM
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 55
diff changeset
3 import re,warc,sys
42
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
4 TUPAT=re.compile(b'^WARC-Target-URI: (.*?)\r',re.MULTILINE)
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
5 LMPAT=re.compile(b'^Last-Modified: (.*?)\r',re.MULTILINE)
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
6
55
11a886a84a49 finds multiples
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
7 OUT=open(sys.stdout.fileno(),'wb')
11a886a84a49 finds multiples
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
8
42
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
9 def showmeLMH(wtype,buf,part):
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
10 global URI
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
11 if part==1:
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
12 if (m:=TUPAT.search(buf)):
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
13 URI=m[1]
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
14 else:
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
15 raise ValueError(b"No target URI in %s ??"%buf)
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
16 else:
64
b14187ccfb46 revert to just showing first LM
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 55
diff changeset
17 mm=LMPAT.search(buf)
42
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
18 OUT.write(URI)
55
11a886a84a49 finds multiples
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
19 if mm:
64
b14187ccfb46 revert to just showing first LM
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 55
diff changeset
20 OUT.write(b'\t')
b14187ccfb46 revert to just showing first LM
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 55
diff changeset
21 OUT.write(mm[1])
42
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
22 OUT.write(b'\n')
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
23
64
b14187ccfb46 revert to just showing first LM
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 55
diff changeset
24 warc.warc(sys.argv[1],showmeLMH,[b'response'],parts=3)
42
689a0e311cd2 make warc.py a library, separate out testing
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
25