annotate pdfCrawl.py @ 5:bd1db1ed4c25

found on ecclerig
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Mon, 09 Mar 2020 17:39:38 +0000
parents 2d7c91f89f6b
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1 import PyPDF2 as pyPdf, sys
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
2
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
3 f = open(sys.argv[1],'rb')
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
4
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
5 pdf = pyPdf.PdfFileReader(f)
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
6 pgs = pdf.getNumPages()
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
7 key = '/Annots'
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
8 uri = '/URI'
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
9 ank = '/A'
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
10
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
11 #print pdf.getNamedDestinations()
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
12
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
13 for pg in range(pgs):
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
14 print '#',pg
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
15 p = pdf.getPage(pg)
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
16 o = p.getObject()
4
2d7c91f89f6b later ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 1
diff changeset
17 print o
0
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
18 if o.has_key(key):
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
19 ann = o[key]
4
2d7c91f89f6b later ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 1
diff changeset
20 print key,len(ann),ann
2d7c91f89f6b later ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 1
diff changeset
21 i=0
0
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
22 for a in ann:
fee51ab07d09 blanket publication of all existing python files in lib/python on maritain
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
23 u = a.getObject()
4
2d7c91f89f6b later ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 1
diff changeset
24 if u[ank].has_key(uri):
2d7c91f89f6b later ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 1
diff changeset
25 try:
2d7c91f89f6b later ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 1
diff changeset
26 print i,u[ank][uri]
2d7c91f89f6b later ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 1
diff changeset
27 except UnicodeEncodeError:
2d7c91f89f6b later ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 1
diff changeset
28 print i,map(ord,u[ank][uri])
2d7c91f89f6b later ecclerig version
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 1
diff changeset
29 i+=1