Mercurial > hg > cc > work
view lurid3/notes.txt @ 46:49672e9b4c1c
unpackz.py working
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Tue, 01 Oct 2024 16:00:22 +0100 |
parents | 737c61f98cbf |
children | fbdaede4155a |
line wrap: on
line source
See old_notes.txt for all older notes on Common Crawl dataprocessing, starting from Azure via Turing and then LURID and LURID2. Installed /beegfs/common_crawl/CC-MAIN-2024-33/cdx >: cd results/CC-MAIN-2024-33/cdx/ >: cut -f 2 counts.tsv | btot 2,793,986,828 State of play wrt data -- see status.xlsx [in trying to tabulate the date ranges of the crawls, I found that the WARC timestamp is sometimes bogus: >: fgrep ' 2009' CC-MAIN-2018-34/cdx/cluster.idx net,tyredeyes)/robots.txt 20090201191318 cdx-00230.gz 160573468 198277 920675 >: zgrep '^net,tyredeyes)/robots.txt' CC-MAIN-2018-34/cdx/warc/cdx-00230.gz net,tyredeyes)/robots.txt 20090201191318 {"url": "http://tyredeyes.net/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "582", "offset": "1224614", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00558.warc.gz"} net,tyredeyes)/robots.txt 20090201191319 {"url": "http://www.tyredeyes.net/robots.txt", "mime": "text/plain", "mime-detected": "text/plain", "status": "200", "digest": "PSX5IZU4B4SIXGNDKXCVFH75Q27VHUTJ", "length": "549", "offset": "2069841", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00485.warc.gz"} This happens in 2019-35 as well :-( >: fgrep ' 20181023' CC-MAIN-2019-35/cdx/cluster.idx com,gyshbsh)/robots.txt 20181023022000 cdx-00078.gz 356340085 162332 315406 >: zgrep ' 20181023' CC-MAIN-2019-35/cdx/warc/cdx-00078.gz com,gyshbsh)/robots.txt 20181023022000 {"url": "http://gyshbsh.com/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "529", "offset": "638892", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027315618.73/robotstxt/CC-MAIN-20190820200701-20190820222701-00120.warc.gz"} ... Tabulate all the date ranges for the WARC files we have >: for d in {2017-30,2019-35,2020-34,2021-25,2023-40,2023-50}; do printf "%s\t" $d; (ls CC-MAIN-$d/*.{?,??}/orig/warc | fgrep .gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | head -1 ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done | cut -f 1,2,4 -d - | sed 's/-20/ 20/;s/.$//' | tr ' ' '\t' > dates.tsv >: for d in {2018-30,2018-34}; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | { sleep 10 ; head -1 ; } ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done >> dates.tsv >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | head -1); done 2019-18 20190418101243-20190418122248 >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | tail -1); done 2019-18 20190426153423-20190426175423 >: echo 2019-18 20190418101243-20190418122248 20190426153423-20190426175423 >> dates.tsv >: pwd /beegfs/common_crawl/CC-MAIN-2016-30/cdx/warc >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/{}' >: sort -mu /tmp/hst/??? > /tmp/hst/all >: wc -l /tmp/hst/all 679686 /tmp/hst/all >: head -1 /tmp/hst/all 20160723090435 >: tail -1 /tmp/hst/all 20160731110639 >: cd ../../.. >: echo 2016-30 20160723090435 20160731110639 >> dates.tsv tweaked and sorted in xemacs: 2016-30 20160723090435 20160731110639 2017-30 20170720121902 20170729132938 2018-30 20180715183800 20180723184955 2018-34 20180814062251 20180822085454 2019-18 20190418101243 20190426175423 2019-35 20190817102624 20190826111356 2020-34 20200803083123 20200815214756 2021-25 20210612103920 20210625145905 2023-40 20230921073711 20231005042006 2023-50 20231128083443 20231212000408 Added to status.xlsx in shortened form, with number of days 8 9 8 8 8 9 12 13 15 15 Fill a gap by downloading 2022-33 >: for s in 0; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 5; done > /tmp/hst/get_22-33_0.log & 130 minutes... >: for s in 1; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_1.log & 59 minutes Another day to get to a quarter? >: for s in {2..23}; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_2-23.log & And finally 2015-35 Fetched in just 2 chunks, 0-9 and 10-99, e.g. >: for s in {10..99}; do ~/bin/getcc_multi.aws CC-MAIN-2015-35 $s 10; done > /tmp/hst/get_15-35_10-99.log & Much smaller. Compare 2023-40, with 900 files per segment: >: lss */orig/warc/*-0023?.* | cut -f 5 -d ' ' | stats n = 1000 min = 1.14775e+09 max = 1.26702e+09 sum = 1.20192e+12 mean = 1.20192e+09 sd = 2.26049e+07 with 2015-35, with 353 files per segment >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats n = 1000 min = 1.66471e+08 max = 9.6322e+08 sum = 9.19222e+11 mean = 9.19222e+08 sd = 8.20542e+07 The min files all come from segment 1440644060633.7, whose files are _all_ small: >: uz *00123-*.gz | wc -l 12,759,931 Compare to 1440644060103.8 >: zcat *00123-*.gz | wc -l 75,806,738 Mystery Also faster Compare 2022-33: >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd 98 19 256 75.1 25.2 with 2015-35: >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd 100 15 40 32.6 2.9 >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/2015_{}' & >: sort --parallel=10 -mu /tmp/hst/2015_??? > /tmp/hst/2015_all >: head -1 /tmp/hst/2015_all 20150827191534 >: tail -1 /tmp/hst/2015_all 20150905180914 >: wc -l /tmp/hst/2015_all 698128 /tmp/hst/2015_all What about wet files -- do they include text from pdfs? What about truncated pdfs? >: time for s in 0; do ~/bin/getcc_wet_multi.aws CC-MAIN-2019-35 $s 10; done > /tmp/hst/get_wet_19-35_0.log & real 26m3.049s user 0m1.225s sys 0m1.310s In the segment 0 cdx file (!) we find 3747 probable truncations: >: zgrep -a '"mime-detected": "application/pdf", ' cdx.gz > /tmp/hst/2019-35_seg0_pdf.idx >: wc -l /tmp/hst/2019-35_seg0_pdf.idx 42345 /tmp/hst/2019-35_seg0_pdf.idx >: egrep -a '"length": "10....."' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_long_pdf.idx & >: wc -l < /tmp/hst/2019-35_seg0_long_pdf.idx 3747 Of which 70 are in file 0: >: egrep -a '.-00000\.' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_file0_pdf.idx >: wc -l /tmp/hst/2019-35_seg0_file0_pdf.idx 70 /tmp/hst/2019-35_seg0_file0_pdf.idx In segment 0 file 0 we find 70 application/pdf Content-Type headers: >: ix.py -h -w -x </tmp/hst/2019-35_seg0_file0_pdf.idx |egrep '^(WARC-Target-URI:|Content-Length:) '|cut -f 2 -d ' ' |tr -d '\r'|while read l1; do read uri; read l2; printf '%s\t%s\t%s\n' $l1 $l2 "$uri"; done > ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv >: wc -l < ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv 70 >: head -3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv Of which 14 are truncated: >: fgrep -c 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv 14 E.g. >: fgrep 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | head -3 1049051 1048576 https://en.heks.ch/sites/default/files/documents/2017-09/HEKS_EPER_Mission_Statement_2016_e.pdf 1049469 1048576 https://bmcmicrobiol.biomedcentral.com/track/pdf/10.1186/s12866-017-0951-4 1048824 1048576 https://citydocs.fcgov.com/?action=cao-cases&cmd=convert&docid=3332339 Are any of the pdfs in the corresponding wet file? Yes, 2: >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D00 Is it in fact corresponding? >: diff -bw <(uz 1566027313501.0/orig/warc/*-00000.warc.gz | egrep -a '^WARC-Target-URI: ' | uniq | head -1000) <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz | egrep -a '^WARC-Target-URI: ' | head -1000)|egrep -c '^<' 19 So, yes, mostly. .2% are missing Just checking the search: >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/warc/*-00000.warc.gz) | wc -l 210 Correct So, what pdfs make it into the WET: >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) > ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt >: wc -l < ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt 2 >: cut -f 2 -d ' ' ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | tr -d '\r' | fgrep -f - ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv 11588 10913 http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf 1048979 1048576 https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 Here's the short one: WARC/1.0 WARC-Type: response WARC-Date: 2019-08-17T22:40:17Z WARC-Record-ID: <urn:uuid:ea98167b-c42a-4233-b57e-994aa627e38a> Content-Length: 11588 Content-Type: application/http; msgtype=response WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e> WARC-Concurrent-To: <urn:uuid:2d51c956-0012-4d78-affc-8f57fe9d2e15> WARC-IP-Address: 92.175.114.24 WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf WARC-Payload-Digest: sha1:7VVIUDQ4Q6XKNOAURYU4VTMRSZNPHDQA WARC-Block-Digest: sha1:OSTWXLV772XNHS22T4UBSCSJAAXM2J6T WARC-Identified-Payload-Type: application/pdf HTTP/1.1 200 OK Cache-Control: must-revalidate, post-check=0, pre-check=0,no-cache Pragma: public,no-cache Content-Type: application/pdf",text/html; charset=utf-8 X-Crawler-Content-Encoding: gzip Expires: 0 Server: X-Powered-By: Set-Cookie: 166d74d734106ba68b20ea303011f622=301619e3fe31ecb98c8473f0ff5f35a2; path=/ Content-Disposition: attachment; filename="Mdiathque dpartementale des Deux-Svres - Rsultats de la recherche Belfond.pdf" Content-Transfer-Encoding: binary P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM" X-Content-Encoded-By: X-Powered-By: Date: Sat, 17 Aug 2019 22:40:16 GMT X-Crawler-Content-Length: 5448 Content-Length: 10913 %PDF-1.7 %<E2><E3><CF><D3> 7 0 obj << /Type /Page /Parent 1 0 R /LastModified (D:20190818004016+02'00') /Resources 2 0 R /MediaBox [0.000000 0.000000 595.276000 841.890000] /CropBox [0.000000 0.000 000 595.276000 841.890000] /BleedBox [0.000000 0.000000 595.276000 841.890000] /T rimBox [0.000000 0.000000 595.276000 841.890000] /ArtBox [0.000000 0.000000 595.2 76000 841.890000] /Contents 8 0 R /Rotate 0 /Group << /Type /Group /S /Transparen cy /CS /DeviceRGB >> /PZ 1 >> endobj 8 0 obj >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +1823434 | tail -n +24 | head -c 20000 > ~/results/CC-MAIN-2019-35/mediatheque.pdf >: ps2ascii mediatheque.pdf Médiathèque départementale des Deux-Sèvres - Résultats de la recherche Belfond Médiathèque départementale des Deux-Sèvres - Résultats de la recherche Belfond A charge de revanche Titre : Auteur : Grippando, James (1958-....) ... etc., three pages, no errors >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|fgrep -an https://museum.wrap.gov.tw/GetFile4.ashx 38896837:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 38896858:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 38904590:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^%%EOF' 27:%%EOF 1114658:%%EOF 1313299:%%EOF Hunh? >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^(%%EOF|WARC)' | head -30 1:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 2:WARC-Payload-Digest: sha1:SZ53DQQHENC7DDN7GQ5IS7VMEPAXAMBE 3:WARC-Block-Digest: sha1:QTKJA6A7445Z7264K2YAFBUUM2OYH2T2 4:WARC-Truncated: length 5:WARC-Identified-Payload-Type: application/pdf 27:%%EOF 7725:WARC/1.0 7726:WARC-Type: metadata 7727:WARC-Date: 2019-08-17T22:59:14Z 7728:WARC-Record-ID: <urn:uuid:77df2747-e567-45d3-8646-3069ae9a9f25> 7731:WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e> 7732:WARC-Concurrent-To: <urn:uuid:eceb4adc-d81e-4497-82fe-eea61ce171f4> 7733:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 7739:WARC/1.0 OK, so indeed truncated after 7700 lines or so... >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf **** Error: An error occurred while reading an XREF table. **** The file has been damaged. Look in big_pdf? ====Modify the original CC indexer to write new indices including lastmod===== Looks like WarcRecordWriter.write, in src/nutch-cc/src/java/org/commoncrawl/util/WarcRecordWriter, is what needs to be editted to include LastModified date To rebuild nutch-cc, particularly to recompile jar files after editting anything >: cd $HHOME/src/nutch-cc >: ant Fixed deprecation bug in WarcCdxWriter.java Modified src/java/org/commoncrawl/util/WarcCdxWriter.java to include lastmod Can run just one test, which should allow testing this: >: ant test-core -Dtestcase='TestWarcRecordWriter' Logic is tricky, and there's no easy way in Basically, tools/WarcExport.java is launches a hadoop job based on a hadoop-runnable WarcExport instance. Hadoop will in due course call ExportReducer.reduce, which will create an instance of WarcCapture "for each page capture", and call ExportMapper.context.write with that instance (via some configuration magic with the hadoop job Context). That in turn uses (more magic) WarcOutputFormat.getRecordWriter, which (finally!) calls a previously created WarcRecordWriter instance.write(the capture). So to fake a test case, I need to build 1) a WarcRecordWriter instance 2) a WarcCapture instance and then invoke 1.write(2) Got that working, although still can't figure out where in the normal flow the metadata entry for Response.CONTENT_TYPE gets set. Now, add a test that takes a stream of WARC Response extracts and rewrites their index entries >: head -8804 <(uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00150.gz)|tail -10| ix.py -h -w -x > /tmp/hst/headers.txt >: cp /tmp/hst/headers.txt src/test/org/commoncrawl/util/ >: shuf /tmp/hst/headers.txt > src/test/org/commoncrawl/util/headers_mixed.txt Won't quite work :-( How do We reconstruct the Warc filename, offset and length from the original index? Well, we can find a .warc.gz records! Thanks to https://stackoverflow.com/a/37042747/2595465 >: ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/recs.txt Nearly working, got 1/3rd of the way through a single WARC and then failed: >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/tt.txt|while read o l; do echo $((n+=1)); echo $o $l >> /tmp/hst/r3a; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz| wc -l; done ... 20 10215 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz Process fail: Compressed file ended before the end-of-stream marker was reached, input: length=10762, offset=60784640, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz >: head -10217 /tmp/hst/r3a | tail -4 60784173 467 60784640 10762 60795402 463 60795865 460 >: ix.py 467 60784173 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|fgrep Target WARC-Target-URI: http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/ >: zcat /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/cdx/warc/cdx.gz ... co,drycarerestoration)/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom 20190819020224 {"url": "http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "DTKGJL45XQDXUS7PTXPYR6POMPLG46RZ", "length": "2570", "offset": "60784640", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz", "charset": "UTF-8", "languages": "eng"} >: ix.py 2570 60784640 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less >: echo $((10762 - 2570)) 8192 Ah, the error I was dreading :-( I _think_ this happens when an individual record ends exactly on a 8K boundary. Yes: >: echo $((60784640 % 8192)) 0 Even with buffer 1MB: 21 160245 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz Process fail: Compressed file ended before the end-of-stream marker was reached, input: length=8415, offset=1059033915, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 0 160246 >: tail -60 /tmp/hst/r3b|head -20 1059013061 423 1059013484 7218 1059020702 425 1059021127 424 1059021551 11471 1059033022 426 1059033448g 467 1059033915 8415 Argh. This is at the _same_ point (before 51 fails before EOF). Ah, maybe that's the point -- this is the last read before EOF, and it's not a full buffer! >: ix.py 467 1059033448 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less ... WARC-Target-URI: https://zowiecarrpsychicmedium.com/tag/oracle/ Reran with more instrumentation, took at least all day: >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2> /tmp/hst/r3e_err.txt | while read o l; do echo $((n+=1)); echo $o $l >> /tmp/hst/r3e_val; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | wc -l; done > /tmp/hst/r3e_log 2>&1 >: wc -l /tmp/hst/r3e_err.txt 160296 /tmp/hst/r3e_err.txt >: tail -60 /tmp/hst/r3e_err.txt|cat -n | grep -C2 True\ True 7 b 28738 28738 28312 426 False False 8 b 28312 28312 27845 467 False False 9 b 27845 378162 369747 8415 True True < this is the first hit the last (partial) block 10 b 369747 369747 369312 435 False True 11 b 369312 369312 368878 434 False True >: tail -55 /tmp/hst/r3e_val | head -3 1059033022 426 1059033448 467 1059033915 8415 >: dd ibs=1 skip=1059033022 count=426 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout | uz -t ... 426 bytes copied, 0.00468243 s, 91.0 kB/s sing<3411>: dd ibs=1 skip=1059033448 count=467 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout | uz -t ... 467 bytes copied, 0.00382692 s, 122 kB/s sing<3412>: dd ibs=1 skip=1059033915 count=8415 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout | uz -t igzip: Error (null) does not contain a complete gzip file ... 8415 bytes (8.4 kB, 8.2 KiB) copied, 0.00968889 s, 869 kB/s So, tried one change to use the actually size rather than BUFSIZE at one point, seems to work now: >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2> /tmp/hst/r3f_err.txt | tee /tmp/hst/r3f_val | while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz'; done 2>&1 | tee /tmp/hst/r3f_log | ix.py -w | egrep -c '^WARC/1\.0' 160296 real 3m48.393s user 0m47.997s sys 0m26.641s >: tail /tmp/hst/r3f_val 10851 1059370472 475 1059381323 444 1059381798 22437 1059382242 447 1059404679 506 1059405126 15183 1059405632 471 1059420815 457 1059421286 17754 1059421743 >: wc -l /tmp/hst/*_val 171 /tmp/hst/r3d_val 160297 /tmp/hst/r3e_val 160296 /tmp/hst/r3f_val 320764 total >: uz /tmp/hst/head.warc.gz |egrep -c '^WARC/1\.0.$' 171 >: tail -n 3 /tmp/hst/*_val ==> /tmp/hst/r3d_val <== 454 1351795 414 1352249 0 1352663 [so the 171 above is bogus, and we're missing one] ==> /tmp/hst/r3e_val <== 1059393441 457 1059393898 17754 0 [likewise bogus, so see below] ==> /tmp/hst/r3f_val <== 471 1059420815 457 1059421286 17754 1059421743 [better, but still one missing] >: uz /tmp/hst/head.warc.gz |egrep '^WARC-Type: ' | tee >(wc -l 1>&2) | tail -4 WARC-Type: response WARC-Type: metadata WARC-Type: request WARC-Type: response [missing] 171 >: ls -lt /tmp/hst/*_val -rw-r--r-- 1 hst dc007 1977 Sep 29 09:27 /tmp/hst/r3d_val -rw-r--r-- 1 hst dc007 2319237 Sep 28 14:28 /tmp/hst/r3f_val -rw-r--r-- 1 hst dc007 2319238 Sep 27 19:41 /tmp/hst/r3e_val >: ls -l ~/lib/python/unpackz.py -rwxr-xr-x 1 hst dc007 1821 Sep 28 15:13 .../dc007/hst/lib/python/unpackz.py So e and f are stale, rerun >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3f_err.txt| tee /tmp/hst/r3f_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3f_log |ix.py -w |egrep '^WARC-Type: ' | tail -4 & >: Reading length, offset, filename tab-delimited triples from stdin... WARC-Type: response WARC-Type: metadata WARC-Type: request WARC-Type: response real 3m49.760s user 0m47.180s sys 0m32.218s So missing the final metadata... Back to head.warc.gz, with debug info >: n=0 && ~/lib/python/unpackz.py /tmp/hst/head.warc.gz 2>/tmp/hst/ttd.txt|while read l o; do echo $((n+=1)); echo $l $o >> /tmp/hst/r3d_val; dd ibs=1 skip=$o count=$l if=/tmp/hst/head.warc.gz of=/dev/stdout 2>/tmp/hst/r3d_ido| uz -t ; done >/tmp/hst/r3d_log 2>&1 >: tail -2 /tmp/hst/r3d_log 171 igzip: Error invalid gzip header found for file (null) >: tail -n 3 /tmp/hst/ttd.txt /tmp/hst/r3d_val ==> /tmp/hst/ttd.txt <== b 9697 9697 9243 454 False True b 9243 9243 8829 414 False True n 8829 ==> /tmp/hst/r3d_val <== 454 1351795 414 1352249 0 1352663 >: cat -n /tmp/hst/r3f_val | head -172 | tail -4 169 454 1351795 170 414 1352249 171 8829 1352663 172 446 1361492 Fixed, maybe >: tail -n 3 /tmp/hst/r3d_log /tmp/hst/r3d_val ==> /tmp/hst/r3d_log <== 169 170 171 ==> /tmp/hst/r3d_val <== 454 1351795 414 1352249 8829 1352663 Yes! >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3f_err| tee /tmp/hst/r3f_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3f_log |ix.py -w |egrep '^WARC-Type: ' | tail -4 Reading length, offset, filename tab-delimited triples from stdin... WARC-Type: metadata WARC-Type: request WARC-Type: response WARC-Type: metadata real 3m26.042s user 0m44.167s sys 0m24.716s >: tail -n 3 /tmp/hst/r3f* ==> /tmp/hst/r3f_err <== ==> /tmp/hst/r3f_val <== 457 1059421286 17754 1059421743 425 1059439497