Mercurial > hg > cc > work
view lurid3/notes.txt @ 52:8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 10 Oct 2024 17:44:58 +0100 |
parents | dc24bb6e524f |
children | d533894173d0 |
line wrap: on
line source
See old_notes.txt for all older notes on Common Crawl dataprocessing, starting from Azure via Turing and then LURID and LURID2. Installed /beegfs/common_crawl/CC-MAIN-2024-33/cdx >: cd results/CC-MAIN-2024-33/cdx/ >: cut -f 2 counts.tsv | btot 2,793,986,828 State of play wrt data -- see status.xlsx [in trying to tabulate the date ranges of the crawls, I found that the WARC timestamp is sometimes bogus: >: fgrep ' 2009' CC-MAIN-2018-34/cdx/cluster.idx net,tyredeyes)/robots.txt 20090201191318 cdx-00230.gz 160573468 198277 920675 >: zgrep '^net,tyredeyes)/robots.txt' CC-MAIN-2018-34/cdx/warc/cdx-00230.gz net,tyredeyes)/robots.txt 20090201191318 {"url": "http://tyredeyes.net/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "582", "offset": "1224614", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00558.warc.gz"} net,tyredeyes)/robots.txt 20090201191319 {"url": "http://www.tyredeyes.net/robots.txt", "mime": "text/plain", "mime-detected": "text/plain", "status": "200", "digest": "PSX5IZU4B4SIXGNDKXCVFH75Q27VHUTJ", "length": "549", "offset": "2069841", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00485.warc.gz"} This happens in 2019-35 as well :-( >: fgrep ' 20181023' CC-MAIN-2019-35/cdx/cluster.idx com,gyshbsh)/robots.txt 20181023022000 cdx-00078.gz 356340085 162332 315406 >: zgrep ' 20181023' CC-MAIN-2019-35/cdx/warc/cdx-00078.gz com,gyshbsh)/robots.txt 20181023022000 {"url": "http://gyshbsh.com/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "529", "offset": "638892", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027315618.73/robotstxt/CC-MAIN-20190820200701-20190820222701-00120.warc.gz"} ... Tabulate all the date ranges for the WARC files we have >: for d in {2017-30,2019-35,2020-34,2021-25,2023-40,2023-50}; do printf "%s\t" $d; (ls CC-MAIN-$d/*.{?,??}/orig/warc | fgrep .gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | head -1 ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done | cut -f 1,2,4 -d - | sed 's/-20/ 20/;s/.$//' | tr ' ' '\t' > dates.tsv >: for d in {2018-30,2018-34}; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | { sleep 10 ; head -1 ; } ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done >> dates.tsv >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | head -1); done 2019-18 20190418101243-20190418122248 >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | tail -1); done 2019-18 20190426153423-20190426175423 >: echo 2019-18 20190418101243-20190418122248 20190426153423-20190426175423 >> dates.tsv >: pwd /beegfs/common_crawl/CC-MAIN-2016-30/cdx/warc >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/{}' >: sort -mu /tmp/hst/??? > /tmp/hst/all >: wc -l /tmp/hst/all 679686 /tmp/hst/all >: head -1 /tmp/hst/all 20160723090435 >: tail -1 /tmp/hst/all 20160731110639 >: cd ../../.. >: echo 2016-30 20160723090435 20160731110639 >> dates.tsv tweaked and sorted in xemacs: 2016-30 20160723090435 20160731110639 2017-30 20170720121902 20170729132938 2018-30 20180715183800 20180723184955 2018-34 20180814062251 20180822085454 2019-18 20190418101243 20190426175423 2019-35 20190817102624 20190826111356 2020-34 20200803083123 20200815214756 2021-25 20210612103920 20210625145905 2023-40 20230921073711 20231005042006 2023-50 20231128083443 20231212000408 Added to status.xlsx in shortened form, with number of days 8 9 8 8 8 9 12 13 15 15 Fill a gap by downloading 2022-33 >: for s in 0; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 5; done > /tmp/hst/get_22-33_0.log & 130 minutes... >: for s in 1; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_1.log & 59 minutes Another day to get to a quarter? >: for s in {2..23}; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_2-23.log & And finally 2015-35 Fetched in just 2 chunks, 0-9 and 10-99, e.g. >: for s in {10..99}; do ~/bin/getcc_multi.aws CC-MAIN-2015-35 $s 10; done > /tmp/hst/get_15-35_10-99.log & Much smaller. Compare 2023-40, with 900 files per segment: >: lss */orig/warc/*-0023?.* | cut -f 5 -d ' ' | stats n = 1000 min = 1.14775e+09 max = 1.26702e+09 sum = 1.20192e+12 mean = 1.20192e+09 sd = 2.26049e+07 with 2015-35, with 353 files per segment >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats n = 1000 min = 1.66471e+08 max = 9.6322e+08 sum = 9.19222e+11 mean = 9.19222e+08 sd = 8.20542e+07 The min files all come from segment 1440644060633.7, whose files are _all_ small: >: uz *00123-*.gz | wc -l 12,759,931 Compare to 1440644060103.8 >: zcat *00123-*.gz | wc -l 75,806,738 Mystery Also faster Compare 2022-33: >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd 98 19 256 75.1 25.2 with 2015-35: >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd 100 15 40 32.6 2.9 >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/2015_{}' & >: sort --parallel=10 -mu /tmp/hst/2015_??? > /tmp/hst/2015_all >: head -1 /tmp/hst/2015_all 20150827191534 >: tail -1 /tmp/hst/2015_all 20150905180914 >: wc -l /tmp/hst/2015_all 698128 /tmp/hst/2015_all What about wet files -- do they include text from pdfs? What about truncated pdfs? >: time for s in 0; do ~/bin/getcc_wet_multi.aws CC-MAIN-2019-35 $s 10; done > /tmp/hst/get_wet_19-35_0.log & real 26m3.049s user 0m1.225s sys 0m1.310s In the segment 0 cdx file (!) we find 3747 probable truncations: >: zgrep -a '"mime-detected": "application/pdf", ' cdx.gz > /tmp/hst/2019-35_seg0_pdf.idx >: wc -l /tmp/hst/2019-35_seg0_pdf.idx 42345 /tmp/hst/2019-35_seg0_pdf.idx >: egrep -a '"length": "10....."' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_long_pdf.idx & >: wc -l < /tmp/hst/2019-35_seg0_long_pdf.idx 3747 Of which 70 are in file 0: >: egrep -a '.-00000\.' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_file0_pdf.idx >: wc -l /tmp/hst/2019-35_seg0_file0_pdf.idx 70 /tmp/hst/2019-35_seg0_file0_pdf.idx In segment 0 file 0 we find 70 application/pdf Content-Type headers: >: ix.py -h -w -x </tmp/hst/2019-35_seg0_file0_pdf.idx |egrep '^(WARC-Target-URI:|Content-Length:) '|cut -f 2 -d ' ' |tr -d '\r'|while read l1; do read uri; read l2; printf '%s\t%s\t%s\n' $l1 $l2 "$uri"; done > ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv >: wc -l < ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv 70 >: head -3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv Of which 14 are truncated: >: fgrep -c 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv 14 E.g. >: fgrep 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | head -3 1049051 1048576 https://en.heks.ch/sites/default/files/documents/2017-09/HEKS_EPER_Mission_Statement_2016_e.pdf 1049469 1048576 https://bmcmicrobiol.biomedcentral.com/track/pdf/10.1186/s12866-017-0951-4 1048824 1048576 https://citydocs.fcgov.com/?action=cao-cases&cmd=convert&docid=3332339 Are any of the pdfs in the corresponding wet file? Yes, 2: >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D00 Is it in fact corresponding? >: diff -bw <(uz 1566027313501.0/orig/warc/*-00000.warc.gz | egrep -a '^WARC-Target-URI: ' | uniq | head -1000) <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz | egrep -a '^WARC-Target-URI: ' | head -1000)|egrep -c '^<' 19 So, yes, mostly. .2% are missing Just checking the search: >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/warc/*-00000.warc.gz) | wc -l 210 Correct So, what pdfs make it into the WET: >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) > ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt >: wc -l < ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt 2 >: cut -f 2 -d ' ' ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | tr -d '\r' | fgrep -f - ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv 11588 10913 http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf 1048979 1048576 https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 Here's the short one: WARC/1.0 WARC-Type: response WARC-Date: 2019-08-17T22:40:17Z WARC-Record-ID: <urn:uuid:ea98167b-c42a-4233-b57e-994aa627e38a> Content-Length: 11588 Content-Type: application/http; msgtype=response WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e> WARC-Concurrent-To: <urn:uuid:2d51c956-0012-4d78-affc-8f57fe9d2e15> WARC-IP-Address: 92.175.114.24 WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf WARC-Payload-Digest: sha1:7VVIUDQ4Q6XKNOAURYU4VTMRSZNPHDQA WARC-Block-Digest: sha1:OSTWXLV772XNHS22T4UBSCSJAAXM2J6T WARC-Identified-Payload-Type: application/pdf HTTP/1.1 200 OK Cache-Control: must-revalidate, post-check=0, pre-check=0,no-cache Pragma: public,no-cache Content-Type: application/pdf",text/html; charset=utf-8 X-Crawler-Content-Encoding: gzip Expires: 0 Server: X-Powered-By: Set-Cookie: 166d74d734106ba68b20ea303011f622=301619e3fe31ecb98c8473f0ff5f35a2; path=/ Content-Disposition: attachment; filename="Mdiathque dpartementale des Deux-Svres - Rsultats de la recherche Belfond.pdf" Content-Transfer-Encoding: binary P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM" X-Content-Encoded-By: X-Powered-By: Date: Sat, 17 Aug 2019 22:40:16 GMT X-Crawler-Content-Length: 5448 Content-Length: 10913 %PDF-1.7 %<E2><E3><CF><D3> 7 0 obj << /Type /Page /Parent 1 0 R /LastModified (D:20190818004016+02'00') /Resources 2 0 R /MediaBox [0.000000 0.000000 595.276000 841.890000] /CropBox [0.000000 0.000 000 595.276000 841.890000] /BleedBox [0.000000 0.000000 595.276000 841.890000] /T rimBox [0.000000 0.000000 595.276000 841.890000] /ArtBox [0.000000 0.000000 595.2 76000 841.890000] /Contents 8 0 R /Rotate 0 /Group << /Type /Group /S /Transparen cy /CS /DeviceRGB >> /PZ 1 >> endobj 8 0 obj >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +1823434 | tail -n +24 | head -c 20000 > ~/results/CC-MAIN-2019-35/mediatheque.pdf >: ps2ascii mediatheque.pdf Médiathèque départementale des Deux-Sèvres - Résultats de la recherche Belfond Médiathèque départementale des Deux-Sèvres - Résultats de la recherche Belfond A charge de revanche Titre : Auteur : Grippando, James (1958-....) ... etc., three pages, no errors >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|fgrep -an https://museum.wrap.gov.tw/GetFile4.ashx 38896837:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 38896858:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 38904590:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^%%EOF' 27:%%EOF 1114658:%%EOF 1313299:%%EOF Hunh? >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^(%%EOF|WARC)' | head -30 1:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 2:WARC-Payload-Digest: sha1:SZ53DQQHENC7DDN7GQ5IS7VMEPAXAMBE 3:WARC-Block-Digest: sha1:QTKJA6A7445Z7264K2YAFBUUM2OYH2T2 4:WARC-Truncated: length 5:WARC-Identified-Payload-Type: application/pdf 27:%%EOF 7725:WARC/1.0 7726:WARC-Type: metadata 7727:WARC-Date: 2019-08-17T22:59:14Z 7728:WARC-Record-ID: <urn:uuid:77df2747-e567-45d3-8646-3069ae9a9f25> 7731:WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e> 7732:WARC-Concurrent-To: <urn:uuid:eceb4adc-d81e-4497-82fe-eea61ce171f4> 7733:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 7739:WARC/1.0 OK, so indeed truncated after 7700 lines or so... >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf **** Error: An error occurred while reading an XREF table. **** The file has been damaged. Look in big_pdf? ====Modify the original CC indexer to write new indices including lastmod===== Looks like WarcRecordWriter.write, in src/nutch-cc/src/java/org/commoncrawl/util/WarcRecordWriter, is what needs to be editted to include LastModified date To rebuild nutch-cc, particularly to recompile jar files after editting anything >: cd $HHOME/src/nutch-cc >: ant Fixed deprecation bug in WarcCdxWriter.java Modified src/java/org/commoncrawl/util/WarcCdxWriter.java to include lastmod Can run just one test, which should allow testing this: >: ant test-core -Dtestcase='TestWarcRecordWriter' Logic is tricky, and there's no easy way in Basically, tools/WarcExport.java is launches a hadoop job based on a hadoop-runnable WarcExport instance. Hadoop will in due course call ExportReducer.reduce, which will create an instance of WarcCapture "for each page capture", and call ExportMapper.context.write with that instance (via some configuration magic with the hadoop job Context). That in turn uses (more magic) WarcOutputFormat.getRecordWriter, which (finally!) calls a previously created WarcRecordWriter instance.write(the capture). So to fake a test case, I need to build 1) a WarcRecordWriter instance 2) a WarcCapture instance and then invoke 1.write(2) Got that working, although still can't figure out where in the normal flow the metadata entry for Response.CONTENT_TYPE gets set. Now, add a test that takes a stream of WARC Response extracts and rewrites their index entries >: head -8804 <(uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00150.gz)|tail -10| ix.py -h -w -x > /tmp/hst/headers.txt >: cp /tmp/hst/headers.txt src/test/org/commoncrawl/util/ >: shuf /tmp/hst/headers.txt > src/test/org/commoncrawl/util/headers_mixed.txt Won't quite work :-( How do We reconstruct the Warc filename, offset and length from the original index? Well, we can find a .warc.gz records! Thanks to https://stackoverflow.com/a/37042747/2595465 >: ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/recs.txt Nearly working, got 1/3rd of the way through a single WARC and then failed: >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/tt.txt|while read o l; do echo $((n+=1)); echo $o $l >> /tmp/hst/r3a; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz| wc -l; done ... 20 10215 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz Process fail: Compressed file ended before the end-of-stream marker was reached, input: length=10762, offset=60784640, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz >: head -10217 /tmp/hst/r3a | tail -4 60784173 467 60784640 10762 60795402 463 60795865 460 >: ix.py 467 60784173 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|fgrep Target WARC-Target-URI: http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/ >: zcat /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/cdx/warc/cdx.gz ... co,drycarerestoration)/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom 20190819020224 {"url": "http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "DTKGJL45XQDXUS7PTXPYR6POMPLG46RZ", "length": "2570", "offset": "60784640", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz", "charset": "UTF-8", "languages": "eng"} >: ix.py 2570 60784640 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less >: echo $((10762 - 2570)) 8192 Ah, the error I was dreading :-( I _think_ this happens when an individual record ends exactly on a 8K boundary. Yes: >: echo $((60784640 % 8192)) 0 Even with buffer 1MB: 21 160245 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz Process fail: Compressed file ended before the end-of-stream marker was reached, input: length=8415, offset=1059033915, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 0 160246 >: tail -60 /tmp/hst/r3b|head -20 1059013061 423 1059013484 7218 1059020702 425 1059021127 424 1059021551 11471 1059033022 426 1059033448g 467 1059033915 8415 Argh. This is at the _same_ point (before 51 fails before EOF). Ah, maybe that's the point -- this is the last read before EOF, and it's not a full buffer! >: ix.py 467 1059033448 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less ... WARC-Target-URI: https://zowiecarrpsychicmedium.com/tag/oracle/ Reran with more instrumentation, took at least all day: >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2> /tmp/hst/r3e_err.txt | while read o l; do echo $((n+=1)); echo $o $l >> /tmp/hst/r3e_val; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | wc -l; done > /tmp/hst/r3e_log 2>&1 >: wc -l /tmp/hst/r3e_err.txt 160296 /tmp/hst/r3e_err.txt >: tail -60 /tmp/hst/r3e_err.txt|cat -n | grep -C2 True\ True 7 b 28738 28738 28312 426 False False 8 b 28312 28312 27845 467 False False 9 b 27845 378162 369747 8415 True True < this is the first hit the last (partial) block 10 b 369747 369747 369312 435 False True 11 b 369312 369312 368878 434 False True >: tail -55 /tmp/hst/r3e_val | head -3 1059033022 426 1059033448 467 1059033915 8415 >: dd ibs=1 skip=1059033022 count=426 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout | uz -t ... 426 bytes copied, 0.00468243 s, 91.0 kB/s sing<3411>: dd ibs=1 skip=1059033448 count=467 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout | uz -t ... 467 bytes copied, 0.00382692 s, 122 kB/s sing<3412>: dd ibs=1 skip=1059033915 count=8415 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout | uz -t igzip: Error (null) does not contain a complete gzip file ... 8415 bytes (8.4 kB, 8.2 KiB) copied, 0.00968889 s, 869 kB/s So, tried one change to use the actually size rather than BUFSIZE at one point, seems to work now: >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2> /tmp/hst/r3f_err.txt | tee /tmp/hst/r3f_val | while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz'; done 2>&1 | tee /tmp/hst/r3f_log | ix.py -w | egrep -c '^WARC/1\.0' 160296 real 3m48.393s user 0m47.997s sys 0m26.641s >: tail /tmp/hst/r3f_val 10851 1059370472 475 1059381323 444 1059381798 22437 1059382242 447 1059404679 506 1059405126 15183 1059405632 471 1059420815 457 1059421286 17754 1059421743 >: wc -l /tmp/hst/*_val 171 /tmp/hst/r3d_val 160297 /tmp/hst/r3e_val 160296 /tmp/hst/r3f_val 320764 total >: uz /tmp/hst/head.warc.gz |egrep -c '^WARC/1\.0.$' 171 >: tail -n 3 /tmp/hst/*_val ==> /tmp/hst/r3d_val <== 454 1351795 414 1352249 0 1352663 [so the 171 above is bogus, and we're missing one] ==> /tmp/hst/r3e_val <== 1059393441 457 1059393898 17754 0 [likewise bogus, so see below] ==> /tmp/hst/r3f_val <== 471 1059420815 457 1059421286 17754 1059421743 [better, but still one missing] >: uz /tmp/hst/head.warc.gz |egrep '^WARC-Type: ' | tee >(wc -l 1>&2) | tail -4 WARC-Type: response WARC-Type: metadata WARC-Type: request WARC-Type: response [missing] 171 >: ls -lt /tmp/hst/*_val -rw-r--r-- 1 hst dc007 1977 Sep 29 09:27 /tmp/hst/r3d_val -rw-r--r-- 1 hst dc007 2319237 Sep 28 14:28 /tmp/hst/r3f_val -rw-r--r-- 1 hst dc007 2319238 Sep 27 19:41 /tmp/hst/r3e_val >: ls -l ~/lib/python/unpackz.py -rwxr-xr-x 1 hst dc007 1821 Sep 28 15:13 .../dc007/hst/lib/python/unpackz.py So e and f are stale, rerun >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3f_err.txt| tee /tmp/hst/r3f_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3f_log |ix.py -w |egrep '^WARC-Type: ' | tail -4 & >: Reading length, offset, filename tab-delimited triples from stdin... WARC-Type: response WARC-Type: metadata WARC-Type: request WARC-Type: response real 3m49.760s user 0m47.180s sys 0m32.218s So missing the final metadata... Back to head.warc.gz, with debug info >: n=0 && ~/lib/python/unpackz.py /tmp/hst/head.warc.gz 2>/tmp/hst/ttd.txt|while read l o; do echo $((n+=1)); echo $l $o >> /tmp/hst/r3d_val; dd ibs=1 skip=$o count=$l if=/tmp/hst/head.warc.gz of=/dev/stdout 2>/tmp/hst/r3d_ido| uz -t ; done >/tmp/hst/r3d_log 2>&1 >: tail -2 /tmp/hst/r3d_log 171 igzip: Error invalid gzip header found for file (null) >: tail -n 3 /tmp/hst/ttd.txt /tmp/hst/r3d_val ==> /tmp/hst/ttd.txt <== b 9697 9697 9243 454 False True b 9243 9243 8829 414 False True n 8829 ==> /tmp/hst/r3d_val <== 454 1351795 414 1352249 0 1352663 >: cat -n /tmp/hst/r3f_val | head -172 | tail -4 169 454 1351795 170 414 1352249 171 8829 1352663 172 446 1361492 Fixed, maybe >: tail -n 3 /tmp/hst/r3d_log /tmp/hst/r3d_val ==> /tmp/hst/r3d_log <== 169 170 171 ==> /tmp/hst/r3d_val <== 454 1351795 414 1352249 8829 1352663 Yes! >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3f_err| tee /tmp/hst/r3f_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3f_log |ix.py -w |egrep '^WARC-Type: ' | tail -4 Reading length, offset, filename tab-delimited triples from stdin... WARC-Type: metadata WARC-Type: request WARC-Type: response WARC-Type: metadata real 3m26.042s user 0m44.167s sys 0m24.716s >: tail -n 3 /tmp/hst/r3f* ==> /tmp/hst/r3f_err <== ==> /tmp/hst/r3f_val <== 457 1059421286 17754 1059421743 425 1059439497 Doubling buffer size doesn't speed up >: time ~/lib/python/unpackz.py -b $((2 * 1024 * 1024)) /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3g_err| tee /tmp/hst/r3g_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3g_log |ix.py -w |egrep '^WARC-Type: ' | tail -4 Reading length, offset, filename tab-delimited triples from stdin... WARC-Type: metadata WARC-Type: request WARC-Type: response WARC-Type: metadata real 3m34.519s user 0m52.312s sys 0m24.875s Tried using FileIO.readinto([a fixed buffer]), but didn't immediately work. Abandoned because I still don't understand how zlib.decompress works at all... Time to convert unpackz to a library which takes a callback alternative to an output file -- Done W/o using callback, timing and structure for what we need for re-indexing task looks encouraging: >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz |egrep -aA20 '^WARC-Type: response' | cut -f 1 -d ' ' | egrep -a '^WARC-' |sus | tee >(wc -l 1>&2) 52468 WARC-Block-Digest: 52468 WARC-Concurrent-To: 52468 WARC-Date: 52468 WARC-Identified-Payload-Type: 52468 WARC-IP-Address: 52468 WARC-Payload-Digest: 52468 WARC-Record-ID: 52468 WARC-Target-URI: 52468 WARC-Type: 52468 WARC-Warcinfo-ID: 236 WARC-Truncated: 11 real 0m20.308s user 0m19.720s sys 0m4.505s Whole thing, with no pre-filtering: >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2) 211794 Content-Length: 211162 Content-Type: 159323 WARC-Target-URI: 159311 WARC-Warcinfo-ID: 159301 WARC-Record-ID: 159299 WARC-Date: 159297 WARC-Type: 105901 WARC-Concurrent-To: 105896 WARC-IP-Address: 52484 WARC-Block-Digest: 52484 WARC-Identified-Payload-Type: 52482 WARC-Payload-Digest: 9239 Last-Modified: 3941 Content-Language: 2262 Content-Security-Policy: 642 Content-language: 326 Content-Security-Policy-Report-Only: 238 WARC-Truncated: 114 Content-Disposition: 352 Content-*: 1 WARC-Filename: 42 real 0m30.896s user 0m37.335s sys 0m7.542s First 51 after WARC-Type: response >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz |egrep -aA50 '^WARC-Type: response' | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2) 106775 Content-Length: 106485 Content-Type: 55215 WARC-Type: 55123 WARC-Date: 54988 WARC-Record-ID: 54551 WARC-Warcinfo-ID: 54246 WARC-Target-URI: 54025 WARC-Concurrent-To: 52806 WARC-IP-Address: 52468 WARC-Block-Digest: 52468 WARC-Identified-Payload-Type: 52468 WARC-Payload-Digest: 9230 Last-Modified: 3938 Content-Language: 2261 Content-Security-Policy: 639 Content-language: 324 Content-Security-Policy-Report-Only: 236 WARC-Truncated: 114 Content-Disposition: 342 Content-*: 41 real 0m21.483s user 0m22.372s sys 0m5.400s So, not worth the risk, let's try python >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|wc -l 9238 real 0m25.426s user 0m23.201s sys 0m0.711s Looks good, but why 9238 instead of 9239??? >: ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | egrep -a '^Last-Modified: ' > /tmp/hst/lmo.tsv Argh. Serious bug in unpackz, wasn't handline cross-buffer-boundary records correctly. Fixed. Redoing the above... No pre-filter: >: uz /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|egrep -c '^WARC/1\.0.$' 160297 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2) 213719 Content-Length: 213088 Content-Type: 160297 WARC-Date: 160297 WARC-Record-ID: 160297 WARC-Type: 160296 WARC-Target-URI: 160296 WARC-Warcinfo-ID: 106864 WARC-Concurrent-To: 106864 WARC-IP-Address: 53432 WARC-Block-Digest: [consistent with 106297 == (3 * 53432) + 1] 53432 WARC-Identified-Payload-Type: 53432 WARC-Payload-Digest: 9430 Last-Modified: 4006 Content-Language: 2325 Content-Security-Policy: 653 Content-language: 331 Content-Security-Policy-Report-Only: 298 WARC-Truncated: 128 Content-Disposition: 83 Content-Location: 67 Content-type: 51 Content-MD5: 45 Content-Script-Type: 42 Content-Style-Type: 31 Content-Transfer-Encoding: 13 Content-disposition: 8 Content-Md5: 5 Content-Description: 5 Content-script-type: 5 Content-style-type: 3 Content-transfer-encoding: 2 Content-Encoding-handler: 1 Content-DocumentTitle: 1 Content-Hash: 1 Content-ID: 1 Content-Legth: 1 Content-length: 1 Content-Range: 1 Content-Secure-Policy: 1 Content-security-policy: 1 Content-Type-Options: 1 WARC-Filename: 42 real 0m28.876s user 0m35.703s sys 0m6.976s >: ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | egrep -a '^Last-Modified: ' > /tmp/hst/lmo.tsv >: wc -l /tmp/hst/lmo.tsv 9430 /tmp/hst/lmo.tsv >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/lm.tsv real 0m17.191s user 0m15.739s sys 0m0.594s >: wc -l /tmp/hst/lm.tsv 9423 /tmp/hst/lm.tsv >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv | tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv) 853d852 < Mon, 19 Aug 2019 01:46:49 GMT 4058d4056 < Tue, 03 Nov 2015 21:31:18 GMT<br /> 4405d4402 < Mon, 19 Aug 2019 01:54:52 GMT 5237,5238d5233 < 3 < Asia/Amman 7009d7003 < Mon, 19 Aug 2019 02:34:20 GMT 9198d9191 < Mon, 19 Aug 2019 02:14:49 GMT All good. The only implausable case is < Mon, 19 Aug 2019 01:54:52 GMT which turns out to be a case of two Last-Modified headers in the same the same response record's HTTP headers. RFCs 2616 and 7230 rule it out but neither specifies a recovery, so first-wins is as good as anything, and indeed 6797 specifies that. Start looking at how we do the merge of cdx_extras.py with existing index Try it with the existing _per segment_ index we have for 2019-35 Assuming we have to key on segment plus offset, as reconstructing the proper index key is such a pain / buggy / is going to change with the year. Stay with segment 49 >: uz cdx.gz |wc -l 29,870,307 >: time uz cdx.gz|egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' |wc 29,870,307 119,481,228 1,241,098,122 = 4 * 29,870,307 So no bogons, not _too_ surprising :-) Bad news is it's a _big_ file: >: ls -lh cdx.gz -rw-r--r-- 1 hst dc007 2.0G Mar 18 2021 cdx.gz So not viable to paste offset as a key and then sort on command line, or to load it in to python and do the work there... Do it per warc file and then merge? >: time uz cdx.gz |fgrep -a warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | sort -n -t\" -k28,28 > /tmp/hst/558.warc.cdx real 0m23.494s user 0m14.541s sys 0m9.158s >: wc -l /tmp/hst/558.warc.cdx 53432 /tmp/hst/558.warc.cdx >: echo $((600 * 53432)) 32,059,200 So, 600 of those, plus approx. same again for extracting, that pbly _is_ doable in python, not more than 10 hours total, assuming internal sort and external merge is not too expensive... For each segment, suppose we pull out 60 groups of 10 target files >: time uz cdx.gz |egrep -a warc/CC-MAIN-2019[^-]*-2019[^-]*-0000..warc.gz > /tmp/hst/0000.warc.cdx real 0m42.129s user 0m35.147s sys 0m9.140s >: wc -l /tmp/hst/0000.warc.cdx 533150 Key it with offset and sort: >: time egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' /tmp/hst/0000.warc.cdx | cut -f 5 -d ' ' | tr -d \" > /tmp/hst/0000_offsets real 0m5.578s user 0m5.593s sys 0m0.265s >: time paste /tmp/hst/0000_offsets /tmp/hst/0000.warc.cdx |sort -nk1,1 | cut -f 2 > /tmp/hst/0000_sorted.warc.cdx real 0m4.185s user 0m2.001s sys 0m1.334s >: time seq 0 9 | parallel -j 10 "~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-0000'{}'.warc.gz > /tmp/hst/lm_0000'{}'.tsv" real 0m24.610s user 2m54.146s sys 0m10.226s >: head /tmp/hst/lm_00000.tsv 9398 16432 Mon, 19 Aug 2019 02:44:15 GMT 20796 26748 Tue, 16 Jul 2019 04:39:09 GMT 4648 340633 Fri, 07 Dec 2018 09:05:59 GMT 3465 357109 Sun, 18 Aug 2019 11:48:23 GMT 7450 914189 Mon, 19 Aug 2019 02:50:08 GMT ... sing<3956>: fgrep '"length": "9398", "offset": "16432"' /tmp/hst/0000_sorted.warc.cdx com,roommeme,0401a)/index.phtml?channel=&op=&p=140&put=show&r2= 20190819024416 {"url": "http://0401a.roommeme.com/index.phtml?PUT=SHOW&R2=&OP=&P=140&CHANNEL=", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "5DNDVX5HQBOOBHISSCOI4UBVMUL63L36", "length": "9398", "offset": "16432", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00000.warc.gz", "charset": "Big5", "languages": "zho"} bingo So, the python code is pretty straightfoward: open the 10 individual lm-*.tsv outputs into an array, initialise a 10-elt array with the first line of each and another with its offset, record the fileno(s) of the lowest offset, then iterate read cdx lines and write unchanged until offset = lowest merge line from fileno and output remove fileno from list of matches read and store a new line for fileno [handle EOF] if list of matches is empty, redo setting of lowest Resort the result by actual key Meanwhile, get a whole test set: sbatch --output=slurm_aug_cdx_49_10-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 00 59 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/49 export DEC=$xarg' "export PYTHONPATH=./lib/python/cc:$PYTHONPATH seq 0 9 | parallel -j 10 \"~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-00\${DEC}'{}'.warc.gz > \$resdir/00\${DEC}'{}'.tsv\"" Actually finished 360 in the hour. Leaving sbatch --output=slurm_aug_cdx_49_360-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 36 59 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/49 export DEC=$xarg' "export PYTHONPATH=./lib/python/cc:$PYTHONPATH seq 0 9 | parallel -j 10 \"~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-00\${DEC}'{}'.warc.gz > \$resdir/00\${DEC}'{}'.tsv\"" But something is wrong, the number of jobs is all wrong: 5>: fgrep -c parallel slurm_aug_cdx_49_0-359-out 741 sing<4046>: ls -lt CC-MAIN-2019-35/aug_cdx/49/|wc -l 372 Every file is being produced twice. Took me a while to figure out my own code :-( >: sbatch --output=slurm_aug_cdx_49_360-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 49 49 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/$xarg export SEG=$xarg share_by_task.sh -f "%03g\n" -s 360 599 $n $task > /tmp/hst_$task' -i 'cat /tmp/hst_$task' 'export PYTHONPATH=./lib/python/cc:$PYTHONPATH ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/*.$SEG/orig/warc/CC-MAIN-*-*-00${arg}.warc.gz > $resdir/00${arg}.tsv' Oops, only 560, not 600 Took 3.5 minutes for 200, so call it 10 for 560, so do 6 more in an hour: >: sbatch --output=slurm_aug_cdx_50-55_out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 50 55 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/$xarg mkdir -p $resdir > export SEG=$xarg share_by_task.sh -f "%03g\n" -s 360 599 $n $task > /tmp/hst_$task' -i 'cat /tmp/hst_$task' 'export PYTHONPATH=./lib/python/cc:$PYTHONPATH ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/*.$SEG/orig/warc/CC-MAIN-*-*-00${arg}.warc.gz > $resdir/00${arg}.tsv' >: tail slurm_aug_cdx_50-55_out ... Wed Oct 9 22:25:47 BST 2024 Finished 55 >: head -1 slurm_aug_cdx_50-55_out Wed Oct 9 21:29:43 BST 56:04 >: du -s CC-MAIN-2019-35/aug_cdx 1,902,916 Not bad, so order 20MB for the whole thing Next step, compare to my existing cdx with timestamp First check looks about right: [cd .../warc_lmhx] >: seq --format='%03g' 0 299 > /tmp/hst/cdx_nums >: parallel -j 20 -a /tmp/hst/cdx_nums 'uz idx/cdx-00{}.gz | egrep -o "\"filename\": \"crawl-data/CC-MAIN-2019-35/segments/[^.]*[.]50.*\"lastmod\":" | sed "s/^.*-00//;s/^\(...\).*/\1/"| sus > /tmp/hst/checkseg_50_{}' [cd .../aug_cdx/50] >: wc -l 00123.tsv 9333 >: egrep -h '123$' /tmp/hst/checkseg_50_??? | acut 1 | btot 9300 >: wc -l 00400.tsv 9477 00400.tsv >: egrep -h '400$' /tmp/hst/checkseg_50_??? | acut 1 | btot 9439 Difference is presumable the bogus timestamps aren't in the augmented cdx as shipped.