comparison lurid3/notes.txt @ 47:fbdaede4155a

cdx_extras and unpackz.py working
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 03 Oct 2024 18:16:05 +0100
parents 49672e9b4c1c
children f688c437180b
comparison
equal deleted inserted replaced
46:49672e9b4c1c 47:fbdaede4155a
552 ==> /tmp/hst/r3f_val <== 552 ==> /tmp/hst/r3f_val <==
553 457 1059421286 553 457 1059421286
554 17754 1059421743 554 17754 1059421743
555 425 1059439497 555 425 1059439497
556 556
557 Doubling buffer size doesn't speed up
558 >: time ~/lib/python/unpackz.py -b $((2 * 1024 * 1024)) /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3g_err| tee /tmp/hst/r3g_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3g_log |ix.py -w |egrep '^WARC-Type: ' | tail -4
559 Reading length, offset, filename tab-delimited triples from stdin...
560 WARC-Type: metadata
561 WARC-Type: request
562 WARC-Type: response
563 WARC-Type: metadata
564
565 real 3m34.519s
566 user 0m52.312s
567 sys 0m24.875s
568
569 Tried using FileIO.readinto([a fixed buffer]), but didn't immediately
570 work. Abandoned because I still don't understand how zlib.decompress
571 works at all...
572
573 Time to convert unpackz to a library which takes a callback
574 alternative to an output file -- Done
575
576 W/o using callback, timing and structure for what we need for
577 re-indexing task looks encouraging:
578 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz |egrep -aA20 '^WARC-Type: response' | cut -f 1 -d ' ' | egrep -a '^WARC-' |sus | tee >(wc -l 1>&2)
579 52468 WARC-Block-Digest:
580 52468 WARC-Concurrent-To:
581 52468 WARC-Date:
582 52468 WARC-Identified-Payload-Type:
583 52468 WARC-IP-Address:
584 52468 WARC-Payload-Digest:
585 52468 WARC-Record-ID:
586 52468 WARC-Target-URI:
587 52468 WARC-Type:
588 52468 WARC-Warcinfo-ID:
589 236 WARC-Truncated:
590 11
591
592 real 0m20.308s
593 user 0m19.720s
594 sys 0m4.505s
595
596 Whole thing, with no pre-filtering:
597
598 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2)
599 211794 Content-Length:
600 211162 Content-Type:
601 159323 WARC-Target-URI:
602 159311 WARC-Warcinfo-ID:
603 159301 WARC-Record-ID:
604 159299 WARC-Date:
605 159297 WARC-Type:
606 105901 WARC-Concurrent-To:
607 105896 WARC-IP-Address:
608 52484 WARC-Block-Digest:
609 52484 WARC-Identified-Payload-Type:
610 52482 WARC-Payload-Digest:
611 9239 Last-Modified:
612 3941 Content-Language:
613 2262 Content-Security-Policy:
614 642 Content-language:
615 326 Content-Security-Policy-Report-Only:
616 238 WARC-Truncated:
617 114 Content-Disposition:
618 352 Content-*:
619 1 WARC-Filename:
620 42
621
622 real 0m30.896s
623 user 0m37.335s
624 sys 0m7.542s
625
626 First 51 after WARC-Type: response
627
628 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz |egrep -aA50 '^WARC-Type: response' | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2)
629 106775 Content-Length:
630 106485 Content-Type:
631 55215 WARC-Type:
632 55123 WARC-Date:
633 54988 WARC-Record-ID:
634 54551 WARC-Warcinfo-ID:
635 54246 WARC-Target-URI:
636 54025 WARC-Concurrent-To:
637 52806 WARC-IP-Address:
638 52468 WARC-Block-Digest:
639 52468 WARC-Identified-Payload-Type:
640 52468 WARC-Payload-Digest:
641 9230 Last-Modified:
642 3938 Content-Language:
643 2261 Content-Security-Policy:
644 639 Content-language:
645 324 Content-Security-Policy-Report-Only:
646 236 WARC-Truncated:
647 114 Content-Disposition:
648 342 Content-*:
649 41
650
651 real 0m21.483s
652 user 0m22.372s
653 sys 0m5.400s
654
655 So, not worth the risk, let's try python
656
657 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|wc -l
658 9238
659
660 real 0m25.426s
661 user 0m23.201s
662 sys 0m0.711s
663
664 Looks good, but why 9238 instead of 9239???
665
666 >: ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | egrep -a '^Last-Modified: ' > /tmp/hst/lmo.tsv
667
668 Argh. Serious bug in unpackz, wasn't handline cross-buffer-boundary
669 records correctly. Fixed. Redoing the above...
670
671 No pre-filter:
672 >: uz /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|egrep -c '^WARC/1\.0.$'
673 160297
674
675 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2)
676
677 213719 Content-Length:
678 213088 Content-Type:
679 160297 WARC-Date:
680 160297 WARC-Record-ID:
681 160297 WARC-Type:
682 160296 WARC-Target-URI:
683 160296 WARC-Warcinfo-ID:
684 106864 WARC-Concurrent-To:
685 106864 WARC-IP-Address:
686 53432 WARC-Block-Digest: [consistent with 106297 == (3 * 53432) + 1]
687 53432 WARC-Identified-Payload-Type:
688 53432 WARC-Payload-Digest:
689 9430 Last-Modified:
690 4006 Content-Language:
691 2325 Content-Security-Policy:
692 653 Content-language:
693 331 Content-Security-Policy-Report-Only:
694 298 WARC-Truncated:
695 128 Content-Disposition:
696 83 Content-Location:
697 67 Content-type:
698 51 Content-MD5:
699 45 Content-Script-Type:
700 42 Content-Style-Type:
701 31 Content-Transfer-Encoding:
702 13 Content-disposition:
703 8 Content-Md5:
704 5 Content-Description:
705 5 Content-script-type:
706 5 Content-style-type:
707 3 Content-transfer-encoding:
708 2 Content-Encoding-handler:
709 1 Content-DocumentTitle:
710 1 Content-Hash:
711 1 Content-ID:
712 1 Content-Legth:
713 1 Content-length:
714 1 Content-Range:
715 1 Content-Secure-Policy:
716 1 Content-security-policy:
717 1 Content-Type-Options:
718 1 WARC-Filename:
719 42
720
721 real 0m28.876s
722 user 0m35.703s
723 sys 0m6.976s
724
725 >: ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | egrep -a '^Last-Modified: ' > /tmp/hst/lmo.tsv
726 >: wc -l /tmp/hst/lmo.tsv
727 9430 /tmp/hst/lmo.tsv
728 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/lm.tsv
729
730 real 0m17.191s
731 user 0m15.739s
732 sys 0m0.594s
733 >: wc -l /tmp/hst/lm.tsv
734 9423 /tmp/hst/lm.tsv
735
736 >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv | tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv)
737 853d852
738 < Mon, 19 Aug 2019 01:46:49 GMT
739 4058d4056
740 < Tue, 03 Nov 2015 21:31:18 GMT<br />
741 4405d4402
742 < Mon, 19 Aug 2019 01:54:52 GMT
743 5237,5238d5233
744 < 3
745 < Asia/Amman
746 7009d7003
747 < Mon, 19 Aug 2019 02:34:20 GMT
748 9198d9191
749 < Mon, 19 Aug 2019 02:14:49 GMT
750
751 All good. The only implausable case is
752 < Mon, 19 Aug 2019 01:54:52 GMT
753 which turns out to be a case of two Last-Modified headers in the same
754 the same response record's HTTP headers. RFCs 2616 and 7230 rule it
755 out but neither specifies a recovery, so first-wins is as good as
756 anything, and indeed 6797 specifies that.
757