Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 47:fbdaede4155a
cdx_extras and unpackz.py working
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 03 Oct 2024 18:16:05 +0100 |
parents | 49672e9b4c1c |
children | f688c437180b |
comparison
equal
deleted
inserted
replaced
46:49672e9b4c1c | 47:fbdaede4155a |
---|---|
552 ==> /tmp/hst/r3f_val <== | 552 ==> /tmp/hst/r3f_val <== |
553 457 1059421286 | 553 457 1059421286 |
554 17754 1059421743 | 554 17754 1059421743 |
555 425 1059439497 | 555 425 1059439497 |
556 | 556 |
557 Doubling buffer size doesn't speed up | |
558 >: time ~/lib/python/unpackz.py -b $((2 * 1024 * 1024)) /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3g_err| tee /tmp/hst/r3g_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3g_log |ix.py -w |egrep '^WARC-Type: ' | tail -4 | |
559 Reading length, offset, filename tab-delimited triples from stdin... | |
560 WARC-Type: metadata | |
561 WARC-Type: request | |
562 WARC-Type: response | |
563 WARC-Type: metadata | |
564 | |
565 real 3m34.519s | |
566 user 0m52.312s | |
567 sys 0m24.875s | |
568 | |
569 Tried using FileIO.readinto([a fixed buffer]), but didn't immediately | |
570 work. Abandoned because I still don't understand how zlib.decompress | |
571 works at all... | |
572 | |
573 Time to convert unpackz to a library which takes a callback | |
574 alternative to an output file -- Done | |
575 | |
576 W/o using callback, timing and structure for what we need for | |
577 re-indexing task looks encouraging: | |
578 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz |egrep -aA20 '^WARC-Type: response' | cut -f 1 -d ' ' | egrep -a '^WARC-' |sus | tee >(wc -l 1>&2) | |
579 52468 WARC-Block-Digest: | |
580 52468 WARC-Concurrent-To: | |
581 52468 WARC-Date: | |
582 52468 WARC-Identified-Payload-Type: | |
583 52468 WARC-IP-Address: | |
584 52468 WARC-Payload-Digest: | |
585 52468 WARC-Record-ID: | |
586 52468 WARC-Target-URI: | |
587 52468 WARC-Type: | |
588 52468 WARC-Warcinfo-ID: | |
589 236 WARC-Truncated: | |
590 11 | |
591 | |
592 real 0m20.308s | |
593 user 0m19.720s | |
594 sys 0m4.505s | |
595 | |
596 Whole thing, with no pre-filtering: | |
597 | |
598 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2) | |
599 211794 Content-Length: | |
600 211162 Content-Type: | |
601 159323 WARC-Target-URI: | |
602 159311 WARC-Warcinfo-ID: | |
603 159301 WARC-Record-ID: | |
604 159299 WARC-Date: | |
605 159297 WARC-Type: | |
606 105901 WARC-Concurrent-To: | |
607 105896 WARC-IP-Address: | |
608 52484 WARC-Block-Digest: | |
609 52484 WARC-Identified-Payload-Type: | |
610 52482 WARC-Payload-Digest: | |
611 9239 Last-Modified: | |
612 3941 Content-Language: | |
613 2262 Content-Security-Policy: | |
614 642 Content-language: | |
615 326 Content-Security-Policy-Report-Only: | |
616 238 WARC-Truncated: | |
617 114 Content-Disposition: | |
618 352 Content-*: | |
619 1 WARC-Filename: | |
620 42 | |
621 | |
622 real 0m30.896s | |
623 user 0m37.335s | |
624 sys 0m7.542s | |
625 | |
626 First 51 after WARC-Type: response | |
627 | |
628 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz |egrep -aA50 '^WARC-Type: response' | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2) | |
629 106775 Content-Length: | |
630 106485 Content-Type: | |
631 55215 WARC-Type: | |
632 55123 WARC-Date: | |
633 54988 WARC-Record-ID: | |
634 54551 WARC-Warcinfo-ID: | |
635 54246 WARC-Target-URI: | |
636 54025 WARC-Concurrent-To: | |
637 52806 WARC-IP-Address: | |
638 52468 WARC-Block-Digest: | |
639 52468 WARC-Identified-Payload-Type: | |
640 52468 WARC-Payload-Digest: | |
641 9230 Last-Modified: | |
642 3938 Content-Language: | |
643 2261 Content-Security-Policy: | |
644 639 Content-language: | |
645 324 Content-Security-Policy-Report-Only: | |
646 236 WARC-Truncated: | |
647 114 Content-Disposition: | |
648 342 Content-*: | |
649 41 | |
650 | |
651 real 0m21.483s | |
652 user 0m22.372s | |
653 sys 0m5.400s | |
654 | |
655 So, not worth the risk, let's try python | |
656 | |
657 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|wc -l | |
658 9238 | |
659 | |
660 real 0m25.426s | |
661 user 0m23.201s | |
662 sys 0m0.711s | |
663 | |
664 Looks good, but why 9238 instead of 9239??? | |
665 | |
666 >: ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | egrep -a '^Last-Modified: ' > /tmp/hst/lmo.tsv | |
667 | |
668 Argh. Serious bug in unpackz, wasn't handline cross-buffer-boundary | |
669 records correctly. Fixed. Redoing the above... | |
670 | |
671 No pre-filter: | |
672 >: uz /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|egrep -c '^WARC/1\.0.$' | |
673 160297 | |
674 | |
675 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2) | |
676 | |
677 213719 Content-Length: | |
678 213088 Content-Type: | |
679 160297 WARC-Date: | |
680 160297 WARC-Record-ID: | |
681 160297 WARC-Type: | |
682 160296 WARC-Target-URI: | |
683 160296 WARC-Warcinfo-ID: | |
684 106864 WARC-Concurrent-To: | |
685 106864 WARC-IP-Address: | |
686 53432 WARC-Block-Digest: [consistent with 106297 == (3 * 53432) + 1] | |
687 53432 WARC-Identified-Payload-Type: | |
688 53432 WARC-Payload-Digest: | |
689 9430 Last-Modified: | |
690 4006 Content-Language: | |
691 2325 Content-Security-Policy: | |
692 653 Content-language: | |
693 331 Content-Security-Policy-Report-Only: | |
694 298 WARC-Truncated: | |
695 128 Content-Disposition: | |
696 83 Content-Location: | |
697 67 Content-type: | |
698 51 Content-MD5: | |
699 45 Content-Script-Type: | |
700 42 Content-Style-Type: | |
701 31 Content-Transfer-Encoding: | |
702 13 Content-disposition: | |
703 8 Content-Md5: | |
704 5 Content-Description: | |
705 5 Content-script-type: | |
706 5 Content-style-type: | |
707 3 Content-transfer-encoding: | |
708 2 Content-Encoding-handler: | |
709 1 Content-DocumentTitle: | |
710 1 Content-Hash: | |
711 1 Content-ID: | |
712 1 Content-Legth: | |
713 1 Content-length: | |
714 1 Content-Range: | |
715 1 Content-Secure-Policy: | |
716 1 Content-security-policy: | |
717 1 Content-Type-Options: | |
718 1 WARC-Filename: | |
719 42 | |
720 | |
721 real 0m28.876s | |
722 user 0m35.703s | |
723 sys 0m6.976s | |
724 | |
725 >: ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | egrep -a '^Last-Modified: ' > /tmp/hst/lmo.tsv | |
726 >: wc -l /tmp/hst/lmo.tsv | |
727 9430 /tmp/hst/lmo.tsv | |
728 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/lm.tsv | |
729 | |
730 real 0m17.191s | |
731 user 0m15.739s | |
732 sys 0m0.594s | |
733 >: wc -l /tmp/hst/lm.tsv | |
734 9423 /tmp/hst/lm.tsv | |
735 | |
736 >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv | tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv) | |
737 853d852 | |
738 < Mon, 19 Aug 2019 01:46:49 GMT | |
739 4058d4056 | |
740 < Tue, 03 Nov 2015 21:31:18 GMT<br /> | |
741 4405d4402 | |
742 < Mon, 19 Aug 2019 01:54:52 GMT | |
743 5237,5238d5233 | |
744 < 3 | |
745 < Asia/Amman | |
746 7009d7003 | |
747 < Mon, 19 Aug 2019 02:34:20 GMT | |
748 9198d9191 | |
749 < Mon, 19 Aug 2019 02:14:49 GMT | |
750 | |
751 All good. The only implausable case is | |
752 < Mon, 19 Aug 2019 01:54:52 GMT | |
753 which turns out to be a case of two Last-Modified headers in the same | |
754 the same response record's HTTP headers. RFCs 2616 and 7230 rule it | |
755 out but neither specifies a recovery, so first-wins is as good as | |
756 anything, and indeed 6797 specifies that. | |
757 |