comparison lurid3/notes.txt @ 73:1283a574260d

working on cross-check of warc2cdb
author Henry Thompson <ht@markup.co.uk>
date Sun, 09 Mar 2025 19:59:50 +0000
parents 7901ce4a39e3
children ff6ef190f901
comparison
equal deleted inserted replaced
72:7901ce4a39e3 73:1283a574260d
1628 >: uz idx/cdx-00101.gz | wc 1628 >: uz idx/cdx-00101.gz | wc
1629 14681147 314748509 7090034893 1629 14681147 314748509 7090034893
1630 >: uz idx/cdx-00101.gz | cmp - /tmp/hst/cdb-00101 1630 >: uz idx/cdx-00101.gz | cmp - /tmp/hst/cdb-00101
1631 >: echo $? 1631 >: echo $?
1632 0 1632 0
1633
1634 OK, now to fill up 2023-40 so we can do a real trial.
1635
1636 Cythonised lmh.py and warc.py, still running about 14 seconds per file
1637 Don't actually need these yet, as 2023-40 has a full set of lmh
1638 outputs, but not many ks.tsv:
1639 >: ls */ks.tsv
1640 0/ks.tsv 15/ks.tsv 4/ks.tsv 68/ks.tsv
1641 12/ks.tsv 36/ks.tsv 56/ks.tsv best_two_by_nl1/ks.tsv
1642
1643 Note, that to populate cdbs, we don't need most of the logic in
1644 sort_date.py, because all the sorting and merging isn't needed!
1645
1646 Hmm. In fact, just go straight to cdb_in.txt
1647
1648 >: time python3 -c 'import sys,warc2cdb; sys.exit(warc2cdb.main(*sys.argv[1:]))' 2023-40 15 0 /tmp/hst 2>/tmp/hst/15/errs
1649 /tmp/hst/15/lmh.cdb_in
1650
1651 real 223m28.517s
1652 user 168m34.751s
1653 sys 12m6.020s
1654
1655 This compares to 273 minutes for 806 files in the cleanup of segment
1656 95, see end of old_notes. Best full segment was 96, i.e. 279 for
1657 900. And those figures are for lmh.py only
1658
1659 Are results the same?
1660 Old:
1661 >: wc -l *.tsv
1662 8247 ks_errs.tsv
1663 5324616 ks.tsv
1664 New:
1665 >: wc -l *
1666 9147 errs
1667 5324617 lmh.cdb_in
1668
1669 I _think_ that's good -- there's one line per input file added to the
1670 error output now, and one blank line at the end of the cdb_in file.
1671 Old:
1672 >: cut -f 2- ks_errs.tsv |sus
1673 5420 cannot unpack non-iterable NoneType object
1674 2670 list index out of range
1675 113 year 641471 is out of range
1676 20 year 641474 is out of range
1677 9 hour must be in 0..23
1678 9 year 642435 is out of range
1679 3 year 642437 is out of range
1680 1 year 4262575 is out of range
1681 1 year 641769 is out of range
1682 1 year 642434 is out of range
1683 New:
1684 >: cut -f 2- <(fgrep -v beegfs /tmp/hst/15/errs) | cut -f 1-5 -d ' ' |sus
1685 8090 Invalid date value or format
1686 113 year 641471 is out of
1687 20 year 641474 is out of
1688 9 hour must be in 0..23
1689 9 year 642435 is out of
1690 3 year 642437 is out of
1691 1 year 4262575 is out of
1692 1 year 641769 is out of
1693 1 year 642434 is out of
1694
1695 But, failed to fix the case where the URI in the original WARC is not
1696 %-encoded:
1697 >: diff -bw <(cut -f2- ks.tsv| sed 's/ //;s/\.0$//'|sort -k1,1) <(sed 's/^\+[0-9,]*://;s/->\([0-9]*\)$/ \1/' /tmp/hst/15/lmh.cdb_in_nopercent | sort -k1,1)|head
1698 4360c4361
1699 < 20230922034234https://www.hansaton.com/zh-hk/\u6D88\u8CBB\u8005.html 1695354153
1700 ---
1701 > 20230922034234https://www.hansaton.com/zh-hk/消費者.html 1695354153
1702 6574c6575
1703 < 20230922034248https://blog.arabtherapy.com/\u0639\u0644\u0627\u062C-\u0627\u0644\u062E\u062C\u0644-\u0627\u0644\u0627\u062C\u062A\u0645\u0627\u0639\u064A/ 1695354168
1704 ---
1705 > 20230922034248https://blog.arabtherapy.com/علاج-الخجل-الاجتماعي/ 1695354168
1706
1707 Could try fixing that at lookup time
1708
1709 Fixed in warc2cdb, seems OK for timing:
1710 /tmp/hst/15/lmh.cdb_in
1711
1712 real 224m32.815s
1713 user 169m25.216s
1714 sys 12m7.867s
1715
1716 Lengths are right, difference with ks looks better, try the real
1717 thing:
1718
1719 >: cd /beegfs/common_crawl/CC-MAIN-2023-40/cdx/warc
1720 >: ls | parallel -j 10 "uz '{}' |cut -f 2,4,18 -d ' '|fgrep .15/warc/CC-MAIN | cut -f 1,2 -d ' ' | sort > '/tmp/hst/15/orig_{#}.cdx'" &
1721 >: sort -m *.cdx | sponge all.cdx
1722
1723 Nope:
1724 >: diff <(sed 's/->[0-9]*$/",/' test.cdb_in) all.cdx |less
1725 >: fgrep -n dryelf.com test.cdb_in all.cdx
1726 test.cdb_in:6710:20230922034248 "http://www.dryelf.com/files/DryCookieInstall.exe",
1727 test.cdb_in:1600193:20230922043436 "http://www.dryelf.com/files/DryLottoInstall.exe",
1728 all.cdx:12190045:20230922044250 "http://www.dryelf.com/downloads.asp?Sort=0&Cat=3",
1729 all.cdx:25401353:20230922053834 "http://www.dryelf.com/downloads.asp?Sort=2&Cat=0",
1730
1731 Stupid -- mime-detected has a space in it...
1633 ================ 1732 ================
1634 1733
1635 Try it with the existing _per segment_ index we have for 2019-35 1734 Try it with the existing _per segment_ index we have for 2019-35
1636 1735
1637 Assuming we have to key on segment / file and offset, as reconstructing the 1736 Assuming we have to key on segment / file and offset, as reconstructing the