Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 73:1283a574260d
working on cross-check of warc2cdb
author | Henry Thompson <ht@markup.co.uk> |
---|---|
date | Sun, 09 Mar 2025 19:59:50 +0000 |
parents | 7901ce4a39e3 |
children | ff6ef190f901 |
comparison
equal
deleted
inserted
replaced
72:7901ce4a39e3 | 73:1283a574260d |
---|---|
1628 >: uz idx/cdx-00101.gz | wc | 1628 >: uz idx/cdx-00101.gz | wc |
1629 14681147 314748509 7090034893 | 1629 14681147 314748509 7090034893 |
1630 >: uz idx/cdx-00101.gz | cmp - /tmp/hst/cdb-00101 | 1630 >: uz idx/cdx-00101.gz | cmp - /tmp/hst/cdb-00101 |
1631 >: echo $? | 1631 >: echo $? |
1632 0 | 1632 0 |
1633 | |
1634 OK, now to fill up 2023-40 so we can do a real trial. | |
1635 | |
1636 Cythonised lmh.py and warc.py, still running about 14 seconds per file | |
1637 Don't actually need these yet, as 2023-40 has a full set of lmh | |
1638 outputs, but not many ks.tsv: | |
1639 >: ls */ks.tsv | |
1640 0/ks.tsv 15/ks.tsv 4/ks.tsv 68/ks.tsv | |
1641 12/ks.tsv 36/ks.tsv 56/ks.tsv best_two_by_nl1/ks.tsv | |
1642 | |
1643 Note, that to populate cdbs, we don't need most of the logic in | |
1644 sort_date.py, because all the sorting and merging isn't needed! | |
1645 | |
1646 Hmm. In fact, just go straight to cdb_in.txt | |
1647 | |
1648 >: time python3 -c 'import sys,warc2cdb; sys.exit(warc2cdb.main(*sys.argv[1:]))' 2023-40 15 0 /tmp/hst 2>/tmp/hst/15/errs | |
1649 /tmp/hst/15/lmh.cdb_in | |
1650 | |
1651 real 223m28.517s | |
1652 user 168m34.751s | |
1653 sys 12m6.020s | |
1654 | |
1655 This compares to 273 minutes for 806 files in the cleanup of segment | |
1656 95, see end of old_notes. Best full segment was 96, i.e. 279 for | |
1657 900. And those figures are for lmh.py only | |
1658 | |
1659 Are results the same? | |
1660 Old: | |
1661 >: wc -l *.tsv | |
1662 8247 ks_errs.tsv | |
1663 5324616 ks.tsv | |
1664 New: | |
1665 >: wc -l * | |
1666 9147 errs | |
1667 5324617 lmh.cdb_in | |
1668 | |
1669 I _think_ that's good -- there's one line per input file added to the | |
1670 error output now, and one blank line at the end of the cdb_in file. | |
1671 Old: | |
1672 >: cut -f 2- ks_errs.tsv |sus | |
1673 5420 cannot unpack non-iterable NoneType object | |
1674 2670 list index out of range | |
1675 113 year 641471 is out of range | |
1676 20 year 641474 is out of range | |
1677 9 hour must be in 0..23 | |
1678 9 year 642435 is out of range | |
1679 3 year 642437 is out of range | |
1680 1 year 4262575 is out of range | |
1681 1 year 641769 is out of range | |
1682 1 year 642434 is out of range | |
1683 New: | |
1684 >: cut -f 2- <(fgrep -v beegfs /tmp/hst/15/errs) | cut -f 1-5 -d ' ' |sus | |
1685 8090 Invalid date value or format | |
1686 113 year 641471 is out of | |
1687 20 year 641474 is out of | |
1688 9 hour must be in 0..23 | |
1689 9 year 642435 is out of | |
1690 3 year 642437 is out of | |
1691 1 year 4262575 is out of | |
1692 1 year 641769 is out of | |
1693 1 year 642434 is out of | |
1694 | |
1695 But, failed to fix the case where the URI in the original WARC is not | |
1696 %-encoded: | |
1697 >: diff -bw <(cut -f2- ks.tsv| sed 's/ //;s/\.0$//'|sort -k1,1) <(sed 's/^\+[0-9,]*://;s/->\([0-9]*\)$/ \1/' /tmp/hst/15/lmh.cdb_in_nopercent | sort -k1,1)|head | |
1698 4360c4361 | |
1699 < 20230922034234https://www.hansaton.com/zh-hk/\u6D88\u8CBB\u8005.html 1695354153 | |
1700 --- | |
1701 > 20230922034234https://www.hansaton.com/zh-hk/消費者.html 1695354153 | |
1702 6574c6575 | |
1703 < 20230922034248https://blog.arabtherapy.com/\u0639\u0644\u0627\u062C-\u0627\u0644\u062E\u062C\u0644-\u0627\u0644\u0627\u062C\u062A\u0645\u0627\u0639\u064A/ 1695354168 | |
1704 --- | |
1705 > 20230922034248https://blog.arabtherapy.com/علاج-الخجل-الاجتماعي/ 1695354168 | |
1706 | |
1707 Could try fixing that at lookup time | |
1708 | |
1709 Fixed in warc2cdb, seems OK for timing: | |
1710 /tmp/hst/15/lmh.cdb_in | |
1711 | |
1712 real 224m32.815s | |
1713 user 169m25.216s | |
1714 sys 12m7.867s | |
1715 | |
1716 Lengths are right, difference with ks looks better, try the real | |
1717 thing: | |
1718 | |
1719 >: cd /beegfs/common_crawl/CC-MAIN-2023-40/cdx/warc | |
1720 >: ls | parallel -j 10 "uz '{}' |cut -f 2,4,18 -d ' '|fgrep .15/warc/CC-MAIN | cut -f 1,2 -d ' ' | sort > '/tmp/hst/15/orig_{#}.cdx'" & | |
1721 >: sort -m *.cdx | sponge all.cdx | |
1722 | |
1723 Nope: | |
1724 >: diff <(sed 's/->[0-9]*$/",/' test.cdb_in) all.cdx |less | |
1725 >: fgrep -n dryelf.com test.cdb_in all.cdx | |
1726 test.cdb_in:6710:20230922034248 "http://www.dryelf.com/files/DryCookieInstall.exe", | |
1727 test.cdb_in:1600193:20230922043436 "http://www.dryelf.com/files/DryLottoInstall.exe", | |
1728 all.cdx:12190045:20230922044250 "http://www.dryelf.com/downloads.asp?Sort=0&Cat=3", | |
1729 all.cdx:25401353:20230922053834 "http://www.dryelf.com/downloads.asp?Sort=2&Cat=0", | |
1730 | |
1731 Stupid -- mime-detected has a space in it... | |
1633 ================ | 1732 ================ |
1634 | 1733 |
1635 Try it with the existing _per segment_ index we have for 2019-35 | 1734 Try it with the existing _per segment_ index we have for 2019-35 |
1636 | 1735 |
1637 Assuming we have to key on segment / file and offset, as reconstructing the | 1736 Assuming we have to key on segment / file and offset, as reconstructing the |