Mercurial > hg > cc > work
changeset 62:bc0bdb649c08
tried cdb, slower by 2 OoM
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Fri, 03 Jan 2025 13:35:14 +0000 |
parents | e6bab0972142 |
children | 663e55844c1d |
files | lurid3/notes.txt |
diffstat | 1 files changed, 79 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Thu Jan 02 18:55:11 2025 +0000 +++ b/lurid3/notes.txt Fri Jan 03 13:35:14 2025 +0000 @@ -949,6 +949,85 @@ >: seq 2 9 | parallel -j 8 'time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_{}0-{}9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_{}0-{}9.tsv' Slightly slower when running in parallel, but done + +Try using cdb file, has to be one seg per file because of 4GB limit of +cdb C-code + + >: ~/lib/python/cc/lmh/ks2cdb.py -f <(head -5236974 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv) -c results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb_in + 5236974 + 11.358113696798682 + >: time cdbmake results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb results/CC-MAIN-2019-35/warc_lmhx/ks_0.tmp < results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb_in + + real 0m4.981s + user 0m2.865s + sys 0m1.886s + >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb_in + -rw-r--r-- 1 hst dc007 574M Jan 3 11:34 results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb_in + >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb + -rw-r--r-- 1 hst dc007 642M Jan 3 11:34 results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb + >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.{tsv,pickle} + -rw-r--r-- 1 hst dc007 5.4G Jan 2 13:41 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle + -rw-r--r-- 1 hst dc007 8.8G Oct 3 2023 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv + >: cdbdump <results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb | head + +52,10:20190817225913http://85.163.4.0:9999/faces/login.jsp->1566082752 + +58,10:20190817232933https://1.1.1.1/?source=https://www.tuppu.fi->1560198539 + +34,10:20190822010151http://163.30.193.1/->1256804738 + +64,10:20190825182530http://45.70.224.1:20026/index.php?tipo=minhaconta->1565613324 + +32,10:20190818210532http://50.63.77.1/->1366940302 + +41,10:20190825182606http://119.145.170.10:2160/->1497240281 + +43,10:20190825142846http://71.43.189.10/dermorph/->1564555978 + +53,10:20190825143020http://71.43.189.10/dermorph/about.html->1564555936 + +55,10:20190825142857http://71.43.189.10/dermorph/contact.html->1564748606 + +54,10:20190825142842http://71.43.189.10/dermorph/guided.html->1563634912 + sing<4795>: cdbget 20190825142846http://71.43.189.10/dermorph/ <results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb + 1564555978 + >: cdbstats <results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb + records 5226422 + d0 3920023 + d1 748992 + d2 269375 + d3 124011 + d4 64304 + d5 36623 + d6 21765 + d7 13737 + d8 8712 + d9 5866 + >9 13014 + >: cdbtest <results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb + found: 5226422 + different record: 0 + bad length: 0 + not found: 0 + untested: 10552 + +All good, _but_ + >>> r=cdblib.Reader.from_file_path('results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb') + >>> r.get(b'xyzzy') + >>> r.get(b'20190825142846http://71.43.189.10/dermorph/') + b'1564555978' + >>> from timeit import Timer + >>> t=Timer("r.get(b'20190825142846http://71.43.189.10/dermorph/')", + globals=globals()) + >>> t.timeit(100000) + 0.30162674374878407 + >>> t.timeit(100000) + 0.30807616002857685 + >>> t.timeit(100000) + 0.30170613154768944 + +So, 100 x slower than a dict :-(. Just checking that... + >>> t = Timer("b'20190825142846http://71.43.189.10/dermorph/' in d",globals={'d':d}) + >>> r=cdblib.Reader.from_file_path('results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb') + >>> r.get(b'20190825142846http://71.43.189.10/dermorph/') + b'1564555978' + >>> s = Timer("r.get(b'20190825142846http://71.43.189.10/dermorph/')", + ... globals=globals()) + >>> (t.repeat(5,100000),s.repeat(5,100000)) + [0.005662968382239342, 0.005780909210443497, 0.005478940904140472, 0.005713008344173431, 0.005547545850276947] + [0.30250774696469307, 0.303345350548625, 0.3002819549292326, 0.30161340720951557, 0.30262864381074905]) + +so, forget cdb ================