Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 51:dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Wed, 09 Oct 2024 22:55:27 +0100 |
parents | 5556c04c7597 |
children | 8dffb8aa33da |
comparison
equal
deleted
inserted
replaced
50:5556c04c7597 | 51:dc24bb6e524f |
---|---|
872 741 | 872 741 |
873 sing<4046>: ls -lt CC-MAIN-2019-35/aug_cdx/49/|wc -l | 873 sing<4046>: ls -lt CC-MAIN-2019-35/aug_cdx/49/|wc -l |
874 372 | 874 372 |
875 | 875 |
876 Every file is being produced twice. | 876 Every file is being produced twice. |
877 | |
878 Took me a while to figure out my own code :-( | |
879 | |
880 >: sbatch --output=slurm_aug_cdx_49_360-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 49 49 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/$xarg | |
881 export SEG=$xarg | |
882 share_by_task.sh -f "%03g\n" -s 360 599 $n $task > /tmp/hst_$task' -i 'cat /tmp/hst_$task' 'export PYTHONPATH=./lib/python/cc:$PYTHONPATH | |
883 ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/*.$SEG/orig/warc/CC-MAIN-*-*-00${arg}.warc.gz > $resdir/00${arg}.tsv' | |
884 | |
885 Oops, only 560, not 600 | |
886 | |
887 Took 3.5 minutes for 200, so call it 10 for 560, so do 6 more in an | |
888 hour: | |
889 | |
890 >: sbatch --output=slurm_aug_cdx_50-55_out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 50 55 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/$xarg | |
891 mkdir -p $resdir | |
892 > export SEG=$xarg | |
893 share_by_task.sh -f "%03g\n" -s 360 599 $n $task > /tmp/hst_$task' -i 'cat /tmp/hst_$task' 'export PYTHONPATH=./lib/python/cc:$PYTHONPATH | |
894 ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/*.$SEG/orig/warc/CC-MAIN-*-*-00${arg}.warc.gz > $resdir/00${arg}.tsv' | |
895 | |
896 >: tail slurm_aug_cdx_50-55_out | |
897 ... | |
898 Wed Oct 9 22:25:47 BST 2024 Finished 55 | |
899 >: head -1 slurm_aug_cdx_50-55_out | |
900 Wed Oct 9 21:29:43 BST | |
901 56:04 | |
902 | |
903 >: du -s CC-MAIN-2019-35/aug_cdx | |
904 1,902,916 | |
905 | |
906 Not bad, so order 20MB for the whole thing | |
907 | |
908 Next step, compare to my existing cdx with timestamp |