comparison lurid3/notes.txt @ 51:dc24bb6e524f

done cdx_aux for segments 49--55 of 2019-35
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 09 Oct 2024 22:55:27 +0100
parents 5556c04c7597
children 8dffb8aa33da
comparison
equal deleted inserted replaced
50:5556c04c7597 51:dc24bb6e524f
872 741 872 741
873 sing<4046>: ls -lt CC-MAIN-2019-35/aug_cdx/49/|wc -l 873 sing<4046>: ls -lt CC-MAIN-2019-35/aug_cdx/49/|wc -l
874 372 874 372
875 875
876 Every file is being produced twice. 876 Every file is being produced twice.
877
878 Took me a while to figure out my own code :-(
879
880 >: sbatch --output=slurm_aug_cdx_49_360-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 49 49 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/$xarg
881 export SEG=$xarg
882 share_by_task.sh -f "%03g\n" -s 360 599 $n $task > /tmp/hst_$task' -i 'cat /tmp/hst_$task' 'export PYTHONPATH=./lib/python/cc:$PYTHONPATH
883 ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/*.$SEG/orig/warc/CC-MAIN-*-*-00${arg}.warc.gz > $resdir/00${arg}.tsv'
884
885 Oops, only 560, not 600
886
887 Took 3.5 minutes for 200, so call it 10 for 560, so do 6 more in an
888 hour:
889
890 >: sbatch --output=slurm_aug_cdx_50-55_out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 50 55 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/$xarg
891 mkdir -p $resdir
892 > export SEG=$xarg
893 share_by_task.sh -f "%03g\n" -s 360 599 $n $task > /tmp/hst_$task' -i 'cat /tmp/hst_$task' 'export PYTHONPATH=./lib/python/cc:$PYTHONPATH
894 ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/*.$SEG/orig/warc/CC-MAIN-*-*-00${arg}.warc.gz > $resdir/00${arg}.tsv'
895
896 >: tail slurm_aug_cdx_50-55_out
897 ...
898 Wed Oct 9 22:25:47 BST 2024 Finished 55
899 >: head -1 slurm_aug_cdx_50-55_out
900 Wed Oct 9 21:29:43 BST
901 56:04
902
903 >: du -s CC-MAIN-2019-35/aug_cdx
904 1,902,916
905
906 Not bad, so order 20MB for the whole thing
907
908 Next step, compare to my existing cdx with timestamp