comparison lurid3/notes.txt @ 50:5556c04c7597

all of 49?
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 09 Oct 2024 09:43:07 +0100
parents deeac8a0a682
children dc24bb6e524f
comparison
equal deleted inserted replaced
49:deeac8a0a682 50:5556c04c7597
850 remove fileno from list of matches 850 remove fileno from list of matches
851 read and store a new line for fileno [handle EOF] 851 read and store a new line for fileno [handle EOF]
852 if list of matches is empty, redo setting of lowest 852 if list of matches is empty, redo setting of lowest
853 853
854 Resort the result by actual key 854 Resort the result by actual key
855
856 Meanwhile, get a whole test set:
857 sbatch --output=slurm_aug_cdx_49_10-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 00 59 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/49
858 export DEC=$xarg' "export PYTHONPATH=./lib/python/cc:$PYTHONPATH
859 seq 0 9 | parallel -j 10 \"~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-00\${DEC}'{}'.warc.gz > \$resdir/00\${DEC}'{}'.tsv\""
860
861 Actually finished 360 in the hour.
862
863 Leaving
864
865 sbatch --output=slurm_aug_cdx_49_360-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 36 59 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/49
866 export DEC=$xarg' "export PYTHONPATH=./lib/python/cc:$PYTHONPATH
867 seq 0 9 | parallel -j 10 \"~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-00\${DEC}'{}'.warc.gz > \$resdir/00\${DEC}'{}'.tsv\""
868
869 But something is wrong, the number of jobs is all wrong:
870
871 5>: fgrep -c parallel slurm_aug_cdx_49_0-359-out
872 741
873 sing<4046>: ls -lt CC-MAIN-2019-35/aug_cdx/49/|wc -l
874 372
875
876 Every file is being produced twice.