changeset 50:5556c04c7597

all of 49?
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 09 Oct 2024 09:43:07 +0100
parents deeac8a0a682
children dc24bb6e524f
files lurid3/notes.txt
diffstat 1 files changed, 22 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Fri Oct 04 21:41:53 2024 +0100
+++ b/lurid3/notes.txt	Wed Oct 09 09:43:07 2024 +0100
@@ -852,3 +852,25 @@
   if list of matches is empty, redo setting of lowest
 
 Resort the result by actual key
+
+Meanwhile, get a whole test set:
+sbatch --output=slurm_aug_cdx_49_10-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 00 59 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/49
+export DEC=$xarg' "export PYTHONPATH=./lib/python/cc:$PYTHONPATH
+seq 0 9 | parallel -j 10 \"~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-00\${DEC}'{}'.warc.gz > \$resdir/00\${DEC}'{}'.tsv\""
+
+Actually finished 360 in the hour.
+
+Leaving
+
+sbatch --output=slurm_aug_cdx_49_360-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 36 59 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/49
+export DEC=$xarg' "export PYTHONPATH=./lib/python/cc:$PYTHONPATH
+seq 0 9 | parallel -j 10 \"~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-00\${DEC}'{}'.warc.gz > \$resdir/00\${DEC}'{}'.tsv\""
+
+But something is wrong, the number of jobs is all wrong:
+  
+  5>: fgrep -c parallel slurm_aug_cdx_49_0-359-out
+  741
+  sing<4046>: ls -lt CC-MAIN-2019-35/aug_cdx/49/|wc -l
+  372
+
+Every file is being produced twice.