log

age author description
16 months ago Henry S. Thompson sic
16 months ago Henry S. Thompson get in/out file management working right
16 months ago Henry S. Thompson refactor to provide for buffer overflow fix
16 months ago Henry S. Thompson bug-fix wrt 1st time,
16 months ago Henry S. Thompson make extra file info optional
16 months ago Henry S. Thompson forget parallel, just do (default 2) parallel single threads
16 months ago Henry S. Thompson add missing makedir
16 months ago Henry S. Thompson now does one named segment only
16 months ago Henry S. Thompson resurrect parallel fetch
16 months ago Henry S. Thompson convert to single thread,
16 months ago Henry S. Thompson avoid global name conflict
17 months ago Henry S. Thompson moved from /beegfs/common-crawl to get under .hg
17 months ago Henry S. Thompson fix typo
17 months ago Henry S. Thompson build cluster.idx
17 months ago Henry S. Thompson no longer using cmp_to_key
17 months ago Henry S. Thompson handle -m case, support src from cmdline mergefix
17 months ago Henry S. Thompson new branch to save do_idx.sh from abandoned merge fixup mergefix
17 months ago Henry S. Thompson try to get the counts right, particularly when re-merging
17 months ago Henry S. Thompson for use in debugging, see notes and tests 2, 17, merge test
17 months ago Henry S. Thompson add various www deletion cases
17 months ago Henry S. Thompson iterate WPAT fix with improved pattern
17 months ago Henry S. Thompson loosen WARC pattern to avoid failure from "mime" = "{...}" intervening
17 months ago Henry S. Thompson refactor to enable rerun with fixup,
17 months ago Henry S. Thompson correct mistaken futnsz test,
17 months ago Henry S. Thompson change path to merge_date.py
17 months ago Henry S. Thompson remove the mistaken deletion of NONPRINT,
17 months ago Henry S. Thompson fix a bad fix and a bad test for the televida case
17 months ago Henry S. Thompson fix and test for all-decimal host
17 months ago Henry S. Thompson no import in lmh.__init__ any more
17 months ago Henry S. Thompson importing in __init__ causes problems
17 months ago Henry S. Thompson commented out duplicate, handle comments better
17 months ago Henry S. Thompson more corner case tests
17 months ago Henry S. Thompson tweaks to get all tests through #14
17 months ago Henry S. Thompson get 7f (two cases) and %25 working
17 months ago Henry S. Thompson add televida case test
17 months ago Henry S. Thompson add test description
17 months ago Henry S. Thompson importable just in case
17 months ago Henry S. Thompson move most of the hacking into fixGoogleCanon,
17 months ago Henry S. Thompson forget assert, allow multiple failures
17 months ago Henry S. Thompson x
17 months ago Henry S. Thompson found right place for \x7f hack, maybe
17 months ago Henry S. Thompson readability
17 months ago Henry S. Thompson x
17 months ago Henry S. Thompson refactor to sort a module in an lmh package
17 months ago Henry S. Thompson start some regression tests
17 months ago Henry S. Thompson creating lmh package
17 months ago Henry S. Thompson moved from bin
17 months ago Henry S. Thompson minor bug wrt EOF of final cdx input file
17 months ago Henry S. Thompson replicate two extremely-corner cases of the way
17 months ago Henry S. Thompson a bit more logging
17 months ago Henry S. Thompson a bit more logging
17 months ago Henry S. Thompson robotstxt and crawldiagnostics get free ride,
17 months ago Henry S. Thompson a few more from ecclerig,
17 months ago Henry S. Thompson refactor datestream reading,
17 months ago Henry S. Thompson more faithful regexps and non-byte uri output
17 months ago Henry S. Thompson one uncommited fix from quentin
17 months ago Henry Thompson pass in debug flag(s) to merge_date.py
17 months ago Henry Thompson loosen must-match criterion in the both-messy case
17 months ago Henry Thompson one more sid fix,
17 months ago Henry S. Thompson working on sessionID pblms, still
17 months ago Henry Thompson first try
18 months ago Henry S. Thompson switch to gzip -7 to get comparable compressed cdx block size
18 months ago Henry S. Thompson use my own Canonicalizer to fix more obscure
18 months ago Henry S. Thompson re-instate logging splits for .idx
18 months ago Henry S. Thompson reinstate better check to start queuing,
18 months ago Henry S. Thompson bug4 fixed, but that created a new, earlier bug
18 months ago Henry S. Thompson rework handling of session key problem
18 months ago Henry S. Thompson initialise paths for csing
18 months ago Henry S. Thompson d'oh
18 months ago Henry S. Thompson include full URI in output
18 months ago Henry S. Thompson try to do csing correctly on compute nodes
18 months ago Henry S. Thompson version which outputs more identification,
18 months ago Henry S. Thompson last version before giving up on approach based only on key and datestamp
18 months ago Henry S. Thompson improve reordering, still failing on cdx-00004
18 months ago Henry S. Thompson attempt at reordering if necessary
18 months ago Henry S. Thompson mostly working, but need to reorder in case of cfid and friends
18 months ago Henry S. Thompson flip loops
18 months ago Henry S. Thompson merge a stream of ks files with a set of cdx files
18 months ago Henry S. Thompson final keystroke fixes, recurse and decimal www stripping
18 months ago Henry S. Thompson final keystroke fixes,
18 months ago Henry S. Thompson handle double .www, more keep-me chars
18 months ago Henry S. Thompson work-around for weird handling of %-encoding in Java impl. of SURT
18 months ago Henry Thompson merge, including pointless fix wrt pq
18 months ago Henry Thompson use surt instead of trying to create index term by hand
18 months ago Henry Thompson merge
18 months ago Henry Thompson stale
18 months ago Henry Thompson catching up by hand with markup version,
18 months ago Henry S. Thompson include timestamp
18 months ago Henry S. Thompson include query
18 months ago Henry S. Thompson make CC's own sorting explicit
19 months ago Henry S. Thompson handle corner cases with final . and initial www..+
19 months ago Henry S. Thompson handle %-encoded utf-8 as idna
19 months ago Henry S. Thompson merge
19 months ago Henry S. Thompson compute timestamps, key and sort lmh lines
19 months ago Henry S. Thompson work with csing
19 months ago Henry S. Thompson get man -k working
19 months ago Henry Thompson for warc_lmh slurm logs
19 months ago Henry S. Thompson for timing analysis
19 months ago Henry S. Thompson add support for multiple calls to srun with a counter
19 months ago Henry S. Thompson fix eof bug, expand error messages
19 months ago Henry S. Thompson part 2 is now working for all types
19 months ago Henry S. Thompson add a response-only test
19 months ago Henry S. Thompson revert to just showing first LM
20 months ago Henry S. Thompson more tests
20 months ago Henry S. Thompson Test 2 works with parts=1,2,3.
20 months ago Henry S. Thompson whole working
20 months ago Henry S. Thompson tests 1 & 2 now working
20 months ago Henry S. Thompson avoid slicing buf by using memoryview to save copying
20 months ago Henry S. Thompson but skip at eobp is not working (with test 2)
20 months ago Henry S. Thompson works with all types, part=1
20 months ago Henry S. Thompson rework completely to refill as much as possible only when necessary,
20 months ago Henry S. Thompson finds multiples
20 months ago Henry S. Thompson little steps
20 months ago Henry S. Thompson made 1 mean 1, still losing after a while
20 months ago Henry S. Thompson better debugging output
20 months ago Henry S. Thompson working better, gets confused by 3-part response
20 months ago Henry S. Thompson a bit better
20 months ago Henry S. Thompson just barely working for 1, need to rethink buffering
20 months ago Henry S. Thompson starting on conversion to direct-querying of buffer
20 months ago Henry S. Thompson sic
20 months ago Henry S. Thompson support on-board unzipping, reduce buffer size to 2MB
20 months ago Henry S. Thompson make test 1 idempotent
20 months ago Henry S. Thompson just count part length
20 months ago Henry S. Thompson get EOF right, finally
20 months ago Henry S. Thompson make warc.py a library, separate out testing
20 months ago Henry S. Thompson correct comment
20 months ago Henry S. Thompson add lots more debugging output,
20 months ago Henry S. Thompson moved from home bin