Mon, 27 Nov 2023 22:14:53 +0000 |
Henry S. Thompson |
add head_map fn
|
Mon, 27 Nov 2023 18:25:39 +0000 |
Henry S. Thompson |
add explore_deltas and predict analysis fns
|
Sun, 26 Nov 2023 21:24:38 +0000 |
Henry S. Thompson |
rename to avoid name clash with scipy.stats
|
Fri, 24 Nov 2023 20:41:03 +0000 |
Henry S. Thompson |
move to class with local vars instead of many globals
|
Fri, 24 Nov 2023 20:40:09 +0000 |
Henry S. Thompson |
renamed to by_interval.py
|
Fri, 24 Nov 2023 20:39:08 +0000 |
Henry S. Thompson |
renamed from spearman.py
|
Fri, 24 Nov 2023 20:38:39 +0000 |
Henry S. Thompson |
renamed to stats.py
|
Fri, 24 Nov 2023 19:52:52 +0000 |
Henry S. Thompson |
do the __main__ thing
|
Fri, 24 Nov 2023 19:52:14 +0000 |
Henry S. Thompson |
put results in numbered subdirs
|
Fri, 24 Nov 2023 19:50:12 +0000 |
Henry S. Thompson |
add minimal logging and don't return until finished
|
Wed, 15 Nov 2023 10:24:32 +0000 |
Henry S. Thompson |
should work for months also now
|
Wed, 15 Nov 2023 09:36:23 +0000 |
Henry S. Thompson |
cross-language confusion :-)
|
Mon, 06 Nov 2023 15:55:57 +0000 |
Henry S. Thompson |
LM plot for multiple crawls, magnitude or %age
|
Fri, 03 Nov 2023 19:05:54 +0000 |
Henry S. Thompson |
can overlay the two
|
Thu, 02 Nov 2023 15:38:39 +0000 |
Henry S. Thompson |
fix output year
|
Thu, 02 Nov 2023 13:49:02 +0000 |
Henry S. Thompson |
sic
|
Tue, 31 Oct 2023 14:05:12 +0000 |
Henry S. Thompson |
sic
|
Tue, 31 Oct 2023 14:04:24 +0000 |
Henry S. Thompson |
get in/out file management working right
|
Tue, 31 Oct 2023 14:03:02 +0000 |
Henry S. Thompson |
refactor to provide for buffer overflow fix
|
Tue, 31 Oct 2023 14:01:50 +0000 |
Henry S. Thompson |
bug-fix wrt 1st time,
|
Mon, 30 Oct 2023 12:19:53 +0000 |
Henry S. Thompson |
make extra file info optional
|
Wed, 25 Oct 2023 23:01:59 +0100 |
Henry S. Thompson |
forget parallel, just do (default 2) parallel single threads
|
Wed, 25 Oct 2023 23:00:45 +0100 |
Henry S. Thompson |
add missing makedir
|
Tue, 24 Oct 2023 16:59:23 +0100 |
Henry S. Thompson |
now does one named segment only
|
Tue, 24 Oct 2023 16:58:44 +0100 |
Henry S. Thompson |
resurrect parallel fetch
|
Tue, 24 Oct 2023 14:34:58 +0100 |
Henry S. Thompson |
convert to single thread,
|
Tue, 24 Oct 2023 14:26:36 +0100 |
Henry S. Thompson |
avoid global name conflict
|
Wed, 11 Oct 2023 12:51:06 +0100 |
Henry S. Thompson |
moved from /beegfs/common-crawl to get under .hg
|
Wed, 11 Oct 2023 12:50:29 +0100 |
Henry S. Thompson |
fix typo
|
Fri, 06 Oct 2023 15:06:53 +0100 |
Henry S. Thompson |
build cluster.idx
|
Fri, 06 Oct 2023 15:05:55 +0100 |
Henry S. Thompson |
no longer using cmp_to_key
|
Wed, 04 Oct 2023 20:04:34 +0100 |
Henry S. Thompson |
handle -m case, support src from cmdline
mergefix
|
Thu, 05 Oct 2023 10:42:15 +0100 |
Henry S. Thompson |
new branch to save do_idx.sh from abandoned merge fixup
mergefix
|
Wed, 04 Oct 2023 18:53:55 +0100 |
Henry S. Thompson |
try to get the counts right, particularly when re-merging
|
Wed, 04 Oct 2023 18:51:56 +0100 |
Henry S. Thompson |
for use in debugging, see notes and tests 2, 17, merge test
|
Tue, 03 Oct 2023 17:45:57 +0100 |
Henry S. Thompson |
add various www deletion cases
|
Tue, 03 Oct 2023 17:44:59 +0100 |
Henry S. Thompson |
iterate WPAT fix with improved pattern
|
Tue, 03 Oct 2023 17:43:52 +0100 |
Henry S. Thompson |
loosen WARC pattern to avoid failure from "mime" = "{...}" intervening
|
Mon, 02 Oct 2023 18:56:50 +0100 |
Henry S. Thompson |
refactor to enable rerun with fixup,
|
Mon, 02 Oct 2023 18:55:48 +0100 |
Henry S. Thompson |
correct mistaken futnsz test,
|
Mon, 02 Oct 2023 18:54:10 +0100 |
Henry S. Thompson |
change path to merge_date.py
|
Mon, 02 Oct 2023 18:52:43 +0100 |
Henry S. Thompson |
remove the mistaken deletion of NONPRINT,
|
Sat, 30 Sep 2023 18:04:15 +0100 |
Henry S. Thompson |
fix a bad fix and a bad test for the televida case
|
Sat, 30 Sep 2023 14:13:19 +0100 |
Henry S. Thompson |
fix and test for all-decimal host
|
Sat, 30 Sep 2023 14:12:39 +0100 |
Henry S. Thompson |
no import in lmh.__init__ any more
|
Sat, 30 Sep 2023 14:11:49 +0100 |
Henry S. Thompson |
importing in __init__ causes problems
|
Fri, 29 Sep 2023 15:59:34 +0100 |
Henry S. Thompson |
commented out duplicate, handle comments better
|
Fri, 29 Sep 2023 15:14:29 +0100 |
Henry S. Thompson |
more corner case tests
|
Fri, 29 Sep 2023 15:13:51 +0100 |
Henry S. Thompson |
tweaks to get all tests through #14
|
Thu, 28 Sep 2023 18:31:23 +0100 |
Henry S. Thompson |
get 7f (two cases) and %25 working
|
Thu, 28 Sep 2023 18:30:48 +0100 |
Henry S. Thompson |
add televida case test
|
Thu, 28 Sep 2023 16:36:15 +0100 |
Henry S. Thompson |
add test description
|
Thu, 28 Sep 2023 16:35:39 +0100 |
Henry S. Thompson |
importable just in case
|
Thu, 28 Sep 2023 16:34:49 +0100 |
Henry S. Thompson |
move most of the hacking into fixGoogleCanon,
|
Thu, 28 Sep 2023 16:10:05 +0100 |
Henry S. Thompson |
forget assert, allow multiple failures
|
Thu, 28 Sep 2023 16:09:38 +0100 |
Henry S. Thompson |
x
|
Thu, 28 Sep 2023 14:08:36 +0100 |
Henry S. Thompson |
found right place for \x7f hack, maybe
|
Thu, 28 Sep 2023 14:06:11 +0100 |
Henry S. Thompson |
readability
|
Thu, 28 Sep 2023 11:00:36 +0100 |
Henry S. Thompson |
x
|
Thu, 28 Sep 2023 11:00:24 +0100 |
Henry S. Thompson |
refactor to sort a module in an lmh package
|
Thu, 28 Sep 2023 10:54:12 +0100 |
Henry S. Thompson |
start some regression tests
|
Thu, 28 Sep 2023 09:01:18 +0100 |
Henry S. Thompson |
creating lmh package
|
Thu, 28 Sep 2023 08:46:01 +0100 |
Henry S. Thompson |
moved from bin
|
Wed, 27 Sep 2023 17:29:51 +0100 |
Henry S. Thompson |
minor bug wrt EOF of final cdx input file
|
Wed, 27 Sep 2023 17:29:09 +0100 |
Henry S. Thompson |
replicate two extremely-corner cases of the way
|
Tue, 26 Sep 2023 18:55:43 +0100 |
Henry S. Thompson |
a bit more logging
|
Tue, 26 Sep 2023 18:55:11 +0100 |
Henry S. Thompson |
a bit more logging
|
Tue, 26 Sep 2023 17:42:57 +0100 |
Henry S. Thompson |
robotstxt and crawldiagnostics get free ride,
|
Tue, 26 Sep 2023 14:18:40 +0100 |
Henry S. Thompson |
a few more from ecclerig,
|
Tue, 26 Sep 2023 09:03:47 +0100 |
Henry S. Thompson |
refactor datestream reading,
|
Mon, 25 Sep 2023 23:53:13 +0100 |
Henry S. Thompson |
more faithful regexps and non-byte uri output
|
Fri, 22 Sep 2023 15:27:28 +0100 |
Henry S. Thompson |
one uncommited fix from quentin
|
Tue, 19 Sep 2023 19:40:58 +0100 |
Henry Thompson |
pass in debug flag(s) to merge_date.py
|
Tue, 19 Sep 2023 19:29:41 +0100 |
Henry Thompson |
loosen must-match criterion in the both-messy case
|
Tue, 19 Sep 2023 19:28:34 +0100 |
Henry Thompson |
one more sid fix,
|
Sun, 17 Sep 2023 15:18:11 +0100 |
Henry S. Thompson |
working on sessionID pblms, still
|
Thu, 14 Sep 2023 19:27:23 +0100 |
Henry Thompson |
first try
|
Wed, 13 Sep 2023 16:48:43 +0100 |
Henry S. Thompson |
switch to gzip -7 to get comparable compressed cdx block size
|
Wed, 13 Sep 2023 12:41:55 +0100 |
Henry S. Thompson |
use my own Canonicalizer to fix more obscure
|
Wed, 13 Sep 2023 12:40:39 +0100 |
Henry S. Thompson |
re-instate logging splits for .idx
|
Tue, 12 Sep 2023 12:14:04 +0100 |
Henry S. Thompson |
reinstate better check to start queuing,
|
Mon, 11 Sep 2023 22:06:45 +0100 |
Henry S. Thompson |
bug4 fixed, but that created a new, earlier bug
|
Mon, 11 Sep 2023 12:56:47 +0100 |
Henry S. Thompson |
rework handling of session key problem
|
Fri, 08 Sep 2023 21:40:52 +0100 |
Henry S. Thompson |
initialise paths for csing
|
Fri, 08 Sep 2023 21:40:06 +0100 |
Henry S. Thompson |
d'oh
|
Fri, 08 Sep 2023 18:06:54 +0100 |
Henry S. Thompson |
include full URI in output
|
Fri, 08 Sep 2023 18:05:57 +0100 |
Henry S. Thompson |
try to do csing correctly on compute nodes
|
Fri, 08 Sep 2023 09:29:25 +0100 |
Henry S. Thompson |
version which outputs more identification,
|
Thu, 07 Sep 2023 18:03:55 +0100 |
Henry S. Thompson |
last version before giving up on approach based only on key and datestamp
|
Wed, 06 Sep 2023 18:51:21 +0100 |
Henry S. Thompson |
improve reordering, still failing on cdx-00004
|
Tue, 05 Sep 2023 17:33:29 +0100 |
Henry S. Thompson |
attempt at reordering if necessary
|
Tue, 05 Sep 2023 17:32:46 +0100 |
Henry S. Thompson |
mostly working, but need to reorder in case of cfid and friends
|
Thu, 31 Aug 2023 14:14:21 +0100 |
Henry S. Thompson |
flip loops
|
Wed, 30 Aug 2023 21:49:43 +0100 |
Henry S. Thompson |
merge a stream of ks files with a set of cdx files
|
Wed, 30 Aug 2023 11:11:31 +0100 |
Henry S. Thompson |
final keystroke fixes, recurse and decimal www stripping
|
Wed, 30 Aug 2023 11:10:54 +0100 |
Henry S. Thompson |
final keystroke fixes,
|
Mon, 28 Aug 2023 21:07:43 +0100 |
Henry S. Thompson |
handle double .www, more keep-me chars
|
Thu, 24 Aug 2023 18:21:41 +0100 |
Henry S. Thompson |
work-around for weird handling of %-encoding in Java impl. of SURT
|
Mon, 21 Aug 2023 13:06:20 -0400 |
Henry Thompson |
merge, including pointless fix wrt pq
|
Sat, 19 Aug 2023 16:33:23 -0400 |
Henry Thompson |
use surt instead of trying to create index term by hand
|
Sat, 19 Aug 2023 16:02:29 -0400 |
Henry Thompson |
merge
|
Sat, 19 Aug 2023 15:58:38 -0400 |
Henry Thompson |
stale
|
Sat, 19 Aug 2023 15:53:59 -0400 |
Henry Thompson |
catching up by hand with markup version,
|
Mon, 21 Aug 2023 13:37:07 +0100 |
Henry S. Thompson |
include timestamp
|
Sun, 20 Aug 2023 00:28:43 +0100 |
Henry S. Thompson |
include query
|
Fri, 18 Aug 2023 18:25:54 +0100 |
Henry S. Thompson |
make CC's own sorting explicit
|
Thu, 10 Aug 2023 22:14:49 +0100 |
Henry S. Thompson |
handle corner cases with final . and initial www..+
|
Wed, 09 Aug 2023 02:01:32 +0100 |
Henry S. Thompson |
handle %-encoded utf-8 as idna
|
Tue, 08 Aug 2023 17:48:29 +0100 |
Henry S. Thompson |
merge
|
Tue, 08 Aug 2023 17:47:27 +0100 |
Henry S. Thompson |
compute timestamps, key and sort lmh lines
|
Tue, 08 Aug 2023 17:46:20 +0100 |
Henry S. Thompson |
work with csing
|
Tue, 08 Aug 2023 17:46:02 +0100 |
Henry S. Thompson |
get man -k working
|
Fri, 28 Jul 2023 00:50:13 +0100 |
Henry Thompson |
for warc_lmh slurm logs
|
Wed, 26 Jul 2023 18:42:19 +0100 |
Henry S. Thompson |
for timing analysis
|
Fri, 21 Jul 2023 11:37:47 +0100 |
Henry S. Thompson |
add support for multiple calls to srun with a counter
|
Thu, 20 Jul 2023 10:32:55 +0100 |
Henry S. Thompson |
fix eof bug, expand error messages
|
Wed, 19 Jul 2023 13:20:46 +0100 |
Henry S. Thompson |
part 2 is now working for all types
|
Wed, 19 Jul 2023 13:19:58 +0100 |
Henry S. Thompson |
add a response-only test
|
Wed, 19 Jul 2023 13:19:42 +0100 |
Henry S. Thompson |
revert to just showing first LM
|
Fri, 14 Jul 2023 17:39:14 +0100 |
Henry S. Thompson |
more tests
|