15 months ago |
Henry S. Thompson |
change heatmap to by percentile
|
15 months ago |
Henry S. Thompson |
with heat
|
15 months ago |
Henry S. Thompson |
heat map for mime vs. nl1 vs. len
|
15 months ago |
Henry S. Thompson |
add head_map fn
|
15 months ago |
Henry S. Thompson |
add explore_deltas and predict analysis fns
|
15 months ago |
Henry S. Thompson |
rename to avoid name clash with scipy.stats
|
15 months ago |
Henry S. Thompson |
move to class with local vars instead of many globals
|
15 months ago |
Henry S. Thompson |
renamed to by_interval.py
|
15 months ago |
Henry S. Thompson |
renamed from spearman.py
|
15 months ago |
Henry S. Thompson |
renamed to stats.py
|
15 months ago |
Henry S. Thompson |
do the __main__ thing
|
15 months ago |
Henry S. Thompson |
put results in numbered subdirs
|
15 months ago |
Henry S. Thompson |
add minimal logging and don't return until finished
|
15 months ago |
Henry S. Thompson |
should work for months also now
|
15 months ago |
Henry S. Thompson |
cross-language confusion :-)
|
16 months ago |
Henry S. Thompson |
LM plot for multiple crawls, magnitude or %age
|
16 months ago |
Henry S. Thompson |
can overlay the two
|
16 months ago |
Henry S. Thompson |
fix output year
|
16 months ago |
Henry S. Thompson |
sic
|
16 months ago |
Henry S. Thompson |
sic
|
16 months ago |
Henry S. Thompson |
get in/out file management working right
|
16 months ago |
Henry S. Thompson |
refactor to provide for buffer overflow fix
|
16 months ago |
Henry S. Thompson |
bug-fix wrt 1st time,
|
16 months ago |
Henry S. Thompson |
make extra file info optional
|
16 months ago |
Henry S. Thompson |
forget parallel, just do (default 2) parallel single threads
|
16 months ago |
Henry S. Thompson |
add missing makedir
|
16 months ago |
Henry S. Thompson |
now does one named segment only
|
16 months ago |
Henry S. Thompson |
resurrect parallel fetch
|
16 months ago |
Henry S. Thompson |
convert to single thread,
|
16 months ago |
Henry S. Thompson |
avoid global name conflict
|
17 months ago |
Henry S. Thompson |
moved from /beegfs/common-crawl to get under .hg
|
17 months ago |
Henry S. Thompson |
fix typo
|
17 months ago |
Henry S. Thompson |
build cluster.idx
|
17 months ago |
Henry S. Thompson |
no longer using cmp_to_key
|
17 months ago |
Henry S. Thompson |
handle -m case, support src from cmdline
mergefix
|
17 months ago |
Henry S. Thompson |
new branch to save do_idx.sh from abandoned merge fixup
mergefix
|
17 months ago |
Henry S. Thompson |
try to get the counts right, particularly when re-merging
|
17 months ago |
Henry S. Thompson |
for use in debugging, see notes and tests 2, 17, merge test
|
17 months ago |
Henry S. Thompson |
add various www deletion cases
|
17 months ago |
Henry S. Thompson |
iterate WPAT fix with improved pattern
|
17 months ago |
Henry S. Thompson |
loosen WARC pattern to avoid failure from "mime" = "{...}" intervening
|
17 months ago |
Henry S. Thompson |
refactor to enable rerun with fixup,
|
17 months ago |
Henry S. Thompson |
correct mistaken futnsz test,
|
17 months ago |
Henry S. Thompson |
change path to merge_date.py
|
17 months ago |
Henry S. Thompson |
remove the mistaken deletion of NONPRINT,
|
17 months ago |
Henry S. Thompson |
fix a bad fix and a bad test for the televida case
|
17 months ago |
Henry S. Thompson |
fix and test for all-decimal host
|
17 months ago |
Henry S. Thompson |
no import in lmh.__init__ any more
|
17 months ago |
Henry S. Thompson |
importing in __init__ causes problems
|
17 months ago |
Henry S. Thompson |
commented out duplicate, handle comments better
|
17 months ago |
Henry S. Thompson |
more corner case tests
|
17 months ago |
Henry S. Thompson |
tweaks to get all tests through #14
|
17 months ago |
Henry S. Thompson |
get 7f (two cases) and %25 working
|
17 months ago |
Henry S. Thompson |
add televida case test
|
17 months ago |
Henry S. Thompson |
add test description
|
17 months ago |
Henry S. Thompson |
importable just in case
|
17 months ago |
Henry S. Thompson |
move most of the hacking into fixGoogleCanon,
|
17 months ago |
Henry S. Thompson |
forget assert, allow multiple failures
|
17 months ago |
Henry S. Thompson |
x
|
17 months ago |
Henry S. Thompson |
found right place for \x7f hack, maybe
|
17 months ago |
Henry S. Thompson |
readability
|
17 months ago |
Henry S. Thompson |
x
|
17 months ago |
Henry S. Thompson |
refactor to sort a module in an lmh package
|
17 months ago |
Henry S. Thompson |
start some regression tests
|
17 months ago |
Henry S. Thompson |
creating lmh package
|
17 months ago |
Henry S. Thompson |
moved from bin
|
17 months ago |
Henry S. Thompson |
minor bug wrt EOF of final cdx input file
|
17 months ago |
Henry S. Thompson |
replicate two extremely-corner cases of the way
|
17 months ago |
Henry S. Thompson |
a bit more logging
|
17 months ago |
Henry S. Thompson |
a bit more logging
|
17 months ago |
Henry S. Thompson |
robotstxt and crawldiagnostics get free ride,
|
17 months ago |
Henry S. Thompson |
a few more from ecclerig,
|
17 months ago |
Henry S. Thompson |
refactor datestream reading,
|
17 months ago |
Henry S. Thompson |
more faithful regexps and non-byte uri output
|
17 months ago |
Henry S. Thompson |
one uncommited fix from quentin
|
17 months ago |
Henry Thompson |
pass in debug flag(s) to merge_date.py
|
17 months ago |
Henry Thompson |
loosen must-match criterion in the both-messy case
|
17 months ago |
Henry Thompson |
one more sid fix,
|
17 months ago |
Henry S. Thompson |
working on sessionID pblms, still
|
17 months ago |
Henry Thompson |
first try
|
17 months ago |
Henry S. Thompson |
switch to gzip -7 to get comparable compressed cdx block size
|
17 months ago |
Henry S. Thompson |
use my own Canonicalizer to fix more obscure
|
17 months ago |
Henry S. Thompson |
re-instate logging splits for .idx
|
18 months ago |
Henry S. Thompson |
reinstate better check to start queuing,
|
18 months ago |
Henry S. Thompson |
bug4 fixed, but that created a new, earlier bug
|
18 months ago |
Henry S. Thompson |
rework handling of session key problem
|
18 months ago |
Henry S. Thompson |
initialise paths for csing
|
18 months ago |
Henry S. Thompson |
d'oh
|
18 months ago |
Henry S. Thompson |
include full URI in output
|
18 months ago |
Henry S. Thompson |
try to do csing correctly on compute nodes
|
18 months ago |
Henry S. Thompson |
version which outputs more identification,
|
18 months ago |
Henry S. Thompson |
last version before giving up on approach based only on key and datestamp
|
18 months ago |
Henry S. Thompson |
improve reordering, still failing on cdx-00004
|
18 months ago |
Henry S. Thompson |
attempt at reordering if necessary
|
18 months ago |
Henry S. Thompson |
mostly working, but need to reorder in case of cfid and friends
|
18 months ago |
Henry S. Thompson |
flip loops
|