5 weeks ago |
Henry S. Thompson |
prepare a ks..tsv file for indexing into a cdb
|
5 weeks ago |
Henry S. Thompson |
renamed cpython class Cdb to CCdb to avoid name conflict with cdb.Cdb
|
5 weeks ago |
Henry S. Thompson |
work with libcdb.a
|
6 weeks ago |
Henry S. Thompson |
value from memory view working
|
6 weeks ago |
Henry S. Thompson |
try using cdb as C library
|
6 weeks ago |
Henry S. Thompson |
add some cython decoration, not much effect
|
6 weeks ago |
Henry S. Thompson |
run with login shell
|
6 weeks ago |
Henry S. Thompson |
tweak XEmacs font/key bindings
|
6 weeks ago |
Henry S. Thompson |
tweak XEmacs font
|
2 months ago |
Henry S. Thompson |
time the unpickling
|
2 months ago |
Henry S. Thompson |
with bloom prefilter
|
2 months ago |
Henry S. Thompson |
try adding lm to existing index from ks_0-9
|
2 months ago |
Henry S. Thompson |
output bytes, pickle and save dict if -p, trim lm value to int
|
2 months ago |
Henry S. Thompson |
test big dict for associating lm timestamp with cc timestamp+uri
|
5 months ago |
Henry S. Thompson |
working together works well to provide what's needed to update a cdx to include lastmod where possible
|
5 months ago |
Henry S. Thompson |
make into a library, entry point def unpackz(infileName, callback, outfile = None),
|
5 months ago |
Henry S. Thompson |
cleaned up indentation to 2 spaces throughout
|
5 months ago |
Henry S. Thompson |
take bufsize from cmdline
|
5 months ago |
Henry S. Thompson |
eof pblms fixed, seems to work
|
5 months ago |
Henry S. Thompson |
working, but last count/offset not being written
|
5 months ago |
Henry S. Thompson |
fix error message
|
5 months ago |
Henry S. Thompson |
csing disabled for now
|
5 months ago |
Henry S. Thompson |
font hacking, see also lib/xemacs/common-init.el
|
5 months ago |
Henry S. Thompson |
new default from CC themselves
|
5 months ago |
Henry S. Thompson |
for debugging?
|
10 months ago |
Henry S. Thompson |
for use in Stuttgart, maybe
|
12 months ago |
Henry S. Thompson |
xxx
|
12 months ago |
Henry S. Thompson |
merge
|
12 months ago |
Henry S. Thompson |
post-processing
|
12 months ago |
Henry S. Thompson |
sic
|
12 months ago |
Henry S. Thompson |
compute offset between LM and crawl timestamp
|
12 months ago |
Henry S. Thompson |
sic
|
12 months ago |
Henry S. Thompson |
rebuild to match triple fig line colour
|
12 months ago |
Henry S. Thompson |
rebuild with more consistent appearance
|
12 months ago |
Henry S. Thompson |
merge
|
12 months ago |
Henry S. Thompson |
replaced mean_lens by w or wo bogon
|
12 months ago |
Henry S. Thompson |
now using clean 2005 count
|
12 months ago |
Henry Thompson |
minor addition?
|
12 months ago |
Henry S. Thompson |
merge
|
12 months ago |
Henry S. Thompson |
what is this?
|
12 months ago |
Henry Thompson |
add percentage of non-latin by crawl table
|
12 months ago |
Henry Thompson |
tld change investigation
|
12 months ago |
Henry S. Thompson |
nl1 and tld summary results
|
12 months ago |
Henry S. Thompson |
correct Usage
|
12 months ago |
Henry S. Thompson |
csing-related tweaks
|
12 months ago |
Henry S. Thompson |
merge
|
12 months ago |
Henry S. Thompson |
see Paul:Documents/HTalks/WebSci2024
|
13 months ago |
Henry S. Thompson |
add some debugging info
|
13 months ago |
Henry S. Thompson |
use 2-digit suffixes,
|
15 months ago |
Henry S. Thompson |
sic
|
15 months ago |
Henry S. Thompson |
sic
|
15 months ago |
Henry S. Thompson |
added back missing years
|
15 months ago |
Henry S. Thompson |
support semilogy from cmd line
|
15 months ago |
Henry S. Thompson |
means of all columns in length analyses
|
15 months ago |
Henry S. Thompson |
normalise % counts by non-empty bases only
|
15 months ago |
Henry S. Thompson |
new plots various
|
15 months ago |
Henry S. Thompson |
get single graph working, tweak params various
|
15 months ago |
Henry S. Thompson |
compute (component) uri lengths and a few other properties
|
15 months ago |
Henry S. Thompson |
with three tracks from two years
|
15 months ago |
Henry S. Thompson |
for pub
|
15 months ago |
Henry S. Thompson |
tweaked formatting
|
15 months ago |
Henry S. Thompson |
excel rewrote, no important changes (?)
|
15 months ago |
Henry S. Thompson |
replace wrong one with right one
|
15 months ago |
Henry S. Thompson |
merge
|
15 months ago |
Henry S. Thompson |
implement alternative confidence measure using stats.bootstrap,
|
15 months ago |
Henry S. Thompson |
for LMh percentile
|
15 months ago |
Henry S. Thompson |
decorated
|
15 months ago |
Henry S. Thompson |
merge
|
15 months ago |
Henry S. Thompson |
can't add props to DescribeResult
|
15 months ago |
Henry S. Thompson |
for 2023-40
|
15 months ago |
Henry S. Thompson |
with decorations
|
15 months ago |
Henry S. Thompson |
excel rewrote, no important changes (?)
|
15 months ago |
Henry S. Thompson |
with percentile instead of raw mean correl
|
15 months ago |
Henry S. Thompson |
change heatmap to by percentile
|
15 months ago |
Henry S. Thompson |
with heat
|
15 months ago |
Henry S. Thompson |
heat map for mime vs. nl1 vs. len
|
15 months ago |
Henry S. Thompson |
add head_map fn
|
15 months ago |
Henry S. Thompson |
add explore_deltas and predict analysis fns
|
15 months ago |
Henry S. Thompson |
rename to avoid name clash with scipy.stats
|
15 months ago |
Henry S. Thompson |
move to class with local vars instead of many globals
|
15 months ago |
Henry S. Thompson |
renamed to by_interval.py
|
15 months ago |
Henry S. Thompson |
renamed from spearman.py
|
15 months ago |
Henry S. Thompson |
renamed to stats.py
|
15 months ago |
Henry S. Thompson |
do the __main__ thing
|
15 months ago |
Henry S. Thompson |
put results in numbered subdirs
|
15 months ago |
Henry S. Thompson |
add minimal logging and don't return until finished
|
15 months ago |
Henry S. Thompson |
should work for months also now
|
15 months ago |
Henry S. Thompson |
cross-language confusion :-)
|
16 months ago |
Henry S. Thompson |
LM plot for multiple crawls, magnitude or %age
|
16 months ago |
Henry S. Thompson |
can overlay the two
|
16 months ago |
Henry S. Thompson |
fix output year
|
16 months ago |
Henry S. Thompson |
sic
|
16 months ago |
Henry S. Thompson |
sic
|
16 months ago |
Henry S. Thompson |
get in/out file management working right
|
16 months ago |
Henry S. Thompson |
refactor to provide for buffer overflow fix
|
16 months ago |
Henry S. Thompson |
bug-fix wrt 1st time,
|
16 months ago |
Henry S. Thompson |
make extra file info optional
|
16 months ago |
Henry S. Thompson |
forget parallel, just do (default 2) parallel single threads
|
16 months ago |
Henry S. Thompson |
add missing makedir
|
16 months ago |
Henry S. Thompson |
now does one named segment only
|
16 months ago |
Henry S. Thompson |
resurrect parallel fetch
|
16 months ago |
Henry S. Thompson |
convert to single thread,
|
16 months ago |
Henry S. Thompson |
avoid global name conflict
|
17 months ago |
Henry S. Thompson |
moved from /beegfs/common-crawl to get under .hg
|
17 months ago |
Henry S. Thompson |
fix typo
|
17 months ago |
Henry S. Thompson |
build cluster.idx
|
17 months ago |
Henry S. Thompson |
no longer using cmp_to_key
|
17 months ago |
Henry S. Thompson |
handle -m case, support src from cmdline
mergefix
|
17 months ago |
Henry S. Thompson |
new branch to save do_idx.sh from abandoned merge fixup
mergefix
|
17 months ago |
Henry S. Thompson |
try to get the counts right, particularly when re-merging
|
17 months ago |
Henry S. Thompson |
for use in debugging, see notes and tests 2, 17, merge test
|
17 months ago |
Henry S. Thompson |
add various www deletion cases
|
17 months ago |
Henry S. Thompson |
iterate WPAT fix with improved pattern
|
17 months ago |
Henry S. Thompson |
loosen WARC pattern to avoid failure from "mime" = "{...}" intervening
|
17 months ago |
Henry S. Thompson |
refactor to enable rerun with fixup,
|
17 months ago |
Henry S. Thompson |
correct mistaken futnsz test,
|
17 months ago |
Henry S. Thompson |
change path to merge_date.py
|
17 months ago |
Henry S. Thompson |
remove the mistaken deletion of NONPRINT,
|
17 months ago |
Henry S. Thompson |
fix a bad fix and a bad test for the televida case
|
17 months ago |
Henry S. Thompson |
fix and test for all-decimal host
|
17 months ago |
Henry S. Thompson |
no import in lmh.__init__ any more
|
17 months ago |
Henry S. Thompson |
importing in __init__ causes problems
|
17 months ago |
Henry S. Thompson |
commented out duplicate, handle comments better
|
17 months ago |
Henry S. Thompson |
more corner case tests
|
17 months ago |
Henry S. Thompson |
tweaks to get all tests through #14
|
17 months ago |
Henry S. Thompson |
get 7f (two cases) and %25 working
|
17 months ago |
Henry S. Thompson |
add televida case test
|
17 months ago |
Henry S. Thompson |
add test description
|
17 months ago |
Henry S. Thompson |
importable just in case
|
17 months ago |
Henry S. Thompson |
move most of the hacking into fixGoogleCanon,
|
17 months ago |
Henry S. Thompson |
forget assert, allow multiple failures
|
17 months ago |
Henry S. Thompson |
x
|
17 months ago |
Henry S. Thompson |
found right place for \x7f hack, maybe
|
17 months ago |
Henry S. Thompson |
readability
|
17 months ago |
Henry S. Thompson |
x
|
17 months ago |
Henry S. Thompson |
refactor to sort a module in an lmh package
|
17 months ago |
Henry S. Thompson |
start some regression tests
|
17 months ago |
Henry S. Thompson |
creating lmh package
|
17 months ago |
Henry S. Thompson |
moved from bin
|
17 months ago |
Henry S. Thompson |
minor bug wrt EOF of final cdx input file
|
17 months ago |
Henry S. Thompson |
replicate two extremely-corner cases of the way
|
17 months ago |
Henry S. Thompson |
a bit more logging
|
17 months ago |
Henry S. Thompson |
a bit more logging
|
17 months ago |
Henry S. Thompson |
robotstxt and crawldiagnostics get free ride,
|
17 months ago |
Henry S. Thompson |
a few more from ecclerig,
|
17 months ago |
Henry S. Thompson |
refactor datestream reading,
|
17 months ago |
Henry S. Thompson |
more faithful regexps and non-byte uri output
|
17 months ago |
Henry S. Thompson |
one uncommited fix from quentin
|
17 months ago |
Henry Thompson |
pass in debug flag(s) to merge_date.py
|
17 months ago |
Henry Thompson |
loosen must-match criterion in the both-messy case
|
17 months ago |
Henry Thompson |
one more sid fix,
|
17 months ago |
Henry S. Thompson |
working on sessionID pblms, still
|
17 months ago |
Henry Thompson |
first try
|
17 months ago |
Henry S. Thompson |
switch to gzip -7 to get comparable compressed cdx block size
|
17 months ago |
Henry S. Thompson |
use my own Canonicalizer to fix more obscure
|
17 months ago |
Henry S. Thompson |
re-instate logging splits for .idx
|
18 months ago |
Henry S. Thompson |
reinstate better check to start queuing,
|
18 months ago |
Henry S. Thompson |
bug4 fixed, but that created a new, earlier bug
|
18 months ago |
Henry S. Thompson |
rework handling of session key problem
|
18 months ago |
Henry S. Thompson |
initialise paths for csing
|
18 months ago |
Henry S. Thompson |
d'oh
|
18 months ago |
Henry S. Thompson |
include full URI in output
|
18 months ago |
Henry S. Thompson |
try to do csing correctly on compute nodes
|
18 months ago |
Henry S. Thompson |
version which outputs more identification,
|
18 months ago |
Henry S. Thompson |
last version before giving up on approach based only on key and datestamp
|
18 months ago |
Henry S. Thompson |
improve reordering, still failing on cdx-00004
|
18 months ago |
Henry S. Thompson |
attempt at reordering if necessary
|
18 months ago |
Henry S. Thompson |
mostly working, but need to reorder in case of cfid and friends
|
18 months ago |
Henry S. Thompson |
flip loops
|
18 months ago |
Henry S. Thompson |
merge a stream of ks files with a set of cdx files
|
18 months ago |
Henry S. Thompson |
final keystroke fixes, recurse and decimal www stripping
|
18 months ago |
Henry S. Thompson |
final keystroke fixes,
|
18 months ago |
Henry S. Thompson |
handle double .www, more keep-me chars
|
18 months ago |
Henry S. Thompson |
work-around for weird handling of %-encoding in Java impl. of SURT
|
18 months ago |
Henry Thompson |
merge, including pointless fix wrt pq
|
18 months ago |
Henry Thompson |
use surt instead of trying to create index term by hand
|
18 months ago |
Henry Thompson |
merge
|
18 months ago |
Henry Thompson |
stale
|
18 months ago |
Henry Thompson |
catching up by hand with markup version,
|
18 months ago |
Henry S. Thompson |
include timestamp
|
18 months ago |
Henry S. Thompson |
include query
|
18 months ago |
Henry S. Thompson |
make CC's own sorting explicit
|
19 months ago |
Henry S. Thompson |
handle corner cases with final . and initial www..+
|
19 months ago |
Henry S. Thompson |
handle %-encoded utf-8 as idna
|
19 months ago |
Henry S. Thompson |
merge
|
19 months ago |
Henry S. Thompson |
compute timestamps, key and sort lmh lines
|
19 months ago |
Henry S. Thompson |
work with csing
|
19 months ago |
Henry S. Thompson |
get man -k working
|
19 months ago |
Henry Thompson |
for warc_lmh slurm logs
|
19 months ago |
Henry S. Thompson |
for timing analysis
|
19 months ago |
Henry S. Thompson |
add support for multiple calls to srun with a counter
|
19 months ago |
Henry S. Thompson |
fix eof bug, expand error messages
|
19 months ago |
Henry S. Thompson |
part 2 is now working for all types
|
19 months ago |
Henry S. Thompson |
add a response-only test
|
19 months ago |
Henry S. Thompson |
revert to just showing first LM
|
20 months ago |
Henry S. Thompson |
more tests
|
20 months ago |
Henry S. Thompson |
Test 2 works with parts=1,2,3.
|
20 months ago |
Henry S. Thompson |
whole working
|
20 months ago |
Henry S. Thompson |
tests 1 & 2 now working
|
20 months ago |
Henry S. Thompson |
avoid slicing buf by using memoryview to save copying
|
20 months ago |
Henry S. Thompson |
but skip at eobp is not working (with test 2)
|
20 months ago |
Henry S. Thompson |
works with all types, part=1
|
20 months ago |
Henry S. Thompson |
rework completely to refill as much as possible only when necessary,
|
20 months ago |
Henry S. Thompson |
finds multiples
|
20 months ago |
Henry S. Thompson |
little steps
|
20 months ago |
Henry S. Thompson |
made 1 mean 1, still losing after a while
|
20 months ago |
Henry S. Thompson |
better debugging output
|
20 months ago |
Henry S. Thompson |
working better, gets confused by 3-part response
|
20 months ago |
Henry S. Thompson |
a bit better
|
20 months ago |
Henry S. Thompson |
just barely working for 1, need to rethink buffering
|
20 months ago |
Henry S. Thompson |
starting on conversion to direct-querying of buffer
|
20 months ago |
Henry S. Thompson |
sic
|
20 months ago |
Henry S. Thompson |
support on-board unzipping, reduce buffer size to 2MB
|
20 months ago |
Henry S. Thompson |
make test 1 idempotent
|
20 months ago |
Henry S. Thompson |
just count part length
|
20 months ago |
Henry S. Thompson |
get EOF right, finally
|
20 months ago |
Henry S. Thompson |
make warc.py a library, separate out testing
|
20 months ago |
Henry S. Thompson |
correct comment
|
20 months ago |
Henry S. Thompson |
add lots more debugging output,
|
20 months ago |
Henry S. Thompson |
moved from home bin
|
2023-01-10 |
Henry S. Thompson |
doc pointer
|
2022-12-13 |
Henry S. Thompson |
push actions in main fn
|
2022-12-13 |
Henry S. Thompson |
fixed for paper
|
2022-11-24 |
Henry S. Thompson |
fix N
|
2022-11-23 |
Henry S. Thompson |
compute and graph confidence intervals
|
2022-11-22 |
Henry S. Thompson |
generalise hist
|
2022-11-22 |
Henry S. Thompson |
add sort flag to plot_x
|
2022-11-17 |
Henry S. Thompson |
get multi-ranking done right
|
2022-11-17 |
Henry S. Thompson |
comments and more care about rows vs. columns
|
2022-11-16 |
Henry S. Thompson |
start work on ranking,
|
2022-11-16 |
Henry S. Thompson |
Spearman for matlab
|
2022-11-16 |
Henry S. Thompson |
move all plots into functions
|
2022-11-15 |
Henry S. Thompson |
a bit more
|
2022-11-14 |
Henry S. Thompson |
framework for stats over results of rank correlations
|
2022-11-11 |
Henry S. Thompson |
first plot efforts w. scipy
|
2022-10-21 |
Henry S. Thompson |
sic
|
2022-09-29 |
Henry S. Thompson |
accept filenames on stdin,
|
2022-09-29 |
Henry S. Thompson |
interpolate process0, support permanent subproc
|
2022-09-29 |
Henry S. Thompson |
new
|
2022-09-29 |
Henry S. Thompson |
new
|
2022-08-07 |
Henry S. Thompson |
write to tmp file implemented
|
2022-08-07 |
Henry S. Thompson |
use awk for simple cut
|
2022-08-07 |
Henry S. Thompson |
toward link extractions from pdf
|
2022-08-07 |
Henry S. Thompson |
in progress...
|
2022-08-07 |
Henry S. Thompson |
x
|
2022-07-28 |
Henry S. Thompson |
x
|
2022-07-28 |
Henry S. Thompson |
fix quoting pblm by using parallel ... -q
|
2022-07-28 |
Henry S. Thompson |
catch-up
|
2022-07-23 |
Henry S. Thompson |
minimal hst preferred options
|
2022-07-23 |
Henry S. Thompson |
work around problem with PROMPT_COMMAND
|
2022-07-20 |
Henry S. Thompson |
x
|
2022-07-20 |
Henry S. Thompson |
fix PROMPT_COMMAND
|
2022-07-20 |
Henry S. Thompson |
x
|
2022-07-20 |
Henry S. Thompson |
tidy up and include uniq -c
|
2022-07-20 |
Henry S. Thompson |
convert to no longer need uniq -c
|
2022-07-19 |
Henry S. Thompson |
oops, 1.1 was half-modified, bogus
|
2022-07-18 |
Henry S. Thompson |
compute node workers, see cirrus_home/bin repo for login node masters
|
2022-07-18 |
Henry S. Thompson |
getting started
|
2022-07-18 |
Henry S. Thompson |
getting started
|