| Wed, 21 May 2025 22:05:24 +0100 |
Henry S. Thompson |
double buffer size to deal with massive header cases
trim tip
|
| Sun, 18 May 2025 13:08:04 +0100 |
Henry S. Thompson |
replace xargs with an explicit serial loop plus wait
trim
|
| Sat, 17 May 2025 11:12:46 +0100 |
Henry S. Thompson |
get rm in loop in right place, ensure unique pipe names
trim
|
| Tue, 13 May 2025 14:44:15 +0100 |
Henry S. Thompson |
working
trim
|
| Tue, 13 May 2025 13:32:26 +0100 |
Henry S. Thompson |
fixed time-stamp fixup bugs
trim
|
| Tue, 13 May 2025 12:06:01 +0100 |
Henry S. Thompson |
sic
trim
|
| Tue, 13 May 2025 12:05:22 +0100 |
Henry S. Thompson |
adapt to new configuration
trim
|
| Tue, 13 May 2025 12:04:01 +0100 |
Henry S. Thompson |
works
trim
|
| Thu, 08 May 2025 19:00:26 +0100 |
Henry S. Thompson |
just starting
trim
|
| Tue, 06 May 2025 16:52:32 +0100 |
Henry S. Thompson |
try trimming various more-or-less constant bits of the key and value
trim
|
| Mon, 05 May 2025 20:57:46 +0100 |
Henry S. Thompson |
robotstxt now working?
default
|
| Mon, 05 May 2025 20:57:30 +0100 |
Henry S. Thompson |
add another digit or two (segment #) to key for r_t
|
| Mon, 05 May 2025 20:39:16 +0100 |
Henry S. Thompson |
better font
|
| Wed, 23 Apr 2025 11:03:48 +0100 |
Henry S. Thompson |
still hacking var bindings...
|
| Tue, 22 Apr 2025 14:32:07 +0100 |
Henry S. Thompson |
job index arg to doit, slightly better diagnostic output
|
| Fri, 18 Apr 2025 13:39:55 +0100 |
Henry S. Thompson |
extend, then fix, to get it working for crawldiagnostics warc files
|
| Wed, 09 Apr 2025 20:42:29 +0100 |
Henry S. Thompson |
fix another long-tail bug
|
| Wed, 09 Apr 2025 17:15:40 +0100 |
Henry S. Thompson |
accommodate to change to digits for record type,
|
| Wed, 09 Apr 2025 12:57:50 +0100 |
Henry S. Thompson |
simple refill working?
|
| Wed, 09 Apr 2025 11:15:14 +0100 |
Henry S. Thompson |
try simpler refill
|
| Tue, 08 Apr 2025 16:06:33 +0100 |
Henry S. Thompson |
park that, try fixed large buffer and large-enough min to ensure we always have a whole record in view
|
| Mon, 07 Apr 2025 16:34:31 +0100 |
Henry S. Thompson |
in the midst of trying to rethink the refill logic
|
| Mon, 24 Mar 2025 14:30:32 +0000 |
Henry S. Thompson |
trying to recover from partial, not-ordered, run of segs 0--7
|
| Sat, 08 Mar 2025 22:31:14 +0000 |
Henry S. Thompson |
fix GMT fix,
|
| Fri, 07 Mar 2025 21:17:47 +0000 |
Henry S. Thompson |
try to do the whole thing in one go
|
| Fri, 07 Mar 2025 18:15:41 +0000 |
Henry S. Thompson |
type decls, cythonize works
|
| Fri, 07 Mar 2025 15:39:36 +0000 |
Henry S. Thompson |
type decls, cythonize works
|
| Wed, 05 Mar 2025 23:29:25 +0000 |
Henry S. Thompson |
automate a cdb chain
|
| Thu, 27 Feb 2025 18:23:31 +0000 |
Henry S. Thompson |
move final report to stderr
|
| Thu, 27 Feb 2025 18:23:05 +0000 |
Henry S. Thompson |
work with cdb logging, not sure why it was necessary
|
| Wed, 26 Feb 2025 19:52:22 +0000 |
Henry S. Thompson |
parameterise the range of cdbs and segments,
|
| Wed, 19 Feb 2025 17:49:31 +0000 |
Henry S. Thompson |
push value printing into C,
|
| Wed, 19 Feb 2025 17:48:11 +0000 |
Henry S. Thompson |
try piping instead of python.isal,
|
| Wed, 19 Feb 2025 17:46:24 +0000 |
Henry S. Thompson |
trivial test, suitable for gdb
|
| Wed, 12 Feb 2025 20:17:39 +0000 |
Henry S. Thompson |
working, but very slowly
|
| Wed, 12 Feb 2025 13:01:05 +0000 |
Henry S. Thompson |
maybe ready
|
| Wed, 12 Feb 2025 12:59:28 +0000 |
Henry S. Thompson |
renamed
|
| Wed, 12 Feb 2025 11:29:41 +0000 |
Henry S. Thompson |
towards a real test of cdb
|
| Tue, 11 Feb 2025 11:25:44 +0000 |
Henry S. Thompson |
convert most CCdb methods to cpdef
|
| Tue, 04 Feb 2025 11:17:13 +0000 |
Henry S. Thompson |
don't use print. Working
|
| Tue, 04 Feb 2025 11:16:12 +0000 |
Henry S. Thompson |
align with change to non-static Cdb.
|
| Tue, 04 Feb 2025 11:13:59 +0000 |
Henry S. Thompson |
align with non-static Cdb, add raw access for debugging
|
| Mon, 03 Feb 2025 23:12:55 +0000 |
Henry S. Thompson |
running but not working
|
| Mon, 03 Feb 2025 19:16:20 +0000 |
Henry S. Thompson |
Test for having multiple cdbs open at once: compiles
|
| Fri, 31 Jan 2025 13:31:02 +0000 |
Henry S. Thompson |
use cdb library directly,
|
| Fri, 31 Jan 2025 13:28:09 +0000 |
Henry S. Thompson |
use cdb library directly
|
| Mon, 27 Jan 2025 21:19:18 +0000 |
Henry S. Thompson |
works with big (ks_0-9.60.cdb) cdb file
|
| Fri, 24 Jan 2025 15:07:00 +0000 |
Henry S. Thompson |
finally get test code separated from db.pyx to work
|
| Fri, 24 Jan 2025 15:04:41 +0000 |
Henry S. Thompson |
cython header file for db.pyx
|
| Fri, 24 Jan 2025 15:02:57 +0000 |
Henry S. Thompson |
remove the testing code, leaving just the class
|
| Fri, 24 Jan 2025 15:01:42 +0000 |
Henry S. Thompson |
prepare a ks..tsv file for indexing into a cdb
|
| Thu, 23 Jan 2025 12:53:28 +0000 |
Henry S. Thompson |
renamed cpython class Cdb to CCdb to avoid name conflict with cdb.Cdb
|
| Thu, 23 Jan 2025 12:27:57 +0000 |
Henry S. Thompson |
work with libcdb.a
|
| Sat, 18 Jan 2025 23:00:30 +0000 |
Henry S. Thompson |
value from memory view working
|
| Sat, 18 Jan 2025 21:25:17 +0000 |
Henry S. Thompson |
try using cdb as C library
|
| Fri, 17 Jan 2025 20:37:10 +0000 |
Henry S. Thompson |
add some cython decoration, not much effect
|
| Fri, 17 Jan 2025 20:35:21 +0000 |
Henry S. Thompson |
run with login shell
|
| Fri, 17 Jan 2025 20:34:32 +0000 |
Henry S. Thompson |
tweak XEmacs font/key bindings
|
| Fri, 17 Jan 2025 19:58:04 +0000 |
Henry S. Thompson |
tweak XEmacs font
|
| Thu, 02 Jan 2025 18:35:08 +0000 |
Henry S. Thompson |
time the unpickling
|
| Thu, 02 Jan 2025 18:30:03 +0000 |
Henry S. Thompson |
with bloom prefilter
|
| Thu, 02 Jan 2025 14:52:14 +0000 |
Henry S. Thompson |
try adding lm to existing index from ks_0-9
|
| Thu, 02 Jan 2025 14:51:00 +0000 |
Henry S. Thompson |
output bytes, pickle and save dict if -p, trim lm value to int
|
| Wed, 01 Jan 2025 23:02:35 +0000 |
Henry S. Thompson |
test big dict for associating lm timestamp with cc timestamp+uri
|
| Thu, 03 Oct 2024 18:17:55 +0100 |
Henry S. Thompson |
working together works well to provide what's needed to update a cdx to include lastmod where possible
|
| Wed, 02 Oct 2024 19:54:45 +0100 |
Henry S. Thompson |
make into a library, entry point def unpackz(infileName, callback, outfile = None),
|
| Wed, 02 Oct 2024 11:09:58 +0100 |
Henry S. Thompson |
cleaned up indentation to 2 spaces throughout
|
| Wed, 02 Oct 2024 09:56:37 +0100 |
Henry S. Thompson |
take bufsize from cmdline
|
| Tue, 01 Oct 2024 15:59:26 +0100 |
Henry S. Thompson |
eof pblms fixed, seems to work
|
| Sat, 28 Sep 2024 15:19:05 +0100 |
Henry S. Thompson |
working, but last count/offset not being written
|
| Thu, 26 Sep 2024 17:54:12 +0100 |
Henry S. Thompson |
fix error message
|
| Thu, 26 Sep 2024 12:38:34 +0100 |
Henry S. Thompson |
csing disabled for now
|
| Thu, 26 Sep 2024 12:29:27 +0100 |
Henry S. Thompson |
font hacking, see also lib/xemacs/common-init.el
|
| Thu, 26 Sep 2024 12:25:54 +0100 |
Henry S. Thompson |
new default from CC themselves
|
| Thu, 26 Sep 2024 12:24:16 +0100 |
Henry S. Thompson |
for debugging?
|
| Thu, 09 May 2024 12:36:57 +0100 |
Henry S. Thompson |
for use in Stuttgart, maybe
|
| Sat, 02 Mar 2024 10:59:06 +0000 |
Henry S. Thompson |
xxx
|
| Thu, 29 Feb 2024 15:01:10 +0000 |
Henry S. Thompson |
merge
|
| Thu, 29 Feb 2024 15:01:02 +0000 |
Henry S. Thompson |
post-processing
|
| Wed, 28 Feb 2024 18:31:52 +0000 |
Henry S. Thompson |
sic
|
| Thu, 29 Feb 2024 14:59:50 +0000 |
Henry S. Thompson |
compute offset between LM and crawl timestamp
|
| Thu, 29 Feb 2024 14:59:09 +0000 |
Henry S. Thompson |
sic
|
| Wed, 28 Feb 2024 15:27:00 +0000 |
Henry S. Thompson |
rebuild to match triple fig line colour
|
| Wed, 28 Feb 2024 15:13:38 +0000 |
Henry S. Thompson |
rebuild with more consistent appearance
|
| Wed, 28 Feb 2024 14:50:08 +0000 |
Henry S. Thompson |
merge
|
| Wed, 28 Feb 2024 14:49:45 +0000 |
Henry S. Thompson |
replaced mean_lens by w or wo bogon
|
| Wed, 28 Feb 2024 14:44:59 +0000 |
Henry S. Thompson |
now using clean 2005 count
|
| Wed, 28 Feb 2024 10:32:01 +0000 |
Henry Thompson |
minor addition?
|
| Wed, 28 Feb 2024 10:20:44 +0000 |
Henry S. Thompson |
merge
|
| Wed, 28 Feb 2024 10:15:56 +0000 |
Henry S. Thompson |
what is this?
|
| Tue, 20 Feb 2024 15:23:47 +0000 |
Henry Thompson |
add percentage of non-latin by crawl table
|
| Fri, 16 Feb 2024 16:24:28 +0000 |
Henry Thompson |
tld change investigation
|
| Fri, 16 Feb 2024 13:54:12 +0000 |
Henry S. Thompson |
nl1 and tld summary results
|
| Thu, 15 Feb 2024 22:31:09 +0000 |
Henry S. Thompson |
correct Usage
|
| Thu, 15 Feb 2024 22:30:40 +0000 |
Henry S. Thompson |
csing-related tweaks
|
| Thu, 15 Feb 2024 16:36:00 +0000 |
Henry S. Thompson |
merge
|
| Thu, 15 Feb 2024 15:10:34 +0000 |
Henry S. Thompson |
see Paul:Documents/HTalks/WebSci2024
|
| Thu, 11 Jan 2024 16:44:45 +0000 |
Henry S. Thompson |
add some debugging info
|
| Thu, 11 Jan 2024 16:43:16 +0000 |
Henry S. Thompson |
use 2-digit suffixes,
|
| Fri, 08 Dec 2023 10:32:07 +0000 |
Henry S. Thompson |
sic
|
| Thu, 07 Dec 2023 18:23:11 +0000 |
Henry S. Thompson |
sic
|
| Thu, 07 Dec 2023 18:21:48 +0000 |
Henry S. Thompson |
added back missing years
|
| Thu, 07 Dec 2023 18:15:43 +0000 |
Henry S. Thompson |
support semilogy from cmd line
|
| Wed, 06 Dec 2023 13:36:49 +0000 |
Henry S. Thompson |
means of all columns in length analyses
|
| Wed, 06 Dec 2023 13:33:25 +0000 |
Henry S. Thompson |
normalise % counts by non-empty bases only
|
| Tue, 05 Dec 2023 19:49:29 +0000 |
Henry S. Thompson |
new plots various
|
| Tue, 05 Dec 2023 19:49:11 +0000 |
Henry S. Thompson |
get single graph working, tweak params various
|
| Tue, 05 Dec 2023 10:35:15 +0000 |
Henry S. Thompson |
compute (component) uri lengths and a few other properties
|
| Mon, 04 Dec 2023 19:06:13 +0000 |
Henry S. Thompson |
with three tracks from two years
|
| Mon, 04 Dec 2023 10:42:02 +0000 |
Henry S. Thompson |
for pub
|
| Mon, 04 Dec 2023 10:40:47 +0000 |
Henry S. Thompson |
tweaked formatting
|
| Mon, 04 Dec 2023 10:21:30 +0000 |
Henry S. Thompson |
excel rewrote, no important changes (?)
|
| Mon, 04 Dec 2023 09:42:39 +0000 |
Henry S. Thompson |
replace wrong one with right one
|
| Mon, 04 Dec 2023 09:37:14 +0000 |
Henry S. Thompson |
merge
|
| Mon, 04 Dec 2023 09:35:53 +0000 |
Henry S. Thompson |
implement alternative confidence measure using stats.bootstrap,
|
| Mon, 04 Dec 2023 09:33:13 +0000 |
Henry S. Thompson |
for LMh percentile
|
| Thu, 30 Nov 2023 14:42:46 +0000 |
Henry S. Thompson |
decorated
|
| Thu, 30 Nov 2023 14:20:22 +0000 |
Henry S. Thompson |
merge
|
| Thu, 30 Nov 2023 14:18:56 +0000 |
Henry S. Thompson |
can't add props to DescribeResult
|
| Thu, 30 Nov 2023 14:17:34 +0000 |
Henry S. Thompson |
for 2023-40
|