changeset 44:7209df5fa5b4

turn attention to nutch-cc and its Cdx code
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Sun, 22 Sep 2024 23:13:56 +0100
parents 6ae6a21ccfb9
children 737c61f98cbf
files lurid3/notes.txt
diffstat 1 files changed, 20 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Thu Sep 05 17:59:02 2024 +0100
+++ b/lurid3/notes.txt	Sun Sep 22 23:13:56 2024 +0100
@@ -284,3 +284,23 @@
    **** Error:  An error occurred while reading an XREF table.
    **** The file has been damaged.
 Look in big_pdf?
+
+====Modify the original CC indexer to write new indices including lastmod=====
+Looks like WarcRecordWriter.write, in
+src/nutch-cc/src/java/org/commoncrawl/util/WarcRecordWriter, is what
+needs to be editted to include LastModified date
+
+To rebuild nutch-cc, particularly to recompile jar files after editting
+anything
+
+  >: cd $HHOME/src/nutch-cc
+  >: ant
+
+Fixed deprecation bug in WarcCdxWriter.java
+
+Modified src/java/org/commoncrawl/util/WarcCdxWriter.java
+to include lastmod
+
+Can run just one test, which should allow testing this:
+
+  >: ant test-core -Dtestcase='TestWarcRecordWriter'