# HG changeset patch # User Henry S. Thompson # Date 1727043236 -3600 # Node ID 7209df5fa5b46f2531219d22a9e2b8f4f8b3b0bf # Parent 6ae6a21ccfb937d5435830c8829816ac78268e2b turn attention to nutch-cc and its Cdx code diff -r 6ae6a21ccfb9 -r 7209df5fa5b4 lurid3/notes.txt --- a/lurid3/notes.txt Thu Sep 05 17:59:02 2024 +0100 +++ b/lurid3/notes.txt Sun Sep 22 23:13:56 2024 +0100 @@ -284,3 +284,23 @@ **** Error: An error occurred while reading an XREF table. **** The file has been damaged. Look in big_pdf? + +====Modify the original CC indexer to write new indices including lastmod===== +Looks like WarcRecordWriter.write, in +src/nutch-cc/src/java/org/commoncrawl/util/WarcRecordWriter, is what +needs to be editted to include LastModified date + +To rebuild nutch-cc, particularly to recompile jar files after editting +anything + + >: cd $HHOME/src/nutch-cc + >: ant + +Fixed deprecation bug in WarcCdxWriter.java + +Modified src/java/org/commoncrawl/util/WarcCdxWriter.java +to include lastmod + +Can run just one test, which should allow testing this: + + >: ant test-core -Dtestcase='TestWarcRecordWriter'