Mercurial > hg > cc > work
changeset 44:7209df5fa5b4
turn attention to nutch-cc and its Cdx code
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Sun, 22 Sep 2024 23:13:56 +0100 |
parents | 6ae6a21ccfb9 |
children | 737c61f98cbf |
files | lurid3/notes.txt |
diffstat | 1 files changed, 20 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Thu Sep 05 17:59:02 2024 +0100 +++ b/lurid3/notes.txt Sun Sep 22 23:13:56 2024 +0100 @@ -284,3 +284,23 @@ **** Error: An error occurred while reading an XREF table. **** The file has been damaged. Look in big_pdf? + +====Modify the original CC indexer to write new indices including lastmod===== +Looks like WarcRecordWriter.write, in +src/nutch-cc/src/java/org/commoncrawl/util/WarcRecordWriter, is what +needs to be editted to include LastModified date + +To rebuild nutch-cc, particularly to recompile jar files after editting +anything + + >: cd $HHOME/src/nutch-cc + >: ant + +Fixed deprecation bug in WarcCdxWriter.java + +Modified src/java/org/commoncrawl/util/WarcCdxWriter.java +to include lastmod + +Can run just one test, which should allow testing this: + + >: ant test-core -Dtestcase='TestWarcRecordWriter'