comparison lurid3/notes.txt @ 44:7209df5fa5b4

turn attention to nutch-cc and its Cdx code
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Sun, 22 Sep 2024 23:13:56 +0100
parents 6ae6a21ccfb9
children 737c61f98cbf
comparison
equal deleted inserted replaced
43:6ae6a21ccfb9 44:7209df5fa5b4
282 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf 282 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf
283 >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf 283 >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf
284 **** Error: An error occurred while reading an XREF table. 284 **** Error: An error occurred while reading an XREF table.
285 **** The file has been damaged. 285 **** The file has been damaged.
286 Look in big_pdf? 286 Look in big_pdf?
287
288 ====Modify the original CC indexer to write new indices including lastmod=====
289 Looks like WarcRecordWriter.write, in
290 src/nutch-cc/src/java/org/commoncrawl/util/WarcRecordWriter, is what
291 needs to be editted to include LastModified date
292
293 To rebuild nutch-cc, particularly to recompile jar files after editting
294 anything
295
296 >: cd $HHOME/src/nutch-cc
297 >: ant
298
299 Fixed deprecation bug in WarcCdxWriter.java
300
301 Modified src/java/org/commoncrawl/util/WarcCdxWriter.java
302 to include lastmod
303
304 Can run just one test, which should allow testing this:
305
306 >: ant test-core -Dtestcase='TestWarcRecordWriter'