Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 44:7209df5fa5b4
turn attention to nutch-cc and its Cdx code
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Sun, 22 Sep 2024 23:13:56 +0100 |
parents | 6ae6a21ccfb9 |
children | 737c61f98cbf |
comparison
equal
deleted
inserted
replaced
43:6ae6a21ccfb9 | 44:7209df5fa5b4 |
---|---|
282 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf | 282 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf |
283 >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf | 283 >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf |
284 **** Error: An error occurred while reading an XREF table. | 284 **** Error: An error occurred while reading an XREF table. |
285 **** The file has been damaged. | 285 **** The file has been damaged. |
286 Look in big_pdf? | 286 Look in big_pdf? |
287 | |
288 ====Modify the original CC indexer to write new indices including lastmod===== | |
289 Looks like WarcRecordWriter.write, in | |
290 src/nutch-cc/src/java/org/commoncrawl/util/WarcRecordWriter, is what | |
291 needs to be editted to include LastModified date | |
292 | |
293 To rebuild nutch-cc, particularly to recompile jar files after editting | |
294 anything | |
295 | |
296 >: cd $HHOME/src/nutch-cc | |
297 >: ant | |
298 | |
299 Fixed deprecation bug in WarcCdxWriter.java | |
300 | |
301 Modified src/java/org/commoncrawl/util/WarcCdxWriter.java | |
302 to include lastmod | |
303 | |
304 Can run just one test, which should allow testing this: | |
305 | |
306 >: ant test-core -Dtestcase='TestWarcRecordWriter' |