comparison master/wecu/README.md @ 57:ac1a20e627a9

from lukasz git repo 2020-05-26 (see ~/src/wecu), then editted, sac not quite working yet
author Henry S. Thompson <ht@markup.co.uk>
date Wed, 27 May 2020 20:54:34 +0000
parents
children
comparison
equal deleted inserted replaced
56:8ce6a81e2bb4 57:ac1a20e627a9
1 # Wee Common Crawl Utility (`wecu`)
2
3 Apache Hadoop is seen by many as the default choice for batch processing large scale data but, compared to alternatives like Apache Spark, its performance often leaves a lot to be desired - even for non-iterative workloads.
4 This utility uses efficient bash utilities to allow you to use your existing Hadoop Streaming scripts without the overheads and complicated configuration of Apache Hadoop.
5
6 `wecu` can be configured manually to run on any cluster you might have access to, but it works best on Azure HDInsight clusters as it will configure itself and install the required dependencies automatically.
7
8 `wecu` will stream the Common Crawl files from Amazon AWS (where the dataset is hosted) and run the required computation, as detailed below.
9
10 # Usage
11 To automatically configure the cluster (if you are using Azure HDInsight) run:
12
13 ## Setup and configuration
14 ```bash
15 git clone https://github.com/maestromusica/wecu
16 # `cd` to the wecu directory and use ./wecu or add the directory to $PATH
17
18 # To configure the tool and install dependencies
19 wecu setup [cluster_password]
20
21 # To check that the required files are in place
22 wecu setup [cluster_password] --check_files
23 ```
24
25 ### Choosing a sample of Common Crawl files
26 ```bash
27 # Opens a wizzard that will present you with a list of crawls available for
28 # streaming, let you choose file type (WARC/WAT/WET), the size of the sample,
29 # and whether you want a random sample or just the first N files.
30 wecu generate-sample
31 ```
32
33 ### Viewing configuration
34 ```bash
35 # Display a list of files
36 wecu list machines
37
38 # Display the currently chosen month & number of files
39 wecu list input_files
40
41 # Display the filenames of all choosen files
42 wecu list input_files --all
43 ```
44
45 # Arbitrary commands
46 Execute the same command on all machines in the cluster
47 ```
48 wecu execute "sudo apt-get install foobar"
49 wecu execute "./setup.sh" --transfer_file setup.sh
50 ```
51 # MapReduce/Hadoop Streaming jobs
52 Run exisitng Hadoop Streaming code or write new MapReduce jobs easily in any programming language.
53 ```
54 wecu mapred ./mapper.py ./reducer.py
55
56 # You can adjust the number of simultaneous map tasks (default = # logical cores)
57 wecu mapred --jobs-per-worker 12 ./mapper.py ./reducer.py
58 ```
59
60 # ''Scan-and-count" jobs without writing any code
61 I dubbed the workloads that go through ("**scan**") a sample of Common Crawl and **count** the occurrences of a given string, or count the number of matches of a certain regular expression.
62 Those workloads are very common when analysing Common Crawl, and can be run using `wecu` without writing any code:
63
64 ```
65 # Count occurences of "foo" and "bar"
66 wecu sac "foo" "bar"
67
68 # Count the number of matches for the "^WARC-Type: response" regex
69 wecu sac --regex "^WARC-Type: response"
70
71 # You can use as many regex/strings as you want - runtime grows
72 # linearly with the number of regex/strings you provide
73 wecu sac --regex \
74 "^WARC-Type: response" \
75 "^WARC-Type: request"
76
77 # You can display the results per file, without aggregating the results
78 wecu sac --regex \
79 --by-file \
80 "^WARC-Type: response" \
81 "^WARC-Type: request"
82
83 # You can adjust the number of simultaneous map tasks (default = # logical cores)
84 wecu sac --jobs-per-worker 8 "keyword"
85 ```
86
87 # Monitor CPU utilisation
88 You can monitor CPU utilisation on all worker machines simultaneously - the output from each machine will be saved to a fine on the head node (from which you should run this command), and the utilisation will be plotted and the graph will be saved under the location you provided.
89 ```
90 wecu utilisation graph_filename.png
91
92 # Adjust how long the utilisation will be monitored for (default = 120 seconds)
93 wecu utilisation --seconds 600 graph_filename.png
94 ```