Mercurial > hg > cc > cirrus_work
comparison bin/spearman.py @ 30:c73ec9deabbe
comments and more care about rows vs. columns
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 17 Nov 2022 11:27:07 +0000 |
parents | 669a0b120d34 |
children | e7c8e64c2fdd |
comparison
equal
deleted
inserted
replaced
29:669a0b120d34 | 30:c73ec9deabbe |
---|---|
1 #!/usr/bin/env python3 | 1 #!/usr/bin/env python3 |
2 '''Rank correlation processing for a csv tabulation of counts by segment | 2 '''Rank correlation processing for a csv tabulation of counts by segment |
3 First column is for whole crawl, then 100 columns for segs 0-99 | 3 First column is for whole crawl, then 100 columns for segs 0-99 |
4 Each row is counts for some property, e.g. mime-detected or tld | 4 Each row is counts for some property, e.g. mime-detected or tld |
5 | 5 |
6 For example | 6 For example, assuming all.tsv has the whole-crawl warc-only counts |
7 and s...tsv have the segment counts, all with counts in column 1, | |
7 | 8 |
8 tr -d ',' <all.tsv |head -100 | while read n m; do printf "%s%s\n" $n $(for i in {0..99}; do printf ",%s" $({ grep -w "w $m\$" s${i}.tsv || echo NaN ;} | cut -f 1 ) ; done ) ; done > all_100.csv | 9 tr -d ',' <all.tsv |head -100 | while read n m; do printf "%s%s\n" $n $(for i in {0..99}; do printf ",%s" $({ grep -w "w $m\$" s${i}.tsv || echo NaN ;} | cut -f 1 ) ; done ) ; done > all_100.csv |
9 | 10 |
10 will produce such a file with 100 rows assuming all.tsv has the whole-crawl | 11 will produce such a file with |
11 warc-only counts and s...tsv have the segment counts, all counts in column 1 | 12 * 100 rows, one for each of the top 100 counts |
13 * 101 columns, 0 for all and 1--100 for segs 0--99 | |
12 | 14 |
13 Usage: python3 -i spearman.py name | 15 Usage: python3 -i spearman.py name |
14 where name.csv has the input | 16 where name.csv has the input |
15 ''' | 17 ''' |
16 | 18 |
80 return np.array([i,all[i],1.0/xd[i].variance,xd[i].mean,first_diff(ranks[i])]) | 82 return np.array([i,all[i],1.0/xd[i].variance,xd[i].mean,first_diff(ranks[i])]) |
81 | 83 |
82 counts=loadtxt(sys.argv[1]+".csv",delimiter=',') | 84 counts=loadtxt(sys.argv[1]+".csv",delimiter=',') |
83 # "If axis=0 (default), then each column represents a variable, with | 85 # "If axis=0 (default), then each column represents a variable, with |
84 # observations in the rows" | 86 # observations in the rows" |
85 ranks=[stats.rankdata(-counts[i],method='average') for for i in range(1,100)] | 87 # So each column is a sequence of counts, for whole crawl in column 0 |
88 # and for segments 0--99 in columns 1--100 | |
86 corr=stats.spearmanr(counts,nan_policy='omit').correlation | 89 corr=stats.spearmanr(counts,nan_policy='omit').correlation |
87 | 90 |
88 all=corr[0][1:] | 91 all=corr[0][1:] |
89 all_s=stats.describe(all) | 92 all_s=stats.describe(all) |
90 all_m=all_s.mean | 93 all_m=all_s.mean |
91 | 94 |
92 x=np.array([np.concatenate((corr[i][1:i], | 95 x=np.array([np.concatenate((corr[i][1:i], |
93 corr[i][i+1:])) for i in range(1,101)]) | 96 corr[i][i+1:])) for i in range(1,101)]) |
97 # The above, although transposed, works because the correlation matrix | |
98 # is symmetric | |
94 xd=[stats.describe(x[i]) for i in range(100)] | 99 xd=[stats.describe(x[i]) for i in range(100)] |
95 xs=stats.describe(np.array([xd[i].mean for i in range(100)])) | 100 xs=stats.describe(np.array([xd[i].mean for i in range(100)])) |
96 xm=xs.mean | 101 xm=xs.mean |
97 xsd=np.sqrt(xs.variance) | 102 xsd=np.sqrt(xs.variance) |
103 | |
104 ranks=[stats.rankdata(-counts[:,i],method='average') for for i in range(1,100)] | |
98 | 105 |
99 ### I need to review rows, e.g. counts[0] is an array of 101 counts | 106 ### I need to review rows, e.g. counts[0] is an array of 101 counts |
100 ### for the most common label in the complete crawl, | 107 ### for the most common label in the complete crawl, |
101 ### from the complete crawl and all the segments | 108 ### from the complete crawl and all the segments |
102 ### versus columns, e.g. counts[:,0] is an array of 100 decreasing counts | 109 ### versus columns, e.g. counts[:,0] is an array of 100 decreasing counts |