comparison bin/spearman.py @ 30:c73ec9deabbe

comments and more care about rows vs. columns
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 17 Nov 2022 11:27:07 +0000
parents 669a0b120d34
children e7c8e64c2fdd
comparison
equal deleted inserted replaced
29:669a0b120d34 30:c73ec9deabbe
1 #!/usr/bin/env python3 1 #!/usr/bin/env python3
2 '''Rank correlation processing for a csv tabulation of counts by segment 2 '''Rank correlation processing for a csv tabulation of counts by segment
3 First column is for whole crawl, then 100 columns for segs 0-99 3 First column is for whole crawl, then 100 columns for segs 0-99
4 Each row is counts for some property, e.g. mime-detected or tld 4 Each row is counts for some property, e.g. mime-detected or tld
5 5
6 For example 6 For example, assuming all.tsv has the whole-crawl warc-only counts
7 and s...tsv have the segment counts, all with counts in column 1,
7 8
8 tr -d ',' <all.tsv |head -100 | while read n m; do printf "%s%s\n" $n $(for i in {0..99}; do printf ",%s" $({ grep -w "w $m\$" s${i}.tsv || echo NaN ;} | cut -f 1 ) ; done ) ; done > all_100.csv 9 tr -d ',' <all.tsv |head -100 | while read n m; do printf "%s%s\n" $n $(for i in {0..99}; do printf ",%s" $({ grep -w "w $m\$" s${i}.tsv || echo NaN ;} | cut -f 1 ) ; done ) ; done > all_100.csv
9 10
10 will produce such a file with 100 rows assuming all.tsv has the whole-crawl 11 will produce such a file with
11 warc-only counts and s...tsv have the segment counts, all counts in column 1 12 * 100 rows, one for each of the top 100 counts
13 * 101 columns, 0 for all and 1--100 for segs 0--99
12 14
13 Usage: python3 -i spearman.py name 15 Usage: python3 -i spearman.py name
14 where name.csv has the input 16 where name.csv has the input
15 ''' 17 '''
16 18
80 return np.array([i,all[i],1.0/xd[i].variance,xd[i].mean,first_diff(ranks[i])]) 82 return np.array([i,all[i],1.0/xd[i].variance,xd[i].mean,first_diff(ranks[i])])
81 83
82 counts=loadtxt(sys.argv[1]+".csv",delimiter=',') 84 counts=loadtxt(sys.argv[1]+".csv",delimiter=',')
83 # "If axis=0 (default), then each column represents a variable, with 85 # "If axis=0 (default), then each column represents a variable, with
84 # observations in the rows" 86 # observations in the rows"
85 ranks=[stats.rankdata(-counts[i],method='average') for for i in range(1,100)] 87 # So each column is a sequence of counts, for whole crawl in column 0
88 # and for segments 0--99 in columns 1--100
86 corr=stats.spearmanr(counts,nan_policy='omit').correlation 89 corr=stats.spearmanr(counts,nan_policy='omit').correlation
87 90
88 all=corr[0][1:] 91 all=corr[0][1:]
89 all_s=stats.describe(all) 92 all_s=stats.describe(all)
90 all_m=all_s.mean 93 all_m=all_s.mean
91 94
92 x=np.array([np.concatenate((corr[i][1:i], 95 x=np.array([np.concatenate((corr[i][1:i],
93 corr[i][i+1:])) for i in range(1,101)]) 96 corr[i][i+1:])) for i in range(1,101)])
97 # The above, although transposed, works because the correlation matrix
98 # is symmetric
94 xd=[stats.describe(x[i]) for i in range(100)] 99 xd=[stats.describe(x[i]) for i in range(100)]
95 xs=stats.describe(np.array([xd[i].mean for i in range(100)])) 100 xs=stats.describe(np.array([xd[i].mean for i in range(100)]))
96 xm=xs.mean 101 xm=xs.mean
97 xsd=np.sqrt(xs.variance) 102 xsd=np.sqrt(xs.variance)
103
104 ranks=[stats.rankdata(-counts[:,i],method='average') for for i in range(1,100)]
98 105
99 ### I need to review rows, e.g. counts[0] is an array of 101 counts 106 ### I need to review rows, e.g. counts[0] is an array of 101 counts
100 ### for the most common label in the complete crawl, 107 ### for the most common label in the complete crawl,
101 ### from the complete crawl and all the segments 108 ### from the complete crawl and all the segments
102 ### versus columns, e.g. counts[:,0] is an array of 100 decreasing counts 109 ### versus columns, e.g. counts[:,0] is an array of 100 decreasing counts