annotate bin/spearman.py @ 30:c73ec9deabbe

comments and more care about rows vs. columns
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 17 Nov 2022 11:27:07 +0000
parents 669a0b120d34
children e7c8e64c2fdd
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
25
50337cd1d16f framework for stats over results of rank correlations
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1 #!/usr/bin/env python3
27
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
2 '''Rank correlation processing for a csv tabulation of counts by segment
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
3 First column is for whole crawl, then 100 columns for segs 0-99
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
4 Each row is counts for some property, e.g. mime-detected or tld
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
5
30
c73ec9deabbe comments and more care about rows vs. columns
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 29
diff changeset
6 For example, assuming all.tsv has the whole-crawl warc-only counts
c73ec9deabbe comments and more care about rows vs. columns
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 29
diff changeset
7 and s...tsv have the segment counts, all with counts in column 1,
27
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
8
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
9 tr -d ',' <all.tsv |head -100 | while read n m; do printf "%s%s\n" $n $(for i in {0..99}; do printf ",%s" $({ grep -w "w $m\$" s${i}.tsv || echo NaN ;} | cut -f 1 ) ; done ) ; done > all_100.csv
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
10
30
c73ec9deabbe comments and more care about rows vs. columns
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 29
diff changeset
11 will produce such a file with
c73ec9deabbe comments and more care about rows vs. columns
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 29
diff changeset
12 * 100 rows, one for each of the top 100 counts
c73ec9deabbe comments and more care about rows vs. columns
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 29
diff changeset
13 * 101 columns, 0 for all and 1--100 for segs 0--99
27
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
14
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
15 Usage: python3 -i spearman.py name
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
16 where name.csv has the input
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
17 '''
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
18
25
50337cd1d16f framework for stats over results of rank correlations
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
19 import numpy as np
50337cd1d16f framework for stats over results of rank correlations
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
20 from numpy import loadtxt
50337cd1d16f framework for stats over results of rank correlations
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
21 from scipy import stats
50337cd1d16f framework for stats over results of rank correlations
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
22 import statsmodels.api as sm
26
5c5440e7854a a bit more
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 25
diff changeset
23 import matplotlib.pyplot as plt
25
50337cd1d16f framework for stats over results of rank correlations
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
24 import pylab
50337cd1d16f framework for stats over results of rank correlations
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
25
50337cd1d16f framework for stats over results of rank correlations
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
26 import sys
50337cd1d16f framework for stats over results of rank correlations
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
27
27
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
28 def qqa():
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
29 # q-q plot for the whole crawl
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
30 sm.qqplot(all, line='s')
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
31 plt.gca().set_title('Rank correlation per segment wrt whole crawl (warc results only)')
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
32 plt.show()
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
33
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
34 def qqs():
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
35 # q-q plots for the best and worst (by variance) segments
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
36 global xv, xworst, xbest
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
37 xv=[d.variance for d in xd]
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
38 xworst=xv.index(max(xv))
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
39 xbest=xv.index(min(xv))
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
40 print(xbest,xworst)
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
41 sm.qqplot(x[xbest], line='s')
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
42 plt.gca().set_title('Best segment (least variance): %s'%xbest)
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
43 plt.show()
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
44 sm.qqplot(x[xworst], line='s')
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
45 plt.gca().set_title('Worst segment (most variance): %s'%xworst)
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
46 plt.show()
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
47
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
48 def plot_x():
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
49 plt.plot([xd[i].mean for i in range(100)],'bx',label='Mean of rank correlation of each segment x all other segments')
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
50 plt.plot([0,99],[xm,xm],'b',label='Mean of segment x segment means')
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
51 plt.plot(all,'rx',label='Rank correlation of segment x whole crawl')
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
52 plt.plot([0,99],[all_m,all_m],'r',label='Mean of segment x whole crawl')
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
53 plt.axis([0,99,0.8,1.0])
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
54 plt.legend(loc='best')
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
55 plt.grid(True)
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
56 plt.show()
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
57
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
58 def hist():
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
59 sdd=[(i,xm-(i*xsd)) for i in range(-2,3)]
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
60 fig,hax=plt.subplots() # Thanks to https://stackoverflow.com/a/7769497
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
61 sdax=hax.twiny()
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
62 hax.hist([xd[i].mean for i in range(100)],color='lightblue')
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
63 hax.set_title('Mean of rank correlation of each segment x all other segments')
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
64 for s,v in sdd:
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
65 sdax.plot([v,v],[0,18],'b')
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
66 sdax.set_xlim(hax.get_xlim())
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
67 sdax.set_ylim(hax.get_ylim())
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
68 sdax.set_xticks([v for s,v in sdd])
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
69 sdax.set_xticklabels([str(s) for s,v in sdd])
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
70 plt.show()
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
71
29
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
72 def first_diff(ranks):
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
73 # first disagreement with baseline == {1,2,...}
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
74 for i in range(len(ranks)):
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
75 if ranks[i]!=i+1.0:
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
76 return i
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
77 return i+1
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
78
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
79 def ranks():
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
80 # Combine segment measures:
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
81 # segID,rank corr. wrt all,inverse variance, mean cross rank corr.,first disagreement
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
82 return np.array([i,all[i],1.0/xd[i].variance,xd[i].mean,first_diff(ranks[i])])
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
83
27
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
84 counts=loadtxt(sys.argv[1]+".csv",delimiter=',')
29
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
85 # "If axis=0 (default), then each column represents a variable, with
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
86 # observations in the rows"
30
c73ec9deabbe comments and more care about rows vs. columns
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 29
diff changeset
87 # So each column is a sequence of counts, for whole crawl in column 0
c73ec9deabbe comments and more care about rows vs. columns
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 29
diff changeset
88 # and for segments 0--99 in columns 1--100
29
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
89 corr=stats.spearmanr(counts,nan_policy='omit').correlation
27
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
90
29
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
91 all=corr[0][1:]
27
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
92 all_s=stats.describe(all)
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
93 all_m=all_s.mean
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
94
29
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
95 x=np.array([np.concatenate((corr[i][1:i],
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
96 corr[i][i+1:])) for i in range(1,101)])
30
c73ec9deabbe comments and more care about rows vs. columns
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 29
diff changeset
97 # The above, although transposed, works because the correlation matrix
c73ec9deabbe comments and more care about rows vs. columns
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 29
diff changeset
98 # is symmetric
27
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
99 xd=[stats.describe(x[i]) for i in range(100)]
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
100 xs=stats.describe(np.array([xd[i].mean for i in range(100)]))
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
101 xm=xs.mean
21da4d6521db move all plots into functions
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 26
diff changeset
102 xsd=np.sqrt(xs.variance)
29
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
103
30
c73ec9deabbe comments and more care about rows vs. columns
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 29
diff changeset
104 ranks=[stats.rankdata(-counts[:,i],method='average') for for i in range(1,100)]
c73ec9deabbe comments and more care about rows vs. columns
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 29
diff changeset
105
29
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
106 ### I need to review rows, e.g. counts[0] is an array of 101 counts
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
107 ### for the most common label in the complete crawl,
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
108 ### from the complete crawl and all the segments
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
109 ### versus columns, e.g. counts[:,0] is an array of 100 decreasing counts
669a0b120d34 start work on ranking,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 27
diff changeset
110 ### for all the labels in the complete crawl