python: twt.py annotate

annotate twt.py @ 69:157f012ffab7 default tip

from local

author	Henry S Thompson <ht@inf.ed.ac.uk>
date	Fri, 17 Jan 2025 15:45:26 +0000
parents
children

rev	line source
69 157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	1 from twitter.twitter import *
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	2 import cld2full
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	3
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	4 tt=xtwc.sents('20100128.txt')
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	5 corp=[]
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	6
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	7 for t in tt:
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	8 at=[w.lower() for w in t if w.isalpha()]
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	9 if len(at) >= 5:
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	10 corp.append((' '.join(at)).encode('utf8'))
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	11 len(corp)
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	12
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	13 rcorp=[r for r in ((cld2full.detect(t),t) for t in corp) if r[0].is_reliable]
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	14 len(rcorp)
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	15
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	16 mecorp=[r for r in rcorp if r[0].details[0].language_code=='en']
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	17 len(mecorp)
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	18
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	19 eecorp=[r for r in mecorp if r[0].details[1].language_code=='un']
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	20 len(eecorp)
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	21
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	22 necorp=[r for r in rcorp if (r[0].details[0].language_code!='en') and (r[0].details[1].language_code!='en') and (r[0].details[2].language_code!='en')]
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	23 len(necorp)
157f012ffab7 from local Henry S Thompson <ht@inf.ed.ac.uk> parents: diff changeset	24

Mercurial > hg > python

annotate twt.py @ 69:157f012ffab7 default tip