FOSDEM's data devroom, Brussels, Feb 5 2011
to analyse cables' full-text and not using meta-data as a structure
to visualize how the discussions within the cables are composed and relate to each other
analyse the ~3300 cables with Python, and a set of productive libraries (NLTK, BeautifulSoup)
use MongoDB and Neo4j for document and network storage
explore the network with Gephi
first used BeautifulSoup which is easy to use then switched to Cablemap based on regular expressions and very fast
NLTK : clean_html() and you're done
MongoDB : document storage
decompose text into tokens : nltk.sent_tokenize and nltk.TreebankWordTokenizer
>>> sentences = nltk.sent_tokenize("WikiLeaks is a non-profit media organization dedicated to bringing important news and information to the public. We provide an innovative, secure and anonymous way for independent sources around the world to leak information to our journalists.")
['WikiLeaks is a [...] to the public.', 'We provide [...] journalists.']
>>> nltk.TreebankWordTokenizer().tokenize(sentences[0])
['WikiLeaks', 'is', 'a', 'non-profit', 'media', 'organization', 'dedicated', 'to', 'bringing', 'important', 'news', 'and', 'information', 'to', 'the', 'public', '.']
easy way to de-duplicate words
group by their radical
use nltk.PorterStemmer
>>> print PorterStemmer().stem("language")
'languag'
compute the sha256 hash to use it as a database index
>>> nltk.tag.pos_tag(['Help','Wikileaks','keep','governments','open'])
[('Help', 'NNP'), ('Wikileaks', 'NNP'), ('keep', 'VB'),
('governments', 'NNS'), ('open', 'JJ')]
$ python train_tagger.py --sequential aubt --default "?" conll2000
The nltk.tag.SequentialBackoffTagger chains many taggers together
^(
(VB,|VBD,|VBG,|VBN,|CD.?,|JJ.?,|\?,){0,2}?
(N.?,|\?,)+?
(CD.,)?
)+?$
Writing to MongoDB :
key/value storage usage : update and modifiers are the key
mongodb.cooc.update({'_id': some_id}, {"$inc":{"value":1}})
compose id patterns to organize records : the edge example
mongodb.cooc.save({'_id': node-source_id +"_"+ node_target_id, "value":1})
Querying MongoDB :
mystartswith_regexp = re.compile("^"+mysha256+"_[a-z0-9]+$")
cooc_curs = mongodb.cooc.find(
{"_id":{
"$regex": mystartswith_regexp
}},
timeout=False,
sort=[("value",pymongo.DESCENDING)],
limit=MAXEDGES)
About
:
use the official neo4j.py component (using python-jpype)
use transactions to reach the maximum performance
with graphdb.transaction as trans:
node = self.graphdb.node()
node[key.encode("ascii","ignore")] = value.encode("ascii","ignore")
control types written to nodes properties : use ascii
Neo4j for Gephi as a plugin : the direct connection
Import network from database
Basic data laboratory features (sort, delete)
Rank category and occurrences to colors
Filter data on graph topology (degree and weight)
Spatialize the network using an OpenOrd layout
Remove artifacts direclty from the visualization
Preview the final map, to tweak appearance
Finally export the map to PDF and GEXF
~600 lines of GNU/GPL one-shot code, 4 external libraries, 2 databases, and 1 Gephi
~1 full week coding, ~5 hours executing the whole process
2 networks obtained without much science :
2 maps online to be explored by all topic maps lovers
1 talk, 1 FOSDEM, 1 big leak ;-)
A special thank to Wikileaks
Learn more at: