Mapping Wikileaks' Cablegate topics using Python, MongoDB, Neo4j and Gephi

FOSDEM's data devroom, Brussels, Feb 5 2011

Goals

  • to analyse cables' full-text and not using meta-data as a structure

  • to produce a occurrence and co-occurrence networks of topics and cables

  • to visualize how the discussions within the cables are composed and relate to each other

How ?

  • analyse the ~3300 cables with Python, and a set of productive libraries (NLTK, BeautifulSoup)

  • use MongoDB and Neo4j for document and network storage

  • explore the network with Gephi

A preview of the result

Import cables to MongoDB

  • Hopefully, Wikileak's archive follows a simple structure, making data hackers' job easy !

  • first used BeautifulSoup which is easy to use then switched to Cablemap based on regular expressions and very fast

  • NLTK : clean_html() and you're done

    • but still an effort: to handle decoding manually

  • MongoDB : document storage

    • transparently inserting and reading records as Python dictionaries
    • automatic serializing/deserializing : unicode, nested lists and dict, datetime...

Extract topics (1) : tokenize

  • decompose text into tokens : nltk.sent_tokenize and nltk.TreebankWordTokenizer

    >>> sentences = nltk.sent_tokenize("WikiLeaks is a non-profit media organization dedicated to bringing important news and information to the public. We provide an innovative, secure and anonymous way for independent sources around the world to leak information to our journalists.")
    ['WikiLeaks is a [...] to the public.', 'We provide [...] journalists.']
    
    
    >>> nltk.TreebankWordTokenizer().tokenize(sentences[0])
    ['WikiLeaks', 'is', 'a', 'non-profit', 'media', 'organization', 'dedicated', 'to', 'bringing', 'important', 'news', 'and', 'information', 'to', 'the', 'public', '.']
                    

Extract topics (2) : stemming

  • easy way to de-duplicate words

  • group by their radical

  • use nltk.PorterStemmer

  • >>> print PorterStemmer().stem("language")
    'languag'
                    
  • compute the sha256 hash to use it as a database index

Extract topics (3) : part of speech tagging with nltk.tag

Choose more relevant topics

  • a DIY POS tag regular expression filtering "useless" words
    ^(
        (VB,|VBD,|VBG,|VBN,|CD.?,|JJ.?,|\?,){0,2}?
        (N.?,|\?,)+?
        (CD.,)?
    )+?$
                      

Create the network : from MongoDB to Neo4j (1)

  • Writing to MongoDB :

    • key/value storage usage : update and modifiers are the key

      mongodb.cooc.update({'_id': some_id}, {"$inc":{"value":1}})
                          

    • compose id patterns to organize records : the edge example

      mongodb.cooc.save({'_id': node-source_id +"_"+ node_target_id, "value":1})
                          

Create the network : from MongoDB to Neo4j (2)

  • Querying MongoDB :

    • example : extract the heaviest co-occurrences edges from a node
      mystartswith_regexp = re.compile("^"+mysha256+"_[a-z0-9]+$")
      cooc_curs = mongodb.cooc.find(
          {"_id":{
              "$regex": mystartswith_regexp
          }},
          timeout=False,
          sort=[("value",pymongo.DESCENDING)],
          limit=MAXEDGES)
                            

Creating the network : from MongoDB to Neo4j (3)

  • About :

    • use the official neo4j.py component (using python-jpype)

    • use transactions to reach the maximum performance

      with graphdb.transaction as trans:
          node = self.graphdb.node()
          node[key.encode("ascii","ignore")] = value.encode("ascii","ignore")
                        

    • control types written to nodes properties : use ascii

    • Neo4j for Gephi as a plugin : the direct connection

Introduction to

  • An AGPL3 desktop app for visualization of complex networks
  • Dedicated toolset for social network analysis and network map creation
  • Based on Java, NetBeans Platform and OpenGL (JOGL)
  • Also available as headless library: Gephi Toolkit
  • Connect nicely with Jython, JPype..
  • Plugin center, with connectors for Neo4J, SQL, some social networks API..
  • To learn more about: http://gephi.org

Summary of our Gephi workflow

  1. Import network from database

  2. Basic data laboratory features (sort, delete)

  3. Rank category and occurrences to colors

  4. Filter data on graph topology (degree and weight)

  5. Spatialize the network using an OpenOrd layout

  6. Remove artifacts direclty from the visualization

  7. Preview the final map, to tweak appearance

  8. Finally export the map to PDF and GEXF

To sum up:

  • ~600 lines of GNU/GPL one-shot code, 4 external libraries, 2 databases, and 1 Gephi

  • ~1 full week coding, ~5 hours executing the whole process

  • 2 networks obtained without much science :

    • bi-grams topics and cables, linked by occurrences : 43 179 nodes, 237 058 edges
    • bi-grams topics only, linked by co-occurrences : 39 808 nodes, 177 023 edges

  • 2 maps online to be explored by all topic maps lovers

  • 1 talk, 1 FOSDEM, 1 big leak ;-)

A zoom on Egypt's cables

A zoom on "Central Bank" neighbourhood

Thanks

A special thank to Wikileaks