Don’t be a lonely document

09 Juni, 2017

Don’t be a lonely document": That is a famous quote of Emil Eifrem. Last week at Graphconnect he repeated it once again, together with the assignment to tweet about the conference. That inspired me to start scraping twitter on the keywords "neo4j" and "graphconnect" and put it into Neo4j. Are people really connecting?

In my first setup, I tried to fetch tweets realtime with logstash, publish the stream to Kafka, and have a Spark Streaming job running to process every tweet and insert it into Neo4j.

You can use the following logstash configuration to do just that.

input {
 twitter {
  consumer_key => "foo"
  consumer_secret => "bar"
  oauth_token => "baz"
  oauth_token_secret => "qux"
  keywords => ["graphconnect", "neo4j", "GraphConnect"]
 }
}
output {
  kafka {
 codec => plain {
    format => "%{message}"
 }
 topic_id => "tweets"
  }
}

The next step is the Spark Streaming job. I had an old test project that does exactly that. For some code examples take a look at: https://github.com/rweverwijk/twitter-to-neo4j

Low laptop battery forced me to abandon this little experiment, but it didn’t leave my mind.

Later at home, I searched for a new solution to collect all the tweets with the selected keywords. I created the following simple python script to search for tweets and store the JSON in a file:

import tweepy
import time
import json

ckey = 'foo'
csecret = 'bar'
atoken = 'baz'
asecret = 'qux'

OAUTH_KEYS = {'consumer_key': ckey, 'consumer_secret':csecret,
 'access_token_key':atoken, 'access_token_secret':asecret}
auth = tweepy.OAuthHandler(OAUTH_KEYS['consumer_key'], OAUTH_KEYS['consumer_secret'])
api = tweepy.API(auth)

def limit_handled(cursor):
 while True:
  try:
   yield cursor.next()
  except tweepy.TweepError as e:
   print(e.error_msg)
   time.sleep(15 * 60)

def search(keyword):
 # Extract the first "xxx" tweets related to "fast car"
 with open('tweets_friday.json', 'a') as the_file:
  for tweet in limit_handled(tweepy.Cursor(api.search, q=keyword, since='2017-05-09').items()):
   the_file.write(json.dumps(tweet._json) + '\n')

Now, the real fun could begin: Loading the tweets in Neo4j. The selected structure is very simple. As I’m particularly interested in people that connect, I will look for Twitter users and the mentions in tweets. Next to that, I want to make a difference between the original writer of a tweet and retweeters. This will give the following structure:

Neo4j schema

The input data is in JSON format. I prefer using Python to read this data, extract the fields that I want to store, and store it to Neo4j. The following code snippet does just that:

import json
from neo4j.v1 import GraphDatabase
import time

def store_tweet(tx, tweet):
 neo4j_params = {"user_id": tweet['user']['id'],
     "user_name": tweet['user']['name'],
     "tweet_id": tweet['id'],
     "tweet_text": tweet['text'],
     "tweet_time": time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y')),
     "mentions": tweet['entities']['user_mentions']
       }
 tx.run("""
   MERGE (u:User {uid: $user_id})
     on create set u.name = $user_name
   MERGE (t:Tweet {uid: $tweet_id})
     on create set t.text = $tweet_text, t.time = $tweet_time
   MERGE (u)-[:TWEETS]->(t)
   WITH t, $mentions as mentions
   unwind mentions as mention
   MERGE (u:User {uid: mention.id}) on create set u.name = mention.name
   MERGE (t)-[:MENTIONS]->(u)
   """, neo4j_params)

def process_file(file_name):
 with GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "test")) as driver:
  with open(file_name, 'r') as the_file:
   with driver.session() as session:
    with session.begin_transaction() as tx:
     for line in the_file:
      tweet = json.loads(line)
      store_tweet(tx, tweet)

      if 'retweeted_status' in tweet:
       store_tweet(tx, tweet['retweeted_status'])
       retweet_data = {
        'tweet_id': tweet['id'],
        "retweet_id": tweet['retweeted_status']['id']
       }

       tx.run("""
       MATCH (t:Tweet {uid: $tweet_id})
       MATCH (r:Tweet {uid: $retweet_id})
       MERGE (t)-[:RE_TWEETS]->(r)
       """, retweet_data)

process_file('tweets_friday2.json')

Let’s see what we can find in Neo4j now.

First, let’s take a quick look at the relationships within MENTIONS:

MATCH p=()-[r:MENTIONS]->() RETURN p LIMIT 50
Neo4j overview

This looks quite nice already!

Let’s find out which user is mentioned most:

MATCH (t:Tweet)-[r:MENTIONS]->(mentioned:User)
RETURN mentioned.name, count(r) as numberOfMentions
order by numberOfMentions desc
limit 10

results in:

usernumberOfMentions
Neo4j1014
GraphConnect613
Emil Eifrem236
Jim Webber150
ICIJ149
Rik Van Bruggen99
GraphAware88
LARUS86
Philip Rathle86
CluedIn69

If we exclude organization accounts, the strongest influencers in the graph are Emil, Jim, and Rik. They most certainly were no lonely documents.

Let’s continue exploring and find out who are writing the tweets containing mentions:

MATCH (u:User)-[:TWEETS]->(t:Tweet)
RETURN u.name as user, count(t) as numberOfTweets
order by numberOfTweets desc
limit 10
usernumberOfTweets
GraphConnect433
Hakaishin Hokutosei379
Neo4j244
Yuxing Sun113
Christophe Willemsen109
Neo Questions85
Bence Arato42
Cedric Fauvet41
Nigel Small 🇪🇺38
Mark Wood36

GraphConnect and Neo4j seem quite obvious, but I don’t know Hakaishin Hokutosei and 379 seems to be a lot of tweets. What is this user tweeting about?

MATCH (u:User)-[:TWEETS]->(t:Tweet)
where u.name = "Hakaishin Hokutosei"
RETURN t.text
limit 100
t.text
RT @BenceArato: Major @neo4j milestones from version 3.0 to current to future plans #GraphConnect https://t.co/J99pXpSzV5
RT @GraphConnect: .@jimwebber: #Neo4j doesn’t do crazy JOINs or sets — it simply chases pointers\n#GraphConnect
RT @GraphConnect: .@jimwebber: Because #Neo4j is a native #graphdatabase and we own the whole stack, we can build to any clustering need…
RT @GraphConnect: .@jimwebber: #Neo4j 3.1 introduced security and Causal Clustering\n#GraphConnect
RT @GraphConnect: .@jimwebber: Causal Clustering, intro-ed in #Neo4j 3.1, can now span multiple data centers\n#GraphConnect
RT @GraphConnect: .@jimwebber: #Neo4j 3.2 drivers are also more aware of Causal Clusters\n#GraphConnect
RT @GraphConnect: .@jimwebber: #Neo4j 3.2 now is able to use #Kerberos, esp for those of you in #FinServ who are required to use it\n#GraphC…
RT @matethurzo: Closing keynote of #graphconnect @jimwebber is always fun to watch #graph #graphdb #conferenceday #neo4j #devlife https://…
RT @mfalcier: Watching @neo4j #graphconnect Dr. @jimwebber ’s talk from the sofa? Awesome! https://t.co/U0bCXGEd0D
RT @GraphConnect: .@jimwebber: Last year in London, #Neo4j 3.0 abolished the upper storage limit altogether\n#GraphConnect

Wait a second, every tweet is starting with "RT". Is he only retweeting, or do we have self-written as well?

Let’s see:

MATCH (u:User)-[:TWEETS]->(t:Tweet)
where u.name = "Hakaishin Hokutosei"
and not (t)-[:RE_TWEETS]->()
RETURN count(t)
count(t)
0

So we need to separate tweets from retweets to make a difference between original writers and retweeters:

MATCH (u)-[r1:TWEETS]->(t)
where not (t)-[:RE_TWEETS]->()
optional match (u)-[:TWEETS]->(rt)-[r2:RE_TWEETS]->()
RETURN u.name, count(distinct r2) as numberOfReTweets, count(distinct r1) as numberOfTweets
order by numberOfTweets desc
u.namenumberOfReTweetsnumberOfTweets
GraphConnect238195
Neo4j65179
Yuxing Sun0113
Neo Questions085
Mark Wood432
Bence Arato1824
Carina Birt323
Marlon Samuels020
Daily Tech Issues016
Andres L. Martinez115
Neo4j France1115
Louis Dubruel015
Nigel Small 🇪🇺2315
Adam Hill515
Rik Van Bruggen115

What are the most popular tweets?

MATCH (rt)-[r2:RE_TWEETS]->(t)[:TWEETS]-(u)
RETURN u.name AS user, t.text, count(rt) AS numberOfRetweets
ORDER BY numberOfRetweets DESC
usert.textnumberOfRetweets
Mar CabraWork with @ICIJorg from DC, Paris or Madrid for 6 months making sense of complex data and graphs thanks to @neo4j’s… https://t.co/Z0rR3Rt7zV20
ICIJInterested in using data to find stories? Want to join ICIJ’s next project? Apply for the Connected Data Fellowship https://t.co/LUdsjWKwRJ18
William LyonDemocratizing Data at @AirbnbEng w/ Dataportal, a new tool for scaling data search and discovery powered by @neo4j \n\nhttps://t.co/e12fHuA26M18
Pat PattersonVisualizing & Analyzing Salesforce Data with #StreamSets Data Collector & @Neo4j https://t.co/DunEFtAPyO Thx for gr… https://t.co/pXwBISQtme18
ICIJExciting announcement: We’re now hiring a Neo4j Connected Data Fellow! More info & how to apply here: https://t.co/knjHKgyQiz #GraphConnect17
Dr. GP PulipakaAnnouncing Neo4j in the Microsoft Azure Marketplace (Part I). #BigData #DataScience #Neo4J #Azure #Analytics… https://t.co/1XEqQHgedu16
Kursion#GraphConnect Neo4j 3.2 ready to download today https://t.co/QZu3XvAjts15

So the most popular tweets are about ICIJ and there Connected Data Fellowship, or the new Neo4j version.

Last but not least: What are the tweets that received the most retweets and can be declared winners of the least "lonely document" award (if that would be a real award):

MATCH (rt)-[r2:RE_TWEETS]->(t)[:TWEETS]-(u)
RETURN u.name as user, count(rt) as numberOfRetweets
order by numberOfRetweets desc
Rik van Bruggen

Or as my dear friend Rik would say. "Maybe I’m the most lonely document and that’s the reason why I tweet that much about Neo4j, I don’t have any hobbies. 😉 "

We are hiring

Subscribe to our newsletter

Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.