Don’t be a lonely document – GoDataDriven Academy

Don’t be a lonely document": That is a famous quote of Emil Eifrem. Last week at Graphconnect he repeated it once again, together with the assignment to tweet about the conference. That inspired me to start scraping twitter on the keywords "neo4j" and "graphconnect" and put it into Neo4j. Are people really connecting?

In my first setup, I tried to fetch tweets realtime with logstash, publish the stream to Kafka, and have a Spark Streaming job running to process every tweet and insert it into Neo4j.

You can use the following logstash configuration to do just that.

input {
 twitter {
  consumer_key => "foo"
  consumer_secret => "bar"
  oauth_token => "baz"
  oauth_token_secret => "qux"
  keywords => ["graphconnect", "neo4j", "GraphConnect"]
 }
}
output {
  kafka {
 codec => plain {
    format => "%{message}"
 }
 topic_id => "tweets"
  }
}

The next step is the Spark Streaming job. I had an old test project that does exactly that. For some code examples take a look at: https://github.com/rweverwijk/twitter-to-neo4j

Low laptop battery forced me to abandon this little experiment, but it didn’t leave my mind.

Later at home, I searched for a new solution to collect all the tweets with the selected keywords. I created the following simple python script to search for tweets and store the JSON in a file:

import tweepy
import time
import json

ckey = 'foo'
csecret = 'bar'
atoken = 'baz'
asecret = 'qux'

OAUTH_KEYS = {'consumer_key': ckey, 'consumer_secret':csecret,
 'access_token_key':atoken, 'access_token_secret':asecret}
auth = tweepy.OAuthHandler(OAUTH_KEYS['consumer_key'], OAUTH_KEYS['consumer_secret'])
api = tweepy.API(auth)

def limit_handled(cursor):
 while True:
  try:
   yield cursor.next()
  except tweepy.TweepError as e:
   print(e.error_msg)
   time.sleep(15 * 60)

def search(keyword):
 # Extract the first "xxx" tweets related to "fast car"
 with open('tweets_friday.json', 'a') as the_file:
  for tweet in limit_handled(tweepy.Cursor(api.search, q=keyword, since='2017-05-09').items()):
   the_file.write(json.dumps(tweet._json) + '\n')

Now, the real fun could begin: Loading the tweets in Neo4j. The selected structure is very simple. As I’m particularly interested in people that connect, I will look for Twitter users and the mentions in tweets. Next to that, I want to make a difference between the original writer of a tweet and retweeters. This will give the following structure:

The input data is in JSON format. I prefer using Python to read this data, extract the fields that I want to store, and store it to Neo4j. The following code snippet does just that:

import json
from neo4j.v1 import GraphDatabase
import time

def store_tweet(tx, tweet):
 neo4j_params = {"user_id": tweet['user']['id'],
     "user_name": tweet['user']['name'],
     "tweet_id": tweet['id'],
     "tweet_text": tweet['text'],
     "tweet_time": time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y')),
     "mentions": tweet['entities']['user_mentions']
       }
 tx.run("""
   MERGE (u:User {uid: $user_id})
     on create set u.name = $user_name
   MERGE (t:Tweet {uid: $tweet_id})
     on create set t.text = $tweet_text, t.time = $tweet_time
   MERGE (u)-[:TWEETS]->(t)
   WITH t, $mentions as mentions
   unwind mentions as mention
   MERGE (u:User {uid: mention.id}) on create set u.name = mention.name
   MERGE (t)-[:MENTIONS]->(u)
   """, neo4j_params)

def process_file(file_name):
 with GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "test")) as driver:
  with open(file_name, 'r') as the_file:
   with driver.session() as session:
    with session.begin_transaction() as tx:
     for line in the_file:
      tweet = json.loads(line)
      store_tweet(tx, tweet)

      if 'retweeted_status' in tweet:
       store_tweet(tx, tweet['retweeted_status'])
       retweet_data = {
        'tweet_id': tweet['id'],
        "retweet_id": tweet['retweeted_status']['id']
       }

       tx.run("""
       MATCH (t:Tweet {uid: $tweet_id})
       MATCH (r:Tweet {uid: $retweet_id})
       MERGE (t)-[:RE_TWEETS]->(r)
       """, retweet_data)

process_file('tweets_friday2.json')

Let’s see what we can find in Neo4j now.

First, let’s take a quick look at the relationships within MENTIONS:

MATCH p=()-[r:MENTIONS]->() RETURN p LIMIT 50

This looks quite nice already!

Let’s find out which user is mentioned most:

MATCH (t:Tweet)-[r:MENTIONS]->(mentioned:User)
RETURN mentioned.name, count(r) as numberOfMentions
order by numberOfMentions desc
limit 10

results in:

user	numberOfMentions
Neo4j	1014
GraphConnect	613
Emil Eifrem	236
Jim Webber	150
ICIJ	149
Rik Van Bruggen	99
GraphAware	88
LARUS	86
Philip Rathle	86
CluedIn	69

If we exclude organization accounts, the strongest influencers in the graph are Emil, Jim, and Rik. They most certainly were no lonely documents.

Let’s continue exploring and find out who are writing the tweets containing mentions:

MATCH (u:User)-[:TWEETS]->(t:Tweet)
RETURN u.name as user, count(t) as numberOfTweets
order by numberOfTweets desc
limit 10

user	numberOfTweets
GraphConnect	433
Hakaishin Hokutosei	379
Neo4j	244
Yuxing Sun	113
Christophe Willemsen	109
Neo Questions	85
Bence Arato	42
Cedric Fauvet	41
Nigel Small 🇪🇺	38
Mark Wood	36

GraphConnect and Neo4j seem quite obvious, but I don’t know Hakaishin Hokutosei and 379 seems to be a lot of tweets. What is this user tweeting about?

MATCH (u:User)-[:TWEETS]->(t:Tweet)
where u.name = "Hakaishin Hokutosei"
RETURN t.text
limit 100

t.text
RT @BenceArato: Major @neo4j milestones from version 3.0 to current to future plans #GraphConnect https://t.co/J99pXpSzV5
RT @GraphConnect: .@jimwebber: #Neo4j doesn’t do crazy JOINs or sets — it simply chases pointers\n#GraphConnect
RT @GraphConnect: .@jimwebber: Because #Neo4j is a native #graphdatabase and we own the whole stack, we can build to any clustering need…
RT @GraphConnect: .@jimwebber: #Neo4j 3.1 introduced security and Causal Clustering\n#GraphConnect
RT @GraphConnect: .@jimwebber: Causal Clustering, intro-ed in #Neo4j 3.1, can now span multiple data centers\n#GraphConnect
RT @GraphConnect: .@jimwebber: #Neo4j 3.2 drivers are also more aware of Causal Clusters\n#GraphConnect
RT @GraphConnect: .@jimwebber: #Neo4j 3.2 now is able to use #Kerberos, esp for those of you in #FinServ who are required to use it\n#GraphC…
RT @matethurzo: Closing keynote of #graphconnect @jimwebber is always fun to watch #graph #graphdb #conferenceday #neo4j #devlife https://…
RT @mfalcier: Watching @neo4j #graphconnect Dr. @jimwebber ’s talk from the sofa? Awesome! https://t.co/U0bCXGEd0D
RT @GraphConnect: .@jimwebber: Last year in London, #Neo4j 3.0 abolished the upper storage limit altogether\n#GraphConnect

Wait a second, every tweet is starting with "RT". Is he only retweeting, or do we have self-written as well?

Let’s see:

MATCH (u:User)-[:TWEETS]->(t:Tweet)
where u.name = "Hakaishin Hokutosei"
and not (t)-[:RE_TWEETS]->()
RETURN count(t)

count(t)
0

So we need to separate tweets from retweets to make a difference between original writers and retweeters:

MATCH (u)-[r1:TWEETS]->(t)
where not (t)-[:RE_TWEETS]->()
optional match (u)-[:TWEETS]->(rt)-[r2:RE_TWEETS]->()
RETURN u.name, count(distinct r2) as numberOfReTweets, count(distinct r1) as numberOfTweets
order by numberOfTweets desc

u.name	numberOfReTweets	numberOfTweets
GraphConnect	238	195
Neo4j	65	179
Yuxing Sun	0	113
Neo Questions	0	85
Mark Wood	4	32
Bence Arato	18	24
Carina Birt	3	23
Marlon Samuels	0	20
Daily Tech Issues	0	16
Andres L. Martinez	1	15
Neo4j France	11	15
Louis Dubruel	0	15
Nigel Small 🇪🇺	23	15
Adam Hill	5	15
Rik Van Bruggen	1	15

What are the most popular tweets?

MATCH (rt)-[r2:RE_TWEETS]->(t)[:TWEETS]-(u)
RETURN u.name AS user, t.text, count(rt) AS numberOfRetweets
ORDER BY numberOfRetweets DESC

user	t.text	numberOfRetweets
Mar Cabra	Work with @ICIJorg from DC, Paris or Madrid for 6 months making sense of complex data and graphs thanks to @neo4j’s… https://t.co/Z0rR3Rt7zV	20
ICIJ	Interested in using data to find stories? Want to join ICIJ’s next project? Apply for the Connected Data Fellowship https://t.co/LUdsjWKwRJ	18
William Lyon	Democratizing Data at @AirbnbEng w/ Dataportal, a new tool for scaling data search and discovery powered by @neo4j \n\nhttps://t.co/e12fHuA26M	18
Pat Patterson	Visualizing & Analyzing Salesforce Data with #StreamSets Data Collector & @Neo4j https://t.co/DunEFtAPyO Thx for gr… https://t.co/pXwBISQtme	18
ICIJ	Exciting announcement: We’re now hiring a Neo4j Connected Data Fellow! More info & how to apply here: https://t.co/knjHKgyQiz #GraphConnect	17
Dr. GP Pulipaka	Announcing Neo4j in the Microsoft Azure Marketplace (Part I). #BigData #DataScience #Neo4J #Azure #Analytics… https://t.co/1XEqQHgedu	16
Kursion	#GraphConnect Neo4j 3.2 ready to download today https://t.co/QZu3XvAjts	15

So the most popular tweets are about ICIJ and there Connected Data Fellowship, or the new Neo4j version.

Last but not least: What are the tweets that received the most retweets and can be declared winners of the least "lonely document" award (if that would be a real award):

MATCH (rt)-[r2:RE_TWEETS]->(t)[:TWEETS]-(u)
RETURN u.name as user, count(rt) as numberOfRetweets
order by numberOfRetweets desc

Rik van Bruggen

Or as my dear friend Rik would say. "Maybe I’m the most lonely document and that’s the reason why I tweet that much about Neo4j, I don’t have any hobbies. 😉 "

We are hiring

Subscribe to our newsletter