This repository was archived by the owner on Jan 1, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathcode.Rmd
More file actions
150 lines (116 loc) · 6.07 KB
/
Copy pathcode.Rmd
File metadata and controls
150 lines (116 loc) · 6.07 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: "Neo4j gRaphs"
output:
html_document:
df_print: paged
toc: yes
html_notebook:
toc: yes
---
## Introduction
Thanks to the driver `RNeo4j` we can use the advantages of the database Neo4j within R. Treat with a new database can be scary and learning the language `Cypher` requires time and effort. In this demo we are going to import data from a Neo4j database into R and treat it like R objects.
For this example we are going to talk about a *movie graph*, that means movies/actors/directors/... and the relationships between them. Get the data from the [Neo4j website](https://neo4j.com/developer/example-data/), or directly from the [GitHub repository](https://github.com/neo4j-examples/cineasts-spring-data-neo4j-3).
## Set the environment
First of all we need the driver to connect to Neo4j. Unfortunately this package is not on CRAN anymore but you can install it directly from [GitHub](https://github.com/nicolewhite/RNeo4j).
I load dplyr to apply easy transformations on the dataframe.
I also load igraph and [ggraph](https://cran.r-project.org/web/packages/ggraph/) to use the potential of R graph packages.
Note: Before connecting you must start your database.
```{r environment, message=FALSE}
# Connector
library(RNeo4j)
# Manipulate data
library(dplyr)
# Work with graphs
library(igraph)
# Visualize
library(ggplot2)
library(ggthemes)
library(visNetwork)
## If you want to use the Grammar of graphics
# library(ggraph)
graph <- startGraph("http://localhost:7474/db/data/", username="neo4j", password="root")
```
## About graphs in Neo4j
A graph is a very simple concept: **A graph G is a collection of entities (nodes) and the relationships (edges) that connect those entities _G = {N, E}_**
In our movie database we can easily find an example listing the actor Kevin Bacon and the movies he has acted in:
```{r kevin}
q <- ("MATCH (a:Actor { name: 'Kevin Bacon' })--(m:Movie)
RETURN a.name AS from, m.title AS to, m.releaseDate AS epoch")
bacon <- cypher(graph, q) %>%
mutate(epoch = as.numeric(epoch),
year = format(as.POSIXct(epoch/1000, origin="1970-01-01"), format="%Y"),
recent = ifelse(year >= 2000, TRUE, FALSE))
ig <- graph_from_data_frame(bacon)
data <- toVisNetworkData(ig)
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px")
```
So the **nodes** are *Kevin Bacon* and the *movies* he has acted in. And the edges are the actual relationship *has acted in*. Both nodes and edges can posses searchable properties. Let's see the properties of a movie node, in this case _White Water Summer_, which is the first movie listed in our `bacon` dataframe.
```{r properties}
q <- paste0("Match (n:Movie {title: '", bacon[1,2],"' })
RETURN properties(n) AS properties")
properties <- cypherToList(graph, q)
purrr::map(properties, ~names(.x$properties))
q <- ("MATCH (a:Actor) -[r]- (m:Movie { title: 'Footloose' })
RETURN properties(r) AS properties")
properties <- cypherToList(graph, q)
purrr::map(properties, ~(.x$properties[[1]][1]))
```
We can make our relationships bidirectional or with one direction, it all depends on our definition of that relationship. In our case, Kevin Bacon **acts in**, so it will only have one direction. Neo4j doesn't work with undirectional graphs.
We can add useful information to the node accompanied with a label which helps to group the nodes into sets. You can tell Neo4j to perform operation only in a given set, e.g. _search all actors born in New York_.
Labeling and adding properties works in relationships too, in our database the relationship *ACTS_IN* has the name of the character as property.
```{r NY}
q <- ("MATCH (a:Actor { birthplace: 'New York' })
RETURN a.name")
NY <- cypher(graph, q)
NY
```
## Query in Neo4j, analise in R
Let's extract some data from our data base and visualise and analise directly in R. It's easy to start with the actors who have more movies in the database.
```{r most_movies}
q <- ("MATCH (a:Actor)-[:ACTS_IN]->(m:Movie)
RETURN m.genre AS genre, COUNT(m) AS movies
ORDER BY movies DESC
LIMIT 20")
most_movies <- cypher(graph, q)
ggplot(most_movies, aes(reorder(genre, -movies), movies)) +
geom_bar(stat="identity") +
theme_tufte() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Most played genres")
```
Let's see the genres of these movies, to see the most acted genre (can we adventure drama or action?)
```{r most_genres}
q <- ("MATCH (a:Actor)-[:ACTS_IN]->(m:Movie)
WHERE m.genre IN ['Comedy', 'Drama', 'Action']
RETURN a.name AS name, m.genre AS genre, COUNT(m) AS count
ORDER BY count DESC
LIMIT 30")
most_genres <- cypher(graph, q)
ggplot(most_genres, aes(name, genre)) +
geom_tile(aes(fill = genre, alpha = count)) +
theme_tufte() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Actors with most movies by genre ")
```
My friend Leti used to play this game in highschool: she chooses two actors/actresses and she must find the shortest path between them, taking movies as the middle nodes. You can also play with directors or any other type of cast. This is similar to the [Six degrees of Kevin Bacon](https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon) extended to any two actors.
Let's play.
```{r play}
q <- "MATCH (a:Actor) RETURN a.name as name"
actors <- cypher(graph, q)
q <- "MATCH path=shortestPath((a1:Actor {name:{a1_name}})-[:ACTS_IN*]-(a2:Actor {name:{a2_name}})) RETURN NODES(path)"
# For directors
# q <- "MATCH path=shortestPath((a1:Director {name:{a1_name}})-[:DIRECTED*]-(a2:Director {name:{a2_name}})) RETURN NODES(path)"
names <- list(a1_name=sample(actors$name, 1), a2_name=sample(actors$name, 1))
print(names)
shortest <- cypherToList(graph, q, names)
shortest
info <- tibble (
name = shortest[[1]][["NODES(path)"]] %>% purrr::map("name"),
title = shortest[[1]][["NODES(path)"]] %>% purrr::map("title"),
label = shortest[[1]][["NODES(path)"]] %>% purrr::map(getLabel)
)
```