Research Papers
Tutorial 1: Research Papers Knowledge Graph¶
Build a knowledge graph of machine learning research papers with field-specific similarity.
Goal¶
Create a searchable database of ML papers where: - Titles must match precisely (high weight, high threshold) - Abstracts capture semantic content (medium weight, chunked) - Tags use set overlap (Jaccard similarity) - Authors use set overlap (low weight)
Step 1: Create Configuration¶
Create config/ml_papers.yaml:
schema:
title:
type: text
required: true
abstract:
type: text
required: true
tags:
type: list
default: []
authors:
type: list
default: []
year:
type: number
embeddings:
# Title: high precision matching
title_vec:
field: title
model: tfidf
# Abstract: semantic content with chunking
abstract_vec:
field: abstract
model: tfidf
chunking:
method: sentences
max_tokens: 512
overlap: 50
# Combined text embedding
text_vec:
combine:
- ref: title_vec
weight: 0.35
- ref: abstract_vec
weight: 0.65
similarities:
# Semantic text similarity
text_sim:
embedding: text_vec
# Tags: set overlap
tag_sim:
field: tags
metric: jaccard
# Authors: set overlap
author_sim:
field: authors
metric: jaccard
# Combined similarity
overall:
combine:
- ref: text_sim
weight: 0.8
- ref: tag_sim
weight: 0.15
- ref: author_sim
weight: 0.05
network:
edges:
similarity: overall
min: 0.4
Step 2: Initialize System¶
from src.network_rag import NetworkRAG
# Create RAG instance
rag = (NetworkRAG.builder()
.with_storage('ml_papers.db')
.with_tfidf_embeddings()
.from_config('config/ml_papers.yaml')
.build())
Step 3: Add Papers¶
papers = [
{
'id': 'transformer',
'title': 'Attention Is All You Need',
'abstract': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...',
'tags': ['transformers', 'attention', 'seq2seq', 'nlp'],
'authors': ['Vaswani', 'Shazeer', 'Parmar', 'Uszkoreit', 'Jones', 'Gomez', 'Kaiser', 'Polosukhin'],
'year': 2017
},
{
'id': 'bert',
'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding',
'abstract': 'We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers...',
'tags': ['transformers', 'pretraining', 'nlp', 'bert'],
'authors': ['Devlin', 'Chang', 'Lee', 'Toutanova'],
'year': 2019
},
{
'id': 'gpt3',
'title': 'Language Models are Few-Shot Learners',
'abstract': 'Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning...',
'tags': ['transformers', 'gpt', 'few-shot', 'nlp'],
'authors': ['Brown', 'Mann', 'Ryder', 'Subbiah', 'Kaplan'],
'year': 2020
},
{
'id': 'vit',
'title': 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale',
'abstract': 'While the Transformer architecture has become the de-facto standard for natural language processing tasks...',
'tags': ['transformers', 'vision', 'image-classification'],
'authors': ['Dosovitskiy', 'Beyer', 'Kolesnikov', 'Weissenborn'],
'year': 2021
},
{
'id': 'resnet',
'title': 'Deep Residual Learning for Image Recognition',
'abstract': 'Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks...',
'tags': ['cnn', 'vision', 'residual', 'image-classification'],
'authors': ['He', 'Zhang', 'Ren', 'Sun'],
'year': 2016
}
]
# Add papers
for paper in papers:
rag.add(paper['id'], document=paper)
print(f"Added {len(papers)} papers")
Step 4: Build Network¶
# Build similarity network
graph = rag.build_network()
print(f"\nNetwork Statistics:")
print(f" Nodes: {len(graph.nodes())}")
print(f" Edges: {len(graph.edges())}")
# Check density
if len(graph.nodes()) > 1:
max_edges = len(graph.nodes()) * (len(graph.nodes()) - 1) / 2
density = len(graph.edges()) / max_edges
print(f" Density: {density:.3f}")
Step 5: Analyze Communities¶
from collections import defaultdict
# Detect communities
communities = rag.detect_communities()
# Group by community
comm_groups = defaultdict(list)
for node_id, comm_id in communities.items():
comm_groups[comm_id].append(node_id)
print(f"\nCommunities Detected: {len(comm_groups)}")
for comm_id, nodes in sorted(comm_groups.items()):
print(f"\n Community {comm_id}: {len(nodes)} papers")
# Auto-tag community
tags = rag.auto_tag_community(comm_id, n_samples=len(nodes))
print(f" Keywords: {', '.join(tags[:5])}")
# Show papers
for node_id in nodes:
node = rag.storage.get_node(node_id)
doc = node['metadata']
print(f" - {doc['title'][:60]}... ({doc['year']})")
Expected output:
Community 0: 4 papers (NLP/Transformer papers)
Keywords: transformers, attention, language, nlp
- Attention Is All You Need (2017)
- BERT: Pre-training... (2019)
- Language Models are Few-Shot Learners (2020)
- An Image is Worth 16x16 Words... (2021)
Community 1: 1 paper (Vision/CNN)
Keywords: residual, image, recognition
- Deep Residual Learning for Image Recognition (2016)
Step 6: Search¶
# Search for transformer papers
print("\n=== Search: 'transformer architecture' ===")
results = rag.search('transformer architecture').with_strategy('hybrid').top(5)
for i, result in enumerate(results, 1):
print(f"\n{i}. {result.id} (score: {result.score:.3f})")
doc = result.metadata
print(f" Title: {doc['title']}")
print(f" Year: {doc['year']}")
print(f" Tags: {', '.join(doc.get('tags', []))}")
print(f" Community: {result.community_id}")
Step 7: Find Bridges¶
# Find papers that bridge NLP and Vision
bridges = rag.get_bridge_nodes(min_betweenness=0.1)
print("\n=== Bridge Papers (connect different communities) ===")
for bridge_id in bridges:
node = rag.storage.get_node(bridge_id)
doc = node['metadata']
# Get communities this paper connects
neighbors = rag.get_neighbors(bridge_id, k_hops=1)
neighbor_communities = {rag.get_community_for_node(n) for n in neighbors}
print(f"\n{doc['title']}")
print(f" Connects communities: {neighbor_communities}")
print(f" Tags: {', '.join(doc.get('tags', []))}")
Complete Script¶
See examples/ml_papers_tutorial.py for the complete working script.