Research Papers

Tutorial 1: Research Papers Knowledge Graph¶

Build a knowledge graph of machine learning research papers with field-specific similarity.

Goal¶

Create a searchable database of ML papers where: - Titles must match precisely (high weight, high threshold) - Abstracts capture semantic content (medium weight, chunked) - Tags use set overlap (Jaccard similarity) - Authors use set overlap (low weight)

Step 1: Create Configuration¶

Create config/ml_papers.yaml:

schema:
  title:
    type: text
    required: true
  abstract:
    type: text
    required: true
  tags:
    type: list
    default: []
  authors:
    type: list
    default: []
  year:
    type: number

embeddings:
  # Title: high precision matching
  title_vec:
    field: title
    model: tfidf

  # Abstract: semantic content with chunking
  abstract_vec:
    field: abstract
    model: tfidf
    chunking:
      method: sentences
      max_tokens: 512
      overlap: 50

  # Combined text embedding
  text_vec:
    combine:
      - ref: title_vec
        weight: 0.35
      - ref: abstract_vec
        weight: 0.65

similarities:
  # Semantic text similarity
  text_sim:
    embedding: text_vec

  # Tags: set overlap
  tag_sim:
    field: tags
    metric: jaccard

  # Authors: set overlap
  author_sim:
    field: authors
    metric: jaccard

  # Combined similarity
  overall:
    combine:
      - ref: text_sim
        weight: 0.8
      - ref: tag_sim
        weight: 0.15
      - ref: author_sim
        weight: 0.05

network:
  edges:
    similarity: overall
    min: 0.4

Step 2: Initialize System¶

from src.network_rag import NetworkRAG

# Create RAG instance
rag = (NetworkRAG.builder()
       .with_storage('ml_papers.db')
       .with_tfidf_embeddings()
       .from_config('config/ml_papers.yaml')
       .build())

Step 3: Add Papers¶

papers = [
    {
        'id': 'transformer',
        'title': 'Attention Is All You Need',
        'abstract': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...',
        'tags': ['transformers', 'attention', 'seq2seq', 'nlp'],
        'authors': ['Vaswani', 'Shazeer', 'Parmar', 'Uszkoreit', 'Jones', 'Gomez', 'Kaiser', 'Polosukhin'],
        'year': 2017
    },
    {
        'id': 'bert',
        'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding',
        'abstract': 'We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers...',
        'tags': ['transformers', 'pretraining', 'nlp', 'bert'],
        'authors': ['Devlin', 'Chang', 'Lee', 'Toutanova'],
        'year': 2019
    },
    {
        'id': 'gpt3',
        'title': 'Language Models are Few-Shot Learners',
        'abstract': 'Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning...',
        'tags': ['transformers', 'gpt', 'few-shot', 'nlp'],
        'authors': ['Brown', 'Mann', 'Ryder', 'Subbiah', 'Kaplan'],
        'year': 2020
    },
    {
        'id': 'vit',
        'title': 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale',
        'abstract': 'While the Transformer architecture has become the de-facto standard for natural language processing tasks...',
        'tags': ['transformers', 'vision', 'image-classification'],
        'authors': ['Dosovitskiy', 'Beyer', 'Kolesnikov', 'Weissenborn'],
        'year': 2021
    },
    {
        'id': 'resnet',
        'title': 'Deep Residual Learning for Image Recognition',
        'abstract': 'Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks...',
        'tags': ['cnn', 'vision', 'residual', 'image-classification'],
        'authors': ['He', 'Zhang', 'Ren', 'Sun'],
        'year': 2016
    }
]

# Add papers
for paper in papers:
    rag.add(paper['id'], document=paper)

print(f"Added {len(papers)} papers")

Step 4: Build Network¶

# Build similarity network
graph = rag.build_network()

print(f"\nNetwork Statistics:")
print(f"  Nodes: {len(graph.nodes())}")
print(f"  Edges: {len(graph.edges())}")

# Check density
if len(graph.nodes()) > 1:
    max_edges = len(graph.nodes()) * (len(graph.nodes()) - 1) / 2
    density = len(graph.edges()) / max_edges
    print(f"  Density: {density:.3f}")

Step 5: Analyze Communities¶

from collections import defaultdict

# Detect communities
communities = rag.detect_communities()

# Group by community
comm_groups = defaultdict(list)
for node_id, comm_id in communities.items():
    comm_groups[comm_id].append(node_id)

print(f"\nCommunities Detected: {len(comm_groups)}")

for comm_id, nodes in sorted(comm_groups.items()):
    print(f"\n  Community {comm_id}: {len(nodes)} papers")

    # Auto-tag community
    tags = rag.auto_tag_community(comm_id, n_samples=len(nodes))
    print(f"    Keywords: {', '.join(tags[:5])}")

    # Show papers
    for node_id in nodes:
        node = rag.storage.get_node(node_id)
        doc = node['metadata']
        print(f"    - {doc['title'][:60]}... ({doc['year']})")

Expected output:

Community 0: 4 papers (NLP/Transformer papers)
  Keywords: transformers, attention, language, nlp
  - Attention Is All You Need (2017)
  - BERT: Pre-training... (2019)
  - Language Models are Few-Shot Learners (2020)
  - An Image is Worth 16x16 Words... (2021)

Community 1: 1 paper (Vision/CNN)
  Keywords: residual, image, recognition
  - Deep Residual Learning for Image Recognition (2016)

Step 6: Search¶

# Search for transformer papers
print("\n=== Search: 'transformer architecture' ===")
results = rag.search('transformer architecture').with_strategy('hybrid').top(5)

for i, result in enumerate(results, 1):
    print(f"\n{i}. {result.id} (score: {result.score:.3f})")
    doc = result.metadata
    print(f"   Title: {doc['title']}")
    print(f"   Year: {doc['year']}")
    print(f"   Tags: {', '.join(doc.get('tags', []))}")
    print(f"   Community: {result.community_id}")

Step 7: Find Bridges¶

# Find papers that bridge NLP and Vision
bridges = rag.get_bridge_nodes(min_betweenness=0.1)

print("\n=== Bridge Papers (connect different communities) ===")
for bridge_id in bridges:
    node = rag.storage.get_node(bridge_id)
    doc = node['metadata']

    # Get communities this paper connects
    neighbors = rag.get_neighbors(bridge_id, k_hops=1)
    neighbor_communities = {rag.get_community_for_node(n) for n in neighbors}

    print(f"\n{doc['title']}")
    print(f"  Connects communities: {neighbor_communities}")
    print(f"  Tags: {', '.join(doc.get('tags', []))}")

Complete Script¶

See examples/ml_papers_tutorial.py for the complete working script.