Tutorial Notebooks
Learn repoindex interactively with our comprehensive Jupyter notebook tutorials. These hands-on guides cover everything from basic repository management to advanced clustering and workflow automation.
Overview
The repoindex tutorial notebooks provide:
- Interactive Learning: Run code and see results immediately
- Progressive Complexity: Start simple, build to advanced topics
- Real Examples: Work with actual repository data
- Best Practices: Learn recommended patterns and workflows
- Visualization: See your data through charts and graphs
Getting Started
Prerequisites
- Python 3.8 or higher
- Jupyter Notebook or JupyterLab
- repoindex installed:
pip install repoindex
Setup
# Clone the repoindex repository to access notebooks
git clone https://github.com/queelius/repoindex.git
cd repoindex/notebooks
# Install Jupyter and visualization dependencies
pip install jupyter matplotlib seaborn plotly pandas
# Start Jupyter
jupyter notebook
# Or use JupyterLab for a more modern interface
jupyter lab
Quick Start with Docker
If you prefer a containerized environment:
docker run -it -p 8888:8888 \
-v $(pwd):/home/jovyan/work \
jupyter/datascience-notebook
# Then navigate to /work/notebooks in the Jupyter interface
Tutorial Notebooks
1. Getting Started (01_getting_started.ipynb)
Duration: 20-30 minutes Level: Beginner
Learn the fundamentals of repoindex:
- Installing and configuring repoindex
- Understanding the JSONL output format
- Listing and querying repositories
- Working with the catalog and tags
- Basic status checks and updates
What You'll Learn:
# List repositories and parse JSONL
repos = !repoindex list
import json
for line in repos:
repo = json.loads(line)
print(f"{repo['name']}: {repo['language']}")
# Query repositories with filters
python_repos = !repoindex query "language == 'Python'"
print(f"Found {len(python_repos)} Python repositories")
# Add tags and organize
!repoindex catalog tag myrepo "production" "python" "api"
Key Concepts: - JSONL streaming output - Unix pipeline composition - Tag-based organization - Query language basics
2. Clustering Analysis (02_clustering_analysis.ipynb)
Duration: 45-60 minutes Level: Intermediate
Master repository clustering and analysis:
- Understanding clustering algorithms
- Feature extraction and selection
- Detecting duplicate code
- Visualizing cluster relationships
- Interpreting clustering results
What You'll Learn:
import pandas as pd
import matplotlib.pyplot as plt
# Run clustering and load results
!repoindex cluster analyze --algorithm kmeans -r > clusters.jsonl
# Parse and visualize
clusters = pd.read_json('clusters.jsonl', lines=True)
cluster_results = clusters[clusters['action'] == 'cluster_result']
# Visualize cluster sizes
cluster_sizes = cluster_results['cluster'].apply(lambda x: x['size'])
plt.bar(range(len(cluster_sizes)), cluster_sizes)
plt.title('Cluster Size Distribution')
plt.xlabel('Cluster ID')
plt.ylabel('Number of Repositories')
plt.show()
# Find duplicates
!repoindex cluster find-duplicates --min-similarity 0.8 -r > duplicates.jsonl
duplicates = pd.read_json('duplicates.jsonl', lines=True)
high_sim = duplicates[duplicates['similarity'] > 0.9]
print(f"Found {len(high_sim)} highly similar repository pairs")
Key Concepts: - Clustering algorithm selection (K-means, DBSCAN, Hierarchical) - Feature importance and weighting - Similarity scoring and thresholds - Code duplication patterns - Cluster quality metrics (silhouette score, coherence)
Hands-On Exercises: 1. Compare clustering algorithms on your repositories 2. Identify duplicate code across projects 3. Create a consolidation plan based on analysis 4. Visualize technology stack distribution
3. Workflow Orchestration (03_workflow_orchestration.ipynb)
Duration: 60-90 minutes Level: Intermediate to Advanced
Build powerful automated workflows:
- YAML workflow syntax
- DAG execution and dependencies
- Conditional logic and branching
- Parallel task execution
- Error handling and retries
What You'll Learn:
# Create a workflow programmatically
from repoindex.integrations.workflow import Workflow, Task
workflow = Workflow(
name="Portfolio Analysis",
description="Analyze entire repository portfolio"
)
# Add tasks
workflow.add_task(Task(
id="list_repos",
type="repoindex",
command="list",
args=["--format", "json"],
parse_output=True,
output_var="all_repos"
))
workflow.add_task(Task(
id="cluster_analysis",
type="repoindex",
command="cluster analyze",
args=["--algorithm", "kmeans"],
depends_on=["list_repos"],
parse_output=True,
output_var="clusters"
))
workflow.add_task(Task(
id="generate_report",
type="python",
code="""
report = []
report.append(f"Total repos: {len(context['all_repos'])}")
report.append(f"Clusters found: {len(context['clusters'])}")
with open('portfolio-report.md', 'w') as f:
f.write('\\n'.join(report))
""",
depends_on=["cluster_analysis"]
))
# Save and run
workflow.save('portfolio-analysis.yaml')
!repoindex workflow run portfolio-analysis.yaml
Key Concepts: - YAML workflow structure - Task dependencies and DAG - Variable templating and context - Parallel vs sequential execution - Error recovery strategies
Hands-On Exercises: 1. Build a morning routine workflow 2. Create a release pipeline workflow 3. Implement error handling and retries 4. Use conditional execution for different scenarios
4. Advanced Integrations (04_advanced_integrations.ipynb)
Duration: 60-75 minutes Level: Advanced
Combine multiple repoindex features for powerful workflows:
- Integration patterns and composition
- Custom action development
- Network analysis and visualization
- Exporting to multiple formats
- Automation and scheduling
What You'll Learn:
# Combine clustering with export
!repoindex cluster analyze -r | \
jq 'select(.action == "cluster_result")' | \
repoindex export markdown --stdin --output clusters.md
# Create custom workflow action
from repoindex.integrations.workflow import Action
class AnalyzeAndReportAction(Action):
def execute(self, parameters, context):
# Cluster repositories
clusters = self.run_repoindex_command('cluster', 'analyze', '-r')
# Find duplicates
duplicates = self.run_repoindex_command('cluster', 'find-duplicates', '-r')
# Generate comprehensive report
report = self.generate_report(clusters, duplicates)
return {'status': 'success', 'report': report}
# Use in workflow
workflow.add_task(Task(
id="comprehensive_analysis",
action="custom.analyze_and_report"
))
Key Concepts: - Pipeline composition with Unix tools - Custom integration development - Multi-format export workflows - Integration with external tools (jq, awk, etc.) - Service mode and automation
Hands-On Exercises: 1. Build a custom workflow action 2. Create a multi-stage analysis pipeline 3. Integrate with external APIs 4. Set up automated reporting
5. Data Visualization (05_data_visualization.ipynb)
Duration: 45-60 minutes Level: Intermediate
Visualize repository data effectively:
- Creating informative charts
- Interactive dashboards
- Network visualizations
- Technology landscape maps
- Trend analysis over time
What You'll Learn:
import plotly.express as px
import plotly.graph_objects as go
import networkx as nx
# Load repository data
repos = pd.read_json('repos.jsonl', lines=True)
# Language distribution pie chart
lang_counts = repos['language'].value_counts()
fig = px.pie(values=lang_counts.values, names=lang_counts.index,
title='Repository Language Distribution')
fig.show()
# Cluster coherence scores
clusters = pd.read_json('clusters.jsonl', lines=True)
cluster_data = clusters[clusters['action'] == 'cluster_result']
coherence = cluster_data['cluster'].apply(lambda x: x['coherence_score'])
fig = go.Figure(data=[go.Bar(x=range(len(coherence)), y=coherence)])
fig.update_layout(title='Cluster Coherence Scores',
xaxis_title='Cluster ID',
yaxis_title='Coherence Score')
fig.show()
# Repository dependency network
G = nx.Graph()
# Add nodes and edges from dependency data
for repo in repos:
G.add_node(repo['name'])
for dep in repo.get('dependencies', []):
if dep in repos['name'].values:
G.add_edge(repo['name'], dep)
# Visualize network
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_color='lightblue',
node_size=500, font_size=8, font_weight='bold')
plt.title('Repository Dependency Network')
plt.show()
Key Concepts: - Data preparation from JSONL - Static visualizations with matplotlib/seaborn - Interactive plots with Plotly - Network graphs with NetworkX - Dashboard creation with Dash
Hands-On Exercises: 1. Create a technology stack heatmap 2. Build an interactive cluster explorer 3. Visualize repository dependencies 4. Generate trend charts over time
Learning Path
For Beginners
- Start with Notebook 1 (Getting Started)
- Learn basic commands and concepts
- Understand JSONL output format
-
Practice with your own repositories
-
Move to Notebook 5 (Data Visualization)
- See your data in visual form
- Understand repository patterns
-
Create informative charts
-
Try Notebook 2 (Clustering Analysis)
- Group similar repositories
- Find patterns in your portfolio
- Detect duplicates
For Intermediate Users
- Review Notebook 1 quickly
- Deep dive into Notebook 2 (Clustering)
- Experiment with different algorithms
- Understand feature selection
-
Optimize clustering parameters
-
Master Notebook 3 (Workflows)
- Automate repetitive tasks
- Build complex pipelines
-
Implement error handling
-
Explore Notebook 4 (Advanced Integrations)
- Combine features creatively
- Build custom integrations
- Create production workflows
For Advanced Users
- Skip to Notebook 4 (Advanced Integrations)
- Study integration patterns
- Develop custom actions
-
Build enterprise workflows
-
Use Notebook 5 for inspiration
- Advanced visualization techniques
- Custom dashboard creation
-
Real-time monitoring
-
Contribute back
- Share your notebooks
- Create new integrations
- Improve documentation
Tips for Success
Environment Setup
# Create a dedicated environment
python -m venv repoindex-tutorials
source repoindex-tutorials/bin/activate # On Windows: repoindex-tutorials\Scripts\activate
# Install all dependencies
pip install repoindex[clustering,workflows] jupyter pandas matplotlib seaborn plotly networkx
Working with Notebooks
- Execute cells sequentially: Notebooks build on previous cells
- Save frequently: Use Ctrl+S or Cmd+S to save your work
- Restart kernel if needed: If something breaks, restart and run all cells
- Experiment freely: Copy cells to try variations
- Add your own notes: Use markdown cells for observations
Data Preparation
# Helper function to load JSONL data
import json
import pandas as pd
def load_repoindex_output(file_or_command):
"""Load repoindex JSONL output into a DataFrame."""
if file_or_command.endswith('.jsonl'):
return pd.read_json(file_or_command, lines=True)
else:
import subprocess
result = subprocess.run(file_or_command, shell=True,
capture_output=True, text=True)
lines = result.stdout.strip().split('\n')
data = [json.loads(line) for line in lines if line]
return pd.DataFrame(data)
# Usage
repos = load_repoindex_output('repoindex list')
# or
repos = load_repoindex_output('repos.jsonl')
Troubleshooting
Jupyter not starting:
jupyter notebook --debug
# Check for port conflicts, try a different port
jupyter notebook --port 8889
Kernel dies when running repoindex commands:
# Increase kernel timeout
jupyter notebook --NotebookApp.iopub_data_rate_limit=1000000000
Import errors:
# Ensure you're using the right Python
which python
# Reinstall in the correct environment
pip install --force-reinstall repoindex
JSONL parsing errors:
# Robust JSONL parsing
import json
def safe_parse_jsonl(file_path):
data = []
with open(file_path, 'r') as f:
for i, line in enumerate(f, 1):
try:
data.append(json.loads(line))
except json.JSONDecodeError as e:
print(f"Error on line {i}: {e}")
continue
return data
Advanced Topics
Creating Custom Visualizations
# Template for custom visualization
import plotly.graph_objects as go
def create_cluster_sunburst(clusters_df):
"""Create a sunburst chart of cluster hierarchies."""
data = []
labels = []
parents = []
values = []
for _, cluster in clusters_df.iterrows():
cluster_info = cluster['cluster']
cluster_id = f"Cluster {cluster_info['cluster_id']}"
# Add cluster as parent
labels.append(cluster_id)
parents.append("")
values.append(cluster_info['size'])
# Add repositories as children
for repo in cluster_info['repositories']:
repo_name = repo.split('/')[-1]
labels.append(repo_name)
parents.append(cluster_id)
values.append(1)
fig = go.Figure(go.Sunburst(
labels=labels,
parents=parents,
values=values,
))
fig.update_layout(title='Repository Cluster Hierarchy')
return fig
# Usage
fig = create_cluster_sunburst(cluster_results)
fig.show()
Building Interactive Dashboards
# Simple Dash dashboard
from dash import Dash, dcc, html
import dash_bootstrap_components as dbc
app = Dash(__name__, external_stylesheets=[dbc.themes.BOOTSTRAP])
app.layout = dbc.Container([
dbc.Row([
dbc.Col(html.H1("Repository Dashboard"), width=12)
]),
dbc.Row([
dbc.Col(dcc.Graph(id='language-dist'), width=6),
dbc.Col(dcc.Graph(id='cluster-coherence'), width=6)
]),
dbc.Row([
dbc.Col(dcc.Graph(id='dependency-network'), width=12)
])
])
if __name__ == '__main__':
app.run_server(debug=True, port=8050)
Additional Resources
Documentation
Example Repositories
Community
Video Tutorials (Coming Soon)
- Introduction to repoindex
- Clustering in action
- Building your first workflow
- Advanced integration patterns
Contributing Your Notebooks
Have you created useful notebooks? Share them with the community!
- Fork the repository
- Add your notebook to
notebooks/community/ - Include a README explaining the notebook's purpose
- Submit a pull request
Notebook Guidelines
- Clear documentation and comments
- Self-contained (include all necessary imports)
- Sample data or instructions to generate it
- Expected outcomes and interpretations
- Attribution for external resources
Next Steps
After completing the tutorials:
- Apply to your repositories: Use repoindex with your actual projects
- Customize workflows: Build workflows for your specific needs
- Explore integrations: Try advanced features and integrations
- Join the community: Share experiences and get help
- Contribute: Help improve repoindex and its documentation
Ready to dive in? Start with Notebook 1: Getting Started