Repository Clustering Integration
The clustering integration provides advanced machine learning algorithms to automatically group similar repositories, detect duplicates, and analyze your project portfolio structure.
Overview
Repository clustering helps you:
- Organize Large Portfolios: Automatically group hundreds of repositories
- Detect Duplicates: Find similar or duplicate projects for consolidation
- Understand Patterns: Discover hidden relationships between projects
- Technology Analysis: Group projects by technology stack
- Maintenance Planning: Identify clusters needing attention
Installation
The clustering integration requires additional dependencies:
# Install with clustering support
pip install repoindex[clustering]
# Or install dependencies separately
pip install scikit-learn numpy pandas matplotlib seaborn
Quick Start
# Basic clustering with defaults
repoindex cluster analyze
# Specify algorithm and clusters
repoindex cluster analyze --algorithm kmeans --n-clusters 5
# Detect duplicates
repoindex cluster duplicates --threshold 0.8
# Visualize clusters
repoindex cluster visualize --output clusters.html
# Export clustering results
repoindex cluster export --format json --output clusters.json
Clustering Algorithms
K-Means Clustering
Best for: Well-separated spherical clusters of similar size
repoindex cluster analyze --algorithm kmeans --n-clusters 5
Options:
- --n-clusters: Number of clusters (default: auto-detect)
- --max-iter: Maximum iterations (default: 300)
- --n-init: Number of initializations (default: 10)
Hierarchical Clustering
Best for: Understanding cluster hierarchy and relationships
repoindex cluster analyze --algorithm hierarchical --distance-threshold 0.5
Options:
- --linkage: Linkage criterion (ward, complete, average, single)
- --distance-threshold: Distance threshold for clustering
- --dendrogram: Generate dendrogram visualization
DBSCAN
Best for: Arbitrary shaped clusters, outlier detection
repoindex cluster analyze --algorithm dbscan --eps 0.5 --min-samples 3
Options:
- --eps: Maximum distance between samples
- --min-samples: Minimum samples in a neighborhood
- --metric: Distance metric (euclidean, cosine, manhattan)
Spectral Clustering
Best for: Non-convex clusters, image segmentation patterns
repoindex cluster analyze --algorithm spectral --n-clusters 4
Options:
- --affinity: Affinity matrix construction (rbf, nearest_neighbors)
- --n-neighbors: Number of neighbors for affinity matrix
- --gamma: Kernel coefficient for RBF
Feature Extraction
Available Features
Control which repository features are used for clustering:
# Technology stack based clustering
repoindex cluster analyze --features tech-stack
# Multi-feature clustering
repoindex cluster analyze --features tech-stack,size,activity,complexity
# All available features
repoindex cluster analyze --features all
Feature categories:
- tech-stack: Programming languages, frameworks, dependencies
- size: Lines of code, number of files, repository size
- activity: Commit frequency, last update, contributor count
- complexity: Cyclomatic complexity, dependency depth, file structure
- documentation: README quality, documentation coverage, examples
- quality: Test coverage, linting scores, security issues
Custom Feature Weights
Adjust the importance of different features:
repoindex cluster analyze \
--features tech-stack,size,activity \
--weights 0.5,0.3,0.2
Duplicate Detection
Find Duplicate Repositories
# Find duplicates with default threshold (0.8)
repoindex cluster duplicates
# Adjust similarity threshold
repoindex cluster duplicates --threshold 0.9
# Include archived repositories
repoindex cluster duplicates --include-archived
# Output detailed similarity scores
repoindex cluster duplicates --detailed
Consolidation Suggestions
# Get consolidation recommendations
repoindex cluster consolidate
# Interactive consolidation wizard
repoindex cluster consolidate --interactive
# Generate consolidation script
repoindex cluster consolidate --generate-script
Visualization
Interactive Web Visualization
# Generate interactive HTML visualization
repoindex cluster visualize --output clusters.html
# Include specific metadata in tooltips
repoindex cluster visualize \
--output clusters.html \
--metadata name,language,stars,last_commit
# 3D visualization
repoindex cluster visualize --3d --output clusters3d.html
Static Visualizations
# Generate static plots
repoindex cluster plot --output clusters.png
# Dendrogram for hierarchical clustering
repoindex cluster plot --type dendrogram --output dendrogram.png
# Scatter plot matrix
repoindex cluster plot --type scatter-matrix --output matrix.png
Cluster Analysis
Cluster Statistics
# Get detailed cluster statistics
repoindex cluster stats
# Focus on specific cluster
repoindex cluster stats --cluster-id 2
# Compare clusters
repoindex cluster compare --cluster-a 1 --cluster-b 2
Cluster Profiles
# Generate cluster profiles
repoindex cluster profile
# Export profiles to markdown
repoindex cluster profile --format markdown --output profiles.md
Example output:
## Cluster 1: Python Data Science (12 repositories)
**Characteristics:**
- Primary Language: Python (100%)
- Common Dependencies: numpy, pandas, scikit-learn
- Average Size: 2,500 LOC
- Activity Level: High (daily commits)
**Repositories:**
- ml-experiments
- data-pipeline
- analytics-dashboard
...
Integration with Other Commands
Pipeline Integration
# Cluster only active Python projects
repoindex query "language == 'Python' and days_since_commit < 30" | \
repoindex cluster analyze --stdin
# Export clustered repositories
repoindex cluster analyze | \
repoindex export hugo --group-by cluster
# Audit each cluster
repoindex cluster analyze | \
jq -r '.cluster_id' | sort -u | \
xargs -I {} repoindex audit --cluster {}
Workflow Integration
name: Weekly Clustering Analysis
steps:
- name: cluster-analysis
action: repoindex.cluster.analyze
parameters:
algorithm: kmeans
n_clusters: 5
- name: find-outliers
action: repoindex.cluster.outliers
parameters:
threshold: 2.0
- name: report
action: repoindex.export
parameters:
format: markdown
template: cluster-report
Configuration
Configure clustering defaults in ~/.repoindex/config.json:
{
"integrations": {
"clustering": {
"default_algorithm": "kmeans",
"default_n_clusters": "auto",
"default_features": ["tech-stack", "size"],
"cache_features": true,
"visualization": {
"colormap": "viridis",
"figure_size": [10, 8],
"include_labels": true
}
}
}
}
Advanced Usage
Feature Engineering
Create custom features for clustering:
# Custom feature extractor
from repoindex.integrations.clustering import FeatureExtractor
class CustomExtractor(FeatureExtractor):
def extract(self, repo):
return {
'custom_metric': self.calculate_metric(repo),
'business_value': self.estimate_value(repo)
}
# Use in clustering
repoindex cluster analyze --extractor custom_extractor.py
Clustering Pipelines
# Multi-stage clustering pipeline
repoindex cluster pipeline \
--stage hierarchical:n_clusters=10 \
--stage kmeans:n_clusters=5 \
--stage dbscan:eps=0.3
Temporal Clustering
Analyze how clusters change over time:
# Cluster evolution analysis
repoindex cluster evolution \
--start-date 2024-01-01 \
--interval monthly \
--output evolution.gif
Use Cases
Portfolio Organization
# Organize repositories by technology
repoindex cluster analyze --features tech-stack --n-clusters 7
repoindex catalog tag-from-clusters --prefix "tech"
# Create directory structure based on clusters
repoindex cluster organize --create-dirs --base-path ~/organized-repos
Technical Debt Analysis
# Find repositories needing updates
repoindex cluster analyze --features quality,activity
repoindex cluster stats | jq '.clusters[] | select(.avg_quality < 0.5)'
Team Assignment
# Cluster by expertise requirements
repoindex cluster analyze --features tech-stack,complexity
repoindex cluster assign-teams --team-config teams.json
Performance Optimization
Large-Scale Clustering
For portfolios with 1000+ repositories:
# Use sampling for initial analysis
repoindex cluster analyze --sample-size 100 --algorithm kmeans
# Mini-batch K-means for large datasets
repoindex cluster analyze --algorithm mini-batch-kmeans --batch-size 100
# Incremental clustering
repoindex cluster analyze --incremental --checkpoint cluster.pkl
Feature Caching
# Cache extracted features
repoindex cluster cache-features
# Use cached features
repoindex cluster analyze --use-cache
# Clear feature cache
repoindex cluster clear-cache
Troubleshooting
Common Issues
No clusters found
# Check feature variance
repoindex cluster diagnose --check-variance
# Try different algorithm
repoindex cluster analyze --algorithm dbscan --eps 0.1
Too many singleton clusters
# Adjust parameters
repoindex cluster analyze --algorithm kmeans --n-clusters 3
# Use different features
repoindex cluster analyze --features size,activity
Memory issues with large datasets
# Use incremental learning
repoindex cluster analyze --algorithm mini-batch-kmeans
# Reduce feature dimensions
repoindex cluster analyze --max-features 50
API Reference
Python API
from repoindex.integrations.clustering import ClusteringIntegration
# Initialize
clustering = ClusteringIntegration(config)
# Analyze repositories
repos = list(repoindex.list_repositories())
results = clustering.analyze(
repos,
algorithm='kmeans',
n_clusters=5,
features=['tech-stack', 'size']
)
# Get cluster assignments
for repo, cluster_id in results.assignments.items():
print(f"{repo}: Cluster {cluster_id}")
# Visualize
clustering.visualize(results, output='clusters.html')
CLI Reference
# Main commands
repoindex cluster analyze # Perform clustering analysis
repoindex cluster duplicates # Find duplicate repositories
repoindex cluster visualize # Create visualizations
repoindex cluster stats # Show cluster statistics
repoindex cluster export # Export clustering results
# Common options
--algorithm # Clustering algorithm
--n-clusters # Number of clusters
--features # Features to use
--threshold # Similarity threshold
--output # Output file path
--format # Output format
--stdin # Read from stdin
--pretty # Human-readable output
Best Practices
- Start Simple: Begin with k-means and basic features
- Iterate: Refine features and parameters based on results
- Validate: Manually review cluster assignments for accuracy
- Document: Save clustering parameters and rationale
- Monitor: Track how clusters evolve over time
- Combine: Use clustering with other repoindex features
Next Steps
- Explore Workflow Integration for automated clustering
- Learn about Network Analysis for relationship mapping
- Check Tutorial Notebooks for hands-on examples
- See API Documentation for detailed reference