Skip to content

Repository Clustering Integration

The clustering integration provides advanced machine learning algorithms to automatically group similar repositories, detect duplicates, and analyze your project portfolio structure.

Overview

Repository clustering helps you:

  • Organize Large Portfolios: Automatically group hundreds of repositories
  • Detect Duplicates: Find similar or duplicate projects for consolidation
  • Understand Patterns: Discover hidden relationships between projects
  • Technology Analysis: Group projects by technology stack
  • Maintenance Planning: Identify clusters needing attention

Installation

The clustering integration requires additional dependencies:

# Install with clustering support
pip install repoindex[clustering]

# Or install dependencies separately
pip install scikit-learn numpy pandas matplotlib seaborn

Quick Start

# Basic clustering with defaults
repoindex cluster analyze

# Specify algorithm and clusters
repoindex cluster analyze --algorithm kmeans --n-clusters 5

# Detect duplicates
repoindex cluster duplicates --threshold 0.8

# Visualize clusters
repoindex cluster visualize --output clusters.html

# Export clustering results
repoindex cluster export --format json --output clusters.json

Clustering Algorithms

K-Means Clustering

Best for: Well-separated spherical clusters of similar size

repoindex cluster analyze --algorithm kmeans --n-clusters 5

Options: - --n-clusters: Number of clusters (default: auto-detect) - --max-iter: Maximum iterations (default: 300) - --n-init: Number of initializations (default: 10)

Hierarchical Clustering

Best for: Understanding cluster hierarchy and relationships

repoindex cluster analyze --algorithm hierarchical --distance-threshold 0.5

Options: - --linkage: Linkage criterion (ward, complete, average, single) - --distance-threshold: Distance threshold for clustering - --dendrogram: Generate dendrogram visualization

DBSCAN

Best for: Arbitrary shaped clusters, outlier detection

repoindex cluster analyze --algorithm dbscan --eps 0.5 --min-samples 3

Options: - --eps: Maximum distance between samples - --min-samples: Minimum samples in a neighborhood - --metric: Distance metric (euclidean, cosine, manhattan)

Spectral Clustering

Best for: Non-convex clusters, image segmentation patterns

repoindex cluster analyze --algorithm spectral --n-clusters 4

Options: - --affinity: Affinity matrix construction (rbf, nearest_neighbors) - --n-neighbors: Number of neighbors for affinity matrix - --gamma: Kernel coefficient for RBF

Feature Extraction

Available Features

Control which repository features are used for clustering:

# Technology stack based clustering
repoindex cluster analyze --features tech-stack

# Multi-feature clustering
repoindex cluster analyze --features tech-stack,size,activity,complexity

# All available features
repoindex cluster analyze --features all

Feature categories:

  • tech-stack: Programming languages, frameworks, dependencies
  • size: Lines of code, number of files, repository size
  • activity: Commit frequency, last update, contributor count
  • complexity: Cyclomatic complexity, dependency depth, file structure
  • documentation: README quality, documentation coverage, examples
  • quality: Test coverage, linting scores, security issues

Custom Feature Weights

Adjust the importance of different features:

repoindex cluster analyze \
  --features tech-stack,size,activity \
  --weights 0.5,0.3,0.2

Duplicate Detection

Find Duplicate Repositories

# Find duplicates with default threshold (0.8)
repoindex cluster duplicates

# Adjust similarity threshold
repoindex cluster duplicates --threshold 0.9

# Include archived repositories
repoindex cluster duplicates --include-archived

# Output detailed similarity scores
repoindex cluster duplicates --detailed

Consolidation Suggestions

# Get consolidation recommendations
repoindex cluster consolidate

# Interactive consolidation wizard
repoindex cluster consolidate --interactive

# Generate consolidation script
repoindex cluster consolidate --generate-script

Visualization

Interactive Web Visualization

# Generate interactive HTML visualization
repoindex cluster visualize --output clusters.html

# Include specific metadata in tooltips
repoindex cluster visualize \
  --output clusters.html \
  --metadata name,language,stars,last_commit

# 3D visualization
repoindex cluster visualize --3d --output clusters3d.html

Static Visualizations

# Generate static plots
repoindex cluster plot --output clusters.png

# Dendrogram for hierarchical clustering
repoindex cluster plot --type dendrogram --output dendrogram.png

# Scatter plot matrix
repoindex cluster plot --type scatter-matrix --output matrix.png

Cluster Analysis

Cluster Statistics

# Get detailed cluster statistics
repoindex cluster stats

# Focus on specific cluster
repoindex cluster stats --cluster-id 2

# Compare clusters
repoindex cluster compare --cluster-a 1 --cluster-b 2

Cluster Profiles

# Generate cluster profiles
repoindex cluster profile

# Export profiles to markdown
repoindex cluster profile --format markdown --output profiles.md

Example output:

## Cluster 1: Python Data Science (12 repositories)

**Characteristics:**
- Primary Language: Python (100%)
- Common Dependencies: numpy, pandas, scikit-learn
- Average Size: 2,500 LOC
- Activity Level: High (daily commits)

**Repositories:**
- ml-experiments
- data-pipeline
- analytics-dashboard
...

Integration with Other Commands

Pipeline Integration

# Cluster only active Python projects
repoindex query "language == 'Python' and days_since_commit < 30" | \
  repoindex cluster analyze --stdin

# Export clustered repositories
repoindex cluster analyze | \
  repoindex export hugo --group-by cluster

# Audit each cluster
repoindex cluster analyze | \
  jq -r '.cluster_id' | sort -u | \
  xargs -I {} repoindex audit --cluster {}

Workflow Integration

name: Weekly Clustering Analysis
steps:
  - name: cluster-analysis
    action: repoindex.cluster.analyze
    parameters:
      algorithm: kmeans
      n_clusters: 5

  - name: find-outliers
    action: repoindex.cluster.outliers
    parameters:
      threshold: 2.0

  - name: report
    action: repoindex.export
    parameters:
      format: markdown
      template: cluster-report

Configuration

Configure clustering defaults in ~/.repoindex/config.json:

{
  "integrations": {
    "clustering": {
      "default_algorithm": "kmeans",
      "default_n_clusters": "auto",
      "default_features": ["tech-stack", "size"],
      "cache_features": true,
      "visualization": {
        "colormap": "viridis",
        "figure_size": [10, 8],
        "include_labels": true
      }
    }
  }
}

Advanced Usage

Feature Engineering

Create custom features for clustering:

# Custom feature extractor
from repoindex.integrations.clustering import FeatureExtractor

class CustomExtractor(FeatureExtractor):
    def extract(self, repo):
        return {
            'custom_metric': self.calculate_metric(repo),
            'business_value': self.estimate_value(repo)
        }

# Use in clustering
repoindex cluster analyze --extractor custom_extractor.py

Clustering Pipelines

# Multi-stage clustering pipeline
repoindex cluster pipeline \
  --stage hierarchical:n_clusters=10 \
  --stage kmeans:n_clusters=5 \
  --stage dbscan:eps=0.3

Temporal Clustering

Analyze how clusters change over time:

# Cluster evolution analysis
repoindex cluster evolution \
  --start-date 2024-01-01 \
  --interval monthly \
  --output evolution.gif

Use Cases

Portfolio Organization

# Organize repositories by technology
repoindex cluster analyze --features tech-stack --n-clusters 7
repoindex catalog tag-from-clusters --prefix "tech"

# Create directory structure based on clusters
repoindex cluster organize --create-dirs --base-path ~/organized-repos

Technical Debt Analysis

# Find repositories needing updates
repoindex cluster analyze --features quality,activity
repoindex cluster stats | jq '.clusters[] | select(.avg_quality < 0.5)'

Team Assignment

# Cluster by expertise requirements
repoindex cluster analyze --features tech-stack,complexity
repoindex cluster assign-teams --team-config teams.json

Performance Optimization

Large-Scale Clustering

For portfolios with 1000+ repositories:

# Use sampling for initial analysis
repoindex cluster analyze --sample-size 100 --algorithm kmeans

# Mini-batch K-means for large datasets
repoindex cluster analyze --algorithm mini-batch-kmeans --batch-size 100

# Incremental clustering
repoindex cluster analyze --incremental --checkpoint cluster.pkl

Feature Caching

# Cache extracted features
repoindex cluster cache-features

# Use cached features
repoindex cluster analyze --use-cache

# Clear feature cache
repoindex cluster clear-cache

Troubleshooting

Common Issues

No clusters found

# Check feature variance
repoindex cluster diagnose --check-variance

# Try different algorithm
repoindex cluster analyze --algorithm dbscan --eps 0.1

Too many singleton clusters

# Adjust parameters
repoindex cluster analyze --algorithm kmeans --n-clusters 3

# Use different features
repoindex cluster analyze --features size,activity

Memory issues with large datasets

# Use incremental learning
repoindex cluster analyze --algorithm mini-batch-kmeans

# Reduce feature dimensions
repoindex cluster analyze --max-features 50

API Reference

Python API

from repoindex.integrations.clustering import ClusteringIntegration

# Initialize
clustering = ClusteringIntegration(config)

# Analyze repositories
repos = list(repoindex.list_repositories())
results = clustering.analyze(
    repos,
    algorithm='kmeans',
    n_clusters=5,
    features=['tech-stack', 'size']
)

# Get cluster assignments
for repo, cluster_id in results.assignments.items():
    print(f"{repo}: Cluster {cluster_id}")

# Visualize
clustering.visualize(results, output='clusters.html')

CLI Reference

# Main commands
repoindex cluster analyze      # Perform clustering analysis
repoindex cluster duplicates    # Find duplicate repositories
repoindex cluster visualize     # Create visualizations
repoindex cluster stats        # Show cluster statistics
repoindex cluster export       # Export clustering results

# Common options
--algorithm         # Clustering algorithm
--n-clusters       # Number of clusters
--features         # Features to use
--threshold        # Similarity threshold
--output          # Output file path
--format          # Output format
--stdin           # Read from stdin
--pretty          # Human-readable output

Best Practices

  1. Start Simple: Begin with k-means and basic features
  2. Iterate: Refine features and parameters based on results
  3. Validate: Manually review cluster assignments for accuracy
  4. Document: Save clustering parameters and rationale
  5. Monitor: Track how clusters evolve over time
  6. Combine: Use clustering with other repoindex features

Next Steps