I am fine-tuning a small LLM to generate ElasticSearch DSL queries from natural language. The project is on GitHub.
The idea: take a task that large models handle well (translating “find all orders over $100 from last month” into the corresponding ElasticSearch JSON query) and see if a tiny model can learn it from synthetic data.
The data pipeline works like this. I started by generating examples from GPT-4. Then I wrote a script that samples from those outputs and uses them as few-shot examples for Mistral, which generates a much larger synthetic dataset. The next step is reshaping the data into the expected format and fine-tuning.
This is early stage. The synthetic data exists, the fine-tuning has not happened yet. I will update this post with results.
If you are interested in collaborating on this or related projects, email me at lex@metafunctor.com.
Discussion