Sadly I can't help much here. In my experience you have to create test cases and...

Sadly I can't help much here. In my experience you have to create test cases and experiment with different models (a LOT). Some general thoughts in no specific order:

- InstructorXL is my favorite embedding model because it has a two part input, sort of like a system prompt, that you can use to qualify the user input without modifying it yourself. You can experiment with using different system prompts on the initial embeddings and the user prompt. You can also use a bunch of different system prompts and weigh their scores, average them, add them, etc.

- You can start with qualitative test cases like the obvious Leviticus prohibitions and see what the range of scores are like before you create automated test cases and evaluations. Find one of those bibles where one side is the original King James translation and the other side is in modern English to use for more complex pairings.

- If that doesn't lead to an obvious winner, you may need to create a dataset for fine tuning. Make sure the dataset includes lots of negative examples too - cosine similarity scores should have a range of -1 to 1 to be most useful. Maybe take some important verses and change them to the opposite meaning ("thou shall not" => "thou shall") to create those negatives. Split your fine tuning data set into different categories so you can experiment with different combinations (i.e. the aforementioned autogenerated opposite pairs might really hurt the fine tuning because they're too similar).

- You can probably fine tune it using a completely synthetic dataset using GPT-3.5/4 to do all the work. It's "aware" of the concept of vector embeddings and the training data format so it can create positive and negative pairs for you based on your instructions. You can probably find some sort of ranking of the most important passages (say most quoted or something) and feed those to a prompt to generate tons of pairs quickly.