I think you’ll want to use web-searching RAG such as the ones built in to GPT-4 or Gemini (or my employer’s product, which has a less complete index but returns a longer text snippet for each result) to search the entire web for relevant data, including dynamically during your data analysis.
If you have O(1,000) long documents, and only want to ask one question or a short, predictable-in-advance set of questions of them, thenwhat you propose might work well. But if you’re going to be queriying them repreatedl, and/or you have O(100,000) documents, then building both a conventional keyword index (e.g. Lucene) and a semantic index (e.g. Pinecone) and querying both of those (since they each have strength and weaknesses) is going to be more cost-effective, and hopefully nearly as good.
A third strategy would be to fine-tune an open-source LLM off them (which would be more expensive, has a much higher hallucination risk, but might also extract more complex/interesting structures from them, if you probed the fine-tuned model with the right prompts)
I think you’ll want to use web-searching RAG such as the ones built in to GPT-4 or Gemini (or my employer’s product, which has a less complete index but returns a longer text snippet for each result) to search the entire web for relevant data, including dynamically during your data analysis.
If you have O(1,000) long documents, and only want to ask one question or a short, predictable-in-advance set of questions of them, thenwhat you propose might work well. But if you’re going to be queriying them repreatedl, and/or you have O(100,000) documents, then building both a conventional keyword index (e.g. Lucene) and a semantic index (e.g. Pinecone) and querying both of those (since they each have strength and weaknesses) is going to be more cost-effective, and hopefully nearly as good.
A third strategy would be to fine-tune an open-source LLM off them (which would be more expensive, has a much higher hallucination risk, but might also extract more complex/interesting structures from them, if you probed the fine-tuned model with the right prompts)