I've spent quite a lot of time in the medical/scientific literature space. With regards to LLMs, specifically RAG, how the data is chunked is quite important. With that, I have a couple projects that might be beneficial additions.
paperetl (https://github.com/neuml/paperetl) - supports parsing arXiv, PubMed and integrates with GROBID to handle parsing metadata and text from arbitrary papers.
While arbitrary chunking/splitting can work, I've found that integrating parsing that has knowledge of medical/scientific paper structure increases the overall accuracy and experience of downstream applications.
it would accelerate research so much if LLM accuracy increased on biomedical papers.
very much agreed on the potential to extract signal from paper structures.
two questions if you don't mind:
1. did you post a summary of your chunking analysis somewhere? i'm curious which method maximized accuracy, and which sentence-overlap methods were most effective.
2. do you think general tokenization methods limit LLMs on scientific/biomedical papers?
> 1. did you post a summary of your chunking analysis somewhere? i'm curious which method maximized accuracy, and which sentence-overlap methods were most effective.
Good idea on this but nothing posted. In general, grouping by sections of a paper has worked best (i.e. methods, conclusions, results etc). GROBID is helpful with arbitrary papers.
> 2. do you think general tokenization methods limit LLMs on scientific/biomedical papers?
The problem is that this is still just retrieval and mechanical. In RAG, you split a PDF into small chunks, but this is way different from how humans digest PDFs. If I hire an RA to go through my Zotero lib and make a mind-map of sorts, he/she would combine papers, paragraphs, figures, etc. to come up with a "concepts" map, which is way richer than a retrieval system that merely finds the semantic similarity between my query and pieces of text.
RAG is good for semantic search, but really we need something that works at a knowledge/understanding level as opposed to data/information level.
I think what you're looking for is possible with LLM agents. For paperai (mentioned previously) at least, it's possible to build workflows that connect multiple prompt steps together.
In the example of the RA being automated by an LLM workflow by agents, I agree it’s very possible and it requires defining a set of specific agents, using prompts and allow function calling for tools, and then defining a full workflow between the agents. The workflow can likely be modeled by breaking down the individual steps the RA takes when doing their work.
The agents are likely very narrow and specific, they do one very very specific task. Then the workflow is a DAG chaining their work together.
Why does this post link to a renamed fork of Paper-QA (https://github.com/whitead/paper-qa) which has made zero changes and is 19 commits behind the original?
Maybe a stupid question, but how are equations handled in the parsing of a paper? Are local runable LLM capable of proposing model equations like programming code? I have seen that GPT4 can, so just wondering if equations are "treated" like normal computer code. My Zotero papers are equations heavy.
I've looked into the available options of parsing PDFs, including pypdf, which is what is being used here, a while ago and it's not good. While I haven't testing equations specifically, it think it's fair so assume that the results will be subpar especially complex ones.
I guess, this could be an application of the agent model. I've seen multiple LLMs recently trained specifically on LateX parsing. One model would recognize from the parsed PDF garbage that there is probably an equation there and call a different want to parse it.
Thank you for the idea to recognize the garbage to then use a different flow for the image of the equation from the pdf. Still left with an image to LaTeX problem, but maybe the state of the art has improved in the past years.
Thanks for sharing various projects. Any tools for materials science that can create summary tables of things like material, application, performance would be really valuable.
This is built on Langchain and I think it’s also possible to build this on top of Haystack now. I’m torn between the two and I’m wondering if this project provides a good example of why Langchain can be a better fit in certain situations, just not sure what those are exactly.
Oh no I'm just realizing that arxiv will be increasingly spammed with what should have been a blog post. I hope I'm wrong in assuming that in a few years the level of credibility that comes with a paper being on arxiv will have entirely worn off.
I know that in theory arxiv, being a pre-print server, shoulnd't give any credibility but practically that is the case and it still is a good quality/bs filter compared to e.g. Medium articles.
In my field, ArXiv has about the same level of credibility as Wikipedia or random journal articles from the International Journal of Sciency Science, i.e. trust, but verify. Among non-peer-reviewed documents, they rank below things like DoE or NASA reports and tend to not be cited.
There are preprints of articles since then published (which have the same credibility as the peer-reviewed article), articles form mates (which are obviously great), and the rest, which might be interesting but not a solid source on its own.
It seems to be working as intended, to be fair. ArXiv has precious little ways of improving the accuracy of the preprints.
There's a tool I use called Petal https://www.petal.org/reference-manager. The free tier allows up to 1GB of PDFs, which I believe are processed by GROBID and chunked for LLM QA.
The feature I find most useful is the table automation which I use for literature review, since it lets me run the same QA prompts on a collection of documents all at once.
That would be fantastic. At the moment the barrier to entry to use this kind of models is quite high. Something that could be used from the GUI would be great.
I've spent quite a lot of time in the medical/scientific literature space. With regards to LLMs, specifically RAG, how the data is chunked is quite important. With that, I have a couple projects that might be beneficial additions.
paperetl (https://github.com/neuml/paperetl) - supports parsing arXiv, PubMed and integrates with GROBID to handle parsing metadata and text from arbitrary papers.
paperai (https://github.com/neuml/paperai) - builds embeddings databases of medical/scientific papers. Supports LLM prompting, semantic workflows and vector search. Built with txtai (https://github.com/neuml/txtai).
While arbitrary chunking/splitting can work, I've found that integrating parsing that has knowledge of medical/scientific paper structure increases the overall accuracy and experience of downstream applications.