the repo is for the real time rag pipeline for the research papers , extract all the rag research papers from the arxiv and the semantic chunkin is done on it , then the embedding model finetuning is done to make cluster for finetuning embedding model the distilabel is used for generating synthetic data set