← Limits of Current LLMs GPT-4o in Education →

Semantic Splitting for GenAI

by Fede Nolasco | Jun 14, 2024

 Bitswired | Document Processing | GenAI | Python Tutorial | Semantic Splitting

Bitswired presents a tutorial on semantic splitting, a method to improve document processing for generative AI applications. The video begins by explaining the limitations of large language models (LLMs) in handling up-to-date data and the inefficiency of retraining models frequently. Instead, retrieval-augmented generation (RAG) is proposed, where documents are split into chunks, and relevant chunks are retrieved to answer user queries. The tutorial introduces semantic splitting, which optimally splits documents based on meaning rather than fixed character counts or paragraphs. Using an example with Wikipedia pages on Francis I of France and linear algebra, the tutorial demonstrates how to identify optimal split points by calculating the semantic divergence between sentences. The implementation involves fetching Wikipedia pages, interleaving sections, splitting sentences, and computing embeddings to measure semantic divergence. The tutorial provides a Python notebook for hands-on practice and explains each step in detail, including the use of libraries like NumPy, Pandas, and OpenAI’s embedding models. The video concludes with a demonstration of how to plot and analyze the results, showing that semantic splitting effectively identifies meaningful split points in the document.

 Bitswired

 Not Applicable

 June 4, 2024

 GitHub Repo

← Limits of Current LLMs GPT-4o in Education →