In this video, the focus is on advanced techniques for parsing and chunking various document types to create datasets suitable for model training, including Microsoft Office documents, tables, and OCR images. The tutorial starts with downloading sample documents from Microsoft’s Investor Relations site, which includes Word, Excel, and PowerPoint files. These documents are then parsed, chunked, and indexed using LLmware’s tools.

The process begins with setting up the environment and loading the sample files into a library. The add files method is used to unzip, parse, and chunk the documents, extracting tables and images as well. The video demonstrates running a simple text query to ensure the parsing and indexing are successful. Advanced techniques include exporting tables to CSV files and running OCR on extracted images to convert them into text, which is then added to the library.

The tutorial also covers creating a dataset from the parsed documents, including different dataset types for model training. This involves exporting all text blocks into a JsonL file and creating training, validation, and testing splits. The video emphasizes the importance of experimenting with different chunking parameters to achieve optimal results.

The video concludes by showcasing the generated datasets and discussing potential applications for domain adaptation and embedding model fine-tuning. Viewers are encouraged to check out the sample files and code available on the LLmware GitHub repository and to join the Discord community for further discussion and support.

llmware
Not Applicable
June 4, 2024