ICC Retrieval Augmented Generation Corpus
Generating RAG Corpus
Retrieval Augmented Generation Corpus is generated for Semantic Search (vector similarity)
across the body of regulation. Raw exports of regulation texts are stored in api/icc folder.
Understanding Corpus Structure
Retrieval Augmented Generation Corpus is organized on the file system as follows:
database/icc/
└ part-1/
└ chapter-1/
└ part-1/
└ section-r101/
├ content.md
├ content.pdf
├ embedding.json
├ metadata.json
└ article-1/
├ content.md
├ content.pdf
├ embedding.json
└ metadata.json
The filesystem can be either Local Disk or a GCS bucket.
Building Corpus Index
BuildingRegulationParser is used to build the corpus index. It uses Apache PDF box to parse the outline from the Table of Contents, using the PdfOutlineNavigator helper class. The outline is then used to build the index of the corpus, initially as a List<PDOutlineItem>.
The actual index.json file is created when BuildingRegulationParser.parseAndSaveSections(List<String> sectionNumbers)
is invoked.
Try parsing a few sections of the code as follows:
cli/codeproof.sh building-regulation-parser \
--source-pdf "inputs/BuildingRegulationCode.pdf" \
--filesystem LOCAL \
--model gemini-1.5-pro-002 \
301 302 303 311
Check the GCS bucket for necessary input resources: https://console.cloud.google.com/storage/browser/construction-code-expert-dev/resources
Generating content.pdf file
The PDF file is a subset of pages from the original PDF document representing the entire body of construction code.
The boundaries of the pages in the content.pdf file don't exactly align with the boundaries of the section or article
of the code, but rather are a "rounded up to page boundaries" excerpt of the original PDF document.
The file is generated using the following methods:
BuildingRegulationSectionParser.getSectionPdfDocument()(used to get the PDF excerpt)BuildingRegulationArticleParser.saveArticlePdfToFilesystem()(used to get the PDF excerpt)
Generating content.md file
After a PDF excerpt is extracted, next an LLM is used to extract raw text from the PDF into a Markdown file.
BuildingRegulationSectionParser.getTextFromSection()(processes the PDF into markdown)BuildingRegulationArticleParser.getTextFromArticleAsMarkdown()(processes the PDF into markdown)
Generating embedding.json file
The embedding.json file contains the embeddings of the raw text extracted from the PDF. The embeddings are generated
using an LLM embedding model.
The file is generated using the following method:
BuildingRegulationArticleParser.saveEmbeddingToFileSystem()
Generating metadata.json file
The Metadata file contains the metadata of the section or article. Here's an example:
{
"sectionNumber" : "R312",
"titleHierarchy" : [ "PART III—BUILDING PLANNING AND CONSTRUCTION", "CHAPTER 3 BUILDING PLANNING", "SECTION R312 GUARDS AND WINDOW FALL PROTECTION" ],
"dateParsed" : [ 2025, 1, 17 ],
"textLength" : 3048,
"directoryPath" : "database/icc-irc/part-3/chapter-3/section-r312"
}
Taking the Inspector Exam
The inspector exam is a multiple-choice exam that evaluates the expertise of the LLM system in construction code. This is a great way to evaluate the quality and AI's understanding of the RAG corpus.
To take the exam, run the following command:
cli/codeproof.sh exam-proctor \
--examinee-model gemini-1.5-pro-002 \
--judge-model gemini-1.5-pro-002 \
--exam "src/test/resources/icc-b1-practice-exam.textproto" \
--question-style MULTIPLE_CHOICE \
1 2 3 4 5
If using RAG, add --use-rag and --filesystem flags. Make sure your RAG corpus has been built on the
corresponding filesystem first before running the exam in this mode.
cli/codeproof.sh exam-proctor \
--examinee-model gemini-1.5-pro-002 \
--use-rag \
--filesystem LOCAL \
--judge-model gemini-1.5-pro-002 \
--exam "src/test/resources/icc-b1-practice-exam.textproto" \
--question-style MULTIPLE_CHOICE \
1 2 3 4 5
Testing RAG Corpus Retriever
The project includes a command-line interface at cli/codeproof.sh implemented using the PicoCli library. Run cli/codeproof.sh --help to see all available commands.
# To be implemented
cli/codeproof.sh rag-retriever \
--filesystem LOCAL \
--question 1
In the meantime try:
mvn test -Dtest=IrcSectionRetrieverTest#testExamRetriever