Skip to main content

ICC Retrieval Augmented Generation Corpus

Generating RAG Corpus

Retrieval Augmented Generation Corpus is generated for Semantic Search (vector similarity) across the body of regulation. Raw exports of regulation texts are stored in api/icc folder.

Understanding Corpus Structure

Retrieval Augmented Generation Corpus is organized on the file system as follows:

database/icc/
└ part-1/
└ chapter-1/
└ part-1/
└ section-r101/
├ content.md
├ content.pdf
├ embedding.json
├ metadata.json
└ article-1/
├ content.md
├ content.pdf
├ embedding.json
└ metadata.json

The filesystem can be either Local Disk or a GCS bucket.

Building Corpus Index

BuildingRegulationParser is used to build the corpus index. It uses Apache PDF box to parse the outline from the Table of Contents, using the PdfOutlineNavigator helper class. The outline is then used to build the index of the corpus, initially as a List<PDOutlineItem>.

The actual index.json file is created when BuildingRegulationParser.parseAndSaveSections(List<String> sectionNumbers) is invoked.

Try parsing a few sections of the code as follows:

cli/codeproof.sh building-regulation-parser \
--source-pdf "inputs/BuildingRegulationCode.pdf" \
--filesystem LOCAL \
--model gemini-1.5-pro-002 \
301 302 303 311

Check the GCS bucket for necessary input resources: https://console.cloud.google.com/storage/browser/construction-code-expert-dev/resources

Generating content.pdf file

The PDF file is a subset of pages from the original PDF document representing the entire body of construction code. The boundaries of the pages in the content.pdf file don't exactly align with the boundaries of the section or article of the code, but rather are a "rounded up to page boundaries" excerpt of the original PDF document.

The file is generated using the following methods:

  1. BuildingRegulationSectionParser.getSectionPdfDocument() (used to get the PDF excerpt)
  2. BuildingRegulationArticleParser.saveArticlePdfToFilesystem() (used to get the PDF excerpt)

Generating content.md file

After a PDF excerpt is extracted, next an LLM is used to extract raw text from the PDF into a Markdown file.

  1. BuildingRegulationSectionParser.getTextFromSection() (processes the PDF into markdown)
  2. BuildingRegulationArticleParser.getTextFromArticleAsMarkdown() (processes the PDF into markdown)

Generating embedding.json file

The embedding.json file contains the embeddings of the raw text extracted from the PDF. The embeddings are generated using an LLM embedding model.

The file is generated using the following method:

  1. BuildingRegulationArticleParser.saveEmbeddingToFileSystem()

Generating metadata.json file

The Metadata file contains the metadata of the section or article. Here's an example:

{
"sectionNumber" : "R312",
"titleHierarchy" : [ "PART III—BUILDING PLANNING AND CONSTRUCTION", "CHAPTER 3 BUILDING PLANNING", "SECTION R312 GUARDS AND WINDOW FALL PROTECTION" ],
"dateParsed" : [ 2025, 1, 17 ],
"textLength" : 3048,
"directoryPath" : "database/icc-irc/part-3/chapter-3/section-r312"
}

Taking the Inspector Exam

The inspector exam is a multiple-choice exam that evaluates the expertise of the LLM system in construction code. This is a great way to evaluate the quality and AI's understanding of the RAG corpus.

To take the exam, run the following command:

cli/codeproof.sh exam-proctor \
--examinee-model gemini-1.5-pro-002 \
--judge-model gemini-1.5-pro-002 \
--exam "src/test/resources/icc-b1-practice-exam.textproto" \
--question-style MULTIPLE_CHOICE \
1 2 3 4 5

If using RAG, add --use-rag and --filesystem flags. Make sure your RAG corpus has been built on the corresponding filesystem first before running the exam in this mode.

cli/codeproof.sh exam-proctor \
--examinee-model gemini-1.5-pro-002 \
--use-rag \
--filesystem LOCAL \
--judge-model gemini-1.5-pro-002 \
--exam "src/test/resources/icc-b1-practice-exam.textproto" \
--question-style MULTIPLE_CHOICE \
1 2 3 4 5

Testing RAG Corpus Retriever

The project includes a command-line interface at cli/codeproof.sh implemented using the PicoCli library. Run cli/codeproof.sh --help to see all available commands.

# To be implemented
cli/codeproof.sh rag-retriever \
--filesystem LOCAL \
--question 1

In the meantime try:

mvn test -Dtest=IrcSectionRetrieverTest#testExamRetriever