Retrieval Augmented Generation (RAG) Application

Learn How to Use Retrieval Augmented Generation on Intel® Gaudi® Processors with Hugging Face*

author-image

作者

RAG on the Intel® Gaudi® 2 Platform

A scalable RAG application uses Hugging Face* tools to deploy optimized applications using the Intel® Gaudi® 2 AI accelerator.

The application is built with readily accessible Hugging Face tools such as text-generation-inference (TGI) and text-embeddings-inference (TEI). Langchain is used to make the code easier to understand. The user interface at the end of the tutorial uses Gradio to submit your queries. This application is in a Docker* environment but can be deployed to a Kubernetes* cluster.

RAG is a method that enhances the precision and dependability of generative AI (GenAI) models by incorporating facts from external sources. This technique addresses the limitations of large language models (LLMs), which, despite their ability to generate responses to general prompts rapidly, may not provide in-depth or specific information. By enabling access to external knowledge sources, RAG improves factual consistency, increases the reliability of generated responses, and helps to mitigate the issue of hallucination in more complex and knowledge-intensive tasks.

This tutorial shows the steps of building the full RAG pipeline on the Intel Gaudi 2 platform. First, you build the text generation, text embedding, vector store index, and database generation tools. Then, the external dataset is prepared by extracting the information from the external document, creating chunks of the document, and creating numerical embeddings of the chunks of data. These embeddings are then loaded into the vector database. To run a query, it runs the embedding model again on the query, attempts to match it with the contents of the database, and sends the overall prompt and query response to the Llama 2 LLM to generate a full response.

""

Figure 1. RAG model details

Initial Setup

To ensure that your build environment is set correctly, follow these steps:
 

  1. Set the appropriate ports for access when you ssh into the node of the Intel Gaudi 2 accelerator. Ensure that the following ports are open:

    Port 8888 (for running this Jupyter* Notebook)
    Port 7680 (for running the Gradio server)
     
  2. Add the following code to your overall ssh command when connecting to the node of the Intel Gaudi accelerator:
    ssh -L 8888:localhost:8888 -L 7860:localhost:7860 ....
  3. Before you load this notebook, run the standard Docker image for the Intel Gaudi accelerator. Include the /var/run/docker.sock file. Use the following commands to start your Docker image.
    docker run -itd --name RAG --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host -v /var/run/docker.sock:/var/run/docker.sock vault.habana.ai/gaudi-docker/1.15.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
    
    
    docker exec -it RAG bash
    
    
    
    cd ~ && git clone https://github.com/HabanaAI/Gaudi-tutorials
    
    
    
    python3 -m pip install jupyterlab && python3 -m jupyterlab_server --IdentityProvider.token='' --ServerApp.password='' --allow-root --port 8888 --ServerApp.root_dir=$HOME &

Set Up the Docker Environment in the Notebook

At this point, you have cloned the Gaudi-tutorials notebook inside your Docker image and have opened the notebook. Note that you will need to install Docker again inside the container with the Intel Gaudi accelerator to manage the running of the RAG tools.

cd /root/Gaudi-tutorials/PyTorch/RAG_Application

apt-get update

apt-get install docker.io curl -y

Load the Tools for RAG

There are three steps to create the RAG environment: text generation, text embedding, and vectorization.

Text-Generation-Inference (TGI)

The first building block of the application is TGI, and its purpose is to serve the LLM that answers questions based on context. To run it, you need to build a Docker image.

Note The Hugging Face TGI depends on software that is subject to non-open source licenses. If you use or redistribute this software, it is your sole responsibility to ensure compliance with such licenses.

cd /root/Gaudi-tutorials/PyTorch/RAG_Application

git clone -b v1.2.1 https://github.com/huggingface/tgi-gaudi.git

cd tgi-gaudi

docker build -t tgi-gaudi .

cd ../

After you build the image, you run it.

How to Access and Use the Llama 2 Model

To use the Llama 2 model, you need a Hugging Face account. Agree to the terms of use of the model in its model card on the Hugging Face Hub and create a read token. Then, copy that token to the HUGGING_FACE_HUB_TOKEN variable that follows.

Use of the pretrained model is subject to compliance with third-party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the Llama 2 model, what will be considered misuse and out-of-scope uses, who the intended users are, and additional terms, review the instructions. Users bear sole liability and responsibility to follow and comply with any third-party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third-party licenses.

docker run -d -p 9001:80 \

--runtime=habana \

--name gaudi-tgi \

-e HABANA_VISIBLE_DEVICES=0 \

-e OMPI_MCA_btl_vader_single_copy_mechanism=none \

-e HUGGING_FACE_HUB_TOKEN="<your_token_here>" \

--cap-add=sys_nice \

--ipc=host \

tgi-gaudi \

--model-id meta-llama/Llama-2-7b-chat-hf

After running the Docker server, it will take some time to download the model and load it into the device. To check the status of the download, run docker logs gaudi-tgi. The following is a sample of the output that displays.

2024-02-23T16:24:35.125179Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0

2024-02-23T16:24:40.729388Z INFO shard-manager: text_generation_launcher: Shard ready in 65.710470677s rank=0

2024-02-23T16:24:40.796775Z INFO text_generation_launcher: Starting Webserver

2024-02-23T16:24:42.589516Z WARN text_generation_router: router/src/main.rs:355: `--revision` is not set

2024-02-23T16:24:42.589551Z WARN text_generation_router: router/src/main.rs:356: We strongly advise to set it to a known supported commit.

2024-02-23T16:24:42.842098Z INFO text_generation_router: router/src/main.rs:377: Serving revision e852bc2e78a3fe509ec28c6d76512df3012acba7 of model Intel/neural-chat-7b-v3-1

2024-02-23T16:24:42.845898Z INFO text_generation_router: router/src/main.rs:219: Warming up model

2024-02-23T16:24:42.846613Z WARN text_generation_router: router/src/main.rs:230: Model does not support automatic max batch total tokens

2024-02-23T16:24:42.846620Z INFO text_generation_router: router/src/main.rs:252: Setting max batch total tokens to 16000

2024-02-23T16:24:42.846623Z INFO text_generation_router: router/src/main.rs:253: Connected

2024-02-23T16:24:42.846626Z WARN text_generation_router: router/src/main.rs:258: Invalid hostname, defaulting to 0.0.0.0

Once the setup is complete, you can verify that the text generation is working by sending a request to it. Note that the first request could be slow due to graph compilation.

curl 127.0.0.1:9001/generate \

    -X POST \

    -d '{"inputs":"why is the earth round?","parameters":{"max_new_tokens":200}}' \

    -H 'Content-Type: application/json'

Text-Embedding-Inference (TEI)

The purpose of TEI is to serve the embeddings model that produces embeddings for a vector database. To run it, we need to build a Docker image.

Note The Hugging Face TEI depends on software that is subject to non-open source licenses. If you use or redistribute this software, it is your sole responsibility to ensure compliance with such licenses.

git clone https://github.com/huggingface/tei-gaudi

cd tei-gaudi

!docker build --quiet -t tei-gaudi .

cd ../../

After building the image, run it:

docker run -d -p 9002:80 \

--runtime=habana \

--name gaudi-tei \

-e HABANA_VISIBLE_DEVICES=4 \

-e OMPI_MCA_btl_vader_single_copy_mechanism=none \

--cap-add=sys_nice \

--ipc=host \

tei-gaudi \

--model-id BAAI/bge-large-en-v1.5

PGVector Database

The third building block is a vector database, PGVector. Set up the Docker image using the following code:

docker pull pgvector/pgvector:pg16

docker run \

-d \

-e POSTGRES_PASSWORD=postgres \

-p 9003:5432 \

pgvector/pgvector:pg16

Application Front End

The last building block is a front end that serves as an http server. The front end is implemented in Python* using the Gradio interface. To set up the environment, run the following code:

cd /root/Gaudi-tutorials/PyTorch/RAG_Application

pip install -q -r requirements.txt

Data Preparation

To have a high-quality RAG application, we need to prepare data. To perform data processing for the vector database, extract text information from documents (such as PDFs or CSVs). Split it into chunks not exceeding the maximum length with additional metadata (for example, the filename or file creation date). Then upload the preprocessed data to the vector database.

In the process of data preprocessing, text splitting plays a crucial role. It involves breaking down the text into smaller, semantically meaningful chunks for further processing and analysis. Here are some common methods of text splitting:

  • By Character: This method involves splitting the text into individual characters. It’s a straightforward approach, but it may not always be the most effective as it doesn’t take into account the semantic meaning of words or phrases.
  • Recursive: Recursive splitting involves breaking down the text into smaller parts repeatedly until a certain condition is met. This method is particularly useful when dealing with complex structures in the text as it allows for a more granular level of splitting.
  • HTML Specific: When dealing with HTML content, text splitting can be done based on specific HTML tags or elements. This method is useful for extracting meaningful information from web pages or other HTML documents.
  • Code Specific: In programming code, text can be split based on specific code syntax or structures. This method is particularly useful for code analysis or for building tools that work with code.
  • By Tokens: Tokenization is a common method of text splitting in natural language processing (NLP). It involves breaking down the text into individual words or tokens. This method is effective for understanding the semantic meaning of the text as it allows for the analysis of individual words and their context.

In conclusion, the choice of a text splitting method depends largely on the nature of the text and the specific requirements of the task at hand. It’s important to choose a method that effectively captures the semantic meaning of the text and facilitates further processing and analysis.

This tutorial uses the recursive method for text splitting. For more on text splitting, try the Text Splitter Playground.

Database Population

Database population is a step where we load documents, embed them, and then load them into the database.

Data Loading

For ease of use, we’ll use helper functions from langchain. Note that langchain_community is also required.

from pathlib import Path

from langchain.docstore.document import Document

from langchain.text_splitter import CharacterTextSplitter

from langchain_community.document_loaders import TextLoader

from langchain.vectorstores.pgvector import PGVector

from langchain_community.embeddings import HuggingFaceHubEmbeddings

Load Documents with Embeddings

Here, we need to create a Hugging Face TEI client and a PGVector client. For PGVector, the collection name corresponds to a table name, and within the connection string, there is a connection protocol: postgresql+psycopg2, user, password, host, port, and database name. For ease of use, pre_delete_collection is set to true to prevent duplicates in the database.

embeddings = HuggingFaceHubEmbeddings(model="http://localhost:9002", huggingfacehub_api_token="EMPTY")

store = PGVector(

collection_name="documents",

connection_string="postgresql+psycopg2://postgres:postgres@localhost:9003/postgres",

embedding_function=embeddings,

pre_delete_collection=True

)

Data Loading and Splitting

Data is loaded from text files in the data/ folder. Then, documents are split into chunks of 512 characters and loaded into the database. Note that documents can have metadata that can also be stored in the vector database.

To load a new text file in the data/ folder and run the RAG pipeline on new content, run the following cell again with new data. This cell creates a new database to run your query.

def load_file_to_db(path: str, store: PGVector):

loader = TextLoader(path)

document = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=0)

for chunk in text_splitter.split_documents(document):

store.add_documents([chunk])



for doc in Path("data/").glob("*.txt"):

print(f"Loading {doc}...")

load_file_to_db(str(doc), store)



print("Finished.")

Running the Application

To start the application, run the following commands to set up the Gradio interface. Load a text file in the data folder and run the cell. The application ingests the data and starts the chat application to ask questions about the document. It directly accesses the TGI and TEI libraries to ingest the data, creates the embeddings and vector database, runs the query through the database, and uses the LLM to generate an answer to your query.

load_ext gradio



from langchain.vectorstores.pgvector import PGVector

from langchain.embeddings import HuggingFaceHubEmbeddings

from text_generation import Client



rag_prompt_intel_raw = """### System: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.



### User: Question: {question}



Context: {context}



### Assistant: """



def get_sources(question):

embeddings = HuggingFaceHubEmbeddings(model="http://localhost:9002", huggingfacehub_api_token="EMPTY")

store = PGVector(

collection_name="documents",

connection_string="postgresql+psycopg2://postgres:postgres@localhost:9003/postgres",

embedding_function=embeddings,

)

return store.similarity_search(f"Represent this sentence for searching relevant passages: {question}", k=2)



def sources_to_str(sources):

return "\n".join(f"{i+1}. {s.page_content}" for i, s in enumerate(sources))



def get_answer(question, sources):

client = Client("http://localhost:9001") #change this to 9009 for the new model

context = "\n".join(s.page_content for s in sources)

prompt = rag_prompt_intel_raw.format(question=question, context=context)

# return client.generate_stream(prompt, max_new_tokens=1024, stop_sequences=["### User:", "</s>"])

return client.generate(prompt, max_new_tokens=1024, stop_sequences=["### User:", "</s>"]).generated_text



default_question = "What is this the summary of this document?"



def rag_answer(question):

sources = get_sources(question)

answer = get_answer(question, sources)

#return f"Sources:\n{sources_to_str(sources)}\nAnswer:\n{answer}"

return f"{answer}"

Finally, run the Gradio application and see the output.

%%blocks



import gradio as gr



with gr.Blocks() as demo:

gr.Markdown(f"# Intel Gaudi 2 RAG app")

question = gr.Textbox(default_question, label="Question")

answer = gr.Textbox(label="Answer")

send_btn = gr.Button("Run")

send_btn.click(fn=rag_answer, inputs=question, outputs=answer)

example of UI output

Next Steps

You can add other .txt documents into the ./data folder and then rerun the steps in the Data Preparation section to update the vector database with the new document.

You can also try the LocalGPT tutorial, which also uses RAG with the LocalGPT script to generate responses using the chroma database.

For a complete list of models optimized for the Intel Gaudi platform, see Optimum for Intel Gaudi AI Accelerator.

1