# Document Completion API Workflow

The Document Completion API enables users to dynamically extract relevant information from documents using a process based on Retrieval-Augmented Generation (RAG). This involves leveraging language models alongside information retrieval techniques to generate accurate and contextual responses.

The process consists of four main steps:
1. Create Dataset
2. Ingest Dataset
3. Prepare Dataset
4. Query

## Create Dataset

   This endpoint is used to create a new dataset in the system. A dataset serves as a logical grouping to store and organize    all documents uploaded by the user. It is the foundation for subsequent steps, such as document ingestion and querying.

In [1]:
import requests  # Import the requests module to send HTTP requests

# Define the API endpoint for the datasets
datasets_endpoint = f"https://api.lab45.ai/v1.1/datasets"

# Set the headers for the request, including the content type, accepted response format and authorization token
headers = {
    "Content-Type": "application/json",
    "Accept": "text/event-stream, application/json",
    "Authorization": "Bearer <api_key>" #Replace <api_key> with your API key for authentication
}

# Payload data to be sent with the request, typically in JSON format. This payload defines a new dataset.
payload = {
            "name": "Test Dataset1",
            "description": "Some test Dataset1"
            }

# Make the POST request to the datasets API endpoint with the provided headers and payload
response = requests.post(datasets_endpoint, headers=headers, json=payload)

# Print the response json from the API call (this contains the name, id and the other field values related to this api request)
print(response.json())

{'_id': 'd90b8b53-da18-4b85-8fb7-6c23f67fb2ed', 'desc': 'Some test Dataset1', 'files': [], 'name': 'Test Dataset1', 'owners': ['edee4f9f-873f-439e-983c-8f732d312177'], 'tenant_id': 'a919164d-8b7c-43fb-8119-f1997d45ca4f'}


## Ingest Dataset
   
   The Ingest Dataset endpoint is used to upload document data into a previously created dataset. The uploaded documents are    stored in a blob storage for future processing and retrieval. This step ensures that the dataset is populated with the       necessary documents, enabling the system to perform document completion tasks efficiently.

In [1]:
import requests

# Define the API endpoint for the ingest process, replacing `{dataset_id}` with the dataset ID
url = "https://api.lab45.ai/v1.1/datasets/d90b8b53-da18-4b85-8fb7-6c23f67fb2ed/ingest"

payload = {}

files = [
    ('test_sample',('test_data.txt',open('./test_data.txt','rb'),'text/plain')),
    ('test_sample2',('test_data2.txt',open('./test_data2.txt','rb'),'text/plain'))
]
headers = {
  "Authorization": "Bearer <api_key>" #Replace <api_key> with your API key for authentication
}

response = requests.request("POST", url, headers=headers, data=payload, files=files)

print(response.text)

{"_id":"d90b8b53-da18-4b85-8fb7-6c23f67fb2ed","desc":"Some test Dataset1","files":[{"name":"test_data.txt","ts":1737615558.063004},{"name":"test_data2.txt","ts":1737615558.0630112}],"name":"Test Dataset1","owners":["edee4f9f-873f-439e-983c-8f732d312177"],"tenant_id":"a919164d-8b7c-43fb-8119-f1997d45ca4f"}



## Prepare Dataset
   
   The Prepare Dataset step is critical in the document completion workflow. It converts the ingested documents into            embeddings, which are dense vector representations of the document content. These embeddings encode the semantic             meaning of the text, allowing the system to retrieve relevant information efficiently based on user queries.

In [2]:
import requests

prepare_endpoint = f"https://api.lab45.ai/v1.1/skills/doc_completion/prepare"

# Set the headers for the request, including the content type, accepted response format, and authorization token
headers = {
    'Content-Type': "application/json",
    'Accept': "text/event-stream, application/json",
    "Authorization": "Bearer <api_key>" #Replace <api_key> with your API key for authentication
}

# Define the payload (request body) for the API call, which includes a dataset ID
payload ={
    "dataset_id": "d90b8b53-da18-4b85-8fb7-6c23f67fb2ed"
}

# Make the POST request to the "prepare" API endpoint with the provided headers and payload
response = requests.post(prepare_endpoint, headers=headers, json=payload)

print(response.text)

{"_id":"54ccfbc7-358a-4e99-84b9-19b8b12b89f1","emb_type":"openai","resource_group_id":"d90b8b53-da18-4b85-8fb7-6c23f67fb2ed","status":"Started"}



## Query

   The Query step enables users to ask questions or make requests, leveraging the processed dataset for context-aware           responses. The system uses embeddings and indexed document content to identify relevant sections of the dataset that         match the query.



In [3]:
import requests
import json

url = "https://api.lab45.ai/v1.1/skills/doc_completion/query"

payload = json.dumps({
    "dataset_id" : "d90b8b53-da18-4b85-8fb7-6c23f67fb2ed",  
    "skill_parameters": {  
        "model_name": "gpt-4o", 
        "retrieval_chain": "custom",  
        "emb_type": "openai",  
        "temperature": 0,  
        "max_output_tokens": 100,  
        "return_sources": False  
    },
    "stream_response": False, 
    "messages": [  
        {"content": "Hi", "role": "user"},  
        {"content": "give summary of the uploaded document", "role": "user"} 
    ]
})
headers = {
  'Content-Type': 'application/json',
  "Authorization": "Bearer <api_key>" #Replace <api_key> with your API key for authentication
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

{"data": {"content": "The document provided consists of two main parts. The first part describes a platform that is an AI-powered digital assistant offering comprehensive and reliable information across various topics. It emphasizes the platform's ability to provide accurate and detailed responses for educational support, general knowledge insights, writing assistance, and guidance on different subjects. The platform is committed to delivering complete and trustworthy information while maintaining a safe and respectful environment for all users.\n\nThe second part of the document provides information about cricket, a popular bat-and-ball sport played between"}}

