Azure Open AI embedding creation

Azure Open AI

There any many articles which already explain Azure Open AI, so I would like to keep it short. Azure Open AI is all about prompt engineering, embeddings, and Content filters. Here we will directly jump into real-time examples, and along with that, we will try to explain the models which are available.

Before starting, I would like to few points.

My journey or exploration of Azure Open AI started with using chat.openai.com. Explore it, and give random prompts like below.

  1. Give me a C# hello world sample.
  2. Give some lengthy text and ask to summarize it.

Chat GPT(Generative Pretrained Transformer) gives the data which is available on the public internet. The data provided will be before 2021, as per the Chat GPT side, so if we try to find who won IPL 2023, it might not fetch the results.

Note. Before proceeding, some things not to do.

Now, we are in a state where we are able to get the sample code that can be used in the application and fetch any data from the public internet. Now the actual scenario comes, it is good to use our code in Chat GPT; the strict answer is "NO" as we use our code for Chat GPT, Chat GPT stores our data where we are opening the security of the application or giving our code to the public domain so far which we considered secure. It is always better to avoid copying the application code to Chat GPT, you refer to or search "samsung chatgpt news," you will few articles which we are talking about.

Models

Models are nothing but a module which is being each scenario.

Model family

Description

GPT-4 A set of models that improve on GPT-3.5 and can understand as well as generate natural language and code.
GPT-3 A series of models that can understand and generate natural language. This includes the new ChatGPT model.
DALL-E A series of models that can generate original images from natural language.
Codex A series of models that can understand and generate code, including translating natural language to code.
Embeddings A set of models that can understand and use embeddings. An embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Currently, we offer three families of Embeddings models for different functionalities: similarity, text search, and code search.

Chat GPT has further models containing inside them like below.

Davinci

Davinci is the most capable model and can do anything that any other model can do and much more—often with fewer instructions.

Davinci can solve logic problems, determine cause and effect, understand the intent of the text, produce creative content, explain character motives, and handle complex summarization tasks.

Curie

Curie tries to balance power and speed. It can do anything that Ada or Babbage can do, but it's also capable of handling more complex classification tasks and more nuanced tasks like summarization, sentiment analysis, chatbot applications, and Question and Answers.

Babbage

Babbage is a bit more capable than Ada but not quite as performant. It can perform all the same tasks as Ada, but it can also handle a bit more involved classification tasks, and it's well suited for semantic search tasks that rank how well documents match a search query.

Ada

Ada is usually the fastest model and least costly. It's best for less nuanced tasks—for example, parsing text, reformatting text, and simpler classification tasks. The more context you provide Ada, the better it will likely perform.

Apart from models, we might listen to a few keywords constantly while using this Gen AI, like NLP, Named entity, and LLM. Depending on the use case, I would touch on & explain what these meant.

NLP? What is this NLP? - NLP is nothing but a way to send a prompt to the system in normal English and fetches the data from the database.

Example

Prompt: Give me the top 10 records from employee

AI System: Would understand and convert the prompt into a machine-understandable format like Select top 10 * from the employee.

Prompt: Give me information on the customer name, billing address, agent name, product name, and cost from the invoice document.

AI System: The system should retrieve all the information which are requested from the above prompt message from all the documents.

Use Case: Fetch Documents from PDF documents having text(here, we are talking about files with only text, no tables or images).

Scenario: I have multiple documents, let's say in different formats like txt, word, and pdf, I want to mechanism where I can search text from these documents in Natural Language Processing(NLP).

Solution Approach

Step 1. Create a blob container where we will be uploading the files.

Step 2. Create an Azure Function on blob trigger, as we want to have real-time document search happening. Because there is a file uploaded in the blob container, we need to trigger the Azure function and read the file information.

Step 3. Write C# code where we will be reading the data from a file by converting the byte[] to a memory stream. We can use iTextSharp nuget package to read the content from files and later, from the no. of pages, try to read each paragraph, generate embeddings, and save to a vector database.

Why do I need each paragraph?

Chat GPT has certain limitations of tokens; the token is nothing but 4 char together. Chat GPT 3.5 supports 4096 tokens; any more than this, we will get an exception; hence instead of passing whole data at once, we are trying to chunk it into smaller sizes and generate embeddings for the plain text.

What are these embeddings?

Embeddings are a mathematical representation of complex data types like words, sentences, and objects in a lower-dimensional vector space. To make more simple, the plain text is converted into decimal values and stored. We use this embedding concept so as to have semantic search capabilities to fetch the data.

Embeddings can be generated by calling the Azure Open AI endpoint.

Step 4. We got the embeddings; now, where do we need to store them? Does SQL or Oracle support them? No, since embeddings are vector dimensional values, we can't store them in SQL or Oracle. We can store them in a database, which we named Vector databases. The Vector database where these embeddings can be stored is Pinecone, Qdrant, Cosmo DB+ Mongo DB VCore, Mongo DB Atlas, SqlLite, etc.

These are the steps that have the end-to-end solution approach from file uploading to blob storage, azure function triggering and generating embeddings, and storing in vectors.

 

Reference