LangChainV0.2-基础教程:构建矢量数据库和检索器

yuyutoo 2024-10-12 00:49 34 浏览 0 评论

本课程，你将了解到LangCahin的矢量数据库和检索器的相关概念。这两个概念是为了解决将外部（矢量）数据库和其它来源的数据集成到LLM的工作流而提出的，是构建模型推理的关键知识点，通常我们称这种技术为生成式增强检索，或简称RAG。

概念

本章节将聚焦文本数据的检索。我们需要对以下概念有所了解：

Documents 文档
Vector stores 矢量数据库
Retrievers 检索器

Jupyter Notebook

本教程（及大部分教程）都是使用的Jupyter notebooks，并预先认为您也会使用。Jupyter notebooks非常适用来学习LLM系统或作为一个原型构建工具，因为我们在学习或开发应用过程中将会碰到很多异常情况（比如，不正确的输出，API挂掉了），使用Jupyter notebooks这种一步一步的、交互式的工具，可以让你迅速调试并学习。

Jupyter Notebook的安装和配置问题，请自行了解。

LangSmith

不再对LangSmith的安装和使用进行说明，前面有提到过。

Document 文档

LangChain对Document“类”的能力进行了增强，文档在这里代表了一定量的文本及其元数据信息，其拥有两个属性：

page_content文档内容：表示文档内容的字符串文本；
metadata元数据：任意数量的元数据字典表；

metadata 元数据属性，用于记录如文档来源一类的信息，这些信息将与其它文档或其它信息产生关联关系。这里要注意的一点，我们所说的单个文档通常是表示一个大文档的一小块。

我们先创建一些文档的示例：

from langchain_core.documents import Document

documents = [
   Document(
       page_content="Dogs are great companions, known for their loyalty and friendliness.",
       metadata={"source": "mammal-pets-doc"},
  ),
   Document(
       page_content="Cats are independent pets that often enjoy their own space.",
       metadata={"source": "mammal-pets-doc"},
  ),
   Document(
       page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
       metadata={"source": "fish-pets-doc"},
  ),
   Document(
       page_content="Parrots are intelligent birds capable of mimicking human speech.",
       metadata={"source": "bird-pets-doc"},
  ),
   Document(
       page_content="Rabbits are social animals that need plenty of space to hop around.",
       metadata={"source": "mammal-pets-doc"},
  ),
]

上面的代码中，我们创建了五个文档，每个文档的元数据里都有一个sources作为key，且其值里都使用“-”来区分不同的来源项，每个文档都有 3个来源。

矢量数据库

矢量检索常用于存储和检索非结构化数据（比如非结构化文本）。其原理是将与文本有映射关系的向量值存储起来。当给出一个查询语句时，我们先将查询语句向量化到矢量数据库的向量空间中，然后利用矢量相似性算法—距离远近从矢量数据库里查找相关联的数据。

LangChain的VectorStore矢量数据库作为一个“对象”，包含了一些方法，如添加text文本或Document文档对象到数据库，使用各种相似性算法进行检索。通常在初始化构建时，我们要用到向量化模型，该模型的作用是将文本数据通过转换映射为同一语义空间下的向量数值。

LangChain集成了大量的矢量数据库。一些矢量数据库是由第三方服务商提供的（比如，许多云服务提供商），这些就需要授权才能使用，也有一些第三方提供的矢量数据库（如Postgres）可以在本地单独运行；还有一些可以运行在内存里，适合轻量级负载任务。本课程使用LangChain的内建矢量数据库Chorma，该数据库支持在内存里运行。

在实例化一个矢量数据库之前，我们通常需要选择一个向量化模型。在这里，我们使用OpenAI的向量化模型。

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(
   documents,
   embedding=OpenAIEmbeddings(),
)

调用.from_documents方法，可以向矢量数据库里添加文档。当然，也可以在文档实例化后再存入到矢量数据库。一般在实施中，你可能需要连接一个现有的矢量数据库-- 这样，你就需要提供客户端，索引名称及其它信息进行连接。

一旦我们实例化了一个矢量数据库，且该数据库已经添加了一些文档后，我们就可以查询了。矢量数据库本身自带一些查询方式。

同步和异步查询
使用文本字符串查询和使用矢量值查询
是否返回相似度评分值；
使用相似度和最大边际相关性（最大边际相关性是一种折中方案，它同时应用了查询语句相似度与检索结果相似度）。

查询方法将返回文档列表。

相似性检索示例

使用相似性查询，提交一个文本字符串，返回一些文档。如提交查询"cat"；

vectorstore.similarity_search("cat")

从 5个文档中返回了 4个与cat相关的文档；

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

异步查询"cat"；

await vectorstore.asimilarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

返回相似性评分值，以下是查询"cat"时，需要返回相似性评分值；

# Note that providers implement different scores; Chroma here
# returns a distance metric that should vary inversely with
# similarity.

vectorstore.similarity_search_with_score("cat")

评分值在最后一项

[(Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
0.3751849830150604),
(Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
0.48316916823387146),
(Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
0.49601367115974426),
(Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'}),
0.4972994923591614)]

使用向量相似性查询，返回相关文档

embedding = OpenAIEmbeddings().embed_query("cat")

vectorstore.similarity_search_by_vector(embedding)

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

Retrievers 检索器

LangChain的VectorStore矢量数据库并不支持“可运行”协议，因此不能直接使用LCEL语言表达式集成到chains工作流中。

但是Retrievers组件是支持“可运行”协议的。

我们可以使用RunnableLambda()方法创建一个支持“可运行的”Retrievers检索器，而不需要继承Retriever类。这样，我们能随时创建一个支持“可运行”的检索器。下面，我们使用相似性检索方法来构建了一个“可运行”的检索器，并返回一个值。

from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda

retriever = RunnableLambda(vectorstore.similarity_search).bind(k=1)  # select top result

retriever.batch(["cat", "shark"])

我们可以看到构建好的retriever使用了“可运行”接口里的.batch方法进行批量调用。

以下返回的是上面批量查询的结果。

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
[Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

矢量数据库使用as_retriever方法用于构建一个专有的检索器。这一类检索器可以定义检索类型和检索附加参数，以调用原有矢量数据库的一些方法。我们再修改一下检索器。

retriever = vectorstore.as_retriever(
   search_type="similarity",
   search_kwargs={"k": 1},
)

retriever.batch(["cat", "shark"])

返回结果是相同的。

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
[Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

VectorStoreRetriever矢量数据库检索器默认的检索类型是“相似性”，其它类型还有“MMR”（最大边际相关性）和“相似度评分阈值”。我们可以使用相似度评分阈值将低于该值的结果过滤掉。

检索器可以集成到更复杂的应用中去，如RAG应用，这类应用使用用户的提问，先在检索器先获取到与用户提问相关的一些上下文（或叫答案/样本），然后将其转换为提示词喂给LLM。以下是一些示例。

pip install -qU langchain-openai

导入要使用的模型；

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

定义提示词，构建RAG工作流；

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm

{"context": retriever, "question": RunnablePassthrough()} 的作用是：通过用户的原始输入，检索到结构返回给context，用户的输入返回给question；

工作流调用："tell me about cats"告诉我关于猫咪的信息；

response = rag_chain.invoke("tell me about cats")

print(response.content)

返回的结果是关于猫的信息；

Cats are independent pets that often enjoy their own space.

检索策略可以使用多个，或进行复杂的组合，如下：

从用户的查询语句里解除出检索规则和过滤条件
返回与该文本相关的一些文档（如，相同类型文档）
使用多种向量化组件对文本进行矢量化处理；
从多个检索器里合并检索结果
为文档增加权重，如，最近的文档权重高；

在如何去做里的检索器章节里，还会讲述更多的检索策略。

你也可以直接扩展BaseRetriever“类”，来定制化你想要的检索器。

databasemetadata

上一篇：Python中最受欢迎的数据库工具包——SQLAlchemy
下一篇：JDBC规范五-ResultSet详解 jdbc resultset fetchsize

LangChainV0.2-基础教程:构建矢量数据库和检索器

概念

Jupyter Notebook

LangSmith

Document 文档

矢量数据库

相似性检索示例

Retrievers 检索器

更多

相关推荐

取消回复欢迎你发表评论:

前端面试:iframe 的优缺点? iframe有那些缺点

带斜线的表头制作好了，如何填充内容?这几种方法你更喜欢哪个?

蓝导航是一个功能齐全、简洁便捷的导航网站

漫学笔记之PHP.ini常用的配置信息

推荐7个模板代码和其他游戏源码下载的网址

其实模版网站在开发工作中很重要，推荐几个参考站给大家

[干货] JAVA - JVM - 2 内存两分 [干货]+java+-+jvm+-+2+内存两分吗

正在学习使用python搭建自动化测试框架?这个系统包你可能会用到

【开源分享】2024PHP在线客服系统源码(搭建教程+终身使用)

织梦(Dedecms)建站教程织梦建站详细步骤

LangChainV0.2-基础教程:构建矢量数据库和检索器

概念

Jupyter Notebook

LangSmith

Document 文档

矢量数据库

相似性检索示例

Retrievers 检索器

更多

相关推荐

取消回复欢迎 你 发表评论:

前端面试:iframe 的优缺点? iframe有那些缺点

带斜线的表头制作好了，如何填充内容?这几种方法你更喜欢哪个?

蓝导航是一个功能齐全、简洁便捷的导航网站

漫学笔记之PHP.ini常用的配置信息

推荐7个模板代码和其他游戏源码下载的网址

其实模版网站在开发工作中很重要，推荐几个参考站给大家

[干货] JAVA - JVM - 2 内存两分 [干货]+java+-+jvm+-+2+内存两分吗

正在学习使用python搭建自动化测试框架?这个系统包你可能会用到

【开源分享】2024PHP在线客服系统源码(搭建教程+终身使用)

织梦(Dedecms)建站教程 织梦建站详细步骤

取消回复欢迎你发表评论:

织梦(Dedecms)建站教程织梦建站详细步骤