LangChain:基于长度的文本切割器

大语言模型存在Token数量限制,不应超出该限制。因此,在分割文本成块时,需要计算好Token的数量。市面上存在多种tokenizer,计算文本token数量时,应使用与语言模型相匹配的tokenizer

文件加载器加载各种类型的文档,读取文档内容为文本。所以文档切割根本是基于文本,再细分文本结构,文本语义。

使用LangChain集成文本分割器LangChain Docs > Text splitter integrations

基于Token切割

Splitting by token - Text splitter integration guide

tiktoken

tiktoken:是OpenAI创建的一个快速BPE标记器。可以用tiktoken 来估算Token数量,对于OpenAI模型来说,这可能更准确。

  • 文本是如何分割的:按传入的字符。
  • 块大小的测量方式:通过tiktoken标记器。

CharacterTextSplitter, RecursiveCharacterTextSplitter, 和 TokenTextSplitter 可以直接使用 tiktoken.

1
pip install --upgrade --quiet langchain-text-splitters tiktoken

示例:

1
2
3
4
5
from langchain_text_splitters import CharacterTextSplitter

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()

若要使用CharacterTextSplitter进行分割,然后使用tiktoken合并分块,使用其.from_tiktoken_encoder()方法。请注意,此方法产生的分割结果可能大于tiktoken分词器测量的分块大小。

.from_tiktoken_encoder() 方法可以接受 encoding_name(例如 cl100k_base)或 model_name(例如 gpt-4)作为参数。所有其他参数(如 chunk_size、chunk_overlap 和 separators)用于实例化 CharacterTextSplitter

1
2
3
4
5
6
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])

为了实现对分块大小硬性约束,我们可以使用RecursiveCharacterTextSplitter.from_tiktoken_encoder,其中如果每个分割块的大小较大,则将递归地进行分割:

1
2
3
4
5
6
7
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
model_name="gpt-4",
chunk_size=100,
chunk_overlap=0,
)

我们还可以加载一个TokenTextSplitter分割器,该分割器直接与tiktoken配合使用,并确保每次分割后的结果小于块大小。

1
2
3
4
5
6
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

**注意:**一些书面语言(如中文和日语)的字符可以编码为两个或更多标记。直接使用TokenTextSplitter可能会将一个字符的标记拆分为两个块,从而导致Unicode字符格式错误。请使用RecursiveCharacterTextSplitter.from_tiktoken_encoderCharacterTextSplitter.from_tiktoken_encoder来确保块包含有效的Unicode字符串。

spaCy

spaCy是一个用于高级自然语言处理的开源软件库,采用Python和Cython编程语言编写。

LangChain基于spaCy分词器实现了分词器。

  • 文本是如何分割的:通过spaCy分词器。

  • 块大小的测量方式:按字符数计算。

安装依赖库:

1
pip install --upgrade --quiet  spacy

使用示例:

1
2
3
4
5
6
7
8
9
10
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()

from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

SentenceTransformers

SentenceTransformersTokenTextSplitter是专为句子变换模型设计的专业文本分割器。其默认行为是将文本分割成符合您所要使用的句子变换器模型token窗口大小要求的文本块。

要根据sentence-transformers分词器对文本进行分割并限制token数量,实例化一个SentenceTransformersTokenTextSplitter。可以选择性地指定:

  • chunk_overlap: 标记重叠的整数计数。
  • model_name: 句子变换器模型名称,默认为sentence-transformers/all-mpnet-base-v2
  • tokens_per_chunk: 每个分块所需的标记数。

示例:

1
2
3
4
5
6
7
8
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "

count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)
1
2
1
2
3
4
5
6
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")
1
tokens in text to split: 514
1
2
3
text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])
1
lorem

NLTK

自然语言工具包(Natural Language Toolkit,简称NLTK)是一套用于Python编程语言编写的英语符号和统计自然语言处理(NLP)的库和程序。

我们不仅可以根据\n\n进行分割,还可以使用NLTK根据NLTK分词器进行分割。

  • 文本是如何分割的:通过NLTK分词器。

  • 块大小的测量方式:按字符数计算。

安装依赖:

1
# pip install nltk
1
2
3
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
1
2
3
4
5
6
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

KoNLPY

KoNLPyPython中的韩语自然语言处理(NLP)是一个用于韩语自然语言处理的Python包。

Hugging Face tokenizer

Hugging Face 提供了多种分词器。我们使用 Hugging Face 的 GPT2TokenizerFast 分词器,以token为单位计算长度。

  • 文本是如何分割的:按传入的字符进行分割。

  • 块大小的测量方式:通过 Hugging Face 分词器计算的 token 数量。

使用示例:

1
2
3
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
1
2
3
4
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter
1
2
3
4
5
6
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])

基于字符切割

基于字符的分割是文本分割最简单的方法。它使用指定的字符序列(默认:\n\n)来分割文本,块长度由字符数量决定。

  • 文本是如何分割的:按传入的字符进行分割。

  • 块大小的测量方式:按字符数计算。

可以选择使用的方法:

  • .split_text — 返回纯字符串块。
  • .create_documents — 返回LangChain文档对象,在需要为下游任务保留元数据时非常有用。
1
pip install -qU langchain-text-splitters
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from langchain_text_splitters import CharacterTextSplitter

# Load an example document
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()

text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
1
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'

使用.create_documents将与每个文档相关的元数据传播到输出块中:

1
2
3
4
5
metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents(
[state_of_the_union, state_of_the_union], metadatas=metadatas
)
print(documents[0])
1
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' metadata={'document': 1}

使用.split_text直接获取字符串内容:

1
text_splitter.split_text(state_of_the_union)[0]
1
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'
作者

光星

发布于

2026-02-20

更新于

2026-02-24

许可协议

评论