LangChain:基于文档结构的的文本切割器

有些文档具有固有的结构,例如HTML、Markdown或JSON文件。在这种情况下,基于文档结构进行拆分往往是有益的,因为这种结构通常能自然地组合语义相关的文本。

基于文档结构切割

基于结构拆分的主要优势包括:

  • 保留文档的逻辑组织
  • 维持每个文本块内的上下文关联
  • 对于下游任务(如检索或摘要)更加有效

基于结构的拆分示例:

  • Markdown:根据标题层级进行拆分(如 #、##、###)
  • HTML:利用标签进行拆分
  • JSON:按对象或数组元素进行拆分
  • Code(代码):依据函数、类或逻辑块进行拆分

Markdown切割

许多聊天或问答应用程序在嵌入和向量存储之前,会对输入文档进行分块处理。

这些来自Pinecone的笔记提供了一些有用的建议:

1
当嵌入整个段落或文档时,嵌入过程会同时考虑文本的整体上下文以及句子和短语之间的关系。这可以生成一个更全面的向量表示,从而捕捉到文本更广泛的含义和主题。

如前所述,分块通常旨在将具有共同上下文的文本聚合在一起。基于这一原则,我们可能希望专门遵循文档自身的结构。例如,markdown文件是按标题组织的,在特定的标题组内创建文本块便成了一个直观的想法。

为解决这一需求,我们可以使用MarkdownHeaderTextSplitter。该工具能够根据一组指定的标题来拆分Markdown文件。
例如,如果我们要拆分以下markdown文档:

1
md = '# Foo\n\n ## Bar\n\nHi this is Jim  \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly'

我们可以指定要分割的头部:

1
[("#", "Header 1"),("##", "Header 2")]

内容按共有的标题进行分组或划分:

1
2
{'content': 'Hi this is Jim  \nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}
{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}

使用示例:

1
pip install -qU langchain-text-splitters
1
2
3
4
5
6
7
8
9
10
11
12
13
from langchain_text_splitters import MarkdownHeaderTextSplitter

markdown_document = "# Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"

headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
1
2
3
[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='Hi this is Jim  \nHi this is Joe'),
Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}, page_content='Hi this is Lance'),
Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Hi this is Molly')]
1
type(md_header_splits[0])
1
langchain_core.documents.base.Document

默认情况下,MarkdownHeaderTextSplitter会从输出块的内容中移除正在拆分的标题。可以通过设置strip_headers = False来禁用此功能。

1
2
3
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
1
2
3
[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='# Foo  \n## Bar  \nHi this is Jim  \nHi this is Joe'),
Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}, page_content='### Boo \nHi this is Lance'),
Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='## Baz \nHi this is Molly')]

:默认的MarkdownHeaderTextSplitter会移除空白和换行符。若要保留Markdown文档的原始格式,请使用ExperimentalMarkdownSyntaxTextSplitter

如何将Markdown行作为单独的文档返回:
默认情况下,MarkdownHeaderTextSplitter 会根据在 headers_to_split_on 中指定的标题来聚合行。我们可以通过指定 return_each_line 来禁用此功能:

1
2
3
4
5
6
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on,
return_each_line=True,
)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
1
2
3
4
[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='Hi this is Jim'),
Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='Hi this is Joe'),
Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}, page_content='Hi this is Lance'),
Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Hi this is Molly')]

请注意,这里的标题信息保留在每个文档的元数据中。

如何限制块大小
在每个Markdown组中,我们可以应用任何我们想要的文本分割器,例如RecursiveCharacterTextSplitter,它允许进一步控制块的大小。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
markdown_document = "# Intro \n\n    ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages."

headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
]

# MD splits
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)

# Char-level splits
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(md_header_splits)
splits
1
2
3
4
5
[Document(metadata={'Header 1': 'Intro', 'Header 2': 'History'}, page_content='# Intro  \n## History  \nMarkdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]'),
Document(metadata={'Header 1': 'Intro', 'Header 2': 'History'}, page_content='Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.'),
Document(metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}, page_content='## Rise and divergence \nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.'),
Document(metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}, page_content='#### Standardization \nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.'),
Document(metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'}, page_content='## Implementations \nImplementations of Markdown are available for over a dozen programming languages.')]

问题排查:块重叠参数似乎未生效

  • 在基于标题的分割(例如使用 MarkdownHeaderTextSplitter)之后,应使用 split_documents(docs)(而非 split_text),以便在每个章节内应用重叠,并且每个章节的元数据(标题)会保留在数据块上。
  • 重叠仅在单个章节超过 chunk_size 并被分割成多个数据块时出现。
  • 重叠不会跨越章节/文档边界(例如 # H1 → ## H2)。
  • 如果标题本身变成了一个微小的首个数据块,可以考虑将 strip_headers 设置为 True,这样标题行就不会成为独立的数据块。
  • 如果你的文本缺少换行符/空格,请在分隔符中保留一个备用的 “”(空字符串),以便分割器仍然能够进行分割并应用重叠。

JSON切割

这个 JSON 分割器在分割 JSON 数据的同时,允许控制块的大小。它会深度优先遍历 JSON 数据,并构建更小的 JSON 块。它尝试保持嵌套的 JSON 对象完整,但如果需要使块大小保持在最小块大小和最大块大小之间,也会对其进行分割。

如果值不是嵌套的 JSON,而是一个非常大的字符串,该字符串将不会被分割。如果你需要对块大小设置硬性上限,可以考虑在这些块上组合使用递归文本分割器。此外,还有一个可选的预处理步骤用于分割列表,方法是先将它们转换为 JSON(字典),然后再进行分割。

  • 文本的分割方式:基于 JSON 值进行分割。
  • 块大小的计算方式:按字符数计算。
1
pip install -qU langchain-text-splitters

首页加载JSON数据

1
2
3
4
5
6
import json

import requests

# This is a large nested json object and will be loaded as a python dict
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()

基本使用

1
2
3
from langchain_text_splitters import RecursiveJsonSplitter

splitter = RecursiveJsonSplitter(max_chunk_size=300)

要获取json数据块,使用.split_json方法

1
2
3
4
5
# Recursively split json data - If you need to access/manipulate the smaller json chunks
json_chunks = splitter.split_json(json_data=json_data)

for chunk in json_chunks[:3]:
print(chunk)
1
2
3
{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'servers': [{'url': 'https://api.smith.langchain.com', 'description': 'LangSmith API endpoint.'}]}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.', 'operationId': 'read_tracer_session_api_v1_sessions__session_id__get'}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}

要获取LangChain文档对象,使用.create_documents方法

1
2
3
4
5
# The splitter can also output documents
docs = splitter.create_documents(texts=[json_data])

for doc in docs[:3]:
print(doc)
1
2
3
page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "servers": [{"url": "https://api.smith.langchain.com", "description": "LangSmith API endpoint."}]}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}'

或使用.split_text方法直接获取文本内容

1
2
3
4
texts = splitter.split_text(json_data=json_data)

print(texts[0])
print(texts[1])
1
2
{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "servers": [{"url": "https://api.smith.langchain.com", "description": "LangSmith API endpoint."}]}
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}

如何管理列表内容中的块大小:

注意,本例中的一个块大于指定的max_chunk_size为300。查看其中一个较大的块,我们发现其中有一个列表对象:

1
2
3
print([len(text) for text in texts][:10])
print()
print(texts[3])
1
2
3
[171, 231, 126, 469, 210, 213, 237, 271, 191, 232]

{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}

默认情况下,JSON分割器不会分割列表。

指定convert_lists=True以对json进行预处理,将列表内容转换为以index:item作为key:val的字典对:

1
texts = splitter.split_text(json_data=json_data, convert_lists=True)

让我们看看这些块的大小。现在它们都小于最大值

1
2
3
print([len(text) for text in texts][:10])

[176, 236, 141, 203, 212, 221, 210, 213, 242, 291]

列表已转换为字典,但即使拆分成多个块,也能保留所有所需的上下文信息:

1
print(texts[1])
1
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": {"0": "tracer-sessions"}, "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}
1
2
# We can also look at the documents
docs[1]
1
Document(page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}')

Code代码切割

HTML切割

基于语义的切割

基于语义文本切割

作者

光星

发布于

2026-02-25

更新于

2026-02-25

许可协议

评论