[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'), Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a list: \nFirst item Second item Third item with bold text and a link'), Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction', 'Header 3': 'Subsection 1.1: Details'}, page_content="This subsection provides additional details. Here's a table:"), Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'), Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block:'), Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]
html_splitter = HTMLHeaderTextSplitter( headers_to_split_on, return_each_element=True, ) html_header_splits_elements = html_splitter.split_text(html_string) for element in html_header_splits_elements[:3]: print(element)
现在每个元素都作为一个单独的 Document 返回:
1 2 3
page_content='This is an introductory paragraph with some basic content.' metadata={'Header 1': 'Main Title'} page_content='This section introduces the topic. Below is a list:' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'} page_content='First item Second item Third item with bold text and a link' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}
与上面元素按其标题聚合的情况进行比较:
1 2
page_content='This is an introductory paragraph with some basic content.' metadata={'Header 1': 'Main Title'} page_content='This section introduces the topic. Below is a list:\nFirst item Second item Third item with bold text and a link' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}
[Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berry’s paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth'), Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.'), Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='This account of Gödel’s discovery was told to Hao Wang very much after the fact; but in Gödel’s contemporary correspondence with Bernays and Zermelo, essentially the same description of his path to the theorems is given. (See Gödel 2003a and Gödel 2003b respectively.) From those accounts we see that the undefinability of truth in arithmetic, a result credited to Tarski, was likely obtained in some form by Gödel by 1931. But he neither publicized nor published the result; the biases logicians'), Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='result; the biases logicians had expressed at the time concerning the notion of truth, biases which came vehemently to the fore when Tarski announced his results on the undefinability of truth in formal systems 1935, may have served as a deterrent to Gödel’s publication of that theorem.'), Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.2 The proof of the First Incompleteness Theorem'}, page_content='We now describe the proof of the two theorems, formulating Gödel’s results in Peano arithmetic. Gödel himself used a system related to that defined in Principia Mathematica, but containing Peano arithmetic. In our presentation of the First and Second Incompleteness Theorems we refer to Peano arithmetic as P, following Gödel’s notation.')]
局限性
不同的 HTML 文档之间可能存在相当大的结构差异,虽然 HTMLHeaderTextSplitter 会尝试将所有“相关的”标题附加到任何给定的块,但它有时可能会错过某些标题。例如,该算法假设一个信息层次结构,其中标题总是位于关联文本“上方”的节点,即先前的兄弟节点、祖先节点及其组合。在以下新闻文章中(在撰写本文档时),文档的结构使得顶级标题的文本(尽管标记为 “h1”)位于与期望它“位于上方”的文本元素不同的子树中——因此我们可以观察到 “h1” 元素及其关联的文本没有出现在块元数据中(但是,在适用的情况下,我们确实看到了 “h2” 及其关联文本):
No two El Niño winters are the same, but many have temperature and precipitation trends in common. Average conditions during an El Niño winter across the continental US. One of the major reasons is the position of the jet stream, which often shifts south during an El Niño winter. This shift typically brings wetter and cooler weather to the South while the North becomes drier and warmer, according to NOAA. Because the jet stream is essentially a river of air that storms flow through, they c
HTMLSectionSplitter示例
在概念上类似于 HTMLHeaderTextSplitter,HTMLSectionSplitter 是一个“结构感知”的文本分割器,它在元素级别分割文本,并为每个与给定块“相关”的标题添加元数据。它允许您按章节分割 HTML。它可以逐个元素返回块,或将具有相同元数据的元素组合在一起,其目标是 (a) 在语义上将相关文本(或多或少)组合在一起,以及 (b) 保留文档结构中编码的富含上下文的丰富信息。使用 xslt_path 提供转换 HTML 的绝对路径,以便它能够基于提供的标签检测章节。默认是使用 data_connection/document_transformers 目录中的 converting_to_header.xslt 文件。这用于将 HTML 转换为更容易检测章节的格式/布局。例如,基于其字体大小的 span 可以转换为标题标签以被检测为章节。
如何分割 HTML 字符串
1 2 3 4 5 6 7 8 9 10
from langchain_text_splitters import HTMLSectionSplitter
[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title \n This is an introductory paragraph with some basic content.'), Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content="Section 1: Introduction \n This section introduces the topic. Below is a list: \n \n First item \n Second item \n Third item with bold text and a link \n \n \n Subsection 1.1: Details \n This subsection provides additional details. Here's a table: \n \n \n \n Header 1 \n Header 2 \n Header 3 \n \n \n \n \n Row 1, Cell 1 \n Row 1, Cell 2 \n Row 1, Cell 3 \n \n \n Row 2, Cell 1 \n Row 2, Cell 2 \n Row 2, Cell 3"), Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content \n This section contains an image and a video: \n \n \n Your browser does not support the video tag.'), Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example \n This section contains a code block: \n \n <div>\n <p>This is a paragraph inside a div.</p>\n </div>'), Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion \n This is the conclusion of the document.')]
[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'), Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic'), Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='. Below is a list: First item Second item Third item with bold text and a link Subsection 1.1: Details This subsection provides additional details'), Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content=". Here's a table: Header 1 Header 2 Header 3 Row 1, Cell 1 Row 1, Cell 2 Row 1, Cell 3 Row 2, Cell 1 Row 2, Cell 2 Row 2, Cell 3"), Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:  '), Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: <code:html> <div> <p>This is a paragraph inside a div.</p> </div> </code>'), Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]
保留表格和列表
在这个例子中,我们将演示 HTMLSemanticPreservingSplitter 如何保留 HTML 文档中的一个表格和一个大列表。块大小将设置为 50 个字符,以说明分割器如何确保这些元素即使超过了定义的最大块大小也不会被分割。
from langchain_text_splitters import HTMLSemanticPreservingSplitter
html_string = """ <!DOCTYPE html> <html> <body> <div> <h1>Section 1</h1> <p>This section contains an important table and list that should not be split across chunks.</p> <table> <tr> <th>Item</th> <th>Quantity</th> <th>Price</th> </tr> <tr> <td>Apples</td> <td>10</td> <td>$1.00</td> </tr> <tr> <td>Oranges</td> <td>5</td> <td>$0.50</td> </tr> <tr> <td>Bananas</td> <td>50</td> <td>$1.50</td> </tr> </table> <h2>Subsection 1.1</h2> <p>Additional text in subsection 1.1 that is separated from the table and list.</p> <p>Here is a detailed list:</p> <ul> <li>Item 1: Description of item 1, which is quite detailed and important.</li> <li>Item 2: Description of item 2, which also contains significant information.</li> <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li> </ul> </div> </body> </html> """
[Document(metadata={'Header 1': 'Section 1'}, page_content='This section contains an important table and list'), Document(metadata={'Header 1': 'Section 1'}, page_content='that should not be split across chunks.'), Document(metadata={'Header 1': 'Section 1'}, page_content='Item Quantity Price Apples 10 $1.00 Oranges 5 $0.50 Bananas 50 $1.50'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='Additional text in subsection 1.1 that is'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='separated from the table and list. Here is a'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content="detailed list: Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]
html_string = """ <!DOCTYPE html> <html> <body> <div> <h1>Section with Iframe</h1> <iframe src="https://example.com/embed"></iframe> <p>Some text after the iframe.</p> <ul> <li>Item 1: Description of item 1, which is quite detailed and important.</li> <li>Item 2: Description of item 2, which also contains significant information.</li> <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li> </ul> </div> </body> </html> """
[Document(metadata={'Header 1': 'Section with Iframe'}, page_content='[iframe:https://example.com/embed](https://example.com/embed) Some text after the iframe'), Document(metadata={'Header 1': 'Section with Iframe'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]
html_string = """ <!DOCTYPE html> <html> <body> <div> <h1>Section with Image and Link</h1> <p> <img src="https://example.com/image.jpg" alt="An example image" /> Some text after the image. </p> <ul> <li>Item 1: Description of item 1, which is quite detailed and important.</li> <li>Item 2: Description of item 2, which also contains significant information.</li> <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li> </ul> </div> </body> </html> """
defcustom_image_handler(img_tag) -> str: img_src = img_tag.get("src", "") img_alt = img_tag.get("alt", "No alt text provided")
[Document(metadata={'Header 1': 'Section with Image and Link'}, page_content='[Image Alt Text: An example image | Image Source: https://example.com/image.jpg | Image Semantic Meaning: semantic-meaning] Some text after the image'), Document(metadata={'Header 1': 'Section with Image and Link'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]
解释
通过编写自定义处理器从 HTML 中的 <img> 元素提取特定字段,我们可以使用我们的代理进一步处理数据,并将结果直接插入到我们的块中。重要的是要确保 preserve_images 设置为 False,否则 <img> 字段的默认处理将会发生