md = '# Foo\n\n ## Bar\n\nHi this is Jim \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly'
我们可以指定要分割的头部:
1
[("#","Header 1"),("##","Header 2")]
内容按共有的标题进行分组或划分:
1 2
{'content': 'Hi this is Jim \nHi this is Joe', 'metadata':{'Header 1': 'Foo', 'Header 2': 'Bar'}} {'content': 'Hi this is Molly', 'metadata':{'Header 1': 'Foo', 'Header 2': 'Baz'}}
使用示例:
1
pip install -qU langchain-text-splitters
1 2 3 4 5 6 7 8 9 10 11 12 13
from langchain_text_splitters import MarkdownHeaderTextSplitter
markdown_document = "# Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"
[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='Hi this is Jim \nHi this is Joe'), Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}, page_content='Hi this is Lance'), Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Hi this is Molly')]
[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='# Foo \n## Bar \nHi this is Jim \nHi this is Joe'), Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}, page_content='### Boo \nHi this is Lance'), Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='## Baz \nHi this is Molly')]
markdown_document = "# Intro \n\n ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages."
[Document(metadata={'Header 1': 'Intro', 'Header 2': 'History'}, page_content='# Intro \n## History \nMarkdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]'), Document(metadata={'Header 1': 'Intro', 'Header 2': 'History'}, page_content='Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.'), Document(metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}, page_content='## Rise and divergence \nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.'), Document(metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}, page_content='#### Standardization \nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.'), Document(metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'}, page_content='## Implementations \nImplementations of Markdown are available for over a dozen programming languages.')]
# This is a large nested json object and will be loaded as a python dict json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()
基本使用:
1 2 3
from langchain_text_splitters import RecursiveJsonSplitter
{"openapi":"3.1.0","info":{"title":"LangSmith","version":"0.1.0"},"servers":[{"url":"https://api.smith.langchain.com","description":"LangSmith API endpoint."}]} {"paths":{"/api/v1/sessions/{session_id}":{"get":{"tags":["tracer-sessions"],"summary":"Read Tracer Session","description":"Get a specific session.","operationId":"read_tracer_session_api_v1_sessions__session_id__get"}}}}
{"paths":{"/api/v1/sessions/{session_id}":{"get":{"tags":{"0":"tracer-sessions"},"summary":"Read Tracer Session","description":"Get a specific session.","operationId":"read_tracer_session_api_v1_sessions__session_id__get"}}}}
1 2
# We can also look at the documents docs[1]
1
Document(page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}')