使用python将html转换成markdown文件

使用python将markdown转换成html的情况比较多,今天我们将另一个库将html转换为markdown。

html2text

安装

1.使用pip

pip install html2text #python3使用pip3

2.源码安装 如果使用的是python3将下面的python后面加一个3

git clone --depth 1 https://github.com/Alir3z4/html2text.git
 python setup.py build
 python setup.py install

使用

import html2text

html = "<p><strong>hello </strong> https://litets.com </p>"
md = html2text.html2text(html)
print(md)

运行结果

**hello** https://litets.com

高级用法

忽略链接即a标签

import html2text
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text_maker.bypass_tables = False
html = html
text = text_maker.handle(html)
print(text)

运行结果

**hello** https://litets.com

链接

如果将ignore_links = False 运行结果

**hello** https://litets.com

[链接](https://litets.com)

我们可以看到开启之后只提取文本,而关闭后变成了markdown的链接语法

其他可选项

  • UNICODE_SNOB for using unicode
  • ESCAPE_SNOB for escaping every special character
  • LINKS_EACH_PARAGRAPH for putting links after every paragraph
  • BODY_WIDTH for wrapping long lines
  • SKIP_INTERNAL_LINKS to skip #local-anchor things
  • INLINE_LINKS for formatting images and links
  • PROTECT_LINKS protect from line breaks
  • GOOGLE_LIST_INDENT no of pixels to indent nested lists
  • IGNORE_ANCHORS
  • IGNORE_IMAGES
  • IMAGES_AS_HTML always generate HTML tags for images; preserves height, width, alt if possible.
  • IMAGES_TO_ALT
  • IMAGES_WITH_SIZE
  • IGNORE_EMPHASIS
  • BYPASS_TABLES format tables in HTML rather than Markdown
  • IGNORE_TABLES ignore table-related tags (table, th, td, tr) while keeping rows
  • SINGLE_LINE_BREAK to use a single line break rather than two
  • UNIFIABLE is a dictionary which maps unicode abbreviations to ASCII values
  • RE_SPACE for finding space-only lines
  • RE_ORDERED_LIST_MATCHER for matching ordered lists in MD
  • RE_UNORDERED_LIST_MATCHER for matching unordered list matcher in MD
  • RE_MD_CHARS_MATCHER for matching Md \,[,],( and )
  • RE_MD_CHARS_MATCHERALL for matching `,*,,{,},[,],(,),#,!
  • RE_MD_DOT_MATCHER for matching lines starting with 1.
  • RE_MD_PLUS_MATCHER for matching lines starting with +
  • RE_MD_DASH_MATCHER for matching lines starting with (-)
  • RE_SLASH_CHARS a string of slash escapeable characters
  • RE_MD_BACKSLASH_MATCHER to match \char
  • USE_AUTOMATIC_LINKS to convert http://xyz to http://xyz
  • MARK_CODE to wrap 'pre' blocks with [code]...[/code] tags
  • WRAP_LINKS to decide if links have to be wrapped during text wrapping (implies INLINE_LINKS = False)
  • WRAP_LIST_ITEMS to decide if list items have to be wrapped during text wrapping
  • DECODE_ERRORS to handle decoding errors. 'strict', 'ignore', 'replace' are the acceptable values.
  • DEFAULT_IMAGE_ALT takes a string as value and is used whenever an image tag is missing an alt value. The default for this is an empty string '' to avoid backward breakage
  • OPEN_QUOTE is the character used to open a quote when replacing the <q> tag. It defaults to ".
  • CLOSE_QUOTE is the character used to close a quote when replacing the <q> tag. It defaults to ".
声明:原创文章,版权所有,转载请注明出处,https://litets.com。