Welcome to w3lib’s documentation!
Overview
This is a Python library of web-related functions, such as:
remove comments, or tags from HTML snippets
extract base url from HTML snippets
translate entities on HTML strings
convert raw HTTP headers to dicts and vice-versa
construct HTTP auth header
converting HTML pages to unicode
sanitize urls (like browsers do)
extract arguments from urls
The w3lib library is licensed under the BSD license.
Modules
- w3lib Package
Requirements
Python 3.9+
Install
pip install w3lib
Tests
pytest is the preferred way to run tests. Just run:
pytest
from the root directory to execute tests using the default Python
interpreter.
tox could be used to run tests for all supported Python
versions. Install it (using ‘pip install tox’) and then run tox
from
the root directory - tests will be executed for all available
Python interpreters.
Changelog
2.2.1 (2024-06-12)
canonicalize_url()
no longer applies lowercase to the userinfo URL component. (#229, #230)
2.2.0 (2024-06-05)
Dropped Python 3.7 support (#214).
Added Python 3.12 and PyPy 3.10 support (#218).
Added the description to the package metadata (#227).
Improved type hints (#226).
Added
.readthedocs.yml
(#219).Updated the intersphinx URLs (#224).
Added the
pre-commit
configuration, code reformatted withblack
(#220).Updated CI configuration (#217, #227).
2.1.2 (2023-08-03)
Fix test failures on Python 3.11.4+ (#212, #213).
Fix an incorrect type hint (#211).
Add project URLs to setup.py (#215).
2.1.1 (2022-12-09)
safe_url_string()
,safe_download_url()
andcanonicalize_url()
now strip whitespace and control characters urls according to the URL living standard.
2.1.0 (2022-11-28)
Dropped Python 3.6 support, and made Python 3.11 support official. (#195, #200)
safe_url_string()
now generates safer URLs.To make URLs safer for the URL living standard:
;=
are percent-encoded in the URL username.;:=
are percent-encoded in the URL password.'
is percent-encoded in the URL query if the URL scheme is special.
To make URLs safer for RFC 2396 and RFC 3986,
|[]
are percent-encoded in URL paths, queries, and fragments.(#80, #203)
html_to_unicode()
now checks for the byte order mark before inspecting theContent-Type
header when determining the content encoding, in line with the URL living standard. (#189, #191)canonicalize_url()
now strips spaces from the input URL, to be more in line with the URL living standard. (#132, #136)get_base_url()
now ignores HTML comments. (#70, #77)Fixed
safe_url_string()
re-encoding percent signs on the URL username and password even when they were being used as part of an escape sequence. (#187, #196)Fixed
basic_auth_header()
using the wrong flavor of base64 encoding, which could prevent authentication in rare cases. (#181, #192)Fixed
replace_entities()
raisingOverflowError
in some cases due to a bug in CPython. (#199, #202)Improved typing and fixed typing issues. (#190, #206)
Made CI and test improvements. (#197, #198)
Adopted a Code of Conduct. (#194)
2.0.1 (2022-08-11)
Minor documentation fix (release date is set in the changelog).
2.0.0 (2022-08-11)
Backwards incompatible changes:
Python 2 is no longer supported; Python 3.6+ is required now (#168, #175).
w3lib.url.safe_url_string()
andw3lib.url.canonicalize_url()
no longer convert “%23” to “#” when it appears in the URL path. This is a bug fix. It’s listed as a backward-incomatible change because in some cases the output ofw3lib.url.canonicalize_url()
is going to change, and so, if this output is used to generate URL fingerprints, new fingerprints might be incompatible with those created with the previous w3lib versions (#141).
Deprecation removals (#169):
The
w3lib.form
module is removed.The
w3lib.html.remove_entities
function is removed.The
w3lib.url.urljoin_rfc
function is removed.
The following functions are deprecated, and will be removed in future releases (#170):
w3lib.util.str_to_unicode
w3lib.util.unicode_to_str
w3lib.util.to_native_str
Other improvements and bug fixes:
Type annotations are added (#172, #184).
Added support for Python 3.9 and 3.10 (#168, #176).
Fixed
w3lib.html.get_meta_refresh()
for<meta>
tags wherehttp-equiv
is written aftercontent
(#179).Fixed
w3lib.url.safe_url_string()
for IDNA domains with ports (#174).w3lib.url.url_query_cleaner()
no longer adds an unneeded#
whenkeep_fragments=True
is passed, and the URL doesn’t have a fragment (#159).Removed a workaround for an ancient pathname2url bug (#142)
CI is migrated to GitHub Actions (#166, #177); other CI improvements (#160, #182).
The code is formatted using black (#173).
1.22.0 (2020-05-13)
Python 3.4 is no longer supported (issue #156)
w3lib.url.safe_url_string()
now supports an optionalquote_path
parameter to disable the percent-encoding of the URL path (issue #119)w3lib.url.add_or_replace_parameter()
andw3lib.url.add_or_replace_parameters()
no longer remove duplicate parameters from the original query string that are not being added or replaced (issue #126)w3lib.html.remove_tags()
now raises aValueError
exception instead ofAssertionError
when using both thewhich_ones
and thekeep
parameters (issue #154)Test improvements (issues #143, #146, #148, #149)
Documentation improvements (issues #140, #144, #145, #151, #152, #153)
Code cleanup (issue #139)
1.21.0 (2019-08-09)
Add the
encoding
andpath_encoding
parameters tow3lib.url.safe_download_url()
(issue #118)w3lib.url.safe_url_string()
now also removes tabs and new lines (issue #133)w3lib.html.remove_comments()
now also removes truncated comments (issue #129)w3lib.html.remove_tags_with_content()
no longer removes tags which start with the same text as one of the specified tags (issue #114)Recommend pytest instead of nose to run tests (issue #124)
1.20.0 (2019-01-11)
Fix url_query_cleaner to do not append “?” to urls without a query string (issue #109)
Add support for Python 3.7 and drop Python 3.3 (issue #113)
Add w3lib.url.add_or_replace_parameters helper (issue #117)
Documentation fixes (issue #115)
1.19.0 (2018-01-25)
Add a workaround for CPython segfault (https://bugs.python.org/issue32583) which affect w3lib.encoding functions. This is technically backwards incompatible because it changes the way non-decodable bytes are replaced (in some cases instead of two
\ufffd
chars you can get one). As a side effect, the fix speeds up decoding in Python 3.4+.Add ‘encoding’ parameter for w3lib.http.basic_auth_header.
Fix pypy testing setup, add pypy3 to CI.
1.18.0 (2017-08-03)
Include additional assets used for distribution packages in the source tarball
Consider
[
and]
as safe characters in path and query components of URLs, i.e. they are not escaped anymoreDisable codecov project coverage check
1.17.0 (2017-02-08)
Add Python 3.5 and 3.6 support
Add
w3lib.url.parse_data_uri
helper for parsing “data:” URIsAdd
w3lib.html.strip_html5_whitespace
function to strip leading and trailing whitespace as per W3C recommendations, e.g. for cleaning “href” attribute valuesFix
w3lib.http.headers_raw_to_dict
for multiple headers with same nameDo not distribute tests/test_*.pyc artifacts
1.16.0 (2016-11-10)
canonicalize_url()
andsafe_url_string()
: strip “:” when no port is specified (as per RFC 3986; see also https://github.com/scrapy/scrapy/issues/2377)url_query_cleaner()
: support newkeep_fragments
argument (defaulting toFalse
)
1.15.0 (2016-07-29)
Add
canonicalize_url()
tow3lib.url
1.14.3 (2016-07-14)
Bugfix release:
Handle IDNA encoding failures in
safe_url_string()
(issue #62)
1.14.2 (2016-04-11)
Bugfix release:
fix function import for (deprecated)
urljoin_rfc
(issue #51)only expose wanted functions from
w3lib.url
, via__all__
(see issue #54, https://github.com/scrapy/scrapy/issues/1917)
1.14.1 (2016-04-07)
Bugfix release:
For bytes URLs, when supplied encoding (or default UTF8) is wrong,
safe_url_string
falls back to percent-encoding offending bytes.
1.14.0 (2016-04-06)
Changes to safe_url_string:
proper handling of non-ASCII characters in Python2 and Python3
support IDNs
new path_encoding to override default UTF-8 when serializing non-ASCII characters before percent-encoding
html_body_declared_encoding also detects encoding when not sole attribute
in <meta>
.
Package is now properly marked as zip_safe
.
1.13.0 (2015-11-05)
remove_tags removes uppercase tags as well;
ignore meta-redirects inside script or noscript tags by default, but add an option to not ignore them;
replace_entities now handles entities without trailing semicolon;
fixed uncaught UnicodeDecodeError when decoding entities.
1.12.0 (2015-06-29)
meta_refresh regex now handles leading newlines and whitespaces in the url;
include tests folder in source distribution.
1.11.0 (2015-01-13)
url_query_cleaner now supports str or list parameters;
add support for resolving base URLs in <base> tags with attributes before href.
1.10.0 (2014-08-20)
reverted all 1.9.0 changes.
1.9.0 (2014-08-16)
all url-related functions accept bytes and unicode and now return bytes.
1.8.1 (2014-08-14)
w3lib.http.basic_auth_header now returns bytes
1.8.0 (2014-07-31)
add support for big5-hkscs encoding.
1.7.1 (2014-07-26)
PY3 fixed headers_raw_to_dict and headers_dict_to_raw;
documentation improvements;
provide wheels.
1.6 (2014-06-03)
w3lib.form.encode_multipart is deprecated;
docstrings and docs are improved;
w3lib.url.add_or_replace_parameter is re-implemented on top of stdlib functions;
remove_entities is renamed to replace_entities.
1.5 (2013-11-09)
Python 2.6 support is dropped.
1.4 (2013-10-18)
Python 3 support;
get_meta_refresh encoding handling is fixed;
check for ‘?’ in add_or_replace_parameter;
ISO-8859-1 is used for HTTP Basic Auth;
fixed unicode handling in replace_escape_chars;
1.3 (2012-05-13)
support non-standard gb_2312_80 encoding;
drop Python 2.5 support.
1.2 (2012-05-02)
Detect encoding for content attr before http-equiv in meta tag.
1.1 (2012-04-18)
w3lib.html.remove_comments handles multiline comments;
Added w3lib.encoding module, containing functions for working with character encoding, like encoding autodetection from HTML pages.
w3lib.url.urljoin_rfc is deprecated.
1.0 (2011-04-17)
First release of w3lib.
History
The code of w3lib was originally part of the Scrapy framework but was later stripped out of Scrapy, with the aim of make it more reusable and to provide a useful library of web functions without depending on Scrapy.