Welcome to w3lib’s documentation!

Overview

This is a Python library of web-related functions, such as:

  • remove comments, or tags from HTML snippets
  • extract base url from HTML snippets
  • translate entites on HTML strings
  • convert raw HTTP headers to dicts and vice-versa
  • construct HTTP auth header
  • converting HTML pages to unicode
  • sanitize urls (like browsers do)
  • extract arguments from urls

The w3lib library is licensed under the BSD license.

Requirements

Python 2.7 or Python 3.3+

Install

pip install w3lib

Tests

nose is the preferred way to run tests. Just run: nosetests from the root directory to execute tests using the default Python interpreter.

tox could be used to run tests for all supported Python versions. Install it (using ‘pip install tox’) and then run tox from the root directory - tests will be executed for all available Python interpreters.

Changelog

1.18.0 (2017-08-03)

  • Include additional assets used for distribution packages in the source tarball
  • Consider [ and ] as safe characters in path and query components of URLs, i.e. they are not escaped anymore
  • Disable codecov project coverage check

1.17.0 (2017-02-08)

  • Add Python 3.5 and 3.6 support
  • Add w3lib.url.parse_data_uri helper for parsing “data:” URIs
  • Add w3lib.html.strip_html5_whitespace function to strip leading and trailing whitespace as per W3C recommendations, e.g. for cleaning “href” attribute values
  • Fix w3lib.http.headers_raw_to_dict for multiple headers with same name
  • Do not distribute tests/test_*.pyc artifacts

1.16.0 (2016-11-10)

1.15.0 (2016-07-29)

  • Add canonicalize_url() to w3lib.url

1.14.3 (2016-07-14)

Bugfix release:

  • Handle IDNA encoding failures in safe_url_string() (issue #62)

1.14.2 (2016-04-11)

Bugfix release:

1.14.1 (2016-04-07)

Bugfix release:

  • For bytes URLs, when supplied encoding (or default UTF8) is wrong, safe_url_string falls back to percent-encoding offending bytes.

1.14.0 (2016-04-06)

Changes to safe_url_string:

  • proper handling of non-ASCII characters in Python2 and Python3
  • support IDNs
  • new path_encoding to override default UTF-8 when serializing non-ASCII characters before percent-encoding

html_body_declared_encoding also detects encoding when not sole attribute in <meta>.

Package is now properly marked as zip_safe.

1.13.0 (2015-11-05)

  • remove_tags removes uppercase tags as well;
  • ignore meta-redirects inside script or noscript tags by default, but add an option to not ignore them;
  • replace_entities now handles entities without trailing semicolon;
  • fixed uncaught UnicodeDecodeError when decoding entities.

1.12.0 (2015-06-29)

  • meta_refresh regex now handles leading newlines and whitespaces in the url;
  • include tests folder in source distribution.

1.11.0 (2015-01-13)

  • url_query_cleaner now supports str or list parameters;
  • add support for resolving base URLs in <base> tags with attributes before href.

1.10.0 (2014-08-20)

  • reverted all 1.9.0 changes.

1.9.0 (2014-08-16)

  • all url-related functions accept bytes and unicode and now return bytes.

1.8.1 (2014-08-14)

  • w3lib.http.basic_auth_header now returns bytes

1.8.0 (2014-07-31)

  • add support for big5-hkscs encoding.

1.7.1 (2014-07-26)

  • PY3 fixed headers_raw_to_dict and headers_dict_to_raw;
  • documentation improvements;
  • provide wheels.

1.6 (2014-06-03)

  • w3lib.form.encode_multipart is deprecated;
  • docstrings and docs are improved;
  • w3lib.url.add_or_replace_parameter is re-implemented on top of stdlib functions;
  • remove_entities is renamed to replace_entities.

1.5 (2013-11-09)

  • Python 2.6 support is dropped.

1.4 (2013-10-18)

  • Python 3 support;
  • get_meta_refresh encoding handling is fixed;
  • check for ‘?’ in add_or_replace_parameter;
  • ISO-8859-1 is used for HTTP Basic Auth;
  • fixed unicode handling in replace_escape_chars;

1.3 (2012-05-13)

  • support non-standard gb_2312_80 encoding;
  • drop Python 2.5 support.

1.2 (2012-05-02)

  • Detect encoding for content attr before http-equiv in meta tag.

1.1 (2012-04-18)

  • w3lib.html.remove_comments handles multiline comments;
  • Added w3lib.encoding module, containing functions for working with character encoding, like encoding autodetection from HTML pages.
  • w3lib.url.urljoin_rfc is deprecated.

1.0 (2011-04-17)

First release of w3lib.

History

The code of w3lib was originally part of the Scrapy framework but was later stripped out of Scrapy, with the aim of make it more reusable and to provide a useful library of web functions without depending on Scrapy.

Indices and tables