w3lib Package¶
encoding
Module¶
Functions for handling encoding of web pages
-
w3lib.encoding.
html_body_declared_encoding
(html_body_str: Union[str, bytes]) → Optional[str][source]¶ Return the encoding specified in meta tags in the html body, or
None
if no suitable encoding was found>>> import w3lib.encoding >>> w3lib.encoding.html_body_declared_encoding( ... """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" ... "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> ... <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> ... <head> ... <title>Some title</title> ... <meta http-equiv="content-type" content="text/html;charset=utf-8" /> ... </head> ... <body> ... ... ... </body> ... </html>""") 'utf-8' >>>
-
w3lib.encoding.
html_to_unicode
(content_type_header: Optional[str], html_body_str: bytes, default_encoding: str = 'utf8', auto_detect_fun: Optional[Callable[[bytes], Optional[str]]] = None) → Tuple[str, str][source]¶ Convert raw html bytes to unicode
This attempts to make a reasonable guess at the content encoding of the html body, following a similar process to a web browser.
It will try in order:
- BOM (byte-order mark)
- http content type header
- meta or xml tag declarations
- auto-detection, if the auto_detect_fun keyword argument is not
None
- default encoding in keyword arg (which defaults to utf8)
If an encoding other than the auto-detected or default encoding is used, overrides will be applied, converting some character encodings to more suitable alternatives.
If a BOM is found matching the encoding, it will be stripped.
The auto_detect_fun argument can be used to pass a function that will sniff the encoding of the text. This function must take the raw text as an argument and return the name of an encoding that python can process, or None. To use chardet, for example, you can define the function as:
auto_detect_fun=lambda x: chardet.detect(x).get('encoding')
or to use UnicodeDammit (shipped with the BeautifulSoup library):
auto_detect_fun=lambda x: UnicodeDammit(x).originalEncoding
If the locale of the website or user language preference is known, then a better default encoding can be supplied.
If content_type_header is not present,
None
can be passed signifying that the header was not present.This method will not fail, if characters cannot be converted to unicode,
\\ufffd
(the unicode replacement character) will be inserted instead.Returns a tuple of
(<encoding used>, <unicode_string>)
Examples:
>>> import w3lib.encoding >>> w3lib.encoding.html_to_unicode(None, ... b"""<!DOCTYPE html> ... <head> ... <meta charset="UTF-8" /> ... <meta name="viewport" content="width=device-width" /> ... <title>Creative Commons France</title> ... <link rel='canonical' href='http://creativecommons.fr/' /> ... <body> ... <p>Creative Commons est une organisation \xc3\xa0 but non lucratif ... qui a pour dessein de faciliter la diffusion et le partage des oeuvres ... tout en accompagnant les nouvelles pratiques de cr\xc3\xa9ation \xc3\xa0 l\xe2\x80\x99\xc3\xa8re numerique.</p> ... </body> ... </html>""") ('utf-8', '<!DOCTYPE html>\n<head>\n<meta charset="UTF-8" />\n<meta name="viewport" content="width=device-width" />\n<title>Creative Commons France</title>\n<link rel=\'canonical\' href=\'http://creativecommons.fr/\' />\n<body>\n<p>Creative Commons est une organisation \xe0 but non lucratif\nqui a pour dessein de faciliter la diffusion et le partage des oeuvres\ntout en accompagnant les nouvelles pratiques de cr\xe9ation \xe0 l\u2019\xe8re numerique.</p>\n</body>\n</html>') >>>
-
w3lib.encoding.
http_content_type_encoding
(content_type: Optional[str]) → Optional[str][source]¶ Extract the encoding in the content-type header
>>> import w3lib.encoding >>> w3lib.encoding.http_content_type_encoding("Content-Type: text/html; charset=ISO-8859-4") 'iso8859-4'
-
w3lib.encoding.
read_bom
(data: bytes) → Union[Tuple[None, None], Tuple[str, bytes]][source]¶ Read the byte order mark in the text, if present, and return the encoding represented by the BOM and the BOM.
If no BOM can be detected,
(None, None)
is returned.>>> import w3lib.encoding >>> w3lib.encoding.read_bom(b'\xfe\xff\x6c\x34') ('utf-16-be', '\xfe\xff') >>> w3lib.encoding.read_bom(b'\xff\xfe\x34\x6c') ('utf-16-le', '\xff\xfe') >>> w3lib.encoding.read_bom(b'\x00\x00\xfe\xff\x00\x00\x6c\x34') ('utf-32-be', '\x00\x00\xfe\xff') >>> w3lib.encoding.read_bom(b'\xff\xfe\x00\x00\x34\x6c\x00\x00') ('utf-32-le', '\xff\xfe\x00\x00') >>> w3lib.encoding.read_bom(b'\x01\x02\x03\x04') (None, None) >>>
-
w3lib.encoding.
resolve_encoding
(encoding_alias: str) → Optional[str][source]¶ Return the encoding that encoding_alias maps to, or
None
if the encoding cannot be interpreted>>> import w3lib.encoding >>> w3lib.encoding.resolve_encoding('latin1') 'cp1252' >>> w3lib.encoding.resolve_encoding('gb_2312-80') 'gb18030' >>>
html
Module¶
Functions for dealing with markup text
-
w3lib.html.
get_base_url
(text: AnyStr, baseurl: Union[str, bytes] = '', encoding: str = 'utf-8') → str[source]¶ Return the base url if declared in the given HTML text, relative to the given base url.
If no base url is found, the given baseurl is returned.
-
w3lib.html.
get_meta_refresh
(text: AnyStr, baseurl: str = '', encoding: str = 'utf-8', ignore_tags: Iterable[str] = ('script', 'noscript')) → Union[Tuple[None, None], Tuple[float, str]][source]¶ Return the http-equiv parameter of the HTML meta element from the given HTML text and return a tuple
(interval, url)
where interval is an integer containing the delay in seconds (or zero if not present) and url is a string with the absolute url to redirect.If no meta redirect is found,
(None, None)
is returned.
-
w3lib.html.
remove_comments
(text: AnyStr, encoding: Optional[str] = None) → str[source]¶ Remove HTML Comments.
>>> import w3lib.html >>> w3lib.html.remove_comments(b"test <!--textcoment--> whatever") 'test whatever' >>>
Remove HTML Tags only.
which_ones and keep are both tuples, there are four cases:
which_ones
keep
what it does not empty empty remove all tags in which_ones
empty not empty remove all tags except the ones in keep
empty empty remove all tags not empty not empty not allowed Remove all tags:
>>> import w3lib.html >>> doc = '<div><p><b>This is a link:</b> <a href="http://www.example.com">example</a></p></div>' >>> w3lib.html.remove_tags(doc) 'This is a link: example' >>>
Keep only some tags:
>>> w3lib.html.remove_tags(doc, keep=('div',)) '<div>This is a link: example</div>' >>>
Remove only specific tags:
>>> w3lib.html.remove_tags(doc, which_ones=('a','b')) '<div><p>This is a link: example</p></div>' >>>
You can’t remove some and keep some:
>>> w3lib.html.remove_tags(doc, which_ones=('a',), keep=('p',)) Traceback (most recent call last): ... ValueError: Cannot use both which_ones and keep >>>
Remove tags and their content.
which_ones is a tuple of which tags to remove including their content. If is empty, returns the string unmodified.
>>> import w3lib.html >>> doc = '<div><p><b>This is a link:</b> <a href="http://www.example.com">example</a></p></div>' >>> w3lib.html.remove_tags_with_content(doc, which_ones=('b',)) '<div><p> <a href="http://www.example.com">example</a></p></div>' >>>
-
w3lib.html.
replace_entities
(text: AnyStr, keep: Iterable[str] = (), remove_illegal: bool = True, encoding: str = 'utf-8') → str[source]¶ Remove entities from the given text by converting them to their corresponding unicode character.
text can be a unicode string or a byte string encoded in the given encoding (which defaults to ‘utf-8’).
If keep is passed (with a list of entity names) those entities will be kept (they won’t be removed).
It supports both numeric entities (
&#nnnn;
and&#hhhh;
) and named entities (such as
or>
).If remove_illegal is
True
, entities that can’t be converted are removed. If remove_illegal isFalse
, entities that can’t be converted are kept “as is”. For more information see the tests.Always returns a unicode string (with the entities removed).
>>> import w3lib.html >>> w3lib.html.replace_entities(b'Price: £100') 'Price: \xa3100' >>> print(w3lib.html.replace_entities(b'Price: £100')) Price: £100 >>>
-
w3lib.html.
replace_escape_chars
(text: AnyStr, which_ones: Iterable[str] = ('\n', '\t', '\r'), replace_by: Union[str, bytes] = '', encoding: Optional[str] = None) → str[source]¶ Remove escape characters.
which_ones is a tuple of which escape characters we want to remove. By default removes
\n
,\t
,\r
.replace_by is the string to replace the escape characters by. It defaults to
''
, meaning the escape characters are removed.
Replace all markup tags found in the given text by the given token. By default token is an empty string so it just removes all tags.
text can be a unicode string or a regular string encoded as encoding (or
'utf-8'
if encoding is not given.)Always returns a unicode string.
Examples:
>>> import w3lib.html >>> w3lib.html.replace_tags('This text contains <a>some tag</a>') 'This text contains some tag' >>> w3lib.html.replace_tags('<p>Je ne parle pas <b>fran\xe7ais</b></p>', ' -- ', 'latin-1') ' -- Je ne parle pas -- fran\xe7ais -- -- ' >>>
-
w3lib.html.
strip_html5_whitespace
(text: str) → str[source]¶ Strip all leading and trailing space characters (as defined in https://www.w3.org/TR/html5/infrastructure.html#space-character).
Such stripping is useful e.g. for processing HTML element attributes which contain URLs, like
href
,src
or formaction
- HTML5 standard defines them as “valid URL potentially surrounded by spaces” or “valid non-empty URL potentially surrounded by spaces”.>>> strip_html5_whitespace(' hello\n') 'hello'
-
w3lib.html.
unquote_markup
(text: AnyStr, keep: Iterable[str] = (), remove_illegal: bool = True, encoding: Optional[str] = None) → str[source]¶ This function receives markup as a text (always a unicode string or a UTF-8 encoded string) and does the following:
- removes entities (except the ones in keep) from any part of it
- that is not inside a CDATA
- searches for CDATAs and extracts their text (if any) without modifying it.
- removes the found CDATAs
http
Module¶
-
w3lib.http.
basic_auth_header
(username: AnyStr, password: AnyStr, encoding: str = 'ISO-8859-1') → bytes[source]¶ Return an Authorization header field value for HTTP Basic Access Authentication (RFC 2617)
>>> import w3lib.http >>> w3lib.http.basic_auth_header('someuser', 'somepass') 'Basic c29tZXVzZXI6c29tZXBhc3M='
-
w3lib.http.
headers_dict_to_raw
(headers_dict: Optional[Mapping[bytes, Union[Any, Sequence[bytes]]]]) → Optional[bytes][source]¶ Returns a raw HTTP headers representation of headers
For example:
>>> import w3lib.http >>> w3lib.http.headers_dict_to_raw({b'Content-type': b'text/html', b'Accept': b'gzip'}) # doctest: +SKIP 'Content-type: text/html\\r\\nAccept: gzip' >>>
Note that keys and values must be bytes.
Argument is
None
(returnsNone
):>>> w3lib.http.headers_dict_to_raw(None) >>>
-
w3lib.http.
headers_raw_to_dict
(headers_raw: Optional[bytes]) → Optional[MutableMapping[bytes, List[bytes]]][source]¶ Convert raw headers (single multi-line bytestring) to a dictionary.
For example:
>>> import w3lib.http >>> w3lib.http.headers_raw_to_dict(b"Content-type: text/html\n\rAccept: gzip\n\n") # doctest: +SKIP {'Content-type': ['text/html'], 'Accept': ['gzip']}
Incorrect input:
>>> w3lib.http.headers_raw_to_dict(b"Content-typt gzip\n\n") {} >>>
Argument is
None
(returnNone
):>>> w3lib.http.headers_raw_to_dict(None) >>>
url
Module¶
This module contains general purpose URL functions not found in the standard library.
-
class
w3lib.url.
ParseDataURIResult
[source]¶ Named tuple returned by
parse_data_uri()
.-
data
¶ Data, decoded if it was encoded in base64 format.
-
media_type
¶ MIME type type and subtype, separated by / (e.g.
"text/plain"
).
-
media_type_parameters
¶ MIME type parameters (e.g.
{"charset": "US-ASCII"}
).
-
-
w3lib.url.
add_or_replace_parameter
(url: str, name: str, new_value: str) → str[source]¶ Add or remove a parameter to a given url
>>> import w3lib.url >>> w3lib.url.add_or_replace_parameter('http://www.example.com/index.php', 'arg', 'v') 'http://www.example.com/index.php?arg=v' >>> w3lib.url.add_or_replace_parameter('http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3', 'arg4', 'v4') 'http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3&arg4=v4' >>> w3lib.url.add_or_replace_parameter('http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3', 'arg3', 'v3new') 'http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3new' >>>
-
w3lib.url.
add_or_replace_parameters
(url: str, new_parameters: Dict[str, str]) → str[source]¶ Add or remove a parameters to a given url
>>> import w3lib.url >>> w3lib.url.add_or_replace_parameters('http://www.example.com/index.php', {'arg': 'v'}) 'http://www.example.com/index.php?arg=v' >>> args = {'arg4': 'v4', 'arg3': 'v3new'} >>> w3lib.url.add_or_replace_parameters('http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3', args) 'http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3new&arg4=v4' >>>
-
w3lib.url.
any_to_uri
(uri_or_path: str) → str[source]¶ If given a path name, return its File URI, otherwise return it unmodified
-
w3lib.url.
canonicalize_url
(url: Union[str, bytes, urllib.parse.ParseResult], keep_blank_values: bool = True, keep_fragments: bool = False, encoding: Optional[str] = None) → str[source]¶ Canonicalize the given url by applying the following procedures:
- make the URL safe
- sort query arguments, first by key, then by value
- normalize all spaces (in query arguments) ‘+’ (plus symbol)
- normalize percent encodings case (%2f -> %2F)
- remove query arguments with blank values (unless keep_blank_values is True)
- remove fragments (unless keep_fragments is True)
The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3).
>>> import w3lib.url >>> >>> # sorting query arguments >>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50') 'http://www.example.com/do?a=50&b=2&b=5&c=3' >>> >>> # UTF-8 conversion + percent-encoding of non-ASCII characters >>> w3lib.url.canonicalize_url('http://www.example.com/r\u00e9sum\u00e9') 'http://www.example.com/r%C3%A9sum%C3%A9' >>>
For more examples, see the tests in tests/test_url.py.
-
w3lib.url.
file_uri_to_path
(uri: str) → str[source]¶ Convert File URI to local filesystem path according to: http://en.wikipedia.org/wiki/File_URI_scheme
-
w3lib.url.
parse_data_uri
(uri: Union[str, bytes]) → w3lib.url.ParseDataURIResult[source]¶ Parse a data: URI into
ParseDataURIResult
.
-
w3lib.url.
path_to_file_uri
(path: str) → str[source]¶ Convert local filesystem path to legal File URIs as described in: http://en.wikipedia.org/wiki/File_URI_scheme
-
w3lib.url.
safe_download_url
(url: Union[str, bytes], encoding: str = 'utf8', path_encoding: str = 'utf8') → str[source]¶ Make a url for download. This will call safe_url_string and then strip the fragment, if one exists. The path will be normalised.
If the path is outside the document root, it will be changed to be within the document root.
-
w3lib.url.
safe_url_string
(url: Union[str, bytes], encoding: str = 'utf8', path_encoding: str = 'utf8', quote_path: bool = True) → str[source]¶ Return a URL equivalent to url that a wide range of web browsers and web servers consider valid.
url is parsed according to the rules of the URL living standard, and during serialization additional characters are percent-encoded to make the URL valid by additional URL standards.
The returned URL should be valid by all of the following URL standards known to be enforced by modern-day web browsers and web servers:
- URL living standard
- RFC 3986
- RFC 2396 and RFC 2732, as interpreted by Java 8’s java.net.URI class.
If a bytes URL is given, it is first converted to str using the given encoding (which defaults to ‘utf-8’). If quote_path is True (default), path_encoding (‘utf-8’ by default) is used to encode URL path component which is then quoted. Otherwise, if quote_path is False, path component is not encoded or quoted. Given encoding is used for query string or form data.
When passing an encoding, you should use the encoding of the original page (the page from which the URL was extracted from).
Calling this function on an already “safe” URL will return the URL unmodified.
-
w3lib.url.
url_query_cleaner
(url: Union[str, bytes], parameterlist: Union[str, bytes, Sequence[Union[str, bytes]]] = (), sep: str = '&', kvsep: str = '=', remove: bool = False, unique: bool = True, keep_fragments: bool = False) → str[source]¶ Clean URL arguments leaving only those passed in the parameterlist keeping order
>>> import w3lib.url >>> w3lib.url.url_query_cleaner("product.html?id=200&foo=bar&name=wired", ('id',)) 'product.html?id=200' >>> w3lib.url.url_query_cleaner("product.html?id=200&foo=bar&name=wired", ['id', 'name']) 'product.html?id=200&name=wired' >>>
If unique is
False
, do not remove duplicated keys>>> w3lib.url.url_query_cleaner("product.html?d=1&e=b&d=2&d=3&other=other", ['d'], unique=False) 'product.html?d=1&d=2&d=3' >>>
If remove is
True
, leave only those not in parameterlist.>>> w3lib.url.url_query_cleaner("product.html?id=200&foo=bar&name=wired", ['id'], remove=True) 'product.html?foo=bar&name=wired' >>> w3lib.url.url_query_cleaner("product.html?id=2&foo=bar&name=wired", ['id', 'foo'], remove=True) 'product.html?name=wired' >>>
By default, URL fragments are removed. If you need to preserve fragments, pass the
keep_fragments
argument asTrue
.>>> w3lib.url.url_query_cleaner('http://domain.tld/?bla=123#123123', ['bla'], remove=True, keep_fragments=True) 'http://domain.tld/#123123'
-
w3lib.url.
url_query_parameter
(url: Union[str, bytes], parameter: str, default: Optional[str] = None, keep_blank_values: Union[bool, int] = 0) → Optional[str][source]¶ Return the value of a url parameter, given the url and parameter name
General case:
>>> import w3lib.url >>> w3lib.url.url_query_parameter("product.html?id=200&foo=bar", "id") '200' >>>
Return a default value if the parameter is not found:
>>> w3lib.url.url_query_parameter("product.html?id=200&foo=bar", "notthere", "mydefault") 'mydefault' >>>
Returns None if keep_blank_values not set or 0 (default):
>>> w3lib.url.url_query_parameter("product.html?id=", "id") >>>
Returns an empty string if keep_blank_values set to 1:
>>> w3lib.url.url_query_parameter("product.html?id=", "id", keep_blank_values=1) '' >>>