Lazyfile documentation

The lazyfile module is an implementation of an “on demand” file object in Python. It allows you to define how the data for a file is fetched, and it ensures that as the file is read, data is only requested when needed, and is cached so that the same block of data is never requested more than once.

This is useful when requesting data is costly, for example when accessing a large file over HTTP, where range requests can be used to limit the amount of network traffic needed.

Example

We will demonstrate the library by implementing a HTTP-based file, that only requests data from the server as needed.

from lazyfile import LazyFile
from urllib.request import Request, urlopen
from functools import partial

# We need to know the size of the file when we
# create the LazyFile object, so we issue a
# HTTP HEAD request to get it
def content_len(url):
    req = Request(url, method="HEAD")
    with urlopen(req) as f:
        return int(f.headers["Content-Length"])

# Get a block of bytes from the URL.
def getter(url, lo, hi):
    # The library passes the start and one-past-the-end values
    # for the range, just like a Python slice. So we need to
    # adjust the end value for the Range header.
    req = Request(url, headers={"Range": f"bytes={lo}-{hi-1}"})

    # Read the actual data
    with urlopen(req) as f:
        data = f.read()
        assert len(data) == hi-lo, f"Data ({lo}, {hi}) = {data!r}"

    return data

def open_url(url):
    url_getter = partial(getter, url)
    file_size = content_len(url)
    return LazyFile(file_size, url_getter)

Tutorial

Basic Usage

You need to know two things to create a LazyFile. The length of the file, and how to get a block of data from it.

>>> from lazyfile import LazyFile
>>> lf = LazyFile(size, reader)

The size is just an integer. The reader argument is a function which will be passed the range of bytes to read, expressed as the index of the start of the range, and the index one past the end of the range (this is the same convention as for Python ranges). The indices will always be between 0 and the size of the file.

As a simple example, the following will implement a reader for a byte string:

>>> data = b"Hello, world!"
>>> def get_bytes(lo, hi):
...     return data[lo:hi]
...
>>> lf = LazyFile(len(data), get_bytes)
>>> lf.seek(3)
3
>>> lf.read(4)
b'lo, '

Obviously, this is pointless, as io.BytesIO does the job far more effectively. The LazyFile class is intended for use when the get_bytes function is expensive, and it is worth minimising the number of calls to it. But the above is fully functional, and demonstrates the usage without needing a more complex data source.

LazyFile How To Guides

How to lazily extract a file from a remote wheel

The original motivation for this library was to extract the metadata file from a wheel, without downloading the whole wheel file (which could potentially be very large).

Wheel files are structured as zip files, and can be read using the standard library zipfile module. The ZipFile constructor takes a file-like object which must return bytes, and be seekable. A LazyFile is ideal for this.

First, we need to implement methods to get the length of the file, and to get a block of bytes from the file.

from urllib.request import Request, urlopen

# Get the file size with a HEAD request
def content_len(url):
    req = Request(url, method="HEAD")
    with urlopen(req) as f:
        return int(f.headers["Content-Length"])

# Get a block of bytes from the URL.
def get_url_range(url, lo, hi):
    # Adjust the range, as hi is "past the end"
    req = Request(url, headers={"Range": f"bytes={lo}-{hi-1}"})
    with urlopen(req) as f:
        data = f.read()
        if len(data) != hi-lo:
            raise ValueError(f"Failed to read {hi-lo} bytes")

    return data

With these helpers, we can open a URL lazily

from lazyfile import LazyFile
from functools import partial

def open_url(url):
    url_getter = partial(get_url_range, url)
    file_size = content_len(url)
    return LazyFile(file_size, url_getter)

And that’s all we need to process the file as a zipfile and extract the metadata

def extract_metadata(url)
    f = open_url(url)
    z = ZipFile(f)
    for name in z.namelist():
        if name.endswith(".dist-info/METADATA"):
            metadata = z.read(name)
            return metadata

Indices and tables