Lettura del file binario e ripetizione ciclica su ogni byte

377

In Python, come faccio a leggere in un file binario e scorrere su ogni byte di quel file?

python file-io binary

— Jesse Vogt
fonte

387

Python 2.4 e precedenti

f = open("myfile", "rb")
try:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)
finally:
    f.close()

Python 2.5-2.7

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)

Nota che l'istruzione with non è disponibile nelle versioni di Python sotto la 2.5. Per usarlo nella v 2.5 dovrai importarlo:

from __future__ import with_statement

In 2.6 questo non è necessario.

Python 3

In Python 3, è un po 'diverso. Non otterremo più caratteri non elaborati dallo stream in modalità byte ma oggetti byte, quindi è necessario modificare la condizione:

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte != b"":
        # Do stuff with byte.
        byte = f.read(1)

O come dice benhoyt, salta il non uguale e approfitta del fatto che viene b""valutato come falso. Ciò rende il codice compatibile tra 2.6 e 3.x senza alcuna modifica. Ti risparmierebbe anche di cambiare la condizione se passi dalla modalità byte al testo o viceversa.

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte:
        # Do stuff with byte.
        byte = f.read(1)

python 3.8

D'ora in poi grazie a: = operatore il codice sopra può essere scritto in modo più breve.

with open("myfile", "rb") as f:
    while (byte := f.read(1)):
        # Do stuff with byte.

— Skurmedel
fonte

40

La lettura di un file byte-saggio è un incubo di prestazioni. Questa non può essere la migliore soluzione disponibile in Python. Questo codice deve essere usato con cura.

— usr

7

@usr: bene gli oggetti file sono bufferizzati internamente, e anche così è stato chiesto. Non tutti gli script richiedono prestazioni ottimali.

— Skurmedel,

4

@mezhaka: Quindi lo cambi da read (1) a read (bufsize) e nel ciclo while fai un for-in ... l'esempio è ancora valido.

— Skurmedel,

3

@usr: la differenza di prestazioni può arrivare fino a 200 volte per il codice che ho provato .

— jfs

2

@usr - dipende da quanti byte vuoi elaborare. Se sono pochi, il codice "male" ma facilmente comprensibile può essere molto preferito. Lo spreco di cicli della CPU viene compensato per il salvataggio di "cicli della CPU del lettore" durante la manutenzione del codice.

— IllvilJa

172

Questo generatore genera byte da un file, leggendo il file in blocchi:

def bytes_from_file(filename, chunksize=8192):
    with open(filename, "rb") as f:
        while True:
            chunk = f.read(chunksize)
            if chunk:
                for b in chunk:
                    yield b
            else:
                break

# example:
for b in bytes_from_file('filename'):
    do_stuff_with(b)

Consulta la documentazione di Python per informazioni su iteratori e generatori .

— Codeape
fonte

3

@codeape Proprio quello che sto cercando. Ma come si determina la pezzatura? Può essere un valore arbitrario?

— swdev,

3

@swdev: nell'esempio viene utilizzata una dimensione in blocco di 8192 byte . Il parametro per la funzione file.read () - specifica semplicemente la dimensione, ovvero il numero di byte da leggere. codeape ha scelto 8192 Byte = 8 kB(in realtà è KiBma non è così comunemente noto). Il valore è "totalmente" casuale ma 8 kB sembra essere un valore appropriato: non si spreca troppa memoria e ancora non ci sono "troppe" operazioni di lettura come nella risposta accettata da Skurmedel ...

— mozzbozz,

3

Il filesystem esegue già il buffering di blocchi di dati, quindi questo codice è ridondante. È meglio leggere un byte alla volta.

— rigido

17

Sebbene sia già più veloce della risposta accettata, questo potrebbe essere accelerato di un altro 20-25% sostituendo l'intero for b in chunk:ciclo più interno con yield from chunk. Questa forma di è yieldstata aggiunta in Python 3.3 (vedi Yield Expressions ).

— martineau,

3

Hmm sembra improbabile, link?

— codeape,

54

Se il file non è troppo grande, tenerlo in memoria è un problema:

with open("filename", "rb") as f:
    bytes_read = f.read()
for b in bytes_read:
    process_byte(b)

dove process_byte rappresenta alcune operazioni che si desidera eseguire sul byte passato.

Se si desidera elaborare un blocco alla volta:

with open("filename", "rb") as f:
    bytes_read = f.read(CHUNKSIZE)
    while bytes_read:
        for b in bytes_read:
            process_byte(b)
        bytes_read = f.read(CHUNKSIZE)

L' withistruzione è disponibile in Python 2.5 e versioni successive.

— Vinay Sajip
fonte

1

Potresti essere interessato al benchmark che ho appena pubblicato.

— martineau,

37

Per leggere un file - un byte alla volta (ignorando il buffering) - è possibile utilizzare la funzione integrata a due argomentiiter(callable, sentinel) :

with open(filename, 'rb') as file:
    for byte in iter(lambda: file.read(1), b''):
        # Do stuff with byte

Chiama file.read(1)fino a quando non restituisce nulla b''(bytestring vuoto). La memoria non cresce illimitata per file di grandi dimensioni. È possibile passare buffering=0 a open(), per disabilitare il buffering - garantisce che venga letto un solo byte per iterazione (lento).

with-statement chiude automaticamente il file, incluso il caso in cui il codice sottostante solleva un'eccezione.

Nonostante la presenza del buffering interno per impostazione predefinita, è ancora inefficiente elaborare un byte alla volta. Ad esempio, ecco l' blackhole.pyutilità che mangia tutto ciò che viene dato:

#!/usr/bin/env python3
"""Discard all input. `cat > /dev/null` analog."""
import sys
from functools import partial
from collections import deque

chunksize = int(sys.argv[1]) if len(sys.argv) > 1 else (1 << 15)
deque(iter(partial(sys.stdin.detach().read, chunksize), b''), maxlen=0)

Esempio:

$ dd if=/dev/zero bs=1M count=1000 | python3 blackhole.py

Elabora ~ 1,5 GB / s quando chunksize == 32768sulla mia macchina e solo ~ 7,5 MB / s quando chunksize == 1. Cioè, è 200 volte più lento a leggere un byte alla volta. Tienilo in considerazione se puoi riscrivere la tua elaborazione per usare più di un byte alla volta e se hai bisogno di prestazioni.

mmapconsente di trattare un file come un bytearrayoggetto file e contemporaneamente. Può servire come alternativa al caricamento dell'intero file in memoria se è necessario accedere a entrambe le interfacce. In particolare, puoi iterare un byte alla volta su un file mappato in memoria semplicemente usando un semplice for-loop:

from mmap import ACCESS_READ, mmap

with open(filename, 'rb', 0) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
    for byte in s: # length is equal to the current file size
        # Do stuff with byte

mmapsupporta la notazione slice. Ad esempio, mm[i:i+len]restituisce lenbyte dal file a partire dalla posizione i. Il protocollo del gestore del contesto non è supportato prima di Python 3.2; devi chiamare mm.close()esplicitamente in questo caso. L'iterazione su ogni byte utilizzando mmapconsuma più memoria di file.read(1), ma mmapè un ordine di grandezza più veloce.

— jfs
fonte

Ho trovato l'ultimo esempio molto interessante. Peccato che non ci siano numpyarray (byte) mappati in memoria equivalenti .

— martineau,

1

@martineau esiste numpy.memmap()e puoi ottenere i dati un byte alla volta (ctypes.data). Potresti pensare agli array intorpiditi come solo un po 'più di BLOB in memoria + metadati.

— jfs,

jfs: Grazie, ottime notizie! Non sapevo che esistesse una cosa del genere. Ottima risposta, BTW.

— martineau,

25

Lettura del file binario in Python e ripetizione ciclica su ogni byte

Una novità di Python 3.5 è il pathlibmodulo, che ha un metodo pratico specificamente per leggere in un file come byte, permettendoci di scorrere i byte. Considero questa una risposta decente (se veloce e sporca):

import pathlib

for byte in pathlib.Path(path).read_bytes():
    print(byte)

Interessante che questa sia l'unica risposta da menzionare pathlib.

In Python 2, probabilmente lo faresti (come suggerisce anche Vinay Sajip):

with open(path, 'b') as file:
    for byte in file.read():
        print(byte)

Nel caso in cui il file potrebbe essere troppo grande per iterare in memoria, lo si dovrebbe dividere, idiomaticamente, usando la iterfunzione con la callable, sentinelfirma - la versione di Python 2:

with open(path, 'b') as file:
    callable = lambda: file.read(1024)
    sentinel = bytes() # or b''
    for chunk in iter(callable, sentinel): 
        for byte in chunk:
            print(byte)

(Diverse altre risposte menzionano questo, ma poche offrono una dimensione di lettura ragionevole.)

Best practice per file di grandi dimensioni o lettura bufferizzata / interattiva

Creiamo una funzione per farlo, inclusi gli usi idiomatici della libreria standard per Python 3.5+:

from pathlib import Path
from functools import partial
from io import DEFAULT_BUFFER_SIZE

def file_byte_iterator(path):
    """given a path, return an iterator over the file
    that lazily loads the file
    """
    path = Path(path)
    with path.open('rb') as file:
        reader = partial(file.read1, DEFAULT_BUFFER_SIZE)
        file_iterator = iter(reader, bytes())
        for chunk in file_iterator:
            yield from chunk

Si noti che usiamo file.read1. file.readsi blocca fino a quando non ottiene tutti i byte richiesti o EOF. file.read1ci consente di evitare il blocco e può tornare più rapidamente per questo motivo. Nessuna altra risposta menziona anche questo.

Dimostrazione di utilizzo delle migliori pratiche:

Facciamo un file con un megabyte (in realtà mebibyte) di dati pseudocasuali:

import random
import pathlib
path = 'pseudorandom_bytes'
pathobj = pathlib.Path(path)

pathobj.write_bytes(
  bytes(random.randint(0, 255) for _ in range(2**20)))

Ora ripetiamolo e materializziamolo in memoria:

>>> l = list(file_byte_iterator(path))
>>> len(l)
1048576

Siamo in grado di ispezionare qualsiasi parte dei dati, ad esempio gli ultimi 100 e i primi 100 byte:

>>> l[-100:]
[208, 5, 156, 186, 58, 107, 24, 12, 75, 15, 1, 252, 216, 183, 235, 6, 136, 50, 222, 218, 7, 65, 234, 129, 240, 195, 165, 215, 245, 201, 222, 95, 87, 71, 232, 235, 36, 224, 190, 185, 12, 40, 131, 54, 79, 93, 210, 6, 154, 184, 82, 222, 80, 141, 117, 110, 254, 82, 29, 166, 91, 42, 232, 72, 231, 235, 33, 180, 238, 29, 61, 250, 38, 86, 120, 38, 49, 141, 17, 190, 191, 107, 95, 223, 222, 162, 116, 153, 232, 85, 100, 97, 41, 61, 219, 233, 237, 55, 246, 181]
>>> l[:100]
[28, 172, 79, 126, 36, 99, 103, 191, 146, 225, 24, 48, 113, 187, 48, 185, 31, 142, 216, 187, 27, 146, 215, 61, 111, 218, 171, 4, 160, 250, 110, 51, 128, 106, 3, 10, 116, 123, 128, 31, 73, 152, 58, 49, 184, 223, 17, 176, 166, 195, 6, 35, 206, 206, 39, 231, 89, 249, 21, 112, 168, 4, 88, 169, 215, 132, 255, 168, 129, 127, 60, 252, 244, 160, 80, 155, 246, 147, 234, 227, 157, 137, 101, 84, 115, 103, 77, 44, 84, 134, 140, 77, 224, 176, 242, 254, 171, 115, 193, 29]

Non ripetere le righe per i file binari

Non fare quanto segue - questo tira un pezzo di dimensioni arbitrarie fino a quando non arriva a un carattere di nuova riga - troppo lento quando i blocchi sono troppo piccoli e forse anche troppo grandi:

    with open(path, 'rb') as file:
        for chunk in file: # text newline iteration - not for bytes
            yield from chunk

Quanto sopra è buono solo per quelli che sono file di testo leggibili semanticamente umani (come testo semplice, codice, markup, markdown ecc ... essenzialmente qualsiasi cosa ASCII, utf, latino, ecc ... codificato) che dovresti aprire senza la 'b'bandiera.

— Sala Aaron
fonte

2

È molto meglio ... grazie per averlo fatto. So che non è sempre divertente tornare a una risposta di due anni, ma apprezzo che tu l'abbia fatto. In particolare mi piace la sottovoce "Non iterare per linee" :-)

— Floris,

1

Ciao Aaron, c'è qualche motivo per cui hai scelto di utilizzare path = Path(path), with path.open('rb') as file:invece di utilizzare la funzione di apertura integrata invece? Entrambi fanno la stessa cosa corretta?

— Joshua Yonathan,

1

@JoshuaYonathan Uso l' Pathoggetto perché è un nuovo modo molto conveniente di gestire i percorsi. Invece di passare una stringa nelle funzioni "giuste" scelte con cura, possiamo semplicemente chiamare i metodi sull'oggetto path, che essenzialmente contiene la maggior parte delle funzionalità importanti che desideri con ciò che è semanticamente una stringa di percorso. Con gli IDE che possono ispezionare, possiamo ottenere più facilmente anche il completamento automatico. Potremmo fare lo stesso con il openbuiltin, ma ci sono molti aspetti positivi quando si scrive il programma affinché il programmatore possa usare l' Pathoggetto.

— Aaron Hall

1

L'ultimo metodo che hai citato usando la funzione file_byte_iteratorè molto più veloce di tutti i metodi che ho provato in questa pagina. Complimenti a te!

— Rick M.

@RickM: potresti essere interessato al benchmark che ho appena pubblicato.

— martineau,

19

Per riassumere tutti i punti brillanti di chrispy, Skurmedel, Ben Hoyt e Peter Hansen, questa sarebbe la soluzione ottimale per l'elaborazione di un file binario un byte alla volta:

with open("myfile", "rb") as f:
    while True:
        byte = f.read(1)
        if not byte:
            break
        do_stuff_with(ord(byte))

Per le versioni 2.6 e successive di Python, perché:

buffer di Python internamente: non è necessario leggere blocchi
Principio ASCIUTTO - non ripetere la riga di lettura
con istruzione assicura la chiusura di un file pulito
'byte' viene valutato come falso quando non ci sono più byte (non quando un byte è zero)

Oppure usa la soluzione JF Sebastians per una maggiore velocità

from functools import partial

with open(filename, 'rb') as file:
    for byte in iter(partial(file.read, 1), b''):
        # Do stuff with byte

O se lo vuoi come funzione generatore come dimostrato da codeape:

def bytes_from_file(filename):
    with open(filename, "rb") as f:
        while True:
            byte = f.read(1)
            if not byte:
                break
            yield(ord(byte))

# example:
for b in bytes_from_file('filename'):
    do_stuff_with(b)

— Holger Bille
fonte

2

Come dice la risposta collegata, leggere / elaborare un byte alla volta è ancora lento in Python anche se le letture sono memorizzate nel buffer. Le prestazioni possono essere migliorate drasticamente se più byte alla volta potrebbero essere elaborati come nell'esempio nella risposta collegata: 1,5 GB / se 7,5 MB / s.

— jfs,

6

Python 3, leggi tutto il file in una volta:

with open("filename", "rb") as binary_file:
    # Read the whole file at once
    data = binary_file.read()
    print(data)

Puoi iterare quello che vuoi usando la datavariabile.

— Mircea
fonte

6

Dopo aver provato tutto quanto sopra e usando la risposta di @Aaron Hall, stavo ottenendo errori di memoria per un file di ~ 90 Mb su un computer con Windows 10, 8 Gb RAM e Python 3.5 a 32 bit. Mi è stato consigliato da un collega di utilizzare numpyinvece e funziona a meraviglia.

Di gran lunga il più veloce per leggere un intero file binario (che ho testato) è:

import numpy as np

file = "binary_file.bin"
data = np.fromfile(file, 'u1')

Riferimento

Moltitudini più veloci di qualsiasi altro metodo finora. Spero che aiuti qualcuno!

— Rick M.
fonte

3

Bello, ma non può essere utilizzato su file binario contenente diversi tipi di dati.

— Nirmal,

@Nirmal: la domanda riguarda il looping sopra il byte di copertura, quindi non è chiaro se il tuo commento su diversi tipi di dati abbia rilevanza.

— martineau,

1

Rick: Il tuo codice non sta facendo la stessa cosa degli altri, vale a dire il looping su ogni byte. Se viene aggiunto ad esso, non è più veloce della maggior parte degli altri, almeno secondo i risultati nel mio benchmark . In effetti sembra essere uno degli approcci più lenti. Se l'elaborazione eseguita su ciascun byte (qualunque cosa possa essere) era qualcosa che poteva essere eseguita tramite numpy, allora potrebbe essere utile.

— martineau,

@martineau Grazie per i tuoi commenti, sì, capisco che la domanda riguarda il loop su ogni byte e non solo il caricamento di tutto in una volta, ma ci sono altre risposte in questa domanda che indicano anche la lettura di tutti i contenuti e quindi la mia risposta

— Rick M.

4

Se hai molti dati binari da leggere, potresti prendere in considerazione il modulo struct . È documentato come conversione "tra tipi C e Python", ma ovviamente i byte sono byte e non importa se quelli sono stati creati come tipi C. Ad esempio, se i tuoi dati binari contengono due numeri interi a 2 byte e un numero intero a 4 byte, puoi leggerli come segue (esempio preso dalla structdocumentazione):

>>> struct.unpack('hhl', b'\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)

Potresti trovare questo più conveniente, più veloce o entrambi, piuttosto che scorrere esplicitamente il contenuto di un file.

— Gerrit
fonte

4

Questo stesso post non è una risposta diretta alla domanda. Quello che è invece è un benchmark estensibile basato sui dati che può essere utilizzato per confrontare molte delle risposte (e varianti dell'utilizzo di nuove funzionalità aggiunte in versioni successive e più moderne di Python) che sono state pubblicate su questa domanda - e dovrebbero quindi essere utile nel determinare quale ha le migliori prestazioni.

In alcuni casi ho modificato il codice nella risposta di riferimento per renderlo compatibile con il framework di riferimento.

Innanzitutto, ecco i risultati di quelle che attualmente sono le ultime versioni di Python 2 e 3:

Fastest to slowest execution speeds with 32-bit Python 2.7.16
  numpy version 1.16.5
  Test file size: 1,024 KiB
  100 executions, best of 3 repetitions

1                  Tcll (array.array) :   3.8943 secs, rel speed   1.00x,   0.00% slower (262.95 KiB/sec)
2  Vinay Sajip (read all into memory) :   4.1164 secs, rel speed   1.06x,   5.71% slower (248.76 KiB/sec)
3            codeape + iter + partial :   4.1616 secs, rel speed   1.07x,   6.87% slower (246.06 KiB/sec)
4                             codeape :   4.1889 secs, rel speed   1.08x,   7.57% slower (244.46 KiB/sec)
5               Vinay Sajip (chunked) :   4.1977 secs, rel speed   1.08x,   7.79% slower (243.94 KiB/sec)
6           Aaron Hall (Py 2 version) :   4.2417 secs, rel speed   1.09x,   8.92% slower (241.41 KiB/sec)
7                     gerrit (struct) :   4.2561 secs, rel speed   1.09x,   9.29% slower (240.59 KiB/sec)
8                     Rick M. (numpy) :   8.1398 secs, rel speed   2.09x, 109.02% slower (125.80 KiB/sec)
9                           Skurmedel :  31.3264 secs, rel speed   8.04x, 704.42% slower ( 32.69 KiB/sec)

Benchmark runtime (min:sec) - 03:26

Fastest to slowest execution speeds with 32-bit Python 3.8.0
  numpy version 1.17.4
  Test file size: 1,024 KiB
  100 executions, best of 3 repetitions

1  Vinay Sajip + "yield from" + "walrus operator" :   3.5235 secs, rel speed   1.00x,   0.00% slower (290.62 KiB/sec)
2                       Aaron Hall + "yield from" :   3.5284 secs, rel speed   1.00x,   0.14% slower (290.22 KiB/sec)
3         codeape + iter + partial + "yield from" :   3.5303 secs, rel speed   1.00x,   0.19% slower (290.06 KiB/sec)
4                      Vinay Sajip + "yield from" :   3.5312 secs, rel speed   1.00x,   0.22% slower (289.99 KiB/sec)
5      codeape + "yield from" + "walrus operator" :   3.5370 secs, rel speed   1.00x,   0.38% slower (289.51 KiB/sec)
6                          codeape + "yield from" :   3.5390 secs, rel speed   1.00x,   0.44% slower (289.35 KiB/sec)
7                                      jfs (mmap) :   4.0612 secs, rel speed   1.15x,  15.26% slower (252.14 KiB/sec)
8              Vinay Sajip (read all into memory) :   4.5948 secs, rel speed   1.30x,  30.40% slower (222.86 KiB/sec)
9                        codeape + iter + partial :   4.5994 secs, rel speed   1.31x,  30.54% slower (222.64 KiB/sec)
10                                        codeape :   4.5995 secs, rel speed   1.31x,  30.54% slower (222.63 KiB/sec)
11                          Vinay Sajip (chunked) :   4.6110 secs, rel speed   1.31x,  30.87% slower (222.08 KiB/sec)
12                      Aaron Hall (Py 2 version) :   4.6292 secs, rel speed   1.31x,  31.38% slower (221.20 KiB/sec)
13                             Tcll (array.array) :   4.8627 secs, rel speed   1.38x,  38.01% slower (210.58 KiB/sec)
14                                gerrit (struct) :   5.0816 secs, rel speed   1.44x,  44.22% slower (201.51 KiB/sec)
15                 Rick M. (numpy) + "yield from" :  11.8084 secs, rel speed   3.35x, 235.13% slower ( 86.72 KiB/sec)
16                                      Skurmedel :  11.8806 secs, rel speed   3.37x, 237.18% slower ( 86.19 KiB/sec)
17                                Rick M. (numpy) :  13.3860 secs, rel speed   3.80x, 279.91% slower ( 76.50 KiB/sec)

Benchmark runtime (min:sec) - 04:47

L'ho anche eseguito con un file di test di 10 MiB molto più grande (che ha impiegato quasi un'ora per l'esecuzione) e ho ottenuto risultati di prestazioni comparabili a quelli mostrati sopra.

Ecco il codice utilizzato per eseguire il benchmarking:

from __future__ import print_function
import array
import atexit
from collections import deque, namedtuple
import io
from mmap import ACCESS_READ, mmap
import numpy as np
from operator import attrgetter
import os
import random
import struct
import sys
import tempfile
from textwrap import dedent
import time
import timeit
import traceback

try:
    xrange
except NameError:  # Python 3
    xrange = range


class KiB(int):
    """ KibiBytes - multiples of the byte units for quantities of information. """
    def __new__(self, value=0):
        return 1024*value


BIG_TEST_FILE = 1  # MiBs or 0 for a small file.
SML_TEST_FILE = KiB(64)
EXECUTIONS = 100  # Number of times each "algorithm" is executed per timing run.
TIMINGS = 3  # Number of timing runs.
CHUNK_SIZE = KiB(8)
if BIG_TEST_FILE:
    FILE_SIZE = KiB(1024) * BIG_TEST_FILE
else:
    FILE_SIZE = SML_TEST_FILE  # For quicker testing.

# Common setup for all algorithms -- prefixed to each algorithm's setup.
COMMON_SETUP = dedent("""
    # Make accessible in algorithms.
    from __main__ import array, deque, get_buffer_size, mmap, np, struct
    from __main__ import ACCESS_READ, CHUNK_SIZE, FILE_SIZE, TEMP_FILENAME
    from functools import partial
    try:
        xrange
    except NameError:  # Python 3
        xrange = range
""")


def get_buffer_size(path):
    """ Determine optimal buffer size for reading files. """
    st = os.stat(path)
    try:
        bufsize = st.st_blksize # Available on some Unix systems (like Linux)
    except AttributeError:
        bufsize = io.DEFAULT_BUFFER_SIZE
    return bufsize

# Utility primarily for use when embedding additional algorithms into benchmark.
VERIFY_NUM_READ = """
    # Verify generator reads correct number of bytes (assumes values are correct).
    bytes_read = sum(1 for _ in file_byte_iterator(TEMP_FILENAME))
    assert bytes_read == FILE_SIZE, \
           'Wrong number of bytes generated: got {:,} instead of {:,}'.format(
                bytes_read, FILE_SIZE)
"""

TIMING = namedtuple('TIMING', 'label, exec_time')

class Algorithm(namedtuple('CodeFragments', 'setup, test')):

    # Default timeit "stmt" code fragment.
    _TEST = """
        #for b in file_byte_iterator(TEMP_FILENAME):  # Loop over every byte.
        #    pass  # Do stuff with byte...
        deque(file_byte_iterator(TEMP_FILENAME), maxlen=0)  # Data sink.
    """

    # Must overload __new__ because (named)tuples are immutable.
    def __new__(cls, setup, test=None):
        """ Dedent (unindent) code fragment string arguments.
        Args:
          `setup` -- Code fragment that defines things used by `test` code.
                     In this case it should define a generator function named
                     `file_byte_iterator()` that will be passed that name of a test file
                     of binary data. This code is not timed.
          `test` -- Code fragment that uses things defined in `setup` code.
                    Defaults to _TEST. This is the code that's timed.
        """
        test =  cls._TEST if test is None else test  # Use default unless one is provided.

        # Uncomment to replace all performance tests with one that verifies the correct
        # number of bytes values are being generated by the file_byte_iterator function.
        #test = VERIFY_NUM_READ

        return tuple.__new__(cls, (dedent(setup), dedent(test)))


algorithms = {

    'Aaron Hall (Py 2 version)': Algorithm("""
        def file_byte_iterator(path):
            with open(path, "rb") as file:
                callable = partial(file.read, 1024)
                sentinel = bytes() # or b''
                for chunk in iter(callable, sentinel):
                    for byte in chunk:
                        yield byte
    """),

    "codeape": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                while True:
                    chunk = f.read(chunksize)
                    if chunk:
                        for b in chunk:
                            yield b
                    else:
                        break
    """),

    "codeape + iter + partial": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                for chunk in iter(partial(f.read, chunksize), b''):
                    for b in chunk:
                        yield b
    """),

    "gerrit (struct)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                fmt = '{}B'.format(FILE_SIZE)  # Reads entire file at once.
                for b in struct.unpack(fmt, f.read()):
                    yield b
    """),

    'Rick M. (numpy)': Algorithm("""
        def file_byte_iterator(filename):
            for byte in np.fromfile(filename, 'u1'):
                yield byte
    """),

    "Skurmedel": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                byte = f.read(1)
                while byte:
                    yield byte
                    byte = f.read(1)
    """),

    "Tcll (array.array)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                arr = array.array('B')
                arr.fromfile(f, FILE_SIZE)  # Reads entire file at once.
                for b in arr:
                    yield b
    """),

    "Vinay Sajip (read all into memory)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                bytes_read = f.read()  # Reads entire file at once.
            for b in bytes_read:
                yield b
    """),

    "Vinay Sajip (chunked)": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                chunk = f.read(chunksize)
                while chunk:
                    for b in chunk:
                        yield b
                    chunk = f.read(chunksize)
    """),

}  # End algorithms

#
# Versions of algorithms that will only work in certain releases (or better) of Python.
#
if sys.version_info >= (3, 3):
    algorithms.update({

        'codeape + iter + partial + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    for chunk in iter(partial(f.read, chunksize), b''):
                        yield from chunk
        """),

        'codeape + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while True:
                        chunk = f.read(chunksize)
                        if chunk:
                            yield from chunk
                        else:
                            break
        """),

        "jfs (mmap)": Algorithm("""
            def file_byte_iterator(filename):
                with open(filename, "rb") as f, \
                     mmap(f.fileno(), 0, access=ACCESS_READ) as s:
                    yield from s
        """),

        'Rick M. (numpy) + "yield from"': Algorithm("""
            def file_byte_iterator(filename):
            #    data = np.fromfile(filename, 'u1')
                yield from np.fromfile(filename, 'u1')
        """),

        'Vinay Sajip + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    chunk = f.read(chunksize)
                    while chunk:
                        yield from chunk  # Added in Py 3.3
                        chunk = f.read(chunksize)
        """),

    })  # End Python 3.3 update.

if sys.version_info >= (3, 5):
    algorithms.update({

        'Aaron Hall + "yield from"': Algorithm("""
            from pathlib import Path

            def file_byte_iterator(path):
                ''' Given a path, return an iterator over the file
                    that lazily loads the file.
                '''
                path = Path(path)
                bufsize = get_buffer_size(path)

                with path.open('rb') as file:
                    reader = partial(file.read1, bufsize)
                    for chunk in iter(reader, bytes()):
                        yield from chunk
        """),

    })  # End Python 3.5 update.

if sys.version_info >= (3, 8, 0):
    algorithms.update({

        'Vinay Sajip + "yield from" + "walrus operator"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while chunk := f.read(chunksize):
                        yield from chunk  # Added in Py 3.3
        """),

        'codeape + "yield from" + "walrus operator"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while chunk := f.read(chunksize):
                        yield from chunk
        """),

    })  # End Python 3.8.0 update.update.


#### Main ####

def main():
    global TEMP_FILENAME

    def cleanup():
        """ Clean up after testing is completed. """
        try:
            os.remove(TEMP_FILENAME)  # Delete the temporary file.
        except Exception:
            pass

    atexit.register(cleanup)

    # Create a named temporary binary file of pseudo-random bytes for testing.
    fd, TEMP_FILENAME = tempfile.mkstemp('.bin')
    with os.fdopen(fd, 'wb') as file:
         os.write(fd, bytearray(random.randrange(256) for _ in range(FILE_SIZE)))

    # Execute and time each algorithm, gather results.
    start_time = time.time()  # To determine how long testing itself takes.

    timings = []
    for label in algorithms:
        try:
            timing = TIMING(label,
                            min(timeit.repeat(algorithms[label].test,
                                              setup=COMMON_SETUP + algorithms[label].setup,
                                              repeat=TIMINGS, number=EXECUTIONS)))
        except Exception as exc:
            print('{} occurred timing the algorithm: "{}"\n  {}'.format(
                    type(exc).__name__, label, exc))
            traceback.print_exc(file=sys.stdout)  # Redirect to stdout.
            sys.exit(1)
        timings.append(timing)

    # Report results.
    print('Fastest to slowest execution speeds with {}-bit Python {}.{}.{}'.format(
            64 if sys.maxsize > 2**32 else 32, *sys.version_info[:3]))
    print('  numpy version {}'.format(np.version.full_version))
    print('  Test file size: {:,} KiB'.format(FILE_SIZE // KiB(1)))
    print('  {:,d} executions, best of {:d} repetitions'.format(EXECUTIONS, TIMINGS))
    print()

    longest = max(len(timing.label) for timing in timings)  # Len of longest identifier.
    ranked = sorted(timings, key=attrgetter('exec_time')) # Sort so fastest is first.
    fastest = ranked[0].exec_time
    for rank, timing in enumerate(ranked, 1):
        print('{:<2d} {:>{width}} : {:8.4f} secs, rel speed {:6.2f}x, {:6.2f}% slower '
              '({:6.2f} KiB/sec)'.format(
                    rank,
                    timing.label, timing.exec_time, round(timing.exec_time/fastest, 2),
                    round((timing.exec_time/fastest - 1) * 100, 2),
                    (FILE_SIZE/timing.exec_time) / KiB(1),  # per sec.
                    width=longest))
    print()
    mins, secs = divmod(time.time()-start_time, 60)
    print('Benchmark runtime (min:sec) - {:02d}:{:02d}'.format(int(mins),
                                                               int(round(secs))))

main()

— Martineau
fonte

Supponi che lo faccia yield from chunkinvece for byte in chunk: yield byte? Sto pensando che dovrei rafforzare la mia risposta con quello.

— Aaron Hall

@Aaron: ci sono due versioni la tua risposta nei risultati di Python 3 e una di queste utilizza yield from.

— martineau,

ok, ho aggiornato la mia risposta. inoltre ti suggerisco di abbandonare enumeratepoiché l'iterazione dovrebbe essere intesa come completa - in caso contrario, l'ultima volta che ho controllato - l'enumerato ha un po 'di costi generali rispetto al fare la contabilità per l'indice con + = 1, quindi potresti in alternativa fare la contabilità nel tuo proprio codice. O addirittura passare a un deque con maxlen=0.

— Aaron Hall

@Aaron: sono d'accordo sul enumerate. Grazie per il feedback. Aggiungerò un aggiornamento al mio post che non lo ha (anche se non credo che cambi molto i risultati). Aggiungerà anche la numpyrisposta basata su @Rick M.

— martineau,

Un po 'più di revisione del codice: non credo abbia senso scrivere le risposte a Python 2 a questo punto - prenderei in considerazione la rimozione di Python 2 poiché mi aspetto che tu usi Python 3.7 o 3.8 a 64 bit. È possibile impostare la pulizia per andare alla fine con atexit e un'applicazione parziale. Errore di battitura: "verificare". Non vedo alcun senso nella duplicazione delle stringhe di test: sono tutte diverse? Immagino che se usi super().invece che tuple.nel tuo __new__potresti usare i namedtuplenomi degli attributi invece degli indici.

— Aaron Hall

3

se stai cercando qualcosa di veloce, ecco un metodo che sto usando che ha funzionato per anni:

from array import array

with open( path, 'rb' ) as file:
    data = array( 'B', file.read() ) # buffer the file

# evaluate it's data
for byte in data:
    v = byte # int value
    c = chr(byte)

se vuoi iterare i caratteri invece di ints, puoi semplicemente usare data = file.read(), che dovrebbe essere un oggetto bytes () in py3.

— Tcll
fonte

1

'array' viene importato da 'from array import array'

— quanly_mc

@quanly_mc sì, grazie per averlo colto, e scusa se ho dimenticato di includerlo, modificando ora.

— Fino al