Il contrassegno dell'ordine dei byte rovina la lettura dei file in Java

107

Sto cercando di leggere file CSV utilizzando Java. Alcuni file potrebbero avere un contrassegno per l'ordine dei byte all'inizio, ma non tutti. Quando è presente, l'ordine dei byte viene letto insieme al resto della prima riga, causando problemi con i confronti delle stringhe.

C'è un modo semplice per saltare il contrassegno dell'ordine dei byte quando è presente?

Grazie!

java utf-8 byte-order-mark

— Tom
fonte

forse: rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html

— Chris

114

EDIT : ho realizzato una versione corretta su GitHub: https://github.com/gpakosz/UnicodeBOMInputStream

Ecco una classe che ho codificato qualche tempo fa, ho appena modificato il nome del pacchetto prima di incollare. Niente di speciale, è abbastanza simile alle soluzioni pubblicate nel database dei bug di SUN. Incorporalo nel tuo codice e stai bene.

/* ____________________________________________________________________________
 * 
 * File:    UnicodeBOMInputStream.java
 * Author:  Gregory Pakosz.
 * Date:    02 - November - 2005    
 * ____________________________________________________________________________
 */
package com.stackoverflow.answer;

import java.io.IOException;
import java.io.InputStream;
import java.io.PushbackInputStream;

/**
 * The <code>UnicodeBOMInputStream</code> class wraps any
 * <code>InputStream</code> and detects the presence of any Unicode BOM
 * (Byte Order Mark) at its beginning, as defined by
 * <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
 * 
 * <p>The
 * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
 * defines 5 types of BOMs:<ul>
 * <li><pre>00 00 FE FF  = UTF-32, big-endian</pre></li>
 * <li><pre>FF FE 00 00  = UTF-32, little-endian</pre></li>
 * <li><pre>FE FF        = UTF-16, big-endian</pre></li>
 * <li><pre>FF FE        = UTF-16, little-endian</pre></li>
 * <li><pre>EF BB BF     = UTF-8</pre></li>
 * </ul></p>
 * 
 * <p>Use the {@link #getBOM()} method to know whether a BOM has been detected
 * or not.
 * </p>
 * <p>Use the {@link #skipBOM()} method to remove the detected BOM from the
 * wrapped <code>InputStream</code> object.</p>
 */
public class UnicodeBOMInputStream extends InputStream
{
  /**
   * Type safe enumeration class that describes the different types of Unicode
   * BOMs.
   */
  public static final class BOM
  {
    /**
     * NONE.
     */
    public static final BOM NONE = new BOM(new byte[]{},"NONE");

    /**
     * UTF-8 BOM (EF BB BF).
     */
    public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
                                                       (byte)0xBB,
                                                       (byte)0xBF},
                                            "UTF-8");

    /**
     * UTF-16, little-endian (FF FE).
     */
    public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE},
                                                "UTF-16 little-endian");

    /**
     * UTF-16, big-endian (FE FF).
     */
    public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-16 big-endian");

    /**
     * UTF-32, little-endian (FF FE 00 00).
     */
    public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE,
                                                            (byte)0x00,
                                                            (byte)0x00},
                                                "UTF-32 little-endian");

    /**
     * UTF-32, big-endian (00 00 FE FF).
     */
    public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
                                                            (byte)0x00,
                                                            (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-32 big-endian");

    /**
     * Returns a <code>String</code> representation of this <code>BOM</code>
     * value.
     */
    public final String toString()
    {
      return description;
    }

    /**
     * Returns the bytes corresponding to this <code>BOM</code> value.
     */
    public final byte[] getBytes()
    {
      final int     length = bytes.length;
      final byte[]  result = new byte[length];

      // Make a defensive copy
      System.arraycopy(bytes,0,result,0,length);

      return result;
    }

    private BOM(final byte bom[], final String description)
    {
      assert(bom != null)               : "invalid BOM: null is not allowed";
      assert(description != null)       : "invalid description: null is not allowed";
      assert(description.length() != 0) : "invalid description: empty string is not allowed";

      this.bytes          = bom;
      this.description  = description;
    }

            final byte    bytes[];
    private final String  description;

  } // BOM

  /**
   * Constructs a new <code>UnicodeBOMInputStream</code> that wraps the
   * specified <code>InputStream</code>.
   * 
   * @param inputStream an <code>InputStream</code>.
   * 
   * @throws NullPointerException when <code>inputStream</code> is
   * <code>null</code>.
   * @throws IOException on reading from the specified <code>InputStream</code>
   * when trying to detect the Unicode BOM.
   */
  public UnicodeBOMInputStream(final InputStream inputStream) throws  NullPointerException,
                                                                      IOException

  {
    if (inputStream == null)
      throw new NullPointerException("invalid input stream: null is not allowed");

    in = new PushbackInputStream(inputStream,4);

    final byte  bom[] = new byte[4];
    final int   read  = in.read(bom);

    switch(read)
    {
      case 4:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE) &&
            (bom[2] == (byte)0x00) &&
            (bom[3] == (byte)0x00))
        {
          this.bom = BOM.UTF_32_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0x00) &&
            (bom[1] == (byte)0x00) &&
            (bom[2] == (byte)0xFE) &&
            (bom[3] == (byte)0xFF))
        {
          this.bom = BOM.UTF_32_BE;
          break;
        }

      case 3:
        if ((bom[0] == (byte)0xEF) &&
            (bom[1] == (byte)0xBB) &&
            (bom[2] == (byte)0xBF))
        {
          this.bom = BOM.UTF_8;
          break;
        }

      case 2:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE))
        {
          this.bom = BOM.UTF_16_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0xFE) &&
            (bom[1] == (byte)0xFF))
        {
          this.bom = BOM.UTF_16_BE;
          break;
        }

      default:
        this.bom = BOM.NONE;
        break;
    }

    if (read > 0)
      in.unread(bom,0,read);
  }

  /**
   * Returns the <code>BOM</code> that was detected in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return a <code>BOM</code> value.
   */
  public final BOM getBOM()
  {
    // BOM type is immutable.
    return bom;
  }

  /**
   * Skips the <code>BOM</code> that was found in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return this <code>UnicodeBOMInputStream</code>.
   * 
   * @throws IOException when trying to skip the BOM from the wrapped
   * <code>InputStream</code> object.
   */
  public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
  {
    if (!skipped)
    {
      in.skip(bom.bytes.length);
      skipped = true;
    }
    return this;
  }

  /**
   * {@inheritDoc}
   */
  public int read() throws IOException
  {
    return in.read();
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[]) throws  IOException,
                                          NullPointerException
  {
    return in.read(b,0,b.length);
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[],
                  final int off,
                  final int len) throws IOException,
                                        NullPointerException
  {
    return in.read(b,off,len);
  }

  /**
   * {@inheritDoc}
   */
  public long skip(final long n) throws IOException
  {
    return in.skip(n);
  }

  /**
   * {@inheritDoc}
   */
  public int available() throws IOException
  {
    return in.available();
  }

  /**
   * {@inheritDoc}
   */
  public void close() throws IOException
  {
    in.close();
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void mark(final int readlimit)
  {
    in.mark(readlimit);
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void reset() throws IOException
  {
    in.reset();
  }

  /**
   * {@inheritDoc}
   */
  public boolean markSupported() 
  {
    return in.markSupported();
  }

  private final PushbackInputStream in;
  private final BOM                 bom;
  private       boolean             skipped = false;

} // UnicodeBOMInputStream

E lo stai usando in questo modo:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public final class UnicodeBOMInputStreamUsage
{
  public static void main(final String[] args) throws Exception
  {
    FileInputStream fis = new FileInputStream("test/offending_bom.txt");
    UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis);

    System.out.println("detected BOM: " + ubis.getBOM());

    System.out.print("Reading the content of the file without skipping the BOM: ");
    InputStreamReader isr = new InputStreamReader(ubis);
    BufferedReader br = new BufferedReader(isr);

    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();

    fis = new FileInputStream("test/offending_bom.txt");
    ubis = new UnicodeBOMInputStream(fis);
    isr = new InputStreamReader(ubis);
    br = new BufferedReader(isr);

    ubis.skipBOM();

    System.out.print("Reading the content of the file after skipping the BOM: ");
    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();
  }

} // UnicodeBOMInputStreamUsage

— Gregory Pakosz
fonte

2

Ci scusiamo per le lunghe aree di scorrimento,

— peccato che

Grazie Gregory, è proprio quello che sto cercando.

— Tom

3

Dovrebbe essere nell'API Java di base

— Denis Kniazhev

7

Sono passati 10 anni e sto ancora ricevendo karma per questo: D Ti sto guardando Java!

— Gregory Pakosz

1

Voto positivo perché la risposta fornisce la cronologia sul motivo per cui il flusso di input del file non fornisce l'opzione per eliminare la distinta componenti per impostazione predefinita.

— MxLDevs

95

La libreria Apache Commons IO ha una InputStreamche può rilevare ed eliminare le distinte materiali: BOMInputStream(javadoc) :

BOMInputStream bomIn = new BOMInputStream(in);
int firstNonBOMByte = bomIn.read(); // Skips BOM
if (bomIn.hasBOM()) {
    // has a UTF-8 BOM
}

Se hai anche bisogno di rilevare codifiche diverse, può anche distinguere tra vari segni di ordine di byte diversi, ad esempio UTF-8 vs UTF-16 big + little endian - dettagli al link del documento sopra. È quindi possibile utilizzare il file rilevatoByteOrderMark per scegliere a Charsetper decodificare il flusso. (C'è probabilmente un modo più semplificato per farlo se hai bisogno di tutte queste funzionalità - forse UnicodeReader nella risposta di BalusC?). Nota che, in generale, non c'è un modo molto buono per rilevare in quale codifica si trovano alcuni byte, ma se lo stream inizia con una distinta materiali, a quanto pare questo può essere utile.

modificare : se è necessario rilevare la distinta componenti in UTF-16, UTF-32 e così via, il costruttore dovrebbe essere:

new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE,
        ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)

Vota il commento di @ martin-charlesworth :)

— rescdsk
fonte

Salta solo la distinta base. Dovrebbe essere la soluzione perfetta per il 99% dei casi d'uso.

— atamanroman

7

Ho usato questa risposta con successo. Tuttavia, aggiungerei rispettosamente l' booleanargomento per specificare se includere o escludere la distinta materiali. Esempio:BOMInputStream bomIn = new BOMInputStream(in, false); // don't include the BOM

— Kevin Meredith

19

Vorrei anche aggiungere che questo rileva solo UTF-8 BOM. Se si desidera rilevare tutte le distinte materiali utf-X, è necessario passarle al costruttore BOMInputStream.

BOMInputStream bomIn = new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, 				ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE);

— Martin Charlesworth,

Per quanto riguarda il commento di @KevinMeredith, voglio sottolineare che il costruttore con booleano è più chiaro, ma il costruttore predefinito ha già eliminato UTF-8 BOM, come suggerisce JavaDoc:BOMInputStream(InputStream delegate) Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.

— WesternGun

Saltare risolve la maggior parte dei miei problemi. Se il mio file inizia con una BOM UTF_16BE, posso creare un InputReader saltando la BOM e leggendo il file come UTF_8? Finora funziona, voglio capire se c'è qualche caso limite? Grazie in anticipo.

— Bhaskar,

31

Soluzione più semplice:

public class BOMSkipper
{
    public static void skip(Reader reader) throws IOException
    {
        reader.mark(1);
        char[] possibleBOM = new char[1];
        reader.read(possibleBOM);

        if (possibleBOM[0] != '\ufeff')
        {
            reader.reset();
        }
    }
}

Esempio di utilizzo:

BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(file), fileExpectedCharset));
BOMSkipper.skip(input);
//Now UTF prefix not present:
input.readLine();
...

Funziona con tutte e 5 le codifiche UTF!

1

Molto simpatico Andrei. Ma potresti spiegare perché funziona? In che modo il pattern 0xFEFF abbina con successo i file UTF-8 che sembrano avere un pattern diverso e 3 byte invece di 2? E come può quel pattern abbinare entrambe le endian di UTF16 e UTF32?

— Vahid Pazirandeh

1

Come puoi vedere, non uso il flusso di byte ma il flusso di caratteri aperto con il set di caratteri previsto. Quindi, se il primo personaggio di questo flusso è BOM, lo salto. BOM può avere una rappresentazione in byte diversa per ogni codifica, ma questo è un carattere. Si prega di leggere questo articolo, mi aiuta: joelonsoftware.com/articles/Unicode.html

Bella soluzione, assicurati di controllare se il file non è vuoto per evitare IOException nel metodo skip prima di leggere. Puoi farlo chiamando if (reader.ready ()) {reader.read (possibleBOM) ...}

— Snow

Vedo che hai coperto 0xFE 0xFF, che è il Byte order Mark per UTF-16BE. Ma cosa succede se i primi 3 byte sono 0xEF 0xBB 0xEF? (il contrassegno dell'ordine dei byte per UTF-8). Dici che funziona per tutti i formati UTF-8. Quale potrebbe essere vero (non ho testato il tuo codice), ma allora come funziona?

— bvdb

1

Vedi la mia risposta a Vahid: non apro il flusso di byte ma il flusso di caratteri e leggo un carattere da esso. Non importa quale codifica utf utilizzata per il file - il prefisso bom può essere rappresentato da un conteggio di byte diverso, ma in termini di caratteri è solo un carattere

24

Google Data API dispone diUnicodeReader che rileva automaticamente la codifica.

Puoi usarlo al posto di InputStreamReader. Ecco un estratto leggermente compattato della sua fonte che è piuttosto semplice:

public class UnicodeReader extends Reader {
    private static final int BOM_SIZE = 4;
    private final InputStreamReader reader;

    /**
     * Construct UnicodeReader
     * @param in Input stream.
     * @param defaultEncoding Default encoding to be used if BOM is not found,
     * or <code>null</code> to use system default encoding.
     * @throws IOException If an I/O error occurs.
     */
    public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
        byte bom[] = new byte[BOM_SIZE];
        String encoding;
        int unread;
        PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
        int n = pushbackStream.read(bom, 0, bom.length);

        // Read ahead four bytes and check for BOM marks.
        if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
            encoding = "UTF-8";
            unread = n - 3;
        } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
            encoding = "UTF-16BE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
            encoding = "UTF-16LE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
            encoding = "UTF-32BE";
            unread = n - 4;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
            encoding = "UTF-32LE";
            unread = n - 4;
        } else {
            encoding = defaultEncoding;
            unread = n;
        }

        // Unread bytes if necessary and skip BOM marks.
        if (unread > 0) {
            pushbackStream.unread(bom, (n - unread), unread);
        } else if (unread < -1) {
            pushbackStream.unread(bom, 0, 0);
        }

        // Use given encoding.
        if (encoding == null) {
            reader = new InputStreamReader(pushbackStream);
        } else {
            reader = new InputStreamReader(pushbackStream, encoding);
        }
    }

    public String getEncoding() {
        return reader.getEncoding();
    }

    public int read(char[] cbuf, int off, int len) throws IOException {
        return reader.read(cbuf, off, len);
    }

    public void close() throws IOException {
        reader.close();
    }
}

— BalusC
fonte

Sembra che il link dica che Google Data API è obsoleto? Dove si dovrebbe cercare ora l'API di dati di Google?

— SOUser

1

@XichenLi: l'API GData è stata deprecata per lo scopo previsto. Non avevo intenzione di suggerire di utilizzare direttamente l'API GData (OP non utilizza alcun servizio GData), ma intendo prendere in consegna il codice sorgente come esempio per la tua implementazione. È anche per questo che l'ho incluso nella mia risposta, pronto per il copypaste.

— BalusC

C'è un bug in questo. Il case UTF-32LE è irraggiungibile. Affinché (bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)sia vero, allora il caso UTF-16LE ( (bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) avrebbe già trovato corrispondenza.

— Joshua Taylor

Poiché questo codice proviene dall'API di dati di Google, ho pubblicato il problema 471 a riguardo.

— Joshua Taylor

13

Il BOMInputStreamApache Commons IO della libreria è già stato menzionato da @rescdsk, ma non ho visto menzionare come ottenere un InputStream senza BOM.

Ecco come l'ho fatto in Scala.

 import java.io._
 val file = new File(path_to_xml_file_with_BOM)
 val fileInpStream = new FileInputStream(file)   
 val bomIn = new BOMInputStream(fileInpStream, 
         false); // false means don't include BOM

— Kevin Meredith
fonte

Costruttore arg singolo fa: public BOMInputStream(InputStream delegate) { this(delegate, false, ByteOrderMark.UTF_8); }. Esclude UTF-8 BOMper impostazione predefinita.

— Vladimir Vagaytsev

Buon punto, Vladimir. Lo vedo nei suoi documenti - commons.apache.org/proper/commons-io/javadocs/api-2.2/org/… :Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.

— Kevin Meredith

4

Per rimuovere semplicemente i caratteri BOM dal tuo file, ti consiglio di utilizzare Apache Common IO

public BOMInputStream(InputStream delegate,
              boolean include)
Constructs a new BOM InputStream that detects a a ByteOrderMark.UTF_8 and optionally includes it.
Parameters:
delegate - the InputStream to delegate to
include - true to include the UTF-8 BOM or false to exclude it

Imposta include su false ei tuoi caratteri BOM verranno esclusi.

— Andreas Baaserud
fonte

2

Purtroppo no. Dovrai identificarti e saltare te stesso. Questa pagina descrive in dettaglio cosa devi guardare. Vedi anche questa domanda SO per maggiori dettagli.

— Brian Agnew
fonte

1

Ho avuto lo stesso problema e poiché non stavo leggendo in un mucchio di file ho fatto una soluzione più semplice. Penso che la mia codifica fosse UTF-8 perché quando ho stampato il carattere offensivo con l'aiuto di questa pagina: Ottieni il valore Unicode di un carattere ho scoperto che era \ufeff. Ho usato il codiceSystem.out.println( "\\u" + Integer.toHexString(str.charAt(0) | 0x10000).substring(1) ); per stampare il valore Unicode incriminato.

Una volta ottenuto il valore Unicode incriminato, l'ho sostituito nella prima riga del mio file prima di continuare a leggere. La logica di business di quella sezione:

String str = reader.readLine().trim();
str = str.replace("\ufeff", "");

Questo ha risolto il mio problema. Quindi sono stato in grado di continuare a elaborare il file senza problemi. Ho aggiunto trim()solo in caso di spazi bianchi iniziali o finali, puoi farlo o meno, in base alle tue esigenze specifiche.

— Amy B Higgins
fonte

1

Non ha funzionato per me, ma ho usato .replaceFirst ("\ u00EF \ u00BB \ u00BF", "") che ha funzionato.

— StackUMan