Come trovare tutti i file nella directory che contengono BOM UTF-8 (segno di ordine dei byte)?

8

Su Windows, devo trovare tutti i file in una directory che contiene BOM UTF-8 (segno di ordine dei byte). Quale strumento può farlo e come?

Può essere uno script PowerShell, alcune funzionalità di ricerca avanzata dell'editor di testo o altro.

windows search utf-8

— Borek Bernard
fonte

15

Ecco un esempio di uno script di PowerShell. Cerca nel C:percorso tutti i file in cui si trovano i primi 3 byte 0xEF, 0xBB, 0xBF.

Function ContainsBOM
{   
    return $input | where {
        $contents = [System.IO.File]::ReadAllBytes($_.FullName)
        $_.Length -gt 2 -and $contents[0] -eq 0xEF -and $contents[1] -eq 0xBB -and $contents[2] -eq 0xBF }
}

get-childitem "C:\*.*" | where {!$_.PsIsContainer } | ContainsBOM

È necessario "ReadAllBytes"? Forse leggere solo pochi primi byte avrebbe prestazioni migliori?

Punto valido. Ecco una versione aggiornata che legge solo i primi 3 byte.

Function ContainsBOM
{   
    return $input | where {
        $contents = new-object byte[] 3
        $stream = [System.IO.File]::OpenRead($_.FullName)
        $stream.Read($contents, 0, 3) | Out-Null
        $stream.Close()
        $contents[0] -eq 0xEF -and $contents[1] -eq 0xBB -and $contents[2] -eq 0xBF }
}

get-childitem "C:\*.*" | where {!$_.PsIsContainer -and $_.Length -gt 2 } | ContainsBOM

— vcsjones
fonte

1

Freddo. Prima di contrassegnare è la risposta, è necessario "ReadAllBytes"? Forse leggere solo pochi primi byte avrebbe prestazioni migliori?

— Borek Bernard,

@Borek Vedi modifica.

— vcsjones,

2

Questo mi ha salvato la giornata! Ho anche imparato get-childitem -recursea gestire anche le sottodirectory.

— Diynevala,

Mi chiedevo se c'è un modo per rimuovere le distinte base usando lo script sopra?

— tom_mai78101

2

Come nota a margine, ecco uno script di PowerShell che utilizzo per rimuovere i caratteri della distinta base UTF-8 dai miei file di origine:

$files=get-childitem -Path . -Include @("*.h","*.cpp") -Recurse
foreach ($f in $files)
{
(Get-Content $f.PSPath) | 
Foreach-Object {$_ -replace "\xEF\xBB\xBF", ""} | 
Set-Content $f.PSPath
}

— Scott Smith
fonte

Ho appena ricevuto una serie di file che differivano solo per il fatto che alcuni avevano una DBA e altri no. La tua risposta era proprio ciò di cui avevo bisogno per ripulire tutto. Grazie!

— Tevya,

1

Se ci si trova su un computer aziendale (come me) con privilegi limitati e non è possibile eseguire lo script PowerShell, è possibile utilizzare un Notepad ++ portatile con plug-in PythonScript per eseguire l'attività, con il seguente script:

import os;
import sys;
filePathSrc="C:\\Temp\\UTF8"
for root, dirs, files in os.walk(filePathSrc):
    for fn in files:
      if fn[-4:] != '.jar' and fn[-5:] != '.ear' and fn[-4:] != '.gif' and fn[-4:] != '.jpg' and fn[-5:] != '.jpeg' and fn[-4:] != '.xls' and fn[-4:] != '.GIF' and fn[-4:] != '.JPG' and fn[-5:] != '.JPEG' and fn[-4:] != '.XLS' and fn[-4:] != '.PNG' and fn[-4:] != '.png' and fn[-4:] != '.cab' and fn[-4:] != '.CAB' and fn[-4:] != '.ico':
        notepad.open(root + "\\" + fn)
        console.write(root + "\\" + fn + "\r\n")
        notepad.runMenuCommand("Encoding", "Convert to UTF-8 without BOM")
        notepad.save()
        notepad.close()

Il credito va a https://pw999.wordpress.com/2013/08/19/mass-convert-a-project-to-utf-8-using-notepad/

— Hoàng Long
fonte