Come eliminare le parole dal file txt, che esiste su un altro file txt?

8

Il file a.txtha circa 100k parole, ogni parola è nella nuova riga

july.cpp
windows.exe
ttm.rar
document.zip

Il file b.txtha 150.000 parole, una parola per riga - alcune parole derivano da un file a.txt, ma alcune sono nuove:

july.cpp    
NOVEMBER.txt    
windows.exe    
ttm.rar    
document.zip    
diary.txt

Come posso unire questi file in uno, eliminare tutte le righe duplicate e mantenere le righe nuove (righe esistenti a.txtma non esistenti b.txte viceversa)?

text-processing

— Kate-Kasia
fonte

Saresti felice di usare Python?

— Tim

2

@ MikołajBartnicki Unix.SE sarebbe probabilmente un posto migliore per chiedere

— Glutanimate,

1

Kasia, ho fatto un errore nella mia risposta, ecco perché l'ho cancellato. Sto lavorando su uno nuovo.

2

@Glutanimate Questa domanda va benissimo qui.

— Seth,

1

@Glutanimate Ah, mi dispiace, in qualche modo ho perso quel commento.

— Seth,

13

C'è un comando per fare questo: comm. Come affermato in man comm, è semplicemente semplice:

   comm -3 file1 file2
          Print lines in file1 not in file2, and vice versa.

Si noti che si commaspetta che i contenuti dei file vengano ordinati, quindi è necessario ordinarli prima di richiamarli comm, proprio così:

sort unsorted-file.txt > sorted-file.txt

Quindi per riassumere:

sort a.txt > as.txt

sort b.txt > bs.txt

comm -3 as.txt bs.txt > result.txt

Dopo i comandi precedenti, avrai le righe previste nel result.txtfile.

grazie, funziona come un fascino. PS. to zdjęcie z tłuczkiem na Twoim profilu jest fajne ;-)

— Kate-Kasia,

2

Ecco un breve script di python3, basato sulla risposta di Germar , che dovrebbe raggiungere questo obiettivo mantenendo b.txtl'ordine non ordinato.

#!/usr/bin/python3

with open('a.txt', 'r') as afile:
    a = set(line.rstrip('\n') for line in afile)

with open('b.txt', 'r') as bfile:
    for line in bfile:
        line = line.rstrip('\n')
        if line not in a:
            print(line)
            # Uncomment the following if you also want to remove duplicates:
            # a.add(line)

— Lily Chung
fonte

1

#!/usr/bin/env python3

with open('a.txt', 'r') as f:
    a_txt = f.read()
a = a_txt.split('\n')
del(a_txt)

with open('b.txt', 'r') as f:
    while True:
        b = f.readline().strip('\n ')
        if not len(b):
            break
        if not b in a:
            print(b)

— Germar
fonte

2

Amico, stai sparando a una zanzara con un cannone navale!

:-) Hai ragione. Ho perso la 'k' in 100k

— Germar, il

1

Dai un'occhiata al commcomando coreutils -man comm

NAME
       comm - compare two sorted files line by line

SYNOPSIS
       comm [OPTION]... FILE1 FILE2

DESCRIPTION
       Compare sorted files FILE1 and FILE2 line by line.

       With  no  options,  produce  three-column  output.  Column one contains
       lines unique to FILE1, column two contains lines unique to  FILE2,  and
       column three contains lines common to both files.

       -1     suppress column 1 (lines unique to FILE1)

       -2     suppress column 2 (lines unique to FILE2)

       -3     suppress column 3 (lines that appear in both files)

Quindi, per esempio, puoi farlo

$ comm -13 <(sort a.txt) <(sort b.txt)
diary.txt
NOVEMBER.txt

(linee uniche per b.txt)

— steeldriver
fonte