Come posso dividere un testo in frasi?

108

Ho un file di testo. Ho bisogno di un elenco di frasi.

Come può essere implementato? Ci sono molte sottigliezze, come un punto utilizzato nelle abbreviazioni.

La mia vecchia espressione regolare funziona male:

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

python text split

— Artyom
fonte

18

Definisci "frase".

— martineau

voglio farlo, ma voglio dividere ovunque ci sia un punto o una nuova riga

— yishairasowsky

152

Il Natural Language Toolkit ( nltk.org ) ha ciò di cui hai bisogno. Questo messaggio di gruppo indica che questo fa:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(Non l'ho provato!)

— Ned Batchelder
fonte

3

@Artyom: Probabilmente può funzionare con il russo - vedi può NLTK / pyNLTK funzionare "per lingua" (cioè non inglese), e come? .

— martineau

4

@Artyom: ecco il collegamento diretto alla documentazione in linea per nltk .tokenize.punkt.PunktSentenceTokenizer.

— martineau

10

Potrebbe essere necessario eseguire nltk.download()prima e scaricare i modelli ->punkt

— Martin Thoma

2

Ciò non riesce nei casi con virgolette finali. Se abbiamo una frase che finisce come "questo".

— Fosa

1

Ok, mi hai convinto. Ma ho appena provato e non sembra fallire. Il mio input è

'This fails on cases with ending quotation marks. If we have a sentence that ends like "this." This is another sentence.'

e il mio output è

['This fails on cases with ending quotation marks.',  'If we have a sentence that ends like "this."',  'This is another sentence.']

Sembra corretto per me.

— szedjani

101

Questa funzione può dividere l'intero testo di Huckleberry Finn in frasi in circa 0,1 secondi e gestisce molti dei casi limite più dolorosi che rendono l'analisi delle frasi non banale, ad esempio " Mr. John Johnson Jr. è nato negli Stati Uniti ma ha conseguito il Ph. D. in Israele prima di entrare in Nike Inc. come ingegnere. Ha anche lavorato a craigslist.org come analista aziendale " .

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

— D Greenberg
fonte

19

Questa è una soluzione fantastica. Tuttavia ho aggiunto altre due righe digits = "([0-9])" nella dichiarazione delle espressioni regolari e text = re.sub (digits + "[.]" + Digits, "\\ 1 <prd> \ \ 2 ", testo) nella funzione. Ora non divide la linea in decimali come 5,5. Grazie per questa risposta.

— Ameya Kulkarni

1

Come hai analizzato l'intera Huckleberry Fin? Dov'è in formato testo?

— PascalVKooten

6

Un'ottima soluzione. Nella funzione, ho aggiunto if "eg" in text: text = text.replace ("eg", "e <prd> g <prd>") if "ie" in text: text = text.replace ("ie" , "i <prd> e <prd>") e ha risolto completamente il mio problema.

— Sisay Chala

3

Ottima soluzione con commenti molto utili! Solo per fare un po 'anche se più robusto: prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]", websites = "[.](com|net|org|io|gov|me|edu)", eif "..." in text: text = text.replace("...","<prd><prd><prd>")

— Dascienz

1

Questa funzione può essere fatta per vedere frasi come questa come una frase: Quando un bambino chiede a sua madre "Da dove vengono i bambini?", Cosa si dovrebbe rispondere?

— twhale

50

Invece di usare regex per dividere il testo in frasi, puoi anche usare la libreria nltk.

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

rif: https://stackoverflow.com/a/9474645/2877052

— Hassan Raza
fonte

Ottimo, più semplice e più riutilizzabile esempio della risposta accettata.

— Jay D.

Se rimuovi uno spazio dopo un punto, tokenize.sent_tokenize () non funziona, ma tokenizer.tokenize () funziona! Hmm ...

— Leonid Ganeline

1

for sentence in tokenize.sent_tokenize(text): print(sentence)

— Victoria Stuart,

11

Puoi provare a utilizzare Spacy invece di regex. Lo uso e fa il lavoro.

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())

— Elfo
fonte

1

Lo spazio è fantastico. ma se hai solo bisogno di separare in frasi, passare il testo nello spazio richiederà troppo tempo se hai a che fare con una pipe di dati

— Berlines

@ Berlines Sono d'accordo ma non sono riuscito a trovare nessun'altra libreria che faccia il lavoro pulito come spaCy. Ma se hai qualche suggerimento, posso provare.

— Elfo

Anche per gli utenti di AWS Lambda Serverless là fuori, i file di dati di supporto di Spacy sono molti 100 MB (l'inglese grande è> 400 MB), quindi non puoi usare cose come questa fuori dagli schemi, molto tristemente (grande fan di Spacy qui)

— Julian H

9

Ecco un approccio a metà strada che non si basa su librerie esterne. Utilizzo la comprensione delle liste per escludere le sovrapposizioni tra abbreviazioni e terminatori, nonché per escludere le sovrapposizioni tra le variazioni sulle terminazioni, ad esempio: '.' contro '. "'

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

Ho usato la funzione find_all di Karl da questa voce: Trova tutte le occorrenze di una sottostringa in Python

— TennisVisuals
fonte

1

Approccio perfetto! Gli altri non prendono ...e ?!.

— Shane Smiskol

6

Per casi semplici (in cui le frasi vengono terminate normalmente), dovrebbe funzionare:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

La regex è *\. +, che corrisponde a un punto circondato da 0 o più spazi a sinistra e 1 o più a destra (per evitare che qualcosa come il punto in re.split venga conteggiato come una modifica nella frase).

Ovviamente, non è la soluzione più robusta, ma andrà bene nella maggior parte dei casi. L'unico caso che non verrà trattato sono le abbreviazioni (magari scorrere l'elenco delle frasi e controllare che ogni stringa sentencesinizi con una lettera maiuscola?)

— Rafe Kettler
fonte

29

Non riesci a pensare a una situazione in inglese in cui una frase non finisce con un punto? Immaginalo! La mia risposta a questo sarebbe "ripensaci". (Vedi cosa ho fatto lì?)

— Ned Batchelder

@ Ned wow, non posso credere di essere stato così stupido. Devo essere ubriaco o qualcosa del genere.

— Rafe Kettler

Sto usando Python 2.7.2 su Win 7 x86 e la regex nel codice sopra mi dà questo errore SyntaxError: EOL while scanning string literal:, che punta alla parentesi di chiusura (dopo text). Inoltre, la regex a cui fai riferimento nel testo non esiste nell'esempio di codice.

— Sabuncu

1

L'espressione regolare non è completamente corretta, come dovrebbe esserer' *[\.\?!][\'"\)\]]* +'

— fsociety

Può causare molti problemi e dividere una frase anche in parti più piccole. Considera il caso in cui abbiamo "Ho pagato $ 3,5 per questo gelato", i pezzi sono "Ho pagato $ 3" e "5 per questo gelato". usa la frase nltk predefinita. tokenizer è più sicuro!

— Reihan_amn

6

Puoi anche utilizzare la funzione di tokenizzazione delle frasi in NLTK:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)

— amiref
fonte

2

@Artyom,

Ciao! Puoi creare un nuovo tokenizer per il russo (e alcune altre lingue) usando questa funzione:

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

e poi chiamalo in questo modo:

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)

Buona fortuna, Marilena.

— Marilena Di Bari
fonte

0

Non c'è dubbio che NLTK sia il più adatto allo scopo. Ma iniziare con NLTK è piuttosto doloroso (ma una volta installato, raccogli i frutti)

Quindi ecco un semplice codice re-based disponibile su http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question

— vaichidrewar
fonte

3

Sì, ma questo fallisce così facilmente, con: "Il signor Smith sa che questa è una frase".

— thomas

0

Ho dovuto leggere i file dei sottotitoli e dividerli in frasi. Dopo la pre-elaborazione (come la rimozione delle informazioni sull'ora, ecc. Nei file .srt), la variabile fullFile conteneva il testo completo del file dei sottotitoli. Il modo rozzo di seguito li divide ordinatamente in frasi. Probabilmente sono stato fortunato che le frasi finissero sempre (correttamente) con uno spazio. Prova prima questo e se ha delle eccezioni, aggiungi più controlli e contrappesi.

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("\n");
sentFile.close;

Oh! bene. Ora mi rendo conto che, poiché il mio contenuto era spagnolo, non avevo il problema di trattare con "Mr. Smith" ecc. Tuttavia, se qualcuno vuole un parser veloce e sporco ...

— kishore
fonte

0

spero che questo ti possa aiutare con il testo latino, cinese e arabo

import re

punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|！|？|；|…|　|!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
    lines = punctuation.sub(r"\1\2<pad>", myFile.read())
    lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

— mamtimen
fonte

0

Stavo lavorando su un'attività simile e mi sono imbattuto in questa query, seguendo alcuni collegamenti e lavorando su alcuni esercizi per nltk, il codice seguente ha funzionato per me come per magia.

from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text)

produzione:

['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']

Fonte: https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/

— Mazeen Muhammed
fonte