Come rimuovere le parole chiave usando nltk o python

110

Quindi ho un set di dati che vorrei rimuovere dall'uso delle parole di arresto

stopwords.words('english')

Sto lottando su come usarlo nel mio codice per estrarre semplicemente queste parole. Ho già un elenco delle parole da questo set di dati, la parte con cui sto lottando è il confronto con questo elenco e la rimozione delle parole di arresto. Qualsiasi aiuto è apprezzato.

python nltk stop-words

— alex
fonte

4

Da dove hai preso le parole chiave? È di NLTK?

— tumultous_rooster

37

@ MattO'Brien from nltk.corpus import stopwordsper i futuri googler

— danodonovan

13

È inoltre necessario eseguire nltk.download("stopwords")per rendere disponibile il dizionario delle parole non significative.

— sffc

Vedere anche stackoverflow.com/questions/19130512/stopword-removal-with-nltk

— alvas

1

Fai attenzione che una parola come "non" è considerata anche una parola d'ordine in nltk. Se fai qualcosa come l'analisi del sentiment, il filtraggio dello spam, una negazione può cambiare l'intero significato della frase e se lo rimuovi dalla fase di elaborazione, potresti non ottenere risultati accurati.

— Darkov

206

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

— Daren Thomas
fonte

Grazie ad entrambe le risposte, funzionano entrambe anche se sembrerebbe che io abbia un difetto nel mio codice che impedisce il corretto funzionamento dell'elenco di stop. Dovrebbe essere un nuovo post di domande? non sono ancora sicuro di come funzionano le cose qui!

— Alex

51

Per migliorare le prestazioni, considera stops = set(stopwords.words("english"))invece.

— isakkarlsson

1

>>> import nltk >>> nltk.download () Source

2

stopwords.words('english')sono minuscole. Quindi assicurati di utilizzare solo parole minuscole nell'elenco, ad esempio[w.lower() for w in word_list]

— AlexG

19

Puoi anche fare un diff impostato, ad esempio:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

— David Lemphers
fonte

16

Nota: questo converte la frase in un SET che rimuove tutte le parole duplicate e quindi non sarai in grado di utilizzare il conteggio della frequenza sul risultato

— David Dehghan

1

la conversione in un set potrebbe rimuovere le informazioni vitali dalla frase raschiando più occorrenze di una parola importante.

— Ujjwal

14

Suppongo che tu abbia un elenco di parole (word_list) da cui desideri rimuovere le parole non significative. Potresti fare qualcosa del genere:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

— das_weezul
fonte

5

questo sarà molto più lento della comprensione dell'elenco di Daren Thomas ...

— drevicko

12

Per escludere tutti i tipi di parole non significative, incluse le parole non significative nltk, potresti fare qualcosa del genere:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

— sumitjainjr
fonte

Ricevo len(get_stop_words('en')) == 174vslen(stopwords.words('english')) == 179

— rubencart il

6

C'è un pacchetto python molto semplice e leggero stop-wordssolo per questo scopo.

Prima installa il pacchetto utilizzando: pip install stop-words

Quindi puoi rimuovere le tue parole in una riga usando la comprensione dell'elenco:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

Questo pacchetto è molto leggero da scaricare (a differenza di nltk), funziona per entrambi Python 2e Python 3, e ha parole di arresto per molte altre lingue come:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

— user_3pij
fonte

3

Usa la libreria textcleaner per rimuovere le stopword dai tuoi dati.

Segui questo link: https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

Segui questi passaggi per farlo con questa libreria.

pip install textcleaner

Dopo l'installazione:

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

Usa il codice sopra per rimuovere le parole di arresto.

— Yugant Hadiyal
fonte

1

puoi usare questa funzione, dovresti notare che devi abbassare tutte le parole

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list

— Mohammed_Ashour
fonte

1

utilizzando il filtro :

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

— Saeid BK
fonte

3

se word_listè grande questo codice è molto lento. E 'meglio per convertire l'elenco stopwords ad un set prima di utilizzarlo: .. in set(stopwords.words('english')).

— Robert

1

Ecco la mia opinione su questo, nel caso in cui desideri ottenere immediatamente la risposta in una stringa (invece di un elenco di parole filtrate):

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text

— justadev
fonte

Non utilizzare questo approccio in francese altrimenti non verrà acquisito.

— David Beauchemin,

0

Nel caso in cui i tuoi dati siano memorizzati come un Pandas DataFrame, puoi usare remove_stopwordsda textero che usa l'elenco di stopword NLTK per impostazione predefinita .

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

— Jonathan Besomi
fonte

0

from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence)

— HM
fonte

-3

   import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
    if x not in list:           # comparing from the list and removing it
        another_list.append(x)  # it is also possible to use .remove
for x in another_list:
     print(x,end=' ')

   # 2) if you want to use .remove more preferred code
    import sys
    print ("enter the string from which you want to remove list of stop words")
    userstring = input().split(" ")
    list =["a","an","the","in"]
    another_list = []
    for x in userstring:
        if x in list:           
            userstring.remove(x)  
    for x in userstring:           
        print(x,end = ' ') 
    #the code will be like this

— Muhammad Yusuf
fonte

è meglio aggiungere stopwords.words ("inglese") piuttosto che specificare tutte le parole che devi rimuovere.

— Condotto il