Come inserire un regex in string.replace?

317

Ho bisogno di aiuto per dichiarare una regex. I miei input sono i seguenti:

this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. 
and there are many other lines in the txt files
with<[3> such tags </[3>

L'output richiesto è:

this is a paragraph with in between and then there are cases ... where the number ranges from 1-100. 
and there are many other lines in the txt files
with such tags

Ho provato questo:

#!/usr/bin/python
import os, sys, re, glob
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
    for line in reader: 
        line2 = line.replace('<[1> ', '')
        line = line2.replace('</[1> ', '')
        line2 = line.replace('<[1>', '')
        line = line2.replace('</[1>', '')

        print line

Ho anche provato questo (ma sembra che sto usando la sintassi regex sbagliata):

    line2 = line.replace('<[*> ', '')
    line = line2.replace('</[*> ', '')
    line2 = line.replace('<[*>', '')
    line = line2.replace('</[*>', '')

Non voglio codificare replaceda 1 a 99. . .

— alvas
fonte

4

La risposta accettata copre già il tuo problema e lo risolve. Avete bisogno di altro ?

— HamZa,

Quale dovrebbe essere il risultato per where the<[99> number ranges from 1-100</[100>?

— utapyngo,

dovrebbe anche rimuovere il numero nel <...>tag, quindi l'output dovrebbe esserewhere the number rangers from 1-100 ?

— alvas il

566

Questo frammento testato dovrebbe farlo:

import re
line = re.sub(r"</?\[\d+>", "", line)

Modifica: ecco una versione commentata che spiega come funziona:

line = re.sub(r"""
  (?x) # Use free-spacing mode.
  <    # Match a literal '<'
  /?   # Optionally match a '/'
  \[   # Match a literal '['
  \d+  # Match one or more digits
  >    # Match a literal '>'
  """, "", line)

I regex sono divertenti! Consiglio vivamente di passare un'ora o due a studiare le basi. Per i principianti, devi imparare quali personaggi sono speciali: "metacaratteri" che devono essere sfuggiti (cioè con una barra rovesciata posta davanti - e le regole sono diverse all'interno e all'esterno delle classi di caratteri.) Esiste un eccellente tutorial online su: www .regular-expressions.info . Il tempo che trascorrerai li ripagherà da solo molte volte. Felice regexing!

— ridgerunner
fonte

sì funziona !! grazie ma puoi spiegare la regex in breve?

— alvas,

9

Inoltre, non trascurare The Book on Regular Expressions - Mastering Regular Expressions , di Jeffrey Friedl

— pcurry

Un altro buon riferimento vede w3schools.com/python/python_regex.asp

— Carson

38

str.replace()fa riparazioni fisse. Usa re.sub()invece.

— Ignacio Vazquez-Abrams
fonte

3

Vale anche la pena notare che il tuo modello dovrebbe assomigliare a "</ {0-1} \ d {1-2}>" o qualsiasi altra variante della notazione regexp usata da Python.

3

Cosa significa sostituzioni fisse?

— avi

@avi Probabilmente intendeva la sostituzione di parole fisse piuttosto che la localizzazione parziale di parole attraverso regex.

— Gunay Anach l'

stringhe fisse (letterali, costanti)

— vstepaniuk

23

Vorrei andare così (regex spiegato nei commenti):

import re

# If you need to use the regex more than once it is suggested to compile it.
pattern = re.compile(r"</{0,}\[\d+>")

# <\/{0,}\[\d+>
# 
# Match the character “<” literally «<»
# Match the character “/” literally «\/{0,}»
#    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}»
# Match the character “[” literally «\[»
# Match a single digit 0..9 «\d+»
#    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “>” literally «>»

subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. 
and there are many other lines in the txt files
with<[3> such tags </[3>"""

result = pattern.sub("", subject)

print(result)

Se vuoi saperne di più su regex, ti consiglio di leggere il ricettario sulle espressioni regolari di Jan Goyvaerts e Steven Levithan.

— Lorenzo Persichetti
fonte

2

Potresti semplicemente usare *invece di{0,}

— HamZa il

3

Dai documenti di Python : {0,}è uguale a *, {1,}è equivalente a +ed {0,1}è uguale a ?. È meglio usarlo *, +o ?quando è possibile, semplicemente perché sono più brevi e più facili da leggere.

— winklerrr,

15

La via più facile

import re

txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.  and there are many other lines in the txt files with<[3> such tags </[3>'

out = re.sub("(<[^>]+>)", '', txt)
print out

— Ezequiel Marquez
fonte

Le parentesi sono davvero necessarie? Non sarebbe la stessa espressione regolare: <[^>]+>? A proposito: penso che il tuo regex corrisponderebbe troppo (ad esempio qualcosa del genere <html>)

— winklerrr

10

Il metodo di sostituzione degli oggetti stringa non accetta espressioni regolari ma solo stringhe fisse (consultare la documentazione: http://docs.python.org/2/library/stdtypes.html#str.replace ).

Devi usare il remodulo:

import re
newline= re.sub("<\/?\[[0-9]+>", "", line)

— Zac
fonte

4

Dovresti usare \d+invece di[0-9]+

— winklerrr il

3

non è necessario utilizzare l'espressione regolare (per la stringa di esempio)

>>> s
'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n'

>>> for w in s.split(">"):
...   if "<" in w:
...      print w.split("<")[0]
...
this is a paragraph with
 in between
 and then there are cases ... where the
 number ranges from 1-100
.
and there are many other lines in the txt files
with
 such tags

— Kurumi
fonte

3

import os, sys, re, glob

pattern = re.compile(r"\<\[\d\>")
replacementStringMatchesPattern = "<[1>"

for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
   for line in reader: 
      retline =  pattern.sub(replacementStringMatchesPattern, "", line)         
      sys.stdout.write(retline)
      print (retline)

— Abena Saulka
fonte