come rimuovere un elemento in lxml

Question 1

Ho bisogno di rimuovere completamente gli elementi, in base al contenuto di un attributo, usando lxml di Python. Esempio:

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  #remove this element from the tree

print et.tostring(tree, pretty_print=True)

Vorrei che questo stampasse:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

C'è un modo per farlo senza memorizzare una variabile temporanea e stamparla manualmente, come:

newxml="<groceries>\n"
for elt in tree.xpath('//fruit[@state=\'fresh\']'):
  newxml+=et.tostring(elt)

newxml+="</groceries>"

Question 2

Usa il removemetodo di un xmlElement:

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it

print et.tostring(tree, pretty_print=True, xml_declaration=True)

Se dovessi confrontarmi con la versione @Acorn, la mia funzionerà anche se gli elementi da rimuovere non sono direttamente sotto il nodo radice del tuo xml.

Question 3

Stai cercando la removefunzione. Chiama il metodo di rimozione dell'albero e passagli un sottoelemento da rimuovere.

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <punnet>
    <fruit state="rotten">strawberry</fruit>
    <fruit state="fresh">blueberry</fruit>
  </punnet>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state='rotten']"):
    bad.getparent().remove(bad)

print et.tostring(tree, pretty_print=True)

Risultato:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

Question 4

Ho incontrato una situazione:

<div>
    <script>
        some code
    </script>
    text here
</div>

div.remove(script)rimuoverà la text hereparte che non intendevo.

seguendo la risposta qui , ho scoperto che etree.strip_elementsè una soluzione migliore per me, che puoi controllare se rimuovere o meno il testo dietro con with_tail=(bool)param.

Ma ancora non so se questo può usare il filtro xpath per il tag. Metti questo per informare.

Ecco il documento:

strip_elements (tree_or_element, * tag_names, with_tail = True)

Elimina tutti gli elementi con i nomi dei tag forniti da un albero o da un sottoalbero. Questo rimuoverà gli elementi e la loro intera sottostruttura, inclusi tutti i loro attributi, contenuto di testo e discendenti. Rimuoverà anche il testo finale dell'elemento a meno che non imposti esplicitamente l' with_tailopzione dell'argomento della parola chiave su False.

I nomi dei tag possono contenere caratteri jolly come in _Element.iter.

Nota che questo non cancellerà l'elemento (o l'elemento radice ElementTree) che hai passato anche se corrisponde. Tratterà solo i suoi discendenti. Se vuoi includere l'elemento root, controlla direttamente il nome del suo tag prima ancora di chiamare questa funzione.

Utilizzo di esempio:
   strip_elements(some_element,
       'simpletagname',             # non-namespaced tag
       '{http://some/ns}tagname',   # namespaced tag
       '{http://some/other/ns}*'    # any tag from a namespace
       lxml.etree.Comment           # comments
       )

Question 5

Come già accennato, puoi utilizzare il remove()metodo per eliminare (sotto) elementi dall'albero:

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)

Ma rimuove l'elemento incluso il suo tail, il che è un problema se stai elaborando documenti a contenuto misto come HTML:

<div><fruit state="rotten">avocado</fruit> Hello!</div>

Diventa

<div></div>

Che è suppongo che ciò che non vuoi sempre :) Ho creato una funzione di supporto per rimuovere solo l'elemento e mantenerne la coda:

def remove_element(el):
    parent = el.getparent()
    if el.tail.strip():
        prev = el.getprevious()
        if prev:
            prev.tail = (prev.tail or '') + el.tail
        else:
            parent.text = (parent.text or '') + el.tail
    parent.remove(el)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
    remove_element(bad)

In questo modo manterrà il testo della coda:

<div> Hello!</div>

Question 6

Puoi anche usare html da lxml per risolverlo:

from lxml import html

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree = html.fromstring(xml)

print("//BEFORE")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

for i in tree.xpath("//fruit[@state='rotten']"):
    i.drop_tree()

print("//AFTER")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

Dovrebbe produrre questo:

//BEFORE
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>


//AFTER
<groceries>

  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>

  <fruit state="fresh">peach</fruit>
</groceries>