Come faccio a preservare le interruzioni di riga quando utilizzo jsoup per convertire HTML in testo normale?

101

Ho il codice seguente:

 public class NewClass {
     public String noTags(String str){
         return Jsoup.parse(str).text();
     }


     public static void main(String args[]) {
         String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
         "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";

         NewClass text = new NewClass();
         System.out.println((text.noTags(strings)));
}

E ho il risultato:

hello world yo googlez

Ma voglio rompere la linea:

hello world
yo googlez

Ho esaminato TextNode # getWholeText () di jsoup ma non riesco a capire come usarlo.

Se c'è un <br>nel markup che analizzo, come posso ottenere un'interruzione di riga nell'output risultante?

java jsoup

— Billy
fonte

modifica il testo: nella domanda non viene visualizzata alcuna interruzione di riga. In generale, leggi l'anteprima della tua domanda prima di pubblicarla, per verificare che tutto sia visualizzato correttamente.

— Robin Green

Ho posto la stessa domanda (senza il requisito jsoup) ma non ho ancora una buona soluzione: stackoverflow.com/questions/2513707/…

— Eduardo

vedi la risposta di @zeenosaur.

— Jang-Ho Bae,

102

La vera soluzione che preserva le interruzioni di riga dovrebbe essere questa:

public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    String s = document.html().replaceAll("\\\\n", "\n");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}

Soddisfa i seguenti requisiti:

se l'html originale contiene una nuova riga (\ n), viene conservato
se l'html originale contiene tag br o p, vengono tradotti in newline (\ n).

— user121196
fonte

5

Questa dovrebbe essere la risposta selezionata

— Duy

2

br2nl non è il nome del metodo più utile o accurato

— DD.

2

Questa è la migliore risposta. Ma che ne dici di for (Element e : document.select("br")) e.after(new TextNode("\n", ""));aggiungere una nuova riga reale e non la sequenza \ n? Vedi Node :: after () ed Elements :: append () per la differenza. In replaceAll()questo caso non è necessario. Simile per pe altri elementi di blocco.

— user2043553

1

La risposta di @ user121196 dovrebbe essere la risposta scelta. Se hai ancora entità HTML dopo aver ripulito l'HTML di input, applica StringEscapeUtils.unescapeHtml (...) Apache commons all'output di Jsoup clean.

— karth500

6

Vedi github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/… per una risposta completa a questo problema.

— Malcolm Smith

44

Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

Stiamo usando questo metodo qui:

public static String clean(String bodyHtml,
                       String baseUri,
                       Whitelist whitelist,
                       Document.OutputSettings outputSettings)

Passandolo Whitelist.none()ci assicuriamo che tutto l'HTML venga rimosso.

Passando new OutputSettings().prettyPrint(false)ci assicuriamo che l'output non venga riformattato e le interruzioni di riga siano preservate.

— Paulius Z
fonte

Questa dovrebbe essere l'unica risposta corretta. Tutti gli altri presumono che solo i brtag producano nuove righe. Che dire di qualsiasi altro elemento di blocco in HTML, come div, p, ulecc? Tutti introducono anche nuove linee.

— adarshr

7

Con questa soluzione, l'html "<html> <body> <div> riga 1 </div> <div> riga 2 </div> <div> riga 3 </div> </body> </html>" ha prodotto l'output: "riga 1 riga 2 riga 3" senza nuove righe.

— JohnC

2

Questo non funziona per me; <br> non creano interruzioni di riga.

— JoshuaD

43

Con

Jsoup.parse("A\nB").text();

hai output

"A B"

e non

A

B

Per questo sto usando:

descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

— Mirco Attocchi
fonte

2

In effetti questo è un semplice palliativo, ma IMHO dovrebbe essere completamente gestito dalla stessa libreria Jsoup (che in questo momento ha alcuni comportamenti inquietanti come questo - altrimenti è una grande libreria!).

— SRG

5

JSoup non ti dà un DOM? Perché non sostituire tutti gli <br>elementi con nodi di testo contenenti nuove righe e quindi chiamare .text()invece di eseguire una trasformazione regex che causerà un output errato per alcune stringhe come<div title=<br>'not an attribute'></div>

— Mike Samuel,

5

Bello, ma da dove viene quella "descrizione"?

— Steve Waters

"descrizione" rappresenta la variabile a cui viene assegnato il testo in chiaro

— enigma969

23

Prova questo usando jsoup:

public static String cleanPreserveLineBreaks(String bodyHtml) {

    // get pretty printed html with preserved br and p tags
    String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
    // get plain text with preserved line breaks by disabled prettyPrint
    return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}

— mkowa
fonte

bello mi funziona con un piccolo cambiamento new Document.OutputSettings().prettyPrint(true)

— Ashu

Questa soluzione lascia "& nbsp;" come testo invece di analizzarli in uno spazio.

— Andrei Volgin

13

Su Jsoup v1.11.2, ora possiamo usare Element.wholeText().

Codice di esempio:

String cleanString = Jsoup.parse(htmlString).wholeText();

user121196's la risposta funziona ancora. Ma wholeText()conserva l'allineamento dei testi.

— zeenosaur
fonte

Caratteristica super bella!

— Denis Kulagin

8

Per HTML più complessi nessuna delle soluzioni precedenti ha funzionato correttamente; Sono stato in grado di eseguire con successo la conversione preservando le interruzioni di riga con:

Document document = Jsoup.parse(myHtml);
String text = new HtmlToPlainText().getPlainText(document);

(versione 1.10.3)

— Andy Res
fonte

1

Meglio di tutte le risposte! Grazie Andy Res!

— Bharath Nadukatla

6

Puoi attraversare un dato elemento

public String convertNodeToText(Element element)
{
    final StringBuilder buffer = new StringBuilder();

    new NodeTraversor(new NodeVisitor() {
        boolean isNewline = true;

        @Override
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                String text = textNode.text().replace('\u00A0', ' ').trim();                    
                if(!text.isEmpty())
                {                        
                    buffer.append(text);
                    isNewline = false;
                }
            } else if (node instanceof Element) {
                Element element = (Element) node;
                if (!isNewline)
                {
                    if((element.isBlock() || element.tagName().equals("br")))
                    {
                        buffer.append("\n");
                        isNewline = true;
                    }
                }
            }                
        }

        @Override
        public void tail(Node node, int depth) {                
        }                        
    }).traverse(element);        

    return buffer.toString();               
}

E per il tuo codice

String result = convertNodeToText(JSoup.parse(html))

— popcorny
fonte

Penso che dovresti testare se isBlockin tail(node, depth)invece e aggiungere \nquando si lascia il blocco piuttosto che quando si entra? Lo sto facendo (cioè usando tail) e funziona bene. Tuttavia, se uso headcome fai tu, allora questo: <p>line one<p>line twofinisce come una singola riga.

— KajMagnus

4

text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

funziona se lo stesso HTML non contiene "br2n"

Così,

text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "<pre>\n</pre>")).text();

funziona in modo più affidabile e facile.

— Berretto Verde
fonte

4

Prova questo usando jsoup:

    doc.outputSettings(new OutputSettings().prettyPrint(false));

    //select all <br> tags and append \n after that
    doc.select("br").after("\\n");

    //select all <p> tags and prepend \n before that
    doc.select("p").before("\\n");

    //get the HTML from the document, and retaining original new lines
    String str = doc.html().replaceAll("\\\\n", "\n");

— Abhay Gupta
fonte

3

Utilizzare textNodes()per ottenere un elenco dei nodi di testo. Quindi concatenali con \ncome separatore. Ecco un po 'di codice scala che uso per questo, il port java dovrebbe essere facile:

val rawTxt = doc.body().getElementsByTag("div").first.textNodes()
                    .asScala.mkString("<br />\n")

— Michael Bar-Sinai
fonte

3

Sulla base delle altre risposte e dei commenti su questa domanda, sembra che la maggior parte delle persone che vengono qui stiano davvero cercando una soluzione generale che fornisca una rappresentazione in testo normale ben formattata di un documento HTML. Lo so che lo ero.

Fortunatamente JSoup fornisce già un esempio abbastanza completo di come ottenere questo risultato: HtmlToPlainText.java

L'esempio FormattingVisitor può essere facilmente modificato in base alle proprie preferenze e si occupa della maggior parte degli elementi di blocco e del ritorno a capo.

Per evitare la decomposizione dei collegamenti, ecco la soluzione completa di Jonathan Hedley :

package org.jsoup.examples;

import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;

import java.io.IOException;

/**
 * HTML to plain-text. This example program demonstrates the use of jsoup to convert HTML input to lightly-formatted
 * plain-text. That is divergent from the general goal of jsoup's .text() methods, which is to get clean data from a
 * scrape.
 * <p>
 * Note that this is a fairly simplistic formatter -- for real world use you'll want to embrace and extend.
 * </p>
 * <p>
 * To invoke from the command line, assuming you've downloaded the jsoup jar to your current directory:</p>
 * <p><code>java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]</code></p>
 * where <i>url</i> is the URL to fetch, and <i>selector</i> is an optional CSS selector.
 * 
 * @author Jonathan Hedley, jonathan@hedley.net
 */
public class HtmlToPlainText {
    private static final String userAgent = "Mozilla/5.0 (jsoup)";
    private static final int timeout = 5 * 1000;

    public static void main(String... args) throws IOException {
        Validate.isTrue(args.length == 1 || args.length == 2, "usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]");
        final String url = args[0];
        final String selector = args.length == 2 ? args[1] : null;

        // fetch the specified URL and parse to a HTML DOM
        Document doc = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get();

        HtmlToPlainText formatter = new HtmlToPlainText();

        if (selector != null) {
            Elements elements = doc.select(selector); // get each element that matches the CSS selector
            for (Element element : elements) {
                String plainText = formatter.getPlainText(element); // format that element to plain text
                System.out.println(plainText);
            }
        } else { // format the whole doc
            String plainText = formatter.getPlainText(doc);
            System.out.println(plainText);
        }
    }

    /**
     * Format an Element to plain-text
     * @param element the root element to format
     * @return formatted text
     */
    public String getPlainText(Element element) {
        FormattingVisitor formatter = new FormattingVisitor();
        NodeTraversor traversor = new NodeTraversor(formatter);
        traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node

        return formatter.toString();
    }

    // the formatting rules, implemented in a breadth-first DOM traverse
    private class FormattingVisitor implements NodeVisitor {
        private static final int maxWidth = 80;
        private int width = 0;
        private StringBuilder accum = new StringBuilder(); // holds the accumulated text

        // hit when the node is first seen
        public void head(Node node, int depth) {
            String name = node.nodeName();
            if (node instanceof TextNode)
                append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM.
            else if (name.equals("li"))
                append("\n * ");
            else if (name.equals("dt"))
                append("  ");
            else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr"))
                append("\n");
        }

        // hit when all of the node's children (if any) have been visited
        public void tail(Node node, int depth) {
            String name = node.nodeName();
            if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5"))
                append("\n");
            else if (name.equals("a"))
                append(String.format(" <%s>", node.absUrl("href")));
        }

        // appends text to the string builder with a simple word wrap method
        private void append(String text) {
            if (text.startsWith("\n"))
                width = 0; // reset counter if starts with a newline. only from formats above, not in natural text
            if (text.equals(" ") &&
                    (accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n")))
                return; // don't accumulate long runs of empty spaces

            if (text.length() + width > maxWidth) { // won't fit, needs to wrap
                String words[] = text.split("\\s+");
                for (int i = 0; i < words.length; i++) {
                    String word = words[i];
                    boolean last = i == words.length - 1;
                    if (!last) // insert a space if not the last word
                        word = word + " ";
                    if (word.length() + width > maxWidth) { // wrap and reset counter
                        accum.append("\n").append(word);
                        width = word.length();
                    } else {
                        accum.append(word);
                        width += word.length();
                    }
                }
            } else { // fits as is, without need to wrap text
                accum.append(text);
                width += text.length();
            }
        }

        @Override
        public String toString() {
            return accum.toString();
        }
    }
}

— Malcolm Smith
fonte

3

Questa è la mia versione di tradurre html in testo (la versione modificata della risposta user121196, in realtà).

Questo non preserva solo le interruzioni di riga, ma anche la formattazione del testo e la rimozione di interruzioni di riga eccessive, simboli di escape HTML e otterrai un risultato molto migliore dal tuo HTML (nel mio caso lo ricevo dalla posta).

È scritto originariamente in Scala, ma puoi cambiarlo facilmente in Java

def html2text( rawHtml : String ) : String = {

    val htmlDoc = Jsoup.parseBodyFragment( rawHtml, "/" )
    htmlDoc.select("br").append("\\nl")
    htmlDoc.select("div").prepend("\\nl").append("\\nl")
    htmlDoc.select("p").prepend("\\nl\\nl").append("\\nl\\nl")

    org.jsoup.parser.Parser.unescapeEntities(
        Jsoup.clean(
          htmlDoc.html(),
          "",
          Whitelist.none(),
          new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true)
        ),false
    ).
    replaceAll("\\\\nl", "\n").
    replaceAll("\r","").
    replaceAll("\n\\s+\n","\n").
    replaceAll("\n\n+","\n\n").     
    trim()      
}

— abdolenza
fonte

È necessario anteporre anche una nuova riga ai tag <div>. Altrimenti, se un div segue i tag <a> o <span>, non sarà su una nuova riga.

— Andrei Volgin,

2

Prova questo:

public String noTags(String str){
    Document d = Jsoup.parse(str);
    TextNode tn = new TextNode(d.body().html(), "");
    return tn.getWholeText();
}

— manji
fonte

1

<p> <b> ciao mondo </b> </p> <p> <br /> <b> yo </b> <a href=" google.com"> googlez </a> </ p > ma ho bisogno di ciao mondo yo googlez (senza tag html)

— Billy

Questa risposta non restituisce testo normale; restituisce HTML con le nuove righe inserite.

— KajMagnus

1

/**
 * Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced.
 * @param html
 * @param linebreakerString
 * @return the html as String with proper java newlines instead of br
 */
public static String replaceBrWithNewLine(String html, String linebreakerString){
    String result = "";
    if(html.contains(linebreakerString)){
        result = replaceBrWithNewLine(html, linebreakerString+"1");
    } else {
        result = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", linebreakerString)).text(); // replace and html line breaks with java linebreak.
        result = result.replaceAll(linebreakerString, "\n");
    }
    return result;
}

Utilizzato chiamando con l'html in questione, contenente il br, insieme a qualsiasi stringa che si desidera utilizzare come segnaposto di nuova riga temporanea. Per esempio:

replaceBrWithNewLine(element.html(), "br2n")

La ricorsione assicurerà che la stringa che usi come segnaposto di newline / linebreaker non sarà mai effettivamente nell'html di origine, poiché continuerà ad aggiungere un "1" fino a quando la stringa segnaposto del linkbreaker non viene trovata nell'html. Non avrà il problema di formattazione che i metodi Jsoup.clean sembrano incontrare con caratteri speciali.

— Chris6647
fonte

Buono, ma non hai bisogno della ricorsione, aggiungi semplicemente questa riga: while (dirtyHTML.contains (linebreakerString)) linebreakerString = linebreakerString + "1";

— Dr NotSoKind

Ah sì. Completamente vero. Immagino che la mia mente sia stata presa per una volta dall'essere effettivamente in grado di usare la ricorsione :)

— Chris6647

1

Sulla base della risposta di user121196 e Green Beret con la se selectla <pre>s, l'unica soluzione che funziona per me è:

org.jsoup.nodes.Element elementWithHtml = ....
elementWithHtml.select("br").append("<pre>\n</pre>");
elementWithHtml.select("p").prepend("<pre>\n\n</pre>");
elementWithHtml.text();

— Bevor
fonte