Quanti personaggi per personaggio?

Su http://shakespeare.mit.edu/ puoi trovare il testo completo di ciascuna delle opere teatrali di Shakespeare su una pagina (ad es. Amleto ).

Scrivi una sceneggiatura che includa l'URL di una rappresentazione teatrale di stdin, come http://shakespeare.mit.edu/hamlet/full.html , e genera il numero di caratteri di testo che ogni personaggio riprodotto ha parlato a stdout, ordinati in base a chi parlava di più.

I titoli teatrali / di scena / recitazione ovviamente non contano come dialoghi, né i nomi dei personaggi. Il testo in corsivo e [testo tra parentesi quadre] non sono dialoghi reali, non devono essere conteggiati. Spazi e altri segni di punteggiatura all'interno del dialogo devono essere conteggiati.

(Il formato delle opere sembra molto coerente anche se non le ho esaminate tutte. Dimmi se ho trascurato qualcosa. La tua sceneggiatura non deve funzionare per le poesie.)

Esempio

Ecco una sezione simulata di Molto rumore per nulla per mostrare cosa mi aspetto per l'output:

Più rumore per nulla

Scena 0.

Messaggero

Lo farò.

BEATRICE

Fare.

Leonato

Tu mai.

BEATRICE

No.

Uscita prevista:

LEONATO 15
Messenger 7
BEATRICE 6

punteggio

Questo è il codice golf. Vincerà il più piccolo programma in byte.

code-golf string counting

— Hobby di Calvin
fonte

E se qualcuno avesse affrontato questa sfida di Shakespeare in Shakespeare? Sarebbe fantastico se fosse possibile ...

— fuandon

Possiamo supporre di avere un elenco dei personaggi della commedia? O dobbiamo dedurre i caratteri dal testo? Quest'ultimo è molto difficile dato che alcuni personaggi (ad es. Messenger) hanno un mix di lettere maiuscole e minuscole. Altri hanno nomi con solo lettere maiuscole (es. LEONATO); e alcuni di questi sono nomi composti.

— DavidC,

Sì, dovresti dedurre i nomi. Sono formattati in modo molto diverso rispetto ai dialoghi, quindi dato l'html che li differenzia non dovrebbe essere troppo complicato.

— Calvin's Hobbies,

"Tutto" dovrebbe essere considerato come un personaggio separato?

— es1024,

@ es1024 Sì. Qualsiasi personaggio con un titolo unico è considerato separato, anche se il risultato non ha esattamente senso.

— Calvin's Hobbies,

Risposte:

PHP (240 caratteri)

Divide l'html in stringhe (usando come delimitatore), quindi esegue un paio di espressioni regolari per estrarre il nome e le parole pronunciate. Salva la lunghezza delle parole pronunciate nell'array. golfed:

<?@$p=preg_match_all;foreach(explode('/bl',implode(file(trim(fgets(STDIN)))))as$c)if($p('/=s.*?b>(.*?):?</',$c,$m)){$p('/=\d.*?>(.*?)</',$c,$o);foreach($m[1]as$n)@$q[$n]+=strlen(implode($o[1]));}arsort($q);foreach($q as$n=>$c)echo"$n $c\n";

Ungolfed:

<?php
$html = implode(file(trim(fgets(STDIN))));
$arr = explode('/bl',$html);
foreach($arr as $chunk){
    if(preg_match_all('/=s.*?b>(.*?):?</',$chunk,$matches)){
        $name = $matches[1];
        preg_match_all('/=\d.*?>(.*?)</',$chunk,$matches);
        foreach($name as $n)
            @$names[$n] += strlen(implode($matches[1]));
    }
}
arsort($names);
foreach($names as $name=>$count)
    echo "$name $count\n";

Nota: questo considera "Tutto" come un personaggio separato.

Esempio:

$php shakespeare.php <<< "http://shakespeare.mit.edu/hamlet/full.html"
HAMLET 60063
KING CLAUDIUS 21461
LORD POLONIUS 13877
HORATIO 10605
LAERTES 7519
OPHELIA 5916
QUEEN GERTRUDE 5554
First Clown 3701
ROSENCRANTZ 3635
Ghost 3619
MARCELLUS 2350
First Player 1980
OSRIC 1943
Player King 1849
GUILDENSTERN 1747
Player Queen 1220
BERNARDO 1153
Gentleman 978
PRINCE FORTINBRAS 971
VOLTIMAND 896
Second Clown 511
First Priest 499
Captain 400
Lord 338
REYNALDO 330
FRANCISCO 287
LUCIANUS 272
First Ambassador 230
First Sailor 187
Messenger 185
Prologue 94
All 94
Danes 75
Servant 49
CORNELIUS 45

— es1024
fonte

Si prega di mostrare alcuni esempi di output.

— DavidC,

@DavidCarraher È stato aggiunto un esempio.

— es1024,

Rebol - 556 527

t: complement charset"<"d: charset"0123456789."m: map[]parse to-string read to-url input[any[(s: 0 a: copy[])some["<A NAME=speech"some d"><b>"copy n some t</b></a>(append a trim/with n":")some newline]<blockquote>newline any["<A NAME="some d">"copy q some t</a><br>newline(while[f: find q"["][q: remove/part f next find f"]"]s: s + length? trim head q)|<p><i>some t</i></p>newline][</blockquote>|</body>](foreach n a[m/:n: either none? m/:n[s][s + m/:n]])| skip]]foreach[x y]sort/reverse/skip/compare to-block m 2 2[print[x y]]

Questo potrebbe probabilmente essere ulteriormente risolto, ma è improbabile che scenda al di sotto delle risposte già fornite :(

Ungolfed:

t: complement charset "<"
d: charset "0123456789."
m: map []

parse to-string read to-url input [
    any [
        (s: 0 a: copy [])

        some [
            "<A NAME=speech" some d "><b>" copy n some t </b></a>
            (append a trim/with n ":")
            some newline
        ]

        <blockquote> newline
        any [
            "<A NAME=" some d ">" copy q some t </a><br> newline (
                while [f: find q "["] [
                    q: remove/part f next find f "]"
                ]
                s: s + length? trim head q
            )
            | <p><i> some t </i></p> newline
        ]
        [</blockquote> | </body>]
        (foreach n a [m/:n: either none? m/:n [s] [s + m/:n]])

        | skip
    ]
]

foreach [x y] sort/reverse/skip/compare to-block m 2 2 [print [x y]]

Questo programma rimuove [testo tra parentesi quadre] e taglia anche gli spazi bianchi circostanti dal dialogo. Senza questo, l'uscita è identica a es1024 risposta .

Esempio:

$ rebol -q shakespeare.reb <<< "http://shakespeare.mit.edu/hamlet/full.html"
HAMLET 59796
KING CLAUDIUS 21343
LORD POLONIUS 13685
HORATIO 10495
LAERTES 7402
OPHELIA 5856
QUEEN GERTRUDE 5464
First Clown 3687
ROSENCRANTZ 3585
Ghost 3556
MARCELLUS 2259
First Player 1980
OSRIC 1925
Player King 1843
GUILDENSTERN 1719
Player Queen 1211
BERNARDO 1135
Gentleman 978
PRINCE FORTINBRAS 953
VOLTIMAND 896
Second Clown 511
First Priest 499
Captain 400
Lord 338
REYNALDO 312
FRANCISCO 287
LUCIANUS 269
First Ambassador 230
First Sailor 187
Messenger 185
Prologue 89
All 76
Danes 51
Servant 49
CORNELIUS 45

— draegtun
fonte

Lisp comune - 528

(use-package :plump)(lambda c(u &aux(h (make-hash-table))n r p)(traverse(parse(drakma:http-request u))(lambda(x &aux y)(case p(0(when(and n(not(ppcre:scan"speech"(attribute x"NAME"))))(setf r t y(#1=ppcre:regex-replace-all"aside: "(#1#"^(\\[[^]]*\\] |\\s*)"(text x)"")""))(dolist(w n)(incf(gethash w h 0)(length y)))))(1(if r(setf n()r()))(push(intern(text(aref(children x)0)))n)))):test(lambda(x)(and(element-p x)(setf p(position(tag-name x)'("A""b"):test #'string=)))))(format t"~{~a ~a~^~%~}"(alexandria:hash-table-plist h)))

Spiegazione

Questa è una versione leggermente modificata che aggiunge informazioni sulla stampa (vedi incolla).

(defun c (u &aux
                 (h (make-hash-table)) ;; hash-table
                 n ;; last seen character name
                 r p
                 )
      (traverse                 ;; traverse the DOM generated by ...
       (parse                   ;; ... parsing the text string
        (drakma:http-request u) ;; ... resulting from http-request to link U
        )

       ;; call the function held in variable f for each traversed element
       (lambda (x &aux y)
         (case p
           (0 ;a
            (when(and n(not(alexandria:starts-with-subseq"speech"(attribute x "NAME"))))
              (setf r t)
              (setf y(#1=ppcre:regex-replace-all"aside: "(#1#"^(\\[[^]]*\\] |\\s*)"(text x)"")""))
              (format t "~A ~S~%" n y) ;; debugging
              (dolist(w n)
                (incf
                    (gethash w h 0) ;; get values in hash, with default value 0
                    (length y)))) ;; length of text
            )
           (1 ;b
            (if r(setf n()r()))
            (push (intern (text (aref (children x)0)))n))))

       ;; but only for elements that satisfy the test predicate
       :test
       (lambda(x)
         (and (element-p x) ;; must be an element node
              (setf p(position(tag-name x)'("A""b"):test #'string=)) ;; either <a> or <b>; save result of "position" in p
              )))

        ;; finally, iterate over the elements of the hash table, as a
        ;; plist, i.e. a list of alternating key values (k1 v1 k2 v2 ...),
        ;; and print them as requested. ~{ ~} is an iteration control format.
  (format t "~&~%~%TOTAL:~%~%~{~a ~a~^~%~}" (alexandria:hash-table-plist h)))

Appunti

Rimuovo il testo tra parentesi e la ricorrenza "a parte:" che non è presente tra parentesi (taglio anche i caratteri degli spazi). Ecco una traccia di esecuzione con il testo da abbinare e il totale per ogni personaggio, per Amleto .
Come altre risposte, si presume che Tutto sia un personaggio. Potrebbe essere allettante aggiungere il valore di tutti a tutti gli altri personaggi, ma ciò non sarebbe corretto poiché "Tutti" si riferisce ai personaggi effettivamente presenti sul palco, il che richiede di mantenere un contesto di chi è presente (tracciare "exit" "exeunt "e" inserire "indicazioni). Questo non è fatto.

— coredump
fonte