Regex: abbina una serie egualitaria

introduzione

Non vedo molte sfide regex qui, quindi vorrei offrire questo ingannevolmente semplice che può essere fatto in diversi modi usando una serie di sapori regex. Spero che offra agli appassionati di regex un po 'di divertimento nel golf.

Sfida

La sfida è quella di abbinare quella che ho definito molto liberamente una serie "egualitaria": una serie di uguali numeri di personaggi diversi. Questo è meglio descritto con esempi.

Incontro:

aaabbbccc
xyz 
iillppddff
ggggggoooooollllllffffff
abc
banana

Non abbinare:

aabc
xxxyyzzz
iilllpppddff
ggggggoooooollllllfff
aaaaaabbbccc
aaabbbc
abbaa
aabbbc

Per generalizzare, vogliamo abbinare un argomento del modulo ( per qualsiasi elenco di caratteri a , dove per tuttic₁)ⁿ(c₂)ⁿ(c₃)ⁿ...(c_k)ⁿc₁c_kc_i != c_i+1i, k > 1, and n > 0.

chiarimenti:

L'input non sarà vuoto.
Un personaggio può ripetersi più tardi nella stringa (es. "Banana")
k > 1, quindi ci saranno sempre almeno 2 caratteri diversi nella stringa.
Puoi presumere che solo i caratteri ASCII verranno passati come input e nessun carattere sarà un terminatore di riga.

Regole

(Grazie a Martin Ender per questo blocco di regole dichiarato in modo eccellente)

La tua risposta dovrebbe consistere in un unico regex, senza alcun codice aggiuntivo (tranne, facoltativamente, un elenco di modificatori regex necessari per far funzionare la tua soluzione). Non devi usare le caratteristiche del sapore regex della tua lingua che ti consentono di invocare il codice nella lingua di hosting (ad es. eModificatore di Perl ).

Puoi usare qualsiasi sapore regex che esisteva prima di questa sfida, ma specifica il sapore.

Non dare per scontato che il regex sia ancorato implicitamente, ad esempio se stai usando Python, supponi che il tuo regex sia usato con re.search e non con re.match. La tua regex deve corrispondere all'intera stringa per stringhe egualitarie valide e non produrre corrispondenze per stringhe non valide. Puoi usare tutti i gruppi di cattura che desideri.

Si può presumere che l'input sarà sempre una stringa di due o più caratteri ASCII che non contengono terminatori di riga.

Questo è regex golf, quindi vince il regex più breve in byte. Se la tua lingua richiede delimitatori (di solito /.../) per indicare espressioni regolari, non contare i delimitatori stessi. Se la tua soluzione richiede modificatori, aggiungi un byte per modificatore.

criteri

Questo è un buon vecchio golf, quindi dimentica l'efficienza e cerca solo di rendere il tuo regex il più piccolo possibile.

Indica quale sapore regex hai usato e, se possibile, includi un link che mostra una demo online della tua espressione in azione.

code-golf string regular-expression

— jaytea
fonte

Questo è in particolare un golf regex? Probabilmente dovresti chiarire questo, insieme alle regole per questo. La maggior parte delle sfide su questo sito sono golf di linguaggi di programmazione assortiti.

— Lirico,

@LyricLy Grazie per il consiglio! Sì, vorrei che fosse puramente regex, cioè. una singola espressione regolare con un sapore regex a scelta del mittente. Ci sono altre regole che dovrei tenere presente?

— jaytea,

Non capisco la tua definizione di "egualitario", tale che bananasia egualitario.

— msh210

@ msh210 Quando ho trovato il termine "egualitario" per descrivere la serie, non ho considerato che avrei permesso ai personaggi di essere ripetuti più avanti nella serie (come in "banana" o "aaabbbcccaaa", ecc.) . Volevo solo un termine per rappresentare l'idea che ogni pezzo di personaggi ripetuti avesse le stesse dimensioni. Poiché "banana" non ha caratteri ripetuti, questa definizione è vera per questo.

— jaytea,

Risposte:

Sapore .NET, 48 byte

^(.)\1*((?<=(\5())*(.))(.)(?<-4>\6)*(?!\4|\6))+$

Provalo online! (usando Retina )

Bene, si scopre che non negare la logica è più semplice dopo tutto. Sto facendo di questa una risposta separata, perché i due approcci sono completamente diversi.

Spiegazione

^            # Anchor the match to the beginning of the string.
(.)\1*       # Match the first run of identical characters. In principle, 
             # it's possible that this matches only half, a quarter, an 
             # eighth etc of of the first run, but that won't affect the 
             # result of the match (in other words, if the match fails with 
             # matching this as the entire first run, then backtracking into
             # only matching half of it won't cause the rest of the regex to
             # match either).
(            # Match this part one or more times. Each instance matches one
             # run of identical letters.
  (?<=       #   We start with a lookbehind to record the length
             #   of the preceding run. Remember that the lookbehind
             #   should be read from the bottom up (and so should
             #   my comments).
    (\5())*  #     And then we match all of its adjacent copies, pushing an
             #     empty capture onto stack 4 each time. That means at the
             #     end of the lookbehind, we will have n-1 captures stack 4, 
             #     where n is the length of the preceding run. Due to the 
             #     atomic nature of lookbehinds, we don't have to worry 
             #     about backtracking matching less than n-1 copies here.
    (.)      #     We capture the character that makes up the preceding
             #     run in group 5.
  )
  (.)        #   Capture the character that makes up the next run in group 6.
  (?<-4>\6)* #   Match copies of that character while depleting stack 4.
             #   If the runs are the same length that means we need to be
             #   able to get to the end of the run at the same time we
             #   empty stack 4 completely.
  (?!\4|\6)  #   This lookahead ensures that. If stack 4 is not empty yet,
             #   \4 will match, because the captures are all empty, so the
             #   the backreference can't fail. If the stack is empty though,
             #   then the backreference will always fail. Similarly, if we
             #   are not at the end of the run yet, then \6 will match 
             #   another copy of the run. So we ensure that neither \4 nor
             #   \6 are possible at this position to assert that this run
             #   has the same length das the previous one.
)+
$            # Finally, we make sure that we can cover the entire string
             # by going through runs of identical lengths like this.

— Martin Ender
fonte

Adoro che tu abbia oscillato tra i due metodi! Ho anche pensato che l'approccio negativo avrebbe dovuto essere più breve fino a quando non l'ho provato e l'ho trovato molto più imbarazzante (anche se sembra che dovrebbe essere più semplice). Ho 48b in PCRE e 49b in Perl con un metodo completamente diverso, e con il tuo terzo metodo in .NET delle stesse dimensioni, direi che questa è una sfida regex piuttosto interessante: D

— jaytea

@jaytea Mi piacerebbe vederli. Se nessuno arriva a qualcosa per circa una settimana, spero che li pubblichi da soli. :) E sì d'accordo, è bello che gli approcci siano così vicini nel conteggio dei byte.

— Martin Ender,

Potrei solo! Inoltre, Perl one è stato portato a golf a 46b;)

— jaytea,

Quindi ho pensato che potresti volerli vedere adesso! Ecco 48b in PCRE: ((^.|\2(?=.*\4\3)|\4(?!\3))(?=\2*+((.)\3?)))+\3$stavo sperimentando al \3*posto di (?!\3)renderlo 45b ma questo fallisce su "aabbbc" :( La versione Perl è più facile da capire, ed è fino a 45b ora: ^((?=(.)\2*(.))(?=(\2(?4)?\3)(?!\3))\2+)+\3+$- il motivo per cui lo chiamo Perl anche se sembra essere valido PCRE è che PCRE pensa di (\2(?4)?\3)poter ricorrere indefinitamente, mentre Perl è un po 'più intelligente / perdona!

— Jaytea,

@jaytea Ah, quelle sono soluzioni davvero pulite. Dovresti davvero pubblicarli in una risposta separata. :)

— Martin Ender,

Sapore .NET, 54 byte

^(?!.*(?<=(\2)*(.))(?!\2)(?>(.)(?<-1>\3)*)(?(1)|\3)).+

Provalo online! (usando Retina )

Sono abbastanza sicuro che questo non sia ottimale, ma è il migliore che sto inventando per i gruppi di bilanciamento in questo momento. Ho un'alternativa allo stesso numero di byte, che è per lo più lo stesso:

^(?!.*(?<=(\3())*(.))(?!\3)(?>(.)(?<-2>\4)*)(\2|\4)).+

Spiegazione

L'idea principale è di invertire il problema, abbinare stringhe non egualitarie e mettere il tutto in un aspetto negativo per negare il risultato. Il vantaggio è che non dobbiamo tenere traccia di n nell'intera stringa (perché a causa della natura dei gruppi di bilanciamento, di solito si consuma n durante il controllo), per verificare che tutte le corse abbiano la stessa lunghezza. Invece, cerchiamo solo una singola coppia di piste adiacenti che non abbiano la stessa lunghezza. In questo modo, ho solo bisogno di usare n una volta.

Ecco una ripartizione della regex.

^(?!.*         # This negative lookahead means that we will match
               # all strings where the pattern inside the lookahead
               # would fail if it were used as a regex on its own.
               # Due to the .* that inner regex can match from any
               # position inside the string. The particular position
               # we're looking for is between two runs (and this
               # will be ensured later).

  (?<=         #   We start with a lookbehind to record the length
               #   of the preceding run. Remember that the lookbehind
               #   should be read from the bottom up (and so should
               #   my comments).
    (\2)*      #     And then we match all of its adjacent copies, capturing
               #     them separately in group 1. That means at the
               #     end of the lookbehind, we will have n-1 captures
               #     on stack 1, where n is the length of the preceding
               #     run. Due to the atomic nature of lookbehinds, we
               #     don't have to worry about backtracking matching
               #     less than n-1 copies here.
    (.)        #     We capture the character that makes up the preceding
               #     run in group 2.
  )
  (?!\2)       #   Make sure the next character isn't the same as the one
               #   we used for the preceding run. This ensures we're at a
               #   boundary between runs.
  (?>          #   Match the next stuff with an atomic group to avoid
               #   backtracking.
    (.)        #     Capture the character that makes up the next run
               #     in group 3.
    (?<-1>\3)* #     Match as many of these characters as possible while
               #     depleting the captures on stack 1.
  )
               #   Due to the atomic group, there are three two possible
               #   situations that cause the previous quantifier to stopp
               #   matching. 
               #   Either the run has ended, or stack 1 has been depleted.
               #   If both of those are true, the runs are the same length,
               #   and we don't actually want a match here. But if the runs
               #   are of different lengths than either the run ended but
               #   the stack isn't empty yet, or the stack was depleted but
               #   the run hasn't ended yet.
  (?(1)|\3)    #   This conditional matches these last two cases. If there's
               #   still a capture on stack 1, we don't match anything,
               #   because we know this run was shorter than the previous
               #   one. But if stack 1, we want to match another copy of 
               #   the character in this run to ensure that this run is 
               #   longer than the previous one.
)
.+             # Finally we just match the entire string to comply with the
               # challenge spec.

— Martin Ender
fonte

Ho cercato di farlo fallire su: banana, aba, bbbaaannnaaannnaaa, bbbaaannnaaannnaaaaaa, The Nineteenth Byte, 11, 110, ^(?!.*(?<=(\2)*(.))(?!\2)(?>(.)(?<-1>\3)*)(?(1)|\3)).+, bababa. Sono io che ho fallito. :( +1

— Erik the Outgolfer,

Quel momento in cui finisci la tua spiegazione e poi scopri che puoi salvare 1 byte usando l'esatto approccio opposto ... Immagino che farò un'altra risposta tra poco ...: |

— Martin Ender,

@MartinEnder ... E poi ti rendi conto che potresti golf questo da 2 byte ahah: P

— Mr. Xcoder

@ Mr.Xcoder Dovrebbe essere di 7 byte ora, quindi spero di essere al sicuro. ;)

— Martin Ender,