Dividi una stringa per spazi - preservando le sottostringhe tra virgolette

269

Ho una stringa che è così:

this is "a test"

Sto cercando di scrivere qualcosa in Python per dividerlo per spazio ignorando gli spazi tra virgolette. Il risultato che sto cercando è:

['this','is','a test']

PS. So che chiederai "cosa succede se ci sono virgolette all'interno delle virgolette, bene, nella mia applicazione, ciò non accadrà mai.

python regex

— Adam Pierce
fonte

1

Grazie per aver posto questa domanda. È esattamente ciò di cui avevo bisogno per riparare il modulo di compilazione pypar.

— Martlark,

393

Vuoi split, dal shlexmodulo integrato.

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

Questo dovrebbe fare esattamente quello che vuoi.

— Ierub
fonte

13

Usa "posix = False" per conservare le citazioni. shlex.split('this is "a test"', posix=False)ritorna['this', 'is', '"a test"']

— Boon

@MatthewG. La "correzione" in Python 2.7.3 significa che il passaggio di una stringa unicode shlex.split()innescherà UnicodeEncodeErrorun'eccezione.

— Rockallite,

57

Dai un'occhiata al shlexmodulo, in particolare shlex.split.

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

— Allen
fonte

40

Vedo approcci regex qui che sembrano complessi e / o sbagliati. Questo mi sorprende, perché la sintassi regex può facilmente descrivere "spazi bianchi o cose racchiuse tra virgolette", e la maggior parte dei motori regex (incluso quello di Python) può dividere su una regex. Quindi, se hai intenzione di usare regex, perché non dire esattamente cosa intendi ?:

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

Spiegazione:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

shlex probabilmente offre più funzionalità, però.

1

Pensavo più o meno allo stesso modo, ma suggerirei invece [t.strip ('"') per t in re.findall (r '[^ \ s"] + | "[^"] * "', 'this is" un test "')]

— Darius Bacon

2

+1 Sto usando questo perché era molto più veloce di shlex.

— Hanleyp,

3

Perché la tripla barra rovesciata? una semplice barra rovesciata non farà lo stesso?

— Doppelganger,

1

In realtà, una cosa che non mi piace è che le virgolette prima / dopo non sono divise correttamente. Se ho una stringa come questa 'PARAMS val1 = "Thing" val2 = "Thing2"'. Mi aspetto che la stringa si divida in tre pezzi, ma si divide in 5. È passato un po 'di tempo da quando ho fatto regex, quindi non ho voglia di provare a risolverlo usando la tua soluzione proprio ora.

— leetNightshade,

1

È necessario utilizzare stringhe non elaborate quando si utilizzano espressioni regolari.

— asmeurer,

29

A seconda del tuo caso d'uso, potresti anche voler controllare il csvmodulo:

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print(row)

Produzione:

['this', 'is', 'a string']
['and', 'more', 'stuff']

— Ryan Ginstrom
fonte

2

utile, quando shlex spoglia alcuni personaggi necessari

— scraplesh

1

I CSV usano due virgolette doppie di fila (come side-by-side, "") per rappresentare una virgoletta doppia ", quindi trasformeranno due virgolette doppie in una virgoletta singola 'this is "a string""'ed 'this is "a string"""'entrambi mapperanno su['this', 'is', 'a string"']

— Boris,

15

Uso shlex.split per elaborare 70.000.000 di righe di calamari, è così lento. Quindi sono passato a re.

Prova questo, se hai problemi di prestazioni con shlex.

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)

— Daniel Dai
fonte

8

Dato che questa domanda è taggata con regex, ho deciso di provare un approccio regex. In primo luogo sostituisco tutti gli spazi nelle parti delle virgolette con \ x00, quindi divido per spazi, quindi rimpiazzo gli spazi di \ x00 in ogni parte.

Entrambe le versioni fanno la stessa cosa, ma splitter è un po 'più leggibile di splitter2.

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)

— elifiner
fonte

Invece avresti dovuto usare re.Scanner. È più affidabile (e in effetti ho implementato un tipo shlex usando re.Scanner).

— Devin Jeanpierre,

+1 Hm, questa è un'idea piuttosto intelligente, che suddivide il problema in più passaggi in modo che la risposta non sia terribilmente complessa. Shlex non ha fatto esattamente ciò di cui avevo bisogno, anche cercando di modificarlo. E le soluzioni regex single pass stavano diventando davvero strane e complicate.

— leetNightshade,

6

Sembra che per motivi di prestazioni resia più veloce. Ecco la mia soluzione usando un operatore meno avido che conserva le virgolette esterne:

re.findall("(?:\".*?\"|\S)+", s)

Risultato:

['this', 'is', '"a test"']

Lascia costrutti come aaa"bla blub"bbbinsieme poiché questi token non sono separati da spazi. Se la stringa contiene caratteri di escape, puoi abbinare in questo modo:

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""

Si noti che questo corrisponde anche alla stringa vuota ""per mezzo della \Sparte del motivo.

— Hochl
fonte

1

Un altro importante vantaggio di questa soluzione è la sua versatilità rispetto al carattere delimitante (es , Via '(?:".*?"|[^,])+'). Lo stesso vale per i caratteri di citazione (che racchiudono).

— a_guest,

4

Il problema principale con l' shlexapproccio accettato è che non ignora i caratteri di escape al di fuori delle sottostringhe tra virgolette e fornisce risultati leggermente imprevisti in alcuni casi angolari.

Ho il seguente caso d'uso, in cui ho bisogno di una funzione split che divide le stringhe di input in modo tale da preservare le sottostringhe a virgoletta singola o doppia, con la possibilità di evitare le virgolette all'interno di tale sottostringa. Le virgolette all'interno di una stringa non quotata non devono essere trattate in modo diverso da qualsiasi altro carattere. Alcuni esempi di casi di test con l'output previsto:

stringa di input | uscita prevista
===============================================
 'abc def' | ['a B c D e F']
 "abc \\ s def" | ['abc', '\\ s', 'def']
 '"abc def" ghi' | ['abc def', 'ghi']
 "'abc def' ghi" | ['abc def', 'ghi']
 '"abc \\" def "ghi' | ['abc" def', 'ghi']
 "'abc \\' def 'ghi" | ["abc 'def",' ghi ']
 "'abc \\ s def' ghi" | ['abc \\ s def', 'ghi']
 '"abc \\ s def" ghi' | ['abc \\ s def', 'ghi']
 '"" test' | ['', 'test']
 "'' test" | ['', 'test']
 "abc'def" | ["a B c D e F"]
 "abc'def '" | ["a B c D e F'"]
 "abc'def 'ghi" | ["abc'def '",' ghi ']
 "abc'def'ghi" | [ ""] abc'def'ghi
 'abc "def' | ['abc" def']
 'abc "def"' | ['a B c D e F"']
 'abc "def" ghi' | ['abc "def"', 'ghi']
 'abc "def" ghi' | [ 'Abc "def" ghi']
 "r'AA 'r'. * _ xyz $ '" | ["r'AA '", "r'. * _ xyz $ '"]

Ho finito con la seguente funzione per dividere una stringa in modo tale che l'output previsto risulti per tutte le stringhe di input:

import re

def quoted_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
            for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

La seguente applicazione di test controlla i risultati di altri approcci ( shlexe csvper ora) e l'implementazione suddivisa personalizzata:

#!/bin/python2.7

import csv
import re
import shlex

from timeit import timeit

def test_case(fn, s, expected):
    try:
        if fn(s) == expected:
            print '[ OK ] %s -> %s' % (s, fn(s))
        else:
            print '[FAIL] %s -> %s' % (s, fn(s))
    except Exception as e:
        print '[FAIL] %s -> exception: %s' % (s, e)

def test_case_no_output(fn, s, expected):
    try:
        fn(s)
    except:
        pass

def test_split(fn, test_case_fn=test_case):
    test_case_fn(fn, 'abc def', ['abc', 'def'])
    test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
    test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
    test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
    test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
    test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
    test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"" test', ['', 'test'])
    test_case_fn(fn, "'' test", ['', 'test'])
    test_case_fn(fn, "abc'def", ["abc'def"])
    test_case_fn(fn, "abc'def'", ["abc'def'"])
    test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
    test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
    test_case_fn(fn, 'abc"def', ['abc"def'])
    test_case_fn(fn, 'abc"def"', ['abc"def"'])
    test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
    test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
    test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])

def csv_split(s):
    return list(csv.reader([s], delimiter=' '))[0]

def re_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

if __name__ == '__main__':
    print 'shlex\n'
    test_split(shlex.split)
    print

    print 'csv\n'
    test_split(csv_split)
    print

    print 're\n'
    test_split(re_split)
    print

    iterations = 100
    setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
    def benchmark(method, code):
        print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
    benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
    benchmark('csv', 'test_split(csv_split, test_case_no_output)')
    benchmark('re', 'test_split(re_split, test_case_no_output)')

Produzione:

shlex

[OK] abc def -> ['abc', 'def']
[FAIL] abc \ s def -> ['abc', 's', 'def']
[OK] "abc def" ghi -> ['abc def', 'ghi']
[OK] 'abc def' ghi -> ['abc def', 'ghi']
[OK] "abc \" def "ghi -> ['abc" def', 'ghi']
[FAIL] 'abc \' def 'ghi -> eccezione: nessuna quotazione di chiusura
[OK] 'abc \ s def' ghi -> ['abc \\ s def', 'ghi']
[OK] "abc \ s def" ghi -> ['abc \\ s def', 'ghi']
[OK] "" test -> ['', 'test']
[OK] '' test -> ['', 'test']
[FAIL] abc'def -> eccezione: nessuna quotazione di chiusura
[FAIL] abc'def '-> [' abcdef ']
[FAIL] abc'def 'ghi -> [' abcdef ',' ghi ']
[FAIL] abc'def'ghi -> ['abcdefghi']
[FAIL] abc "def -> eccezione: nessuna quotazione di chiusura
[FAIL] abc "def" -> ['abcdef']
[FAIL] abc "def" ghi -> ['abcdef', 'ghi']
[FAIL] abc "def" ghi -> ['abcdefghi']
[FAIL] r'AA 'r'. * _ Xyz $ '-> [' rAA ',' r. * _ Xyz $ ']

csv

[OK] abc def -> ['abc', 'def']
[OK] abc \ s def -> ['abc', '\\ s', 'def']
[OK] "abc def" ghi -> ['abc def', 'ghi']
[FAIL] 'abc def' ghi -> ["'abc", "def'", 'ghi']
[FAIL] "abc \" def "ghi -> ['abc \\', 'def"', 'ghi']
[FAIL] 'abc \' def 'ghi -> ["' abc", "\\ '", "def'", 'ghi']
[FAIL] 'abc \ s def' ghi -> ["'abc",' \\ s ', "def'", 'ghi']
[OK] "abc \ s def" ghi -> ['abc \\ s def', 'ghi']
[OK] "" test -> ['', 'test']
[FAIL] '' test -> ["''", 'test']
[OK] abc'def -> ["abc'def"]
[OK] abc'def '-> ["abc'def'"]
[OK] abc'def 'ghi -> ["abc'def'", 'ghi']
[OK] abc'def'ghi -> ["abc'def'ghi"]
[OK] abc "def -> ['abc" def']
[OK] abc "def" -> ['abc "def"']
[OK] abc "def" ghi -> ['abc "def"', 'ghi']
[OK] abc "def" ghi -> ['abc "def" ghi']
[OK] r'AA 'r'. * _ Xyz $ '-> ["r'AA'", "r '. * _ Xyz $'"]

ri

[OK] abc def -> ['abc', 'def']
[OK] abc \ s def -> ['abc', '\\ s', 'def']
[OK] "abc def" ghi -> ['abc def', 'ghi']
[OK] 'abc def' ghi -> ['abc def', 'ghi']
[OK] "abc \" def "ghi -> ['abc" def', 'ghi']
[OK] 'abc \' def 'ghi -> ["abc' def", 'ghi']
[OK] 'abc \ s def' ghi -> ['abc \\ s def', 'ghi']
[OK] "abc \ s def" ghi -> ['abc \\ s def', 'ghi']
[OK] "" test -> ['', 'test']
[OK] '' test -> ['', 'test']
[OK] abc'def -> ["abc'def"]
[OK] abc'def '-> ["abc'def'"]
[OK] abc'def 'ghi -> ["abc'def'", 'ghi']
[OK] abc'def'ghi -> ["abc'def'ghi"]
[OK] abc "def -> ['abc" def']
[OK] abc "def" -> ['abc "def"']
[OK] abc "def" ghi -> ['abc "def"', 'ghi']
[OK] abc "def" ghi -> ['abc "def" ghi']
[OK] r'AA 'r'. * _ Xyz $ '-> ["r'AA'", "r '. * _ Xyz $'"]

shlex: 0,281 ms per iterazione
csv: 0,030 ms per iterazione
ri: 0,049 ms per iterazione

Quindi le prestazioni sono molto migliori di shlexe possono essere ulteriormente migliorate precompilando l'espressione regolare, nel qual caso supereranno l' csvapproccio.

— Ton van den Heuvel
fonte

Non sono sicuro di cosa stai parlando: `` >>> shlex.split ('this is "a test"') ['this', 'is', 'a test'] >>> shlex.split (' questo è \\ "un test \\" ') [' this ',' is ',' "a ',' test" '] >>> shlex.split (' questo è "a \\" test \\ " "') [' this ',' is ',' a" test "']` ``

— morsik

@morsik, qual è il tuo punto? Forse il tuo caso d'uso non corrisponde al mio? Quando guardi i casi di test vedrai tutti i casi in cui shlexnon si comportano come previsto per i miei casi d'uso.

— Ton van den Heuvel,

3

Per conservare le virgolette usa questa funzione:

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args

— THE_MAD_KING
fonte

Quando si confronta con una stringa più grande, la funzione è molto lenta

— Faran2007,

3

Test di velocità di risposte diverse:

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

— har777
fonte

1

Hmm, non riesco a trovare il pulsante "Rispondi" ... comunque, questa risposta si basa sull'approccio di Kate, ma divide correttamente le stringhe con sottostringhe contenenti virgolette sfuggite e rimuove anche le virgolette di inizio e fine delle sottostringhe:

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

Funziona su stringhe come 'This is " a \\\"test\\\"\\\'s substring"'(il markup folle è purtroppo necessario per impedire a Python di rimuovere le fughe).

Se le escape risultanti nelle stringhe nell'elenco restituito non sono desiderate, è possibile utilizzare questa versione leggermente modificata della funzione:

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

1

Per aggirare i problemi di Unicode in alcune versioni di Python 2, suggerisco:

from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]

— moschlar
fonte

Per python 2.7.5 questo dovrebbe essere: split = lambda a: [b.decode('utf-8') for b in _split(a)]altrimenti otterrai:UnicodeDecodeError: 'ascii' codec can't decode byte ... in position ...: ordinal not in range(128)

— Peter Varo il

1

Come opzione prova tssplit:

In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']

— Mikhail Zakharov
fonte

0

Suggerisco:

stringa di prova:

s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''

per catturare anche "" e '':

import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)

risultato:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]

ignorare "" e '' vuoti:

import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)

risultato:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']

— hussic
fonte

Potrebbe anche essere scritto re.findall("(?:\".*?\"|'.*?'|[^\s'\"]+)", s).

— Hochl,

-3

Se non ti interessa le stringhe secondarie di una semplice

>>> 'a short sized string with spaces '.split()

Prestazione:

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

O modulo di stringa

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

Prestazioni: il modulo String sembra funzionare meglio dei metodi String

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

Oppure puoi usare il motore RE

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

Prestazione

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

Per stringhe molto lunghe non devi caricare l'intera stringa in memoria e invece dividere le linee o usare un ciclo iterativo

— Gregorio
fonte

11

Sembra che tu abbia perso l'intero punto della domanda. Ci sono sezioni tra virgolette nella stringa che non devono essere divise.

— rjmunro,

-3

Prova questo:

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.extend(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result

Alcune stringhe di prova:

'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]

— pjz
fonte

Fornisci il repr di una stringa che ritieni fallisca.

— pjz,

Pensi ? adamsplit("This is 'a test'")→['This', 'is', "'a", "test'"]

— Matthew Schinckel,

OP dice solo "tra virgolette" e ha solo un esempio con virgolette doppie.

— pjz,

Dividi una stringa per spazi - preservando le sottostringhe tra virgolette - in Python