Un algoritmo di classificazione delle somiglianze migliore per stringhe di lunghezza variabile

152

Sto cercando un algoritmo di somiglianza delle stringhe che produca risultati migliori su stringhe di lunghezza variabile rispetto a quelle che vengono solitamente suggerite (distanza di levenshtein, soundex, ecc.).

Per esempio,

Data stringa A: "Robert",

Quindi stringa B: "Amy Robertson"

sarebbe una partita migliore di

Stringa C: "Richard"

Inoltre, preferibilmente, questo algoritmo dovrebbe essere agnostico (funziona anche in lingue diverse dall'inglese).

— Marzagão
fonte

simile in .net: stackoverflow.com/questions/83777/...

— Ciro Santilli郝海东冠状病六四事件法轮功

Scopri anche: Coefficiente dei dadi

— avid_useR

155

Simon White di Catalysoft ha scritto un articolo su un algoritmo molto intelligente che confronta le coppie di caratteri adiacenti che funziona davvero bene per i miei scopi:

http://www.catalysoft.com/articles/StrikeAMatch.html

Simon ha una versione Java dell'algoritmo e di seguito ne ho scritto una versione PL / Ruby (presa dalla versione ruby semplice fatta nel relativo commento alla voce del forum di Mark Wong-VanHaren) in modo da poterla usare nelle mie query PostgreSQL:

CREATE FUNCTION string_similarity(str1 varchar, str2 varchar)
RETURNS float8 AS '

str1.downcase! 
pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject {
  |pair| pair.include? " "}
str2.downcase! 
pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject {
  |pair| pair.include? " "}
union = pairs1.size + pairs2.size 
intersection = 0 
pairs1.each do |p1| 
  0.upto(pairs2.size-1) do |i| 
    if p1 == pairs2[i] 
      intersection += 1 
      pairs2.slice!(i) 
      break 
    end 
  end 
end 
(2.0 * intersection) / union

' LANGUAGE 'plruby';

Funziona come un fascino!

— Marzagão
fonte

32

Hai trovato la risposta e hai scritto tutto in 4 minuti? Degno di nota!

— Matt J

28

Ho preparato la mia risposta dopo alcune ricerche e implementazioni. L'ho messo qui a beneficio di chiunque altro cerchi in SO una risposta pratica usando un algoritmo alternativo perché la maggior parte delle risposte nelle domande correlate sembrano ruotare attorno a levenshtein o soundex.

— Marzagao,

18

Proprio quello che stavo cercando. Vuoi sposarmi?

— BlackTea,

6

@JasonSundram ha ragione: in effetti, questo è il noto coefficiente di dadi sui bigram a livello di personaggio, come scrive l'autore nell'addendum (in fondo alla pagina).

— Fred Foo,

4

Ciò restituisce un "punteggio" di 1 (corrispondenza del 100%) quando si confrontano le stringhe con una singola lettera isolata come differenza, come questo esempio: string_similarity("vitamin B", "vitamin C") #=> 1esiste un modo semplice per prevenire questo tipo di comportamento?

— MrYoshiji,

77

la risposta di marzagao è ottima. L'ho convertito in C # quindi ho pensato di pubblicarlo qui:

Link Pastebin

/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public class SimilarityTool
{
    /// <summary>
    /// Compares the two strings based on letter pair matches
    /// </summary>
    /// <param name="str1"></param>
    /// <param name="str2"></param>
    /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
    public double CompareStrings(string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }
}

— Michael La Voie
fonte

2

+100 se potessi, mi hai appena salvato una dura giornata di lavoro amico! Saluti.

— vvohra87,

1

Molto bella! L'unico suggerimento che ho, potrebbe trasformarlo in un'estensione.

— Levitikon,

+1! Fantastico che funzioni, con lievi modifiche anche per Java. E sembra restituire risposte migliori di Levenshtein.

— Xyene,

1

Ho aggiunto una versione convertendola in un metodo di estensione di seguito. Grazie per la versione originale e la fantastica traduzione.

— Frank Rundatz,

@Michael La Voie Grazie, è molto carino! Anche se un piccolo problema con (2.0 * intersection) / union- Ottengo Double.NaN quando confronto due stringhe vuote.

— Vojtěch Dohnal,

41

Ecco un'altra versione della risposta di Marzagao , questa scritta in Python:

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    s = string.lower()
    return [s[i:i+2] for i in list(range(len(s) - 1))]

def string_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

if __name__ == "__main__":
    """
    Run a test using the example taken from:
    http://www.catalysoft.com/articles/StrikeAMatch.html
    """
    w1 = 'Healed'
    words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold']

    for w2 in words:
        print('Healed --- ' + w2)
        print(string_similarity(w1, w2))
        print()

— John Rutledge
fonte

2

C'è un piccolo bug in string_similarity quando ci sono ngram duplicati in una parola, risultando in un punteggio> 1 per stringhe identiche. L'aggiunta di una "pausa" dopo che "hit_count + = 1" lo ha corretto.

— jbaiter,

1

@jbaiter: buona cattura. L'ho modificato per riflettere le tue modifiche.

— John Rutledge,

3

Nell'articolo di Simon White, dice "Nota che ogni volta che viene trovata una corrispondenza, quella coppia di caratteri viene rimossa dal secondo elenco di array per impedirci di abbinare più volte la stessa coppia di caratteri. Altrimenti," GGGGG "segnerebbe una corrispondenza perfetta contro "GG".) "Modificherei questa affermazione per dire che darebbe una corrispondenza più alta che perfetta. Senza tenerne conto, sembra anche che l'algoritmo non sia transitivo (somiglianza (x, y) = / = somiglianza (y, x)). Aggiungendo coppie2.remove (y) dopo la riga hit_count + = 1 risolve il problema.

— NinjaMeTimbers

17

Ecco la mia implementazione PHP dell'algoritmo StrikeAMatch suggerito, di Simon White. i vantaggi (come si dice nel link) sono:

Un vero riflesso della somiglianza lessicale - le stringhe con piccole differenze dovrebbero essere riconosciute come simili. In particolare, una significativa sovrapposizione della sottostringa dovrebbe indicare un alto livello di somiglianza tra le stringhe.
Una solidità ai cambiamenti nell'ordine delle parole - due stringhe che contengono le stesse parole, ma in un ordine diverso, dovrebbero essere riconosciute come simili. D'altra parte, se una stringa è solo un anagramma casuale dei caratteri contenuti nell'altra, allora dovrebbe (di solito) essere riconosciuta come diversa.
Indipendenza dalla lingua : l'algoritmo dovrebbe funzionare non solo in inglese, ma in molte lingue diverse.

<?php
/**
 * LetterPairSimilarity algorithm implementation in PHP
 * @author Igal Alkon
 * @link http://www.catalysoft.com/articles/StrikeAMatch.html
 */
class LetterPairSimilarity
{
    /**
     * @param $str
     * @return mixed
     */
    private function wordLetterPairs($str)
    {
        $allPairs = array();

        // Tokenize the string and put the tokens/words into an array

        $words = explode(' ', $str);

        // For each word
        for ($w = 0; $w < count($words); $w++)
        {
            // Find the pairs of characters
            $pairsInWord = $this->letterPairs($words[$w]);

            for ($p = 0; $p < count($pairsInWord); $p++)
            {
                $allPairs[] = $pairsInWord[$p];
            }
        }

        return $allPairs;
    }

    /**
     * @param $str
     * @return array
     */
    private function letterPairs($str)
    {
        $numPairs = mb_strlen($str)-1;
        $pairs = array();

        for ($i = 0; $i < $numPairs; $i++)
        {
            $pairs[$i] = mb_substr($str,$i,2);
        }

        return $pairs;
    }

    /**
     * @param $str1
     * @param $str2
     * @return float
     */
    public function compareStrings($str1, $str2)
    {
        $pairs1 = $this->wordLetterPairs(strtoupper($str1));
        $pairs2 = $this->wordLetterPairs(strtoupper($str2));

        $intersection = 0;

        $union = count($pairs1) + count($pairs2);

        for ($i=0; $i < count($pairs1); $i++)
        {
            $pair1 = $pairs1[$i];

            $pairs2 = array_values($pairs2);
            for($j = 0; $j < count($pairs2); $j++)
            {
                $pair2 = $pairs2[$j];
                if ($pair1 === $pair2)
                {
                    $intersection++;
                    unset($pairs2[$j]);
                    break;
                }
            }
        }

        return (2.0*$intersection)/$union;
    }
}

— Igal Alkon
fonte

17

Una versione più breve della risposta di John Rutledge :

def get_bigrams(string):
    '''
    Takes a string and returns a list of bigrams
    '''
    s = string.lower()
    return {s[i:i+2] for i in xrange(len(s) - 1)}

def string_similarity(str1, str2):
    '''
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    '''
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    return (2.0 * len(pairs1 & pairs2)) / (len(pairs1) + len(pairs2))

— quantistico
fonte

Anche la intersectionvariabile è uno spreco di linea.

— Chibueze Opata,

14

Questa discussione è stata davvero utile, grazie. Ho convertito l'algoritmo in VBA per l'uso con Excel e ho scritto alcune versioni di una funzione del foglio di lavoro, una per un semplice confronto di una coppia di stringhe, l'altra per confrontare una stringa con un intervallo / matrice di stringhe. La versione strSimLookup restituisce l'ultima corrispondenza migliore come stringa, indice di matrice o metrica di somiglianza.

Questa implementazione produce gli stessi risultati elencati nell'esempio di Amazon sul sito Web di Simon White con poche eccezioni su partite a basso punteggio; non sono sicuro di dove si insinui la differenza, potrebbe essere la funzione Split di VBA, ma non ho studiato perché funziona bene per i miei scopi.

'Implements functions to rate how similar two strings are on
'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar)
'Source:   http://www.catalysoft.com/articles/StrikeAMatch.html
'Author: Bob Chatham, bob.chatham at gmail.com
'9/12/2010

Option Explicit

Public Function stringSimilarity(str1 As String, str2 As String) As Variant
'Simple version of the algorithm that computes the similiarity metric
'between two strings.
'NOTE: This verision is not efficient to use if you're comparing one string
'with a range of other values as it will needlessly calculate the pairs for the
'first string over an over again; use the array-optimized version for this case.

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection

    Set sPairs1 = New Collection
    Set sPairs2 = New Collection

    WordLetterPairs str1, sPairs1
    WordLetterPairs str2, sPairs2

    stringSimilarity = SimilarityMetric(sPairs1, sPairs2)

    Set sPairs1 = Nothing
    Set sPairs2 = Nothing

End Function

Public Function strSimA(str1 As Variant, rRng As Range) As Variant
'Return an array of string similarity indexes for str1 vs every string in input range rRng
    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim arrOut As Variant
    Dim l As Long, j As Long

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    l = rRng.Count
    ReDim arrOut(1 To l)
    For j = 1 To l
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(j)), sPairs2
        arrOut(j) = SimilarityMetric(sPairs1, sPairs2)
        Set sPairs2 = Nothing
    Next j

    strSimA = Application.Transpose(arrOut)

End Function

Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant
'Return either the best match or the index of the best match
'depending on returnTYype parameter) between str1 and strings in rRng)
' returnType = 0 or omitted: returns the best matching string
' returnType = 1           : returns the index of the best matching string
' returnType = 2           : returns the similarity metric

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim metric, bestMetric As Double
    Dim i, iBest As Long
    Const RETURN_STRING As Integer = 0
    Const RETURN_INDEX As Integer = 1
    Const RETURN_METRIC As Integer = 2

    If IsMissing(returnType) Then returnType = RETURN_STRING

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    bestMetric = -1
    iBest = -1

    For i = 1 To rRng.Count
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(i)), sPairs2
        metric = SimilarityMetric(sPairs1, sPairs2)
        If metric > bestMetric Then
            bestMetric = metric
            iBest = i
        End If
        Set sPairs2 = Nothing
    Next i

    If iBest = -1 Then
        strSimLookup = CVErr(xlErrValue)
        Exit Function
    End If

    Select Case returnType
    Case RETURN_STRING
        strSimLookup = CStr(rRng(iBest))
    Case RETURN_INDEX
        strSimLookup = iBest
    Case Else
        strSimLookup = bestMetric
    End Select

End Function

Public Function strSim(str1 As String, str2 As String) As Variant
    Dim ilen, iLen1, ilen2 As Integer

    iLen1 = Len(str1)
    ilen2 = Len(str2)

    If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1

    strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen))

End Function

Sub WordLetterPairs(str As String, pairColl As Collection)
'Tokenize str into words, then add all letter pairs to pairColl

    Dim Words() As String
    Dim word, nPairs, pair As Integer

    Words = Split(str)

    If UBound(Words) < 0 Then
        Set pairColl = Nothing
        Exit Sub
    End If

    For word = 0 To UBound(Words)
        nPairs = Len(Words(word)) - 1
        If nPairs > 0 Then
            For pair = 1 To nPairs
                pairColl.Add Mid(Words(word), pair, 2)
            Next pair
        End If
    Next word

End Sub

Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant
'Helper function to calculate similarity metric given two collections of letter pairs.
'This function is designed to allow the pair collections to be set up separately as needed.
'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection
'if this is not the desired behavior.
'Also assumes that collections will be deallocated somewhere else

    Dim Intersect As Double
    Dim Union As Double
    Dim i, j As Long

    If sPairs1.Count = 0 Or sPairs2.Count = 0 Then
        SimilarityMetric = CVErr(xlErrNA)
        Exit Function
    End If

    Union = sPairs1.Count + sPairs2.Count
    Intersect = 0

    For i = 1 To sPairs1.Count
        For j = 1 To sPairs2.Count
            If StrComp(sPairs1(i), sPairs2(j)) = 0 Then
                Intersect = Intersect + 1
                sPairs2.Remove j
                Exit For
            End If
        Next j
    Next i

    SimilarityMetric = (2 * Intersect) / Union

End Function

— bchatham
fonte

@bchatham Sembra estremamente utile, ma sono nuovo di VBA e un po 'sfidato dal codice. È possibile pubblicare un file Excel che utilizza il tuo contributo? Per i miei scopi spero di usarlo per abbinare nomi simili di una singola colonna in Excel con circa 1000 voci (estratto qui: dropbox.com/s/ofdliln9zxgi882/first-names-excerpt.xlsx ). Userò quindi le corrispondenze come sinonimi in una ricerca di persone. (vedi anche softwarerecs.stackexchange.com/questions/38227/… )

— bjornte

12

Mi dispiace, la risposta non è stata inventata dall'autore. Questo è un algoritmo ben noto che era stato presentato per la prima volta da Digital Equipment Corporation e spesso viene chiamato shingling.

http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf

10

Ho tradotto l'algoritmo di Simon White in PL / pgSQL. Questo è il mio contributo

<!-- language: lang-sql -->

create or replace function spt1.letterpairs(in p_str varchar) 
returns varchar  as 
$$
declare

    v_numpairs integer := length(p_str)-1;
    v_pairs varchar[];

begin

    for i in 1 .. v_numpairs loop
        v_pairs[i] := substr(p_str, i, 2);
    end loop;

    return v_pairs;

end;
$$ language 'plpgsql';

--===================================================================

create or replace function spt1.wordletterpairs(in p_str varchar) 
returns varchar as
$$
declare
    v_allpairs varchar[];
    v_words varchar[];
    v_pairsinword varchar[];
begin
    v_words := regexp_split_to_array(p_str, '[[:space:]]');

    for i in 1 .. array_length(v_words, 1) loop
        v_pairsinword := spt1.letterpairs(v_words[i]);

        if v_pairsinword is not null then
            for j in 1 .. array_length(v_pairsinword, 1) loop
                v_allpairs := v_allpairs || v_pairsinword[j];
            end loop;
        end if;

    end loop;


    return v_allpairs;
end;
$$ language 'plpgsql';

--===================================================================

create or replace function spt1.arrayintersect(ANYARRAY, ANYARRAY)
returns anyarray as 
$$
    select array(select unnest($1) intersect select unnest($2))
$$ language 'sql';

--===================================================================

create or replace function spt1.comparestrings(in p_str1 varchar, in p_str2 varchar)
returns float as
$$
declare
    v_pairs1 varchar[];
    v_pairs2 varchar[];
    v_intersection integer;
    v_union integer;
begin
    v_pairs1 := wordletterpairs(upper(p_str1));
    v_pairs2 := wordletterpairs(upper(p_str2));
    v_union := array_length(v_pairs1, 1) + array_length(v_pairs2, 1); 

    v_intersection := array_length(arrayintersect(v_pairs1, v_pairs2), 1);

    return (2.0 * v_intersection / v_union);
end;
$$ language 'plpgsql';

— fabiolimace
fonte

Funziona sul mio PostgreSQL che non ha supporto plruby! Grazie!

— hostnik,

Grazie! Come lo faresti in Oracle SQL?

— olovholm,

Questa porta non è corretta. La stringa esatta non restituisce 1.

— Brandon Wigfield

9

Le metriche di similarità delle stringhe contengono una panoramica di molte metriche diverse utilizzate nel confronto delle stringhe (anche Wikipedia ha una panoramica). Gran parte di queste metriche è implementata in una libreria simmetrica .

Ancora un altro esempio di metrica, non incluso nella panoramica data è ad esempio la distanza di compressione (tentando di approssimare la complessità del Kolmogorov ), che può essere usata per testi un po 'più lunghi di quello che hai presentato.

Potresti anche considerare di esaminare un argomento molto più ampio di Natural Language Processing . Questi pacchetti R possono iniziare rapidamente (o almeno dare alcune idee).

E un'ultima modifica: cerca le altre domande su questo argomento in SO, ce ne sono alcune correlate.

— Anonimo
fonte

9

Una versione PHP più veloce dell'algoritmo:

/**
 *
 * @param $str
 * @return mixed
 */
private static function wordLetterPairs ($str)
{
    $allPairs = array();

    // Tokenize the string and put the tokens/words into an array

    $words = explode(' ', $str);

    // For each word
    for ($w = 0; $w < count($words); $w ++) {
        // Find the pairs of characters
        $pairsInWord = self::letterPairs($words[$w]);

        for ($p = 0; $p < count($pairsInWord); $p ++) {
            $allPairs[$pairsInWord[$p]] = $pairsInWord[$p];
        }
    }

    return array_values($allPairs);
}

/**
 *
 * @param $str
 * @return array
 */
private static function letterPairs ($str)
{
    $numPairs = mb_strlen($str) - 1;
    $pairs = array();

    for ($i = 0; $i < $numPairs; $i ++) {
        $pairs[$i] = mb_substr($str, $i, 2);
    }

    return $pairs;
}

/**
 *
 * @param $str1
 * @param $str2
 * @return float
 */
public static function compareStrings ($str1, $str2)
{
    $pairs1 = self::wordLetterPairs(mb_strtolower($str1));
    $pairs2 = self::wordLetterPairs(mb_strtolower($str2));


    $union = count($pairs1) + count($pairs2);

    $intersection = count(array_intersect($pairs1, $pairs2));

    return (2.0 * $intersection) / $union;
}

Per i dati che ho avuto (circa 2300 confronti) ho avuto un tempo di esecuzione di 0,58 secondi con la soluzione Igal Alkon rispetto a 0,35 secondi con il mio.

— Andy
fonte

9

Una versione nella bellissima Scala:

  def pairDistance(s1: String, s2: String): Double = {

    def strToPairs(s: String, acc: List[String]): List[String] = {
      if (s.size < 2) acc
      else strToPairs(s.drop(1),
        if (s.take(2).contains(" ")) acc else acc ::: List(s.take(2)))
    }

    val lst1 = strToPairs(s1.toUpperCase, List())
    val lst2 = strToPairs(s2.toUpperCase, List())

    (2.0 * lst2.intersect(lst1).size) / (lst1.size + lst2.size)

  }

— ignobilis
fonte

6

Ecco la versione R:

get_bigrams <- function(str)
{
  lstr = tolower(str)
  bigramlst = list()
  for(i in 1:(nchar(str)-1))
  {
    bigramlst[[i]] = substr(str, i, i+1)
  }
  return(bigramlst)
}

str_similarity <- function(str1, str2)
{
   pairs1 = get_bigrams(str1)
   pairs2 = get_bigrams(str2)
   unionlen  = length(pairs1) + length(pairs2)
   hit_count = 0
   for(x in 1:length(pairs1)){
        for(y in 1:length(pairs2)){
            if (pairs1[[x]] == pairs2[[y]])
                hit_count = hit_count + 1
        }
   }
   return ((2.0 * hit_count) / unionlen)
}

— Rainer
fonte

Questo algoritmo è migliore ma piuttosto lento per dati di grandi dimensioni. Voglio dire se si devono confrontare 10000 parole con 15000 altre parole, è troppo lento. Possiamo aumentare le sue prestazioni in termini di velocità ??

— indra_patil,

6

Pubblicando la risposta di Marzagao in C99, ispirata a questi algoritmi

double dice_match(const char *string1, const char *string2) {

    //check fast cases
    if (((string1 != NULL) && (string1[0] == '\0')) || 
        ((string2 != NULL) && (string2[0] == '\0'))) {
        return 0;
    }
    if (string1 == string2) {
        return 1;
    }

    size_t strlen1 = strlen(string1);
    size_t strlen2 = strlen(string2);
    if (strlen1 < 2 || strlen2 < 2) {
        return 0;
    }

    size_t length1 = strlen1 - 1;
    size_t length2 = strlen2 - 1;

    double matches = 0;
    int i = 0, j = 0;

    //get bigrams and compare
    while (i < length1 && j < length2) {
        char a[3] = {string1[i], string1[i + 1], '\0'};
        char b[3] = {string2[j], string2[j + 1], '\0'};
        int cmp = strcmpi(a, b);
        if (cmp == 0) {
            matches += 2;
        }
        i++;
        j++;
    }

    return matches / (length1 + length2);
}

Alcuni test basati sull'articolo originale :

#include <stdio.h>

void article_test1() {
    char *string1 = "FRANCE";
    char *string2 = "FRENCH";
    printf("====%s====\n", __func__);
    printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);
}


void article_test2() {
    printf("====%s====\n", __func__);
    char *string = "Healed";
    char *ss[] = {"Heard", "Healthy", "Help",
                  "Herded", "Sealed", "Sold"};
    int correct[] = {44, 55, 25, 40, 80, 0};
    for (int i = 0; i < 6; ++i) {
        printf("%2.f%% == %d%%\n", dice_match(string, ss[i]) * 100, correct[i]);
    }
}

void multicase_test() {
    char *string1 = "FRaNcE";
    char *string2 = "fREnCh";
    printf("====%s====\n", __func__);
    printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);

}

void gg_test() {
    char *string1 = "GG";
    char *string2 = "GGGGG";
    printf("====%s====\n", __func__);
    printf("%2.f%% != 100%%\n", dice_match(string1, string2) * 100);
}


int main() {
    article_test1();
    article_test2();
    multicase_test();
    gg_test();

    return 0;
}

— script
fonte

5

Basandomi sulla fantastica versione C # di Michael La Voie, secondo la richiesta di renderlo un metodo di estensione, ecco cosa mi è venuto in mente. Il vantaggio principale di farlo in questo modo è che puoi ordinare un elenco generico in base alla corrispondenza percentuale. Ad esempio, considera di avere un campo stringa denominato "Città" nel tuo oggetto. Un utente cerca "Chester" e desideri restituire i risultati in ordine decrescente di corrispondenza. Ad esempio, vuoi che partite letterali di Chester vengano presentate prima di Rochester. Per fare ciò, aggiungi due nuove proprietà al tuo oggetto:

    public string SearchText { get; set; }
    public double PercentMatch
    {
        get
        {
            return City.ToUpper().PercentMatchTo(this.SearchText.ToUpper());
        }
    }

Quindi su ogni oggetto, imposta SearchText su ciò che l'utente ha cercato. Quindi puoi ordinarlo facilmente con qualcosa del tipo:

    zipcodes = zipcodes.OrderByDescending(x => x.PercentMatch);

Ecco la leggera modifica per renderlo un metodo di estensione:

    /// <summary>
    /// This class implements string comparison algorithm
    /// based on character pair similarity
    /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
    /// </summary>
    public static double PercentMatchTo(this string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private static List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private static  string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }

— Frank Rundatz
fonte

Penso che sarebbe meglio usare un bool isCaseSensitive con un valore predefinito false - anche se è vero l'implementazione è molto più pulita

— Giordania

5

La mia implementazione JavaScript accetta una stringa o una matrice di stringhe e un piano facoltativo (il piano predefinito è 0,5). Se le passi una stringa, restituirà vero o falso a seconda che il punteggio di somiglianza della stringa sia maggiore o uguale al pavimento. Se gli passi una matrice di stringhe, restituirà una matrice di quelle stringhe il cui punteggio di somiglianza è maggiore o uguale al piano, ordinati per punteggio.

Esempi:

'Healed'.fuzzy('Sealed');      // returns true
'Healed'.fuzzy('Help');        // returns false
'Healed'.fuzzy('Help', 0.25);  // returns true

'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy']);
// returns ["Sealed", "Healthy"]

'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy'], 0);
// returns ["Sealed", "Healthy", "Heard", "Herded", "Help", "Sold"]

Ecco qui:

(function(){
  var default_floor = 0.5;

  function pairs(str){
    var pairs = []
      , length = str.length - 1
      , pair;
    str = str.toLowerCase();
    for(var i = 0; i < length; i++){
      pair = str.substr(i, 2);
      if(!/\s/.test(pair)){
        pairs.push(pair);
      }
    }
    return pairs;
  }

  function similarity(pairs1, pairs2){
    var union = pairs1.length + pairs2.length
      , hits = 0;

    for(var i = 0; i < pairs1.length; i++){
      for(var j = 0; j < pairs2.length; j++){
        if(pairs1[i] == pairs2[j]){
          pairs2.splice(j--, 1);
          hits++;
          break;
        }
      }
    }
    return 2*hits/union || 0;
  }

  String.prototype.fuzzy = function(strings, floor){
    var str1 = this
      , pairs1 = pairs(this);

    floor = typeof floor == 'number' ? floor : default_floor;

    if(typeof(strings) == 'string'){
      return str1.length > 1 && strings.length > 1 && similarity(pairs1, pairs(strings)) >= floor || str1.toLowerCase() == strings.toLowerCase();
    }else if(strings instanceof Array){
      var scores = {};

      strings.map(function(str2){
        scores[str2] = str1.length > 1 ? similarity(pairs1, pairs(str2)) : 1*(str1.toLowerCase() == str2.toLowerCase());
      });

      return strings.filter(function(str){
        return scores[str] >= floor;
      }).sort(function(a, b){
        return scores[b] - scores[a];
      });
    }
  };
})();

— gruppler
fonte

1

Bug / Typo! for(var j = 0; j < pairs1.length; j++){dovrebbe esserefor(var j = 0; j < pairs2.length; j++){

— Searle

3

L'algoritmo del coefficiente dei dadi (la risposta di Simon White / marzagao) è implementato in Ruby nel metodo pair_distance_similar nella gemma amatch

https://github.com/flori/amatch

Questa gemma contiene anche implementazioni di una serie di algoritmi approssimativi di confronto e confronto delle stringhe: Levenshtein modifica distanza, Venditori modifica distanza, la distanza di Hamming, la più lunga lunghezza di sottosequenza comune, la più lunga lunghezza di sottostringa comune, la metrica della distanza della coppia, la metrica Jaro-Winkler .

— s01ipsist
fonte

2

Una versione di Haskell: sentiti libero di suggerire le modifiche perché non ho fatto molto Haskell.

import Data.Char
import Data.List

-- Convert a string into words, then get the pairs of words from that phrase
wordLetterPairs :: String -> [String]
wordLetterPairs s1 = concat $ map pairs $ words s1

-- Converts a String into a list of letter pairs.
pairs :: String -> [String]
pairs [] = []
pairs (x:[]) = []
pairs (x:ys) = [x, head ys]:(pairs ys)

-- Calculates the match rating for two strings
matchRating :: String -> String -> Double
matchRating s1 s2 = (numberOfMatches * 2) / totalLength
  where pairsS1 = wordLetterPairs $ map toLower s1
        pairsS2 = wordLetterPairs $ map toLower s2
        numberOfMatches = fromIntegral $ length $ pairsS1 `intersect` pairsS2
        totalLength = fromIntegral $ length pairsS1 + length pairsS2

— matthewpalmer
fonte

2

Clojure:

(require '[clojure.set :refer [intersection]])

(defn bigrams [s]
  (->> (split s #"\s+")
       (mapcat #(partition 2 1 %))
       (set)))

(defn string-similarity [a b]
  (let [a-pairs (bigrams a)
        b-pairs (bigrams b)
        total-count (+ (count a-pairs) (count b-pairs))
        match-count (count (intersection a-pairs b-pairs))
        similarity (/ (* 2 match-count) total-count)]
    similarity))

— Shaun Lebron
fonte

1

Che dire della distanza di Levenshtein, divisa per la lunghezza della prima stringa (o in alternativa divisa la mia lunghezza min / max / media di entrambe le stringhe)? Finora ha funzionato per me.

— tehvan
fonte

Tuttavia, per citare un altro post su questo argomento, ciò che restituisce è spesso "irregolare". Classifica l'eco come abbastanza simile al "cane".

— Xyene,

@Nox: la parte "divisa per la lunghezza della prima stringa" di questa risposta è significativa. Inoltre, questo si comporta meglio dell'algoritmo Dice tanto lodato per errori di battitura ed errori di trasposizione, e persino per coniugazioni comuni (si consideri il confronto tra "nuoto" e "nuoto", per esempio).

— Logan Pickup,

1

Ehi ragazzi, l'ho provato in javascript, ma sono nuovo, qualcuno conosce modi più veloci per farlo?

function get_bigrams(string) {
    // Takes a string and returns a list of bigrams
    var s = string.toLowerCase();
    var v = new Array(s.length-1);
    for (i = 0; i< v.length; i++){
        v[i] =s.slice(i,i+2);
    }
    return v;
}

function string_similarity(str1, str2){
    /*
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    */
    var pairs1 = get_bigrams(str1);
    var pairs2 = get_bigrams(str2);
    var union = pairs1.length + pairs2.length;
    var hit_count = 0;
    for (x in pairs1){
        for (y in pairs2){
            if (pairs1[x] == pairs2[y]){
                hit_count++;
            }
        }
    }
    return ((2.0 * hit_count) / union);
}


var w1 = 'Healed';
var word =['Heard','Healthy','Help','Herded','Sealed','Sold']
for (w2 in word){
    console.log('Healed --- ' + word[w2])
    console.log(string_similarity(w1,word[w2]));
}

— user779420
fonte

Questa implementazione non è corretta. La funzione bigram si interrompe per l'inserimento della lunghezza 0. Il metodo string_similarity non si interrompe correttamente all'interno del secondo loop, il che può portare a contare coppie più volte, portando a un valore di ritorno che supera il 100%. E hai anche dimenticato di dichiarare xe y, e non dovresti passare da un for..in..loop all'altro usando un loop (usa for(..;..;..)invece).

— Rob W,

1

Ecco un'altra versione di Somiglianza basata sull'indice Sørensen – Dice (la risposta di Marzagao), questa scritta in C ++ 11:

/*
 * Similarity based in Sørensen–Dice index.
 *
 * Returns the Similarity between _str1 and _str2.
 */
double similarity_sorensen_dice(const std::string& _str1, const std::string& _str2) {
    // Base case: if some string is empty.
    if (_str1.empty() || _str2.empty()) {
        return 1.0;
    }

    auto str1 = upper_string(_str1);
    auto str2 = upper_string(_str2);

    // Base case: if the strings are equals.
    if (str1 == str2) {
        return 0.0;
    }

    // Base case: if some string does not have bigrams.
    if (str1.size() < 2 || str2.size() < 2) {
        return 1.0;
    }

    // Extract bigrams from str1
    auto num_pairs1 = str1.size() - 1;
    std::unordered_set<std::string> str1_bigrams;
    str1_bigrams.reserve(num_pairs1);
    for (unsigned i = 0; i < num_pairs1; ++i) {
        str1_bigrams.insert(str1.substr(i, 2));
    }

    // Extract bigrams from str2
    auto num_pairs2 = str2.size() - 1;
    std::unordered_set<std::string> str2_bigrams;
    str2_bigrams.reserve(num_pairs2);
    for (unsigned int i = 0; i < num_pairs2; ++i) {
        str2_bigrams.insert(str2.substr(i, 2));
    }

    // Find the intersection between the two sets.
    int intersection = 0;
    if (str1_bigrams.size() < str2_bigrams.size()) {
        const auto it_e = str2_bigrams.end();
        for (const auto& bigram : str1_bigrams) {
            intersection += str2_bigrams.find(bigram) != it_e;
        }
    } else {
        const auto it_e = str1_bigrams.end();
        for (const auto& bigram : str2_bigrams) {
            intersection += str1_bigrams.find(bigram) != it_e;
        }
    }

    // Returns similarity coefficient.
    return (2.0 * intersection) / (num_pairs1 + num_pairs2);
}

— chema989
fonte

1

Stavo cercando l'implementazione in puro rubino dell'algoritmo indicato dalla risposta di @ marzagao. Sfortunatamente, il link indicato da @marzagao è rotto. Nella risposta di @s01ipsist, ha indicato la gemma rubino in un punto in cui l'implementazione non è in puro rubino. Quindi ho cercato un po 'e ho trovato gemma fuzzy_match che ha un'implementazione di rubini puri (sebbene questo gemma lo usi amatch) qui . Spero che questo possa aiutare qualcuno come me.

— Engr. Hasanuzzaman Sumon
fonte