Modo efficiente per eliminare le file con tempi di sovrapposizione


9

Ho un set di dati lungo con colonne che rappresentano i tempi di inizio e di fine e voglio eliminare una riga se si sovrappone a un'altra e ha una priorità più alta (ad esempio 1 è la priorità più alta). I miei dati di esempio sono

library(tidyverse)
library(lubridate)
times_df <- tibble(start = as_datetime(c("2019-10-05 14:05:25", 
    "2019-10-05 17:30:20", 
    "2019-10-05 17:37:00", 
    "2019-10-06 04:43:55", 
    "2019-10-06 04:53:45")), 
    stop = as_datetime(c("2019-10-05 14:19:20",
    "2019-10-05 17:45:15", 
    "2019-10-05 17:50:45", 
    "2019-10-06 04:59:00",
    "2019-10-06 05:07:10")), priority = c(5,3,4,3,4))

Il modo in cui ho escogitato attacca il problema all'indietro, trovando le sovrapposizioni con un valore di priorità più elevato e quindi usando un anti_joinper rimuoverle dal frame di dati originale. Questo codice non funziona se ci sono tre periodi che si sovrappongono nello stesso punto temporale e sono sicuro che ci sia un modo più efficiente e funzionale per farlo.

dropOverlaps <- function(df) {
    drops <- df %>% 
        filter(stop > lead(start) | lag(stop) > start) %>% 
        mutate(group = ({seq(1, nrow(.)/2)} %>% 
        rep(each=2))) %>% 
        group_by(group) %>% 
        filter(priority == max(priority))
    anti_join(df, drops)
}

dropOverlaps(times_df)
#> Joining, by = c("start", "stop", "priority")
#> # A tibble: 3 x 3
#>   start               stop                priority
#>   <dttm>              <dttm>                 <dbl>
#> 1 2019-10-05 14:05:25 2019-10-05 14:19:20        5
#> 2 2019-10-05 17:30:20 2019-10-05 17:45:15        3
#> 3 2019-10-06 04:43:55 2019-10-06 04:59:00        3

Qualcuno può aiutarmi a ottenere lo stesso risultato ma con una funzione più pulita? Bonus se può gestire un input con tre o più periodi di tempo che si sovrappongono tutti.


2
Se vuoi puoi controllare tutte le combinazioni con combn, anche se può diventare costoso se hai molte righe. times_df %>% mutate(interval = interval(start, stop)) %>% {combn(nrow(.), 2, function(x) if (int_overlaps(.$interval[x[1]], .$interval[x[2]])) x[which.min(.$priority[x])], simplify = FALSE)} %>% unlist() %>% {slice(times_df, -.)}
Alistaire

Potresti provare a fare casino plyrangesche adatta IRanges / GRanges (usato per trovare sovrapposizioni tra i genomi) per il riordino. Penso che potresti trasformare i tuoi tempi in intervalli "genomici" convertendo i tuoi giorni + ore in un numero intero di ore ("coromosoma") e i tuoi minuti + secondi in un numero intero di secondi ("nucleotidi"). Se hai guardato l'output di pair_overlaps(e hai usato una colonna ID per rimuovere le sovrapposizioni), potresti mantenere la tua priorità e fare un bel filtro dei risultati + inner_join con la tua tabella originale. È confuso ma dovrebbe ottimizzare la facilità di codifica + efficienza.
Genes Rus,

Oppure puoi semplicemente usare IRange con i tempi dei dati convertiti in numeri. Un esempio è qui: stackoverflow.com/questions/40647177/...
GenesRus

2
Ho appena trovato data.table :: foverlaps e questa sarebbe una soluzione migliore rispetto agli strumenti genomici che ho suggerito. Non ho tempo di elaborare la logica di cosa conservare, ma dovrebbe essere risolvibile.
GenesRus,

Risposte:


4

Ecco una data.tablesoluzione che utilizza foverlapsper rilevare i record sovrapposti (come già accennato da @GenesRus). I record sovrapposti vengono assegnati ai gruppi per filtrare il record con max. priorità nel gruppo. Ho aggiunto altri due record ai dati di esempio, per dimostrare che questa procedura funziona anche per tre o più record sovrapposti:

Modifica: ho modificato e tradotto la soluzione di @ pgcudahy alla data.tablequale fornisce un codice ancora più veloce:

library(data.table)
library(lubridate)

times_df <- data.frame(
  start = as_datetime(
    c(
      "2019-10-05 14:05:25",
      "2019-10-05 17:30:20",
      "2019-10-05 17:37:00",
      "2019-10-06 04:43:55",
      "2019-10-06 04:53:45",
      "2019-10-06 04:53:46",
      "2019-10-06 04:53:47"
    )
  ),
  stop = as_datetime(
    c(
      "2019-10-05 14:19:20",
      "2019-10-05 17:45:15",
      "2019-10-05 17:50:45",
      "2019-10-06 04:59:00",
      "2019-10-06 05:07:10",
      "2019-10-06 05:07:11",
      "2019-10-06 05:07:12"
    )
  ),
  priority = c(5, 3, 4, 3, 4, 5, 6)
)

resultDT <- setDT(times_df, key="start")[!(stop >= shift(start, type="lead", fill = TRUE) & priority > shift(priority, type="lead", fill = TRUE)) &
                                         !(start <= shift(stop, type="lag", fill = FALSE) & priority > shift(priority, type="lag", fill = TRUE))]

# old approach ------------------------------------------------------------
# times_dt <- as.data.table(times_df)
# setkey(times_dt, start, stop)[, index := .I]
# overlaps_dt <- foverlaps(times_dt, times_dt, type = "any", which = TRUE)[xid != yid][, group := fifelse(xid > yid, yes = paste0(yid, "_", xid), no = paste0(xid, "_", yid))]
# overlaps_merged <- merge(times_dt, overlaps_dt, by.x = "index", by.y = "xid")[, .(delete_index = index[priority == max(priority)]), by = "group"]
# result_dt <- times_dt[!unique(overlaps_merged$delete_index)][, index := NULL]

Per ulteriori dettagli, consultare ?foverlaps- Esistono alcune funzioni più utili implementate per controllare ciò che è considerato una sovrapposizione come maxgap, minoverlapo type(qualsiasi, all'interno, inizio, fine ed uguale).


Aggiornamento: nuovo benchmark

Unit: microseconds
          expr       min         lq      mean    median        uq        max neval
          Paul 25572.550 26105.2710 30183.930 26514.342 29614.272 153810.600   100
           MKa  5100.447  5276.8350  6508.333  5401.275  5832.270  23137.879   100
      pgcudahy  3330.243  3474.4345  4284.640  3556.802  3748.203  21241.260   100
 ismirsehregal   711.084   913.3475  1144.829  1013.096  1433.427   2316.159   100

Codice benchmark:

#### library ----

library(dplyr)
library(lubridate)
library(igraph)
library(data.table)
library(microbenchmark)

#### data ----

times_df <- data.frame(
  start = as_datetime(
    c(
      "2019-10-05 14:05:25",
      "2019-10-05 17:30:20",
      "2019-10-05 17:37:00",
      "2019-10-06 04:43:55",
      "2019-10-06 04:53:45",
      "2019-10-06 04:53:46",
      "2019-10-06 04:53:47"
    )
  ),
  stop = as_datetime(
    c(
      "2019-10-05 14:19:20",
      "2019-10-05 17:45:15",
      "2019-10-05 17:50:45",
      "2019-10-06 04:59:00",
      "2019-10-06 05:07:10",
      "2019-10-06 05:07:11",
      "2019-10-06 05:07:12"
    )
  ),
  priority = c(5, 3, 4, 3, 4, 5, 6)
)

times_tib <- as_tibble(times_df)
times_dt <- as.data.table(times_df)

#### group_interval function ----

# buffer to take a form similar to: days(1), weeks(2), etc.
group_interval <- function(start, end, buffer = 0) {

  dat <- tibble(rid = 1:length(start),
                start = start,
                end = end,
                intervals = case_when(!is.na(start) & !is.na(end) ~ interval(start, end),
                                      is.na(start) ~ interval(end, end),
                                      is.na(end) ~ interval(start, start),
                                      TRUE ~ interval(NA, NA)))

  # apply buffer period to intervals
  int_start(dat$intervals) <- int_start(dat$intervals) - buffer + seconds(0.01)
  int_end(dat$intervals) <- int_end(dat$intervals) + buffer - seconds(0.01)

  df_overlap <- bind_cols(
    expand.grid(dat$rid, dat$rid), # make a 2 col table with every combination of id numbers
    expand.grid(dat$intervals, dat$intervals)) %>% # make a combination of every interval
    mutate(overlap = int_overlaps(.data$Var11, .data$Var21)) %>% # determine if intervals overlap
    rename("row" = "Var1", "col" = "Var2")

  # Find groups via graph theory See igraph package
  dat_graph <- graph_from_data_frame(filter(df_overlap, overlap) %>% select(row, col))
  groups <- components(dat_graph)$membership[df_overlap$row]

  # create a 2 column df with row (index) and group number, arrange on row number and return distinct values
  df_groups <- tibble(row = as.integer(names(groups)), group = groups) %>%
    unique()

  # returns
  left_join(select(dat, rid), df_groups, by = c("rid" = "row"))$group

}

#### benchmark ----

library(igraph)
library(data.table)
library(dplyr)
library(lubridate)
library(microbenchmark)

df_Paul <- df_MKa <- df_pgcudahy <- df_ismirsehregal <- times_df <- data.frame(
  start = as_datetime(
    c(
      "2019-10-05 14:05:25",
      "2019-10-05 17:30:20",
      "2019-10-05 17:37:00",
      "2019-10-06 04:43:55",
      "2019-10-06 04:53:45",
      "2019-10-06 04:53:46",
      "2019-10-07 06:00:00",
      "2019-10-07 06:10:00",
      "2019-10-07 06:20:00",
      "2019-10-08 06:00:00",
      "2019-10-08 06:10:00",
      "2019-10-08 06:20:00",
      "2019-10-09 03:00:00",
      "2019-10-09 03:10:00",
      "2019-10-10 03:00:00",
      "2019-10-10 03:10:00",
      "2019-10-11 05:00:00",
      "2019-10-11 05:00:00")
  ),
  stop = as_datetime(
    c(
      "2019-10-05 14:19:20",
      "2019-10-05 17:45:15",
      "2019-10-05 17:50:45",
      "2019-10-06 04:59:00",
      "2019-10-06 05:07:10",
      "2019-10-06 05:07:11",
      "2019-10-07 06:18:00",
      "2019-10-07 06:28:00",
      "2019-10-07 06:38:00",
      "2019-10-08 06:18:00",
      "2019-10-08 06:28:00",
      "2019-10-08 06:38:00",
      "2019-10-09 03:30:00",
      "2019-10-09 03:20:00",
      "2019-10-10 03:30:00",
      "2019-10-10 03:20:00",
      "2019-10-11 05:40:00",
      "2019-10-11 05:40:00")
  ),
  priority = c(5, 3, 4, 3, 4, 5, 4, 3, 4, 3, 4, 3, 1, 2, 2, 1, 3, 4)
)


benchmarks <- microbenchmark(Paul = {
  group_interval <- function(start, end, buffer = 0) {

    dat <- tibble(rid = 1:length(start),
                  start = start,
                  end = end,
                  intervals = case_when(!is.na(start) & !is.na(end) ~ interval(start, end),
                                        is.na(start) ~ interval(end, end),
                                        is.na(end) ~ interval(start, start),
                                        TRUE ~ interval(NA, NA)))

    int_start(dat$intervals) <- int_start(dat$intervals) - buffer + seconds(0.01)
    int_end(dat$intervals) <- int_end(dat$intervals) + buffer - seconds(0.01)

    df_overlap <- bind_cols(
      expand.grid(dat$rid, dat$rid), # make a 2 col table with every combination of id numbers
      expand.grid(dat$intervals, dat$intervals)) %>% # make a combination of every interval
      mutate(overlap = int_overlaps(.data$Var11, .data$Var21)) %>% # determine if intervals overlap
      rename("row" = "Var1", "col" = "Var2")

    dat_graph <- graph_from_data_frame(filter(df_overlap, overlap) %>% select(row, col))
    groups <- components(dat_graph)$membership[df_overlap$row]

    df_groups <- tibble(row = as.integer(names(groups)), group = groups) %>%
      unique()

    left_join(select(dat, rid), df_groups, by = c("rid" = "row"))$group
  }

  times_tib <- as_tibble(df_Paul)

  mutate(times_tib, group = group_interval(start, stop)) %>%
    group_by(group) %>%
    top_n(1, desc(priority)) %>%
    ungroup() %>%
    select(-group)
},
MKa = {
  df_MKa$id <- 1:nrow(df_MKa)

  # Create consolidated df which we will use to check if stop date is in between start and stop
  my_df <- bind_rows(replicate(n = nrow(df_MKa), expr = df_MKa, simplify = FALSE))
  my_df$stop_chk <- rep(df_MKa$stop, each = nrow(df_MKa))

  # Flag if stop date sits in between start and stop
  my_df$chk <- my_df$stop_chk >= my_df$start & my_df$stop_chk <= my_df$stop
  my_df$chk_id <- df_MKa[match(my_df$stop_chk, df_MKa$stop), "id"]

  # Using igrpah to cluster ids to create unique groups
  # this will identify any overlapping groups
  library(igraph)
  g <- graph.data.frame(my_df[my_df$chk == TRUE, c("id", "chk_id")])
  df_g <- data.frame(clusters(g)$membership)
  df_g$chk_id <- row.names(df_g)

  # copy the unique groups to the df
  my_df$new_id <- df_g[match(my_df$chk_id, df_g$chk_id), "clusters.g..membership"]
  my_df %>% 
    filter(chk == TRUE) %>%
    arrange(priority) %>%
    filter(!duplicated(new_id)) %>%
    select(start, stop, priority) %>%
    arrange(start)
}, pgcudahy = {
  df_pgcudahy %>%
    arrange(start) %>%
    mutate(remove1 = ifelse((stop >= lead(start, default=FALSE)) & 
                              (priority > lead(priority, default=(max(priority) + 1))), TRUE, FALSE)) %>%
    mutate(remove2 = ifelse((start <= lag(stop, default=FALSE)) & 
                              (priority > lag(priority, default=(max(priority) + 1))), TRUE, FALSE)) %>%
    filter(remove1 == FALSE & remove2 == FALSE) %>%
    select(1:3)
}, ismirsehregal = {
  setDT(df_ismirsehregal, key="start")[!(stop >= shift(start, type="lead", fill = TRUE) & priority > shift(priority, type="lead", fill = TRUE)) &
                                       !(start <= shift(stop, type="lag", fill = FALSE) & priority > shift(priority, type="lag", fill = TRUE))]
})

benchmarks

1

Ho una funzione di aiuto che raggruppa i dati sovrapposti / dati temporali usando il pacchetto igraph (può includere un buffer di sovrapposizione, cioè i terminus sono entro 1 minuto ...)

L'ho usato per raggruppare i tuoi dati in base agli intervalli in lubridate, quindi ho fatto un po 'di wrangling dei dati per ottenere solo la massima priorità dai tempi di sovrapposizione.

Non sono sicuro di quanto si ridimensionerà.

#### library ----

library(dplyr)
library(lubridate)
library(igraph)

#### data ----

times_df <- tibble(start = as_datetime(c("2019-10-05 14:05:25", 
                                         "2019-10-05 17:30:20", 
                                         "2019-10-05 17:37:00", 
                                         "2019-10-06 04:43:55", 
                                         "2019-10-06 04:53:45")), 
                   stop = as_datetime(c("2019-10-05 14:19:20",
                                        "2019-10-05 17:45:15", 
                                        "2019-10-05 17:50:45", 
                                        "2019-10-06 04:59:00",
                                        "2019-10-06 05:07:10")), priority = c(5,3,4,3,4))

#### group_interval function ----

# buffer to take a form similar to: days(1), weeks(2), etc.
group_interval <- function(start, end, buffer = 0) {

  dat <- tibble(rid = 1:length(start),
                start = start,
                end = end,
                intervals = case_when(!is.na(start) & !is.na(end) ~ interval(start, end),
                                      is.na(start) ~ interval(end, end),
                                      is.na(end) ~ interval(start, start),
                                      TRUE ~ interval(NA, NA)))

  # apply buffer period to intervals
  int_start(dat$intervals) <- int_start(dat$intervals) - buffer + seconds(0.01)
  int_end(dat$intervals) <- int_end(dat$intervals) + buffer - seconds(0.01)

  df_overlap <- bind_cols(
    expand.grid(dat$rid, dat$rid), # make a 2 col table with every combination of id numbers
    expand.grid(dat$intervals, dat$intervals)) %>% # make a combination of every interval
    mutate(overlap = int_overlaps(.data$Var11, .data$Var21)) %>% # determine if intervals overlap
    rename("row" = "Var1", "col" = "Var2")

  # Find groups via graph theory See igraph package
  dat_graph <- graph_from_data_frame(filter(df_overlap, overlap) %>% select(row, col))
  groups <- components(dat_graph)$membership[df_overlap$row]

  # create a 2 column df with row (index) and group number, arrange on row number and return distinct values
  df_groups <- tibble(row = as.integer(names(groups)), group = groups) %>%
    unique()

  # returns
  left_join(select(dat, rid), df_groups, by = c("rid" = "row"))$group

}

#### data munging ----

mutate(times_df, group = group_interval(start, stop)) %>%
  group_by(group) %>%
  top_n(1, desc(priority)) %>% # not sure why desc is needed, but top_n was giving the lower 
  ungroup() %>%
  select(-group)

Che dà:

    # A tibble: 3 x 3
      start               stop                priority
      <dttm>              <dttm>                 <dbl>
    1 2019-10-05 14:05:25 2019-10-05 14:19:20        5
    2 2019-10-05 17:30:20 2019-10-05 17:45:15        3
    3 2019-10-06 04:43:55 2019-10-06 04:59:00        3

0

Sono andato in una tana di coniglio guardando gli alberi degli intervalli (e le implementazioni R come IRanges / plyranges) ma penso che questo problema non abbia bisogno di una struttura di dati così coinvolta poiché i tempi di inizio possono essere facilmente ordinati. Ho anche ampliato il set di test come @ismirsehregal per coprire più potenziali relazioni di intervallo come un intervallo che inizia prima e termina dopo il suo vicino o quando tre intervalli si sovrappongono ma il primo e l'ultimo non si sovrappongono tra loro, o due intervalli che iniziano e fermarsi esattamente allo stesso tempo.

library(lubridate)
times_df <- data.frame(
  start = as_datetime(
    c(
      "2019-10-05 14:05:25",
      "2019-10-05 17:30:20",
      "2019-10-05 17:37:00",
      "2019-10-06 04:43:55",
      "2019-10-06 04:53:45",
      "2019-10-06 04:53:46",
      "2019-10-07 06:00:00",
      "2019-10-07 06:10:00",
      "2019-10-07 06:20:00",
      "2019-10-08 06:00:00",
      "2019-10-08 06:10:00",
      "2019-10-08 06:20:00",
      "2019-10-09 03:00:00",
      "2019-10-09 03:10:00",
      "2019-10-10 03:00:00",
      "2019-10-10 03:10:00",
      "2019-10-11 05:00:00",
      "2019-10-11 05:00:00")
  ),
  stop = as_datetime(
    c(
      "2019-10-05 14:19:20",
      "2019-10-05 17:45:15",
      "2019-10-05 17:50:45",
      "2019-10-06 04:59:00",
      "2019-10-06 05:07:10",
      "2019-10-06 05:07:11",
      "2019-10-07 06:18:00",
      "2019-10-07 06:28:00",
      "2019-10-07 06:38:00",
      "2019-10-08 06:18:00",
      "2019-10-08 06:28:00",
      "2019-10-08 06:38:00",
      "2019-10-09 03:30:00",
      "2019-10-09 03:20:00",
      "2019-10-10 03:30:00",
      "2019-10-10 03:20:00",
      "2019-10-11 05:40:00",
      "2019-10-11 05:40:00")
  ),
  priority = c(5, 3, 4, 3, 4, 5, 4, 3, 4, 3, 4, 3, 1, 2, 2, 1, 3, 4)
)

Quindi faccio due passaggi attraverso ogni intervallo per vedere se si sovrappone al suo predecessore o successore

stop >= lead(start, default=FALSE) e start <= lag(stop, default=FALSE))

Durante ogni passaggio, c'è un secondo controllo per vedere se la priorità dell'intervallo ha un valore numerico più alto rispetto al predecessore o al successore priority > lead(priority, default=(max(priority) + 1)). Durante ogni passaggio, se entrambe le condizioni sono vere, un flag "rimuovi" viene impostato su vero in una nuova colonna usando mutate. Tutte le righe con un flag di rimozione vengono quindi filtrate.

library(tidyverse)
times_df %>%
    arrange(start) %>%
    mutate(remove1 = ifelse((stop >= lead(start, default=FALSE)) & 
                            (priority > lead(priority, default=(max(priority) + 1))), 
                            TRUE, FALSE)) %>%
    mutate(remove2 = ifelse((start <= lag(stop, default=FALSE)) & 
                            (priority > lag(priority, default=(max(priority) + 1))), 
                            TRUE, FALSE)) %>%
    filter(remove1 == FALSE & remove2 == FALSE) %>%
    select(1:3)

Questo evita di controllare tutte le potenziali combinazioni di intervalli come la risposta di @ Paul (confronti 2n contro n!) E soddisfa la mia ignoranza della teoria dei grafi :)

Allo stesso modo la risposta di @ ismirsehregal ha una magia data.table che va oltre la mia comprensione.

La soluzione di @ MKa non sembra funzionare con> 2 periodi sovrapposti

Prova le soluzioni dà

#>          expr       min        lq      mean    median        uq       max
#> 1 dplyr_igraph 36.568842 41.510950 46.692147 43.362724 47.065277 241.92073
#> 2  data.table  9.126385  9.935049 11.395977 10.521032 11.446257  34.26953
#> 3       dplyr  5.031397  5.500363  6.224059  5.902589  6.373197  15.09273
#>   neval
#> 1   100
#> 2   100
#> 3   100

Da questo codice

library(igraph)
library(data.table)
library(microbenchmark)
benchmarks <- microbenchmark(dplyr_igraph = {
  group_interval <- function(start, end, buffer = 0) {

  dat <- tibble(rid = 1:length(start),
                start = start,
                end = end,
                intervals = case_when(!is.na(start) & !is.na(end) ~ interval(start, end),
                                      is.na(start) ~ interval(end, end),
                                      is.na(end) ~ interval(start, start),
                                      TRUE ~ interval(NA, NA)))

  int_start(dat$intervals) <- int_start(dat$intervals) - buffer + seconds(0.01)
  int_end(dat$intervals) <- int_end(dat$intervals) + buffer - seconds(0.01)

  df_overlap <- bind_cols(
    expand.grid(dat$rid, dat$rid), # make a 2 col table with every combination of id numbers
    expand.grid(dat$intervals, dat$intervals)) %>% # make a combination of every interval
    mutate(overlap = int_overlaps(.data$Var11, .data$Var21)) %>% # determine if intervals overlap
    rename("row" = "Var1", "col" = "Var2")

  dat_graph <- graph_from_data_frame(filter(df_overlap, overlap) %>% select(row, col))
  groups <- components(dat_graph)$membership[df_overlap$row]

  df_groups <- tibble(row = as.integer(names(groups)), group = groups) %>%
    unique()

  left_join(select(dat, rid), df_groups, by = c("rid" = "row"))$group
  }

  times_tib <- as_tibble(times_df)

  mutate(times_tib, group = group_interval(start, stop)) %>%
    group_by(group) %>%
    top_n(1, desc(priority)) %>%
    ungroup() %>%
    select(-group)
}, data.table = {
  times_dt <- as.data.table(times_df)
  setkey(times_dt, start, stop)[, index := .I]
  overlaps_dt <- foverlaps(times_dt, times_dt, type = "any", which = TRUE)[xid != yid][, group := fifelse(xid > yid, yes = paste0(yid, "_", xid), no = paste0(xid, "_", yid))]
  overlaps_merged <- merge(times_dt, overlaps_dt, by.x = "index", by.y = "xid")[, .(delete_index = index[priority == max(priority)]), by = "group"]
  result_dt <- times_dt[!unique(overlaps_merged$delete_index)][, index := NULL]
}, dplyr = {
times_df %>%
    arrange(start) %>%
    mutate(remove1 = ifelse((stop >= lead(start, default=FALSE)) & 
                            (priority > lead(priority, default=(max(priority) + 1))), TRUE, FALSE)) %>%
    mutate(remove2 = ifelse((start <= lag(stop, default=FALSE)) & 
                            (priority > lag(priority, default=(max(priority) + 1))), TRUE, FALSE)) %>%
    filter(remove1 == FALSE & remove2 == FALSE) %>%
    select(1:3)
})
summary(benchmarks)

Grazie per il feedback: non avevo familiarità con la tibblestruttura e sembra che stia pull()causando il problema. Per dataframe(), dovrebbe funzionare così com'è. Ho appena aggiornato la risposta.
MKa,

Bel approccio, ho preso la tua logica, l'ho modificata un po 'e tradotta in modo da data.tablerendere le cose ancora più veloci (controlla il mio nuovo benchmark).
Ismirsehregal,

0

Inoltre, utilizzando igraphper identificare eventuali gruppi sovrapposti, è possibile provare:

library(tidyverse)
library(lubridate)
times_df <- data.frame(
  start = as_datetime(
    c(
      "2019-10-05 14:05:25",
      "2019-10-05 17:30:20",
      "2019-10-05 17:37:00",
      "2019-10-06 04:43:55",
      "2019-10-06 04:53:45",
      "2019-10-06 04:53:46",
      "2019-10-07 06:00:00",
      "2019-10-07 06:10:00",
      "2019-10-07 06:20:00",
      "2019-10-08 06:00:00",
      "2019-10-08 06:10:00",
      "2019-10-08 06:20:00",
      "2019-10-09 03:00:00",
      "2019-10-09 03:10:00",
      "2019-10-10 03:00:00",
      "2019-10-10 03:10:00",
      "2019-10-11 05:00:00",
      "2019-10-11 05:00:00")
  ),
  stop = as_datetime(
    c(
      "2019-10-05 14:19:20",
      "2019-10-05 17:45:15",
      "2019-10-05 17:50:45",
      "2019-10-06 04:59:00",
      "2019-10-06 05:07:10",
      "2019-10-06 05:07:11",
      "2019-10-07 06:18:00",
      "2019-10-07 06:28:00",
      "2019-10-07 06:38:00",
      "2019-10-08 06:18:00",
      "2019-10-08 06:28:00",
      "2019-10-08 06:38:00",
      "2019-10-09 03:30:00",
      "2019-10-09 03:20:00",
      "2019-10-10 03:30:00",
      "2019-10-10 03:20:00",
      "2019-10-11 05:40:00",
      "2019-10-11 05:40:00")
  ),
  priority = c(5, 3, 4, 3, 4, 5, 4, 3, 4, 3, 4, 3, 1, 2, 2, 1, 3, 4)
)
times_df$id <- 1:nrow(times_df)


# Create consolidated df which we will use to check if stop date is in between start and stop
my_df <- bind_rows(replicate(n = nrow(times_df), expr = times_df, simplify = FALSE))
my_df$stop_chk <- rep(times_df$stop, each = nrow(times_df))

# Flag if stop date sits in between start and stop
my_df$chk <- my_df$stop_chk >= my_df$start & my_df$stop_chk <= my_df$stop
my_df$chk_id <- times_df[match(my_df$stop_chk, times_df$stop), "id"]

# Using igrpah to cluster ids to create unique groups
# this will identify any overlapping groups
library(igraph)
g <- graph.data.frame(my_df[my_df$chk == TRUE, c("id", "chk_id")])
df_g <- data.frame(clusters(g)$membership)
df_g$chk_id <- row.names(df_g)

# copy the unique groups to the df
my_df$new_id <- df_g[match(my_df$chk_id, df_g$chk_id), "clusters.g..membership"]
my_df %>% 
  filter(chk == TRUE) %>%
  arrange(priority) %>%
  filter(!duplicated(new_id)) %>%
  select(start, stop, priority) %>%
  arrange(start)
Utilizzando il nostro sito, riconosci di aver letto e compreso le nostre Informativa sui cookie e Informativa sulla privacy.
Licensed under cc by-sa 3.0 with attribution required.