Seleziona la prima e l'ultima riga dai dati raggruppati

137

Domanda

Utilizzando dplyr, come faccio a selezionare le osservazioni / righe superiore e inferiore dei dati raggruppati in un'istruzione?

Dati ed esempio

Dato un frame di dati

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), 
                 stopId=c("a","b","c","a","b","c","a","b","c"), 
                 stopSequence=c(1,2,3,3,1,4,3,1,2))

Posso ottenere le osservazioni in alto e in basso da ciascun gruppo usando slice , ma usando due istruzioni separate:

firstStop <- df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  slice(1) %>%
  ungroup

lastStop <- df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  slice(n()) %>%
  ungroup

Posso combinare queste due statmenet in una che seleziona sia le osservazioni top che bottom?

r dplyr

— tospig
fonte

Vedi anche Come selezionare la prima e l'ultima riga all'interno di una variabile di raggruppamento in un frame di dati?

— Henrik,

232

C'è probabilmente un modo più veloce:

df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  filter(row_number()==1 | row_number()==n())

— jeremycg
fonte

66

rownumber() %in% c(1, n())eviterebbe la necessità di eseguire la scansione vettoriale due volte

— MichaelChirico,

13

@MichaelChirico Sospetto che tu abbia omesso un _? cioèfilter(row_number() %in% c(1, n()))

— Eric Fail,

107

Solo per completezza: puoi passare sliceun vettore di indici:

df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))

che dà

  id stopId stopSequence
1  1      a            1
2  1      c            3
3  2      b            1
4  2      c            4
5  3      b            1
6  3      a            3

— Franco
fonte

potrebbe anche essere più veloce di filter- non l'ho testato, ma vedi qui

— Tjebo,

1

@Tjebo A differenza del filtro, slice può restituire la stessa riga più volte, ad es. mtcars[1, ] %>% slice(c(1, n()))In tal senso la scelta tra loro dipende da ciò che si desidera restituire. Mi aspetto che i tempi siano vicini a meno che non nsiano molto grandi (dove lo slice potrebbe essere preferito), ma non ho nemmeno testato.

— Frank

15

No dplyr, ma è molto più diretto usando data.table:

library(data.table)
setDT(df)
df[ df[order(id, stopSequence), .I[c(1L,.N)], by=id]$V1 ]
#    id stopId stopSequence
# 1:  1      a            1
# 2:  1      c            3
# 3:  2      b            1
# 4:  2      c            4
# 5:  3      b            1
# 6:  3      a            3

Spiegazione più dettagliata:

# 1) get row numbers of first/last observations from each group
#    * basically, we sort the table by id/stopSequence, then,
#      grouping by id, name the row numbers of the first/last
#      observations for each id; since this operation produces
#      a data.table
#    * .I is data.table shorthand for the row number
#    * here, to be maximally explicit, I've named the variable V1
#      as row_num to give other readers of my code a clearer
#      understanding of what operation is producing what variable
first_last = df[order(id, stopSequence), .(row_num = .I[c(1L,.N)]), by=id]
idx = first_last$row_num

# 2) extract rows by number
df[idx]

Assicurati di consultare il wiki Guida introduttiva per ottenere informazioni di data.tablebase

— MichaelChirico
fonte

1

Or df[ df[order(stopSequence), .I[c(1,.N)], keyby=id]$V1 ]. Vedere idapparire due volte è strano per me.

— Frank,

È possibile impostare le chiavi nella setDTchiamata. Quindi una orderchiamata non è necessaria qui.

— Artem Klevtsov,

1

@ArtemKlevtsov - potresti non voler sempre impostare le chiavi.

— SymbolixAU il

2

Or df[order(stopSequence), .SD[c(1L,.N)], by = id]. Vedi qui

— JWilliman l'

@JWilliman che non sarà necessariamente esattamente lo stesso, dal momento che non riordinerà id. Penso che df[order(stopSequence), .SD[c(1L, .N)], keyby = id]dovrei fare il trucco (con la minima differenza rispetto alla soluzione sopra che il risultato sarà keyed

— MichaelChirico,

8

Qualcosa di simile a:

library(dplyr)

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
                 stopId=c("a","b","c","a","b","c","a","b","c"),
                 stopSequence=c(1,2,3,3,1,4,3,1,2))

first_last <- function(x) {
  bind_rows(slice(x, 1), slice(x, n()))
}

df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  do(first_last(.)) %>%
  ungroup

## Source: local data frame [6 x 3]
## 
##   id stopId stopSequence
## 1  1      a            1
## 2  1      c            3
## 3  2      b            1
## 4  2      c            4
## 5  3      b            1
## 6  3      a            3

Con dote puoi praticamente eseguire qualsiasi numero di operazioni sul gruppo, ma la risposta di @ jeremycg è molto più appropriata per questo compito.

— hrbrmstr
fonte

1

Non avevo considerato la possibilità di scrivere una funzione - sicuramente un buon modo di fare qualcosa di più complesso.

— domenica

1

Questo sembra complicato rispetto al solo utilizzo slice, comedf %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))

— Frank

4

Non in disaccordo (e ho indicato jeremycg come una risposta migliore nel post) ma avere un doesempio qui potrebbe aiutare gli altri quando slicenon funzionerà (cioè operazioni più complesse su un gruppo). E, dopo, dovrai pubblicare il tuo commento come risposta (è il migliore).

— hrbrmstr,

6

Conosco la domanda specificata dplyr. Ma dal momento che altri hanno già pubblicato soluzioni utilizzando altri pacchetti, ho deciso di provare anche altri pacchetti:

Pacchetto base:

df <- df[with(df, order(id, stopSequence, stopId)), ]
merge(df[!duplicated(df$id), ], 
      df[!duplicated(df$id, fromLast = TRUE), ], 
      all = TRUE)

tabella dati:

df <-  setDT(df)
df[order(id, stopSequence)][, .SD[c(1,.N)], by=id]

sqldf:

library(sqldf)
min <- sqldf("SELECT id, stopId, min(stopSequence) AS StopSequence
      FROM df GROUP BY id 
      ORDER BY id, StopSequence, stopId")
max <- sqldf("SELECT id, stopId, max(stopSequence) AS StopSequence
      FROM df GROUP BY id 
      ORDER BY id, StopSequence, stopId")
sqldf("SELECT * FROM min
      UNION
      SELECT * FROM max")

In una query:

sqldf("SELECT * 
        FROM (SELECT id, stopId, min(stopSequence) AS StopSequence
              FROM df GROUP BY id 
              ORDER BY id, StopSequence, stopId)
        UNION
        SELECT *
        FROM (SELECT id, stopId, max(stopSequence) AS StopSequence
              FROM df GROUP BY id 
              ORDER BY id, StopSequence, stopId)")

Produzione:

  id stopId StopSequence
1  1      a            1
2  1      c            3
3  2      b            1
4  2      c            4
5  3      a            3
6  3      b            1

— mpalanco
fonte

3

utilizzando which.mine which.max:

library(dplyr, warn.conflicts = F)
df %>% 
  group_by(id) %>% 
  slice(c(which.min(stopSequence), which.max(stopSequence)))

#> # A tibble: 6 x 3
#> # Groups:   id [3]
#>      id stopId stopSequence
#>   <dbl> <fct>         <dbl>
#> 1     1 a                 1
#> 2     1 c                 3
#> 3     2 b                 1
#> 4     2 c                 4
#> 5     3 b                 1
#> 6     3 a                 3

prova delle prestazioni

È anche molto più veloce della risposta attualmente accettata perché troviamo il valore minimo e massimo per gruppo, invece di ordinare l'intera colonna stopSequence.

# create a 100k times longer data frame
df2 <- bind_rows(replicate(1e5, df, F)) 
bench::mark(
  mm =df2 %>% 
    group_by(id) %>% 
    slice(c(which.min(stopSequence), which.max(stopSequence))),
  jeremy = df2 %>%
    group_by(id) %>%
    arrange(stopSequence) %>%
    filter(row_number()==1 | row_number()==n()))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 mm           22.6ms     27ms     34.9     14.2MB     21.3
#> 2 jeremy      254.3ms    273ms      3.66    58.4MB     11.0

— Moody_Mudskipper
fonte

2

Utilizzando data.table:

# convert to data.table
setDT(df) 
# order, group, filter
df[order(stopSequence)][, .SD[c(1, .N)], by = id]

   id stopId stopSequence
1:  1      a            1
2:  1      c            3
3:  2      b            1
4:  2      c            4
5:  3      b            1
6:  3      a            3

— sindri_baldur
fonte

1

Un altro approccio con lapply e una dichiarazione dplyr. Possiamo applicare un numero arbitrario di qualunque funzione di riepilogo alla stessa affermazione:

lapply(c(first, last), 
       function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>% 
bind_rows()

Ad esempio potresti essere interessato anche alle righe con il valore max stopSequence e fare:

lapply(c(first, last, max("stopSequence")), 
       function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>%
bind_rows()

— Sahir Moosvi
fonte

0

Un'alternativa di base R diversa sarebbe quella di prima orderdi ide stopSequence, splitin base a ide per ogni id, selezioniamo solo il primo e l'ultimo indice e sottoinsieme il frame di dati usando quegli indici.

df[sapply(with(df, split(order(id, stopSequence), id)), function(x) 
                   c(x[1], x[length(x)])), ]


#  id stopId stopSequence
#1  1      a            1
#3  1      c            3
#5  2      b            1
#6  2      c            4
#8  3      b            1
#7  3      a            3

O simile usando by

df[unlist(with(df, by(order(id, stopSequence), id, function(x) 
                   c(x[1], x[length(x)])))), ]

— Ronak Shah
fonte