将列中以逗号分隔的字符串拆分为单独的行

r string split r-faq

24805 观看

5回复

3907 作者的声誉

我有一个数据框,如下所示:

data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", 
                        "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
                        "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
                        "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
                        "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
                        "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
                        "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))

如您所见,director列中的某些条目是由逗号分隔的多个名称。我想将这些条目拆分为单独的行,同时保持另一列的值。例如,上面数据框中的第一行应该分成两行,director列中各有一个名称,列中有“A” AB

作者: RoyalTS 的来源 发布者: 2012 年 12 月 8 日

回应 5


47

36719 作者的声誉

命名原始data.frame v,我们有:

> s <- strsplit(as.character(v$director), ',')
> data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))
                      director AB
1                 Aaron Blaise  A
2                   Bob Walker  A
3               Akira Kurosawa  B
4               Alan J. Pakula  A
5                  Alan Parker  A
6           Alejandro Amenabar  B
7  Alejandro Gonzalez Inarritu  B
8  Alejandro Gonzalez Inarritu  B
9             Benicio Del Toro  B
10 Alejandro González Iñárritu  A
11                 Alex Proyas  B
12              Alexander Hall  A
13              Alfonso Cuaron  B
14            Alfred Hitchcock  A
15              Anatole Litvak  A
16              Andrew Adamson  B
17                 Marilyn Fox  B
18              Andrew Dominik  B
19              Andrew Stanton  B
20              Andrew Stanton  B
21                 Lee Unkrich  B
22              Angelina Jolie  B
23              John Stevenson  B
24               Anne Fontaine  B
25              Anthony Harvey  A

注意使用rep构建新的AB列。这里,sapply返回每个原始行中的名称数。

作者: Matthew Lundberg 发布者: 2012 年 12 月 8 日

29

157781 作者的声誉

迟到了,但另一个通用的替代方案是使用cSplit我的“splitstackshape”包中有一个direction参数。设置此项以"long"获取您指定的结果:

library(splitstackshape)
head(cSplit(mydf, "director", ",", direction = "long"))
#              director AB
# 1:       Aaron Blaise  A
# 2:         Bob Walker  A
# 3:     Akira Kurosawa  B
# 4:     Alan J. Pakula  A
# 5:        Alan Parker  A
# 6: Alejandro Amenabar  B
作者: A5C1D2H2I1M1N2O1R2T1 发布者: 2015 年 2 月 3 日

77

59459 作者的声誉

几种选择:

1)两种方式data.table

library(data.table)
# method 1 (preferred)
setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]
# method 2
setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)
         ][,.(director = V1, AB)]

2)a dplyr/ tidyr组合:或者,您也可以使用dplyr/ tidyr组合:

library(dplyr)
library(tidyr)
v %>% 
  mutate(director = strsplit(as.character(director), ",")) %>%
  unnest(director)

3)tidyr仅限:使用tidyr 0.5.0(及以后),您也可以使用separate_rows

separate_rows(v, director, sep = ",")

您可以使用该convert = TRUE参数自动将数字转换为数字列。

4)基础R:

# if 'director' is a character-column:
stack(setNames(strsplit(df$director,','), df$AB))

# if 'director' is a factor-column:
stack(setNames(strsplit(as.character(df$director),','), df$AB))
作者: Jaap 发布者: 2015 年 7 月 20 日

63

25405 作者的声誉

决定

这个老问题经常被用作欺骗目标(标记为r-faq)。截至今天,已经回答了三次,提供了6种不同的方法,但缺乏基准作为指导哪种方法最快1

基准测试解决方案包括

总共8种不同的方法使用该microbenchmark包对6种不同大小的数据帧进行了基准测试(参见下面的代码)。

OP给出的样本数据仅包含20行。为了创建更大的数据帧,这20行简单地重复1次,10次,100次,1000次,10000次和100000次,这给出了最多200万行的问题大小。

基准测试结果

在此输入图像描述

基准测试结果表明,对于足够大的数据帧,所有data.table方法都比任何其他方法更快。对于具有超过约5000行的数据帧,Jaap的data.table方法2和变体DT3是最快的,比最慢的方法更快。

值得注意的是,两种tidyverse方法和splistackshape解决方案的时间非常相似,因此难以对图表中的曲线进行分类。它们是所有数据帧大小中最慢的基准测试方法。

对于较小的数据帧,Matt的基本R解决方案和data.table方法4似乎比其他方法具有更少的开销。

director <- 
  c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", 
    "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
    "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
    "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
    "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
    "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
    "Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", 
        "A", "B", "B", "B", "B", "B", "B", "A")

library(data.table)
library(magrittr)

为问题规模的基准运行定义函数 n

run_mb <- function(n) {
  # compute number of benchmark runs depending on problem size `n`
  mb_times <- scales::squish(10000L / n , c(3L, 100L)) 
  cat(n, " ", mb_times, "\n")
  # create data
  DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
  DT <- as.data.table(DF)
  # start benchmarks
  microbenchmark::microbenchmark(
    matt_mod = {
      s <- strsplit(as.character(DF$director), ',')
      data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
    jaap_DT1 = {
      DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]},
    jaap_DT2 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE), 
         by = .(AB, director)][,.(director = V1, AB)]},
    jaap_dplyr = {
      DF %>% 
        dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
        tidyr::unnest(director)},
    jaap_tidyr = {
      tidyr::separate_rows(DF, director, sep = ",")},
    cSplit = {
      splitstackshape::cSplit(DF, "director", ",", direction = "long")},
    DT3 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE),
         by = .(AB, director)][, director := NULL][
           , setnames(.SD, "V1", "director")]},
    DT4 = {
      DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), 
         by = .(AB)]},
    times = mb_times
  )
}

针对不同的问题规模运行基准

# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)

准备绘图数据

mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]

创建图表

library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + 
  geom_point() + geom_smooth(se = FALSE) + 
  scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + 
  xlab("number of rows") + ylab("median of execution time [ms]") +
  ggtitle("microbenchmark results") + theme_bw()

会话信息和包版本(摘录)

devtools::session_info()
#Session info
# version  R version 3.3.2 (2016-10-31)
# system   x86_64, mingw32
#Packages
# data.table      * 1.10.4  2017-02-01 CRAN (R 3.3.2)
# dplyr             0.5.0   2016-06-24 CRAN (R 3.3.1)
# forcats           0.2.0   2017-01-23 CRAN (R 3.3.2)
# ggplot2         * 2.2.1   2016-12-30 CRAN (R 3.3.2)
# magrittr        * 1.5     2014-11-22 CRAN (R 3.3.0)
# microbenchmark    1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales            0.4.1   2016-11-09 CRAN (R 3.3.2)
# splitstackshape   1.4.2   2014-10-23 CRAN (R 3.3.3)
# tidyr             0.6.1   2017-01-10 CRAN (R 3.3.2)

1 我的好奇心被这个充满激情的评论所 震惊!秩序更快!tidyverse的回答问题,其被关闭,因为这问题的一个副本。

作者: Uwe 发布者: 2017 年 4 月 15 日

0

11 作者的声誉

devtools::install_github("yikeshu0611/onetree")

library(onetree)

dd=spread_byonecolumn(data=mydata,bycolumn="director",joint=",")

head(dd)
            director AB
1       Aaron Blaise  A
2         Bob Walker  A
3     Akira Kurosawa  B
4     Alan J. Pakula  A
5        Alan Parker  A
6 Alejandro Amenabar  B
作者: zhang jing 发布者: 2019 年 7 月 26 日
32x32