How to apply summary function on two different types of data

r

50 观看

2回复

92 作者的声誉

I have data frame with multiple variable , some variables those contains only 0's and 1's and other columns contains all the possible values.
How to summarize df columns contains only 0's & 1's with "sts_1=sum(sts_1*0.25,na.rm=T)" and other columns with "non_sts_3=mean(non_sts_3,na.rm = T)," with out specifying column name.

df <- data.frame(year=c("2014","2014","2015","2015","2015"),
                 month_=c("Jan","Jan","Jan","Jan","Feb"),
                 sts_1=c(0,1,1,1,0),
                 sts_2=c(1,0,0,1,NA),
                 non_sts_1=c(0,3,7,31,10),
                 non_sts_2=c(1,4,NA,12,6),
                 non_sts_3 = c(12,14,18,1,9))

We can do by dplyr by entering column names manually with below code

df<-group_by(df,year, month_)

df_aggregation<-summarise(df,
                          non_sts_1=mean(non_sts_1,na.rm = T),
                          non_sts_2=mean(non_sts_2,na.rm = T),
                          non_sts_3=mean(non_sts_3,na.rm = T),
                          sts_1=sum(sts_1*0.25,na.rm=T),
                          sts_2=sum(sts_2*0.25,na.rm=T))

Thanks in advance...

作者: sasir 的来源 发布者: 2017 年 9 月 15 日

回应 2


4

472207 作者的声誉

We can use summarise_all and then remove the extra columns

df %>% 
  group_by(year, month_) %>% 
  summarise_all(funs(mean(., na.rm = TRUE), sum(.*0.25, na.rm = TRUE))) %>%
  select(matches("month_|non_sts.*mean|\\bsts.*sum"))
# A tibble: 3 x 7
# Groups:   year [2]
#    year month_ non_sts_1_mean non_sts_2_mean non_sts_3_mean sts_1_sum sts_2_sum
#    <fctr> <fctr>          <dbl>          <dbl>          <dbl>     <dbl>     <dbl>
#1   2014    Jan            1.5            2.5           13.0      0.25      0.25
#2   2015    Feb           10.0            6.0            9.0      0.00      0.00
#3   2015    Jan           19.0           12.0            9.5      0.50      0.25

Another approach if we have multiple sets of functions to be applied on different set of columns, will be to approach by applying the functions on different blocks of columns separately and then join

library(tidyverse)
flist <- list(function(x) mean(x, na.rm = TRUE), function(x) sum(x*0.25, na.rm = TRUE))
nm1 <- c("^non_sts", "^sts")
map2(nm1, flist, ~df %>%
                    group_by(year, month_) %>% 
                    summarise_at(vars(matches(.x)), funs(.y))) %>% 
                    reduce(inner_join, by = c('year', 'month_'))
# A tibble: 3 x 7
# Groups:   year [?]
#     year month_ non_sts_1 non_sts_2 non_sts_3 sts_1 sts_2
#   <fctr> <fctr>     <dbl>     <dbl>     <dbl> <dbl> <dbl>
#1   2014    Jan       1.5       2.5      13.0  0.25  0.25
#2   2015    Feb      10.0       6.0       9.0  0.00  0.00
#3   2015    Jan      19.0      12.0       9.5  0.50  0.25

NOTE: This approach is flexible to use for any set of columns

If we were to modify the approach for 0:1 case

l1 <- df %>% 
         summarise_at(3:7, funs(all(. %in% c(0, 1, NA)))) %>% 
         unlist
nm1 <- split(names(df)[-(1:2)], l1)

and then apply as above by removing the matches

作者: akrun 发布者: 2017 年 9 月 15 日

5

34091 作者的声誉

决定

@akrun's answer is straight-forward. If you prefer to not calculate unnecessarily, however, you can define a function that discriminates directly:

library(dplyr)
mysumm <- function(x, na.rm = FALSE) {
  if (all(x %in% 0:1)) {
    sum(x * 0.25, na.rm = na.rm)
  } else {
    mean(x, na.rm = na.rm)
  }
}

df %>%
  group_by(year, month_) %>%
  summarise_if(is.numeric, mysumm, na.rm = TRUE)
# # A tibble: 3 x 7
# # Groups:   year [?]
#     year month_ sts_1 sts_2 non_sts_1 non_sts_2 non_sts_3
#   <fctr> <fctr> <dbl> <dbl>     <dbl>     <dbl>     <dbl>
# 1   2014    Jan  0.25  0.25       1.5       2.5      13.0
# 2   2015    Feb  0.00   NaN      10.0       6.0       9.0
# 3   2015    Jan  0.50  0.25      19.0      12.0       9.5
作者: r2evans 发布者: 2017 年 9 月 15 日
32x32