python之Pandas 计算时间序列中的零点

lexus 阅读:36 2024-11-24 20:56:43 评论:0

我有一个每日时间序列 [1980 年至今],我需要在其中检查每个每日时间步长是否为零并系统地删除记录。我最终想对该解决方案进行矢量化处理,这样我就可以在继续分析之前预处理这些操作。如果我有数据框 df:

         date               name  elev_exact      swe 
0  1990-10-30   COTTONWOOD_CREEK    2337.816  0.01524 
1  1990-10-30    EMIGRANT_SUMMIT    2252.472  0.00000 
2  1990-10-30     PHILLIPS_BENCH    2499.360  0.05334 
3  1990-10-30    PINE_CREEK_PASS    2048.256  0.00000 
4  1990-10-30  SALT_RIVER_SUMMIT    2328.672  0.00000 
5  1990-10-30      SEDGWICK_PEAK    2392.680  0.00000 
6  1990-10-30          SHEEP_MTN    2026.920  0.00000 
7  1990-10-30  SLUG_CREEK_DIVIDE    2202.180  0.00000 
8  1990-10-30       SOMSEN_RANCH    2072.640  0.00000 
9  1990-10-30   WILDHORSE_DIVIDE    1978.152  0.00000 
10 1990-10-30       WILLOW_CREEK    2462.784  0.01778 
11 1991-03-15   COTTONWOOD_CREEK    2337.816  0.41910 
12 1991-03-15    EMIGRANT_SUMMIT    2252.472  0.42418 
13 1991-03-15     PHILLIPS_BENCH    2499.360  0.52832 
14 1991-03-15    PINE_CREEK_PASS    2048.256  0.32258 
15 1991-03-15  SALT_RIVER_SUMMIT    2328.672  0.23876 
16 1991-03-15      SEDGWICK_PEAK    2392.680  0.39878 
17 1991-03-15          SHEEP_MTN    2026.920  0.31242 
18 1991-03-15  SLUG_CREEK_DIVIDE    2202.180  0.29464 
19 1991-03-15       SOMSEN_RANCH    2072.640  0.29972 
20 1991-03-15   WILDHORSE_DIVIDE    1978.152  0.35052 
21 1991-03-15       WILLOW_CREEK    2462.784  0.60706 
22 1991-10-25   COTTONWOOD_CREEK    2337.816  0.01270 
23 1991-10-25    EMIGRANT_SUMMIT    2252.472  0.01016 
24 1991-10-25     PHILLIPS_BENCH    2499.360  0.02286 
25 1991-10-25    PINE_CREEK_PASS    2048.256  0.00508 
26 1991-10-25  SALT_RIVER_SUMMIT    2328.672  0.01016 
27 1991-10-25      SEDGWICK_PEAK    2392.680  0.00254 
28 1991-10-25          SHEEP_MTN    2026.920  0.00000 
29 1991-10-25  SLUG_CREEK_DIVIDE    2202.180  0.00762 
30 1991-10-25       SOMSEN_RANCH    2072.640  0.00000 
31 1991-10-25   WILDHORSE_DIVIDE    1978.152  0.00508 
32 1991-10-25       WILLOW_CREEK    2462.784  0.02032 

问题是我想找到超过一个零 swe 测量值的日子,并且只保留具有最大 elev_exact 的观测值。然后我需要将所需的零记录合并回 df

这是一个 groupby 循环,可以实现我想要的:

result = pd.DataFrame() 
for name, group in df.groupby('date'): 
 
    non_zero = group.where(group.swe >0).dropna() 
 
    if not group.equals(non_zero): 
        zeros = group.where(group.swe == 0).dropna()  
        zero_kept = zeros.loc[zeros.elev_exact.idxmax()] 
        out = non_zero.append(zero_kept) 
        out = out[out.elev_exact >= zero_kept.elev_exact] 
        result = pd.concat([result, out]) 
    else: 
        result = pd.concat([result, non_zero]) 

我不介意使用 groupby 但我想更有条理地使用它,所以我没有内部 if-else 循环。

我是这样想的

  1. 对于每个每日时间步长,我想找到哪里有多个零测量值
zero_count = df.groupby('date').apply(lambda x: np.count_nonzero(x==0)) 
zero_count = zero_count.where(zero_count >1).dropna() 
  1. zero_count > 1
  2. 分隔日期
zero_fix = zero_count.where(zero_count >1).dropna() 
  1. 找出具有多个零的每一天的最大海拔
fixes = df[df.date.isin(zero_fix.index)].dropna() 
fixes = fixes.loc[fixes[fixes.swe==0].groupby('date')['elev_exact'].idxmax().to_list()] 
  1. 将找到的海拔阈值应用回 df
df.loc[:,'threshold'] = df.date.map(lu_dict) 
df = df.replace(np.nan, 0) 
df = df[df.elev_exact >= df.threshold].drop('threshold', axis=1) 

这也有效,但步骤 1 中的 lambda 函数非常慢。还有另一种计算零的方法吗?

预期输出:

          date               name  elev_exact      swe 
2   1990-10-30     PHILLIPS_BENCH    2499.360  0.05334 
5   1990-10-30      SEDGWICK_PEAK    2392.680  0.00000 
10  1990-10-30       WILLOW_CREEK    2462.784  0.01778 
11  1991-03-15   COTTONWOOD_CREEK    2337.816  0.41910 
12  1991-03-15    EMIGRANT_SUMMIT    2252.472  0.42418 
13  1991-03-15     PHILLIPS_BENCH    2499.360  0.52832 
14  1991-03-15    PINE_CREEK_PASS    2048.256  0.32258 
15  1991-03-15  SALT_RIVER_SUMMIT    2328.672  0.23876 
16  1991-03-15      SEDGWICK_PEAK    2392.680  0.39878 
17  1991-03-15          SHEEP_MTN    2026.920  0.31242 
18  1991-03-15  SLUG_CREEK_DIVIDE    2202.180  0.29464 
19  1991-03-15       SOMSEN_RANCH    2072.640  0.29972 
20  1991-03-15   WILDHORSE_DIVIDE    1978.152  0.35052 
21  1991-03-15       WILLOW_CREEK    2462.784  0.60706 
22  1991-10-25   COTTONWOOD_CREEK    2337.816  0.01270 
23  1991-10-25    EMIGRANT_SUMMIT    2252.472  0.01016 
24  1991-10-25     PHILLIPS_BENCH    2499.360  0.02286 
26  1991-10-25  SALT_RIVER_SUMMIT    2328.672  0.01016 
27  1991-10-25      SEDGWICK_PEAK    2392.680  0.00254 
29  1991-10-25  SLUG_CREEK_DIVIDE    2202.180  0.00762 
30  1991-10-25       SOMSEN_RANCH    2072.640  0.00000 
32  1991-10-25       WILLOW_CREEK    2462.784  0.02032 

请您参考如下方法:

您可以试试这个,将数据帧拆分为非零和零,然后按最高 elev_exact 对零数据帧进行排序,并将 drop_duplicates 与日期列的子集一起使用。最后,使用 pd.concat 将数据帧重新连接在一起并排序:

df_nonzeroes = df[df['swe'].ne(0)] 
df_zeroes = df[df['swe'].eq(0)].sort_values('elev_exact', ascending=False).drop_duplicates(subset=['date']) 
 
df_out = pd.concat([df_nonzeroes, df_zeroes]).sort_index() 
print(df_out) 

输出:

          date               name  elev_exact      swe 
0   1990-10-30   COTTONWOOD_CREEK    2337.816  0.01524 
2   1990-10-30     PHILLIPS_BENCH    2499.360  0.05334 
5   1990-10-30      SEDGWICK_PEAK    2392.680  0.00000 
10  1990-10-30       WILLOW_CREEK    2462.784  0.01778 
11  1991-03-15   COTTONWOOD_CREEK    2337.816  0.41910 
12  1991-03-15    EMIGRANT_SUMMIT    2252.472  0.42418 
13  1991-03-15     PHILLIPS_BENCH    2499.360  0.52832 
14  1991-03-15    PINE_CREEK_PASS    2048.256  0.32258 
15  1991-03-15  SALT_RIVER_SUMMIT    2328.672  0.23876 
16  1991-03-15      SEDGWICK_PEAK    2392.680  0.39878 
17  1991-03-15          SHEEP_MTN    2026.920  0.31242 
18  1991-03-15  SLUG_CREEK_DIVIDE    2202.180  0.29464 
19  1991-03-15       SOMSEN_RANCH    2072.640  0.29972 
20  1991-03-15   WILDHORSE_DIVIDE    1978.152  0.35052 
21  1991-03-15       WILLOW_CREEK    2462.784  0.60706 
22  1991-10-25   COTTONWOOD_CREEK    2337.816  0.01270 
23  1991-10-25    EMIGRANT_SUMMIT    2252.472  0.01016 
24  1991-10-25     PHILLIPS_BENCH    2499.360  0.02286 
25  1991-10-25    PINE_CREEK_PASS    2048.256  0.00508 
26  1991-10-25  SALT_RIVER_SUMMIT    2328.672  0.01016 
27  1991-10-25      SEDGWICK_PEAK    2392.680  0.00254 
29  1991-10-25  SLUG_CREEK_DIVIDE    2202.180  0.00762 
30  1991-10-25       SOMSEN_RANCH    2072.640  0.00000 
31  1991-10-25   WILDHORSE_DIVIDE    1978.152  0.00508 
32  1991-10-25       WILLOW_CREEK    2462.784  0.02032 


标签:Python
声明

1.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源;2.本站的原创文章,请转载时务必注明文章作者和来源,不尊重原创的行为我们将追究责任;3.作者投稿可能会经我们编辑修改或补充。

关注我们

一个IT知识分享的公众号