写给程序员的数据挖掘指南_程序员清洁指南，用于处理混乱的传感器数据-白红宇

写给程序员的数据挖掘指南_程序员清洁指南，用于处理混乱的传感器数据

阅读量：2527 次

发布时间：2019-05-11

本文共 7636 字，大约阅读时间需要 25 分钟。

写给程序员的数据挖掘指南

在本教程中，我将说明如何使用和处理凌乱的数据。如果您以前从未使用过Pandas并且了解Python的基础知识，那么本教程适合您。

。

让我们从头开始，将杂乱的文件变成有用的数据集。整个源代码。

读取CSV文件

您可以使用以下命令在Pandas中打开CSV文件：

pandas.read_csv（） ：打开CSV文件作为DataFrame（如表格）。

DataFrame.head（） ：显示前5个条目。

就像Pandas中的表格一样；它具有固定数量的列和索引。 CSV文件非常适合DataFrame，因为它们位于数据的列和行中。

import pandas      
     as pd     
               
          
     # Open a comma-separated values (CSV) file as a DataFrame      
     weather_observations      
     = \       
       pd.      
     read_csv      
     (      
     'observations/Canberra_observations.csv'      
     )      
               
          
     # Print the first 5 entries      
     weather_observations.      
     head      
     (      
     )

看起来我们的数据实际上是由\ t制表符分隔的。那里有一些有趣的东西，看起来似乎是时间。

pandas.read_csv（）提供了针对不同情况的通用关键字参数。在这里，您有一个用于日期的列，另一个用于时间的列。您可以引入一些关键字参数来增加一些智能：

sep ：列之间的分隔符

parse_dates ：将一列或多列视为日期

dayfirst ：使用DD.MM.YYYY格式，而不是月初

infer_datetime_format ：告诉熊猫猜测日期格式

na_values ：添加值以将其视为空

使用这些关键字参数可以对数据进行预格式化，并让Pandas完成一些繁重的工作。

# Supply pandas with some hints about the file to read      
     weather_observations      
     = \     
       pd.      
     read_csv      
     (      
     'observations/Canberra_observations.csv'      
     ,      
          sep      
     =      
     ' \t '      
     ,      
          parse_dates      
     =      
     {
            
     'Datetime' :      
     [      
     'Date'      
     ,      
     'Time'      
     ]      
     }      
     ,      
          dayfirst      
     =      
     True      
     ,      
          infer_datetime_format      
     =      
     True      
     ,      
          na_values      
     =      
     [      
     '-'      
     ]      
          
     )

Pandas很好地将两列Date和Time转换为单列Datetime ，并以标准格式呈现。

这里有一个NaN值，请勿与“非数字”浮点数混淆。这只是熊猫说的是空的。

按顺序排序数据

让我们看一下熊猫如何处理数据顺序。

DataFrame.sort_values（） ：按顺序重新排列。

DataFrame.drop_duplicates（） ：删除重复的项目。

DataFrame.set_index（） ：指定要用作索引的列。

因为时间似乎在倒退，所以我们对其进行排序：

# Sorting is ascending by default, or chronological order      
     sorted_dataframe      
     = weather_observations.      
     sort_values      
     (      
     'Datetime'      
     )      
     sorted_dataframe.      
     head      
     (      
     )

为什么会有两个午夜？事实证明，我们的数据集（）在每天的结尾和开头都包含午夜。您可以将其中一个作为重复项丢弃，因为第二天还有另一个午夜。

此处的逻辑顺序是丢弃重复项，对数据进行排序，然后设置索引：

# Sorting is ascending by default, or chronological order      
     sorted_dataframe      
     = weather_observations.      
     sort_values      
     (      
     'Datetime'      
     )      
          
          
     # Remove duplicated items with the same date and time      
     no_duplicates      
     = sorted_dataframe.      
     drop_duplicates      
     (      
     'Datetime'      
     , keep      
     =      
     'last'      
     )      
          
          
     # Use `Datetime` as our DataFrame index      
     indexed_weather_observations      
     = \     
       sorted_dataframe.      
     set_index      
     (      
     'Datetime'      
     )      
     indexed_weather_observations.      
     head      
     (      
     )

现在，您有了一个以时间为索引的DataFrame，它将在以后派上用场。首先，让我们改变风向。

转换列值

要准备用于天气建模的风力数据，您可以使用数字格式的风力值。按照惯例，北风（↓）为0度，顺时针⟳。东风（←）为90度，依此类推。您将利用Pandas进行转换：

Series.apply（） ：使用函数转换每个条目。

为了确定每个风向的确切值，我手工编写了一个字典，因为只有16个值。这是整洁且易于理解的。

# Translate wind direction to degrees      
     wind_directions      
     =      
     {
            
               
     'N' :        
     0 .      
     ,      
     'NNE' :       
     22.5      
     ,      
     'NE' :       
     45 .      
     ,      
     'ENE' :       
     67.5      
     ,      
               
     'E' :       
     90 .      
     ,      
     'ESE' :      
     112.5      
     ,      
     'SE' :      
     135 .      
     ,      
     'SSE' :      
     157.5      
     ,      
               
     'S' :      
     180 .      
     ,      
     'SSW' :      
     202.5      
     ,      
     'SW' :      
     225 .      
     ,      
     'WSW' :      
     247.5      
     ,      
               
     'W' :      
     270 .      
     ,      
     'WNW' :      
     292.5      
     ,      
     'NW' :      
     315 .      
     ,      
     'NNW' :      
     337.5      
     }

您可以像使用Python字典那样通过索引访问器访问DataFrame列（在Pandas中称为Series） 。转换后，将Series替换为新值。

# Replace wind directions column with a new number column      
          
     # `get()` accesses values fomr the dictionary safely      
     indexed_weather_observations      
     [      
     'Wind dir'      
     ]      
     = \     
         indexed_weather_observations      
     [      
     'Wind dir'      
     ] .      
     apply      
     ( wind_directions.      
     get      
     )      
          
          
     # Display some entries      
     indexed_weather_observations.      
     head      
     (      
     )

现在，每个有效风向都是一个数字。值是字符串还是其他类型的数字都没有关系。您可以使用Series.apply（）对其进行转换。

设定索引频率

深入研究，您会在数据集中发现更多缺陷：

# One section where the data has weird timestamps ...      
     indexed_weather_observations      
     [      
     1800 :      
     1805      
     ]

00:33:00 ？ 01:11:00 ？这些是奇怪的时间戳。有一项功能可以确保频率一致：

DataFrame.asfreq（） ：在索引上强制使用特定频率，并丢弃其余频率。

# Force the index to be every 30 minutes        
       regular_observations        
       = \       
         indexed_weather_observations.        
       asfreq        
       (        
       '30min'        
       )        
                       
              
       # Same section at different indices since setting          
              
       # its frequency :)        
       regular_observations        
       [        
       1633 :        
       1638        
       ]

熊猫会丢弃任何与频率不匹配的索引，如果不存在则添加一个空行。现在您有了一致的索引频率。让我们对其进行绘图，以查看其与流行的绘图库matplotlib的外观：

import matplotlib.      
     pyplot      
     as plt     
          
          
     # Make the graphs a bit prettier      
     pd.      
     set_option      
     (      
     'display.mpl_style'      
     ,      
     'default'      
     )      
     plt.      
     rcParams      
     [      
     'figure.figsize'      
     ]      
     =      
     (      
     18      
     ,      
     5      
     )      
          
          
     # Plot the first 500 entries with selected columns      
     regular_observations      
     [      
     [      
     'Wind spd'      
     ,      
     'Wind gust'      
     ,      
     'Tmp'      
     ,      
     'Feels like'      
     ]      
     ]      
     [ :      
     500      
     ] .      
     plot      
     (      
     )

仔细观察，似乎在1月6日，7日及以后还有差距。您需要用有意义的内容填充这些内容。

插值并填充空白行

要填充间隙，您可以线性插值，或从间隙的两个端点绘制一条线并相应地填充每个时间戳。

Series.interpolate（） ：根据索引填写空值。

在这里，您还可以使用inplace关键字参数来告诉Pandas执行该操作并自行替换。

# Interpolate data to fill empty values      
          
     for column      
     in regular_observations.      
     columns :     
         regular_observations      
     [ column      
     ] .      
     interpolate      
     (      
     'time'      
     , inplace      
     =      
     True      
     , limit_direction      
     =      
     'both'      
     )      
          
          
     # Display some interpolated entries          
     regular_observations      
     [      
     1633 :      
     1638      
     ]

NaN值已被替换。让我们再次绘制：

# Plot it again - gap free!      
     regular_observations      
     [      
     [      
     'Wind spd'      
     ,      
     'Wind gust'      
     ,      
     'Tmp'      
     ,      
     'Feels like'      
     ]      
     ]      
     [ :      
     500      
     ] .      
     plot      
     (      
     )

恭喜你！现在就可以将数据用于天气处理了。您可以并使用它。

结论

我已经展示了如何通过多种方式使用Python和Pandas清理混乱的数据，例如：

读取结构正确的CSV文件，

对数据集进行排序

通过应用函数转换列

调节数据频率

插值并填充丢失的数据

绘制数据集

Pandas提供了许多更强大的功能，您可以在找到它，以及其出色的。您可能在那里发现了一些宝石。如果您有任何疑问或想法，请随时通过Twitter 与我。

清理数据愉快！

读取CSV文件

按顺序排序数据

转换列值

设定索引频率

插值并填充空白行

结论

更多资源