The case study consists of analysis of migration patterns of three birds
case-study
Visualization
EDA
Same old quotidian work of importing libraries and the data
import pandas as pdimport numpy as npimport matplotlib.pyplot as plt%matplotlib notebookplt.style.use('ggplot')birddata = pd.read_csv("data/bird/bird_tracking.csv", index_col=0)birddata.head()
The data consists of almost 62,000 data points and 9 features or columns
birddata.bird_name.value_counts()
Nico 21121
Sanne 21004
Eric 19795
Name: bird_name, dtype: int64
There are 3 types of birds in our dataset, named Nico, Sanne, Eric
Linear estimation - because the earth is not flat - of flight trajectory of bird migration of a particular bird “Eric”. The trajectory will be substantially distorted because we have not done any Cartographic Projection of the flight trajectory.
This plot is just to get a rought look at the flight trajectory of a bird
ind = birddata.bird_name =="Eric"x, y = birddata.longitude[ind], birddata.latitude[ind]plt.figure(figsize=(5,5), dpi=100)plt.plot(x, y, "o", ms=2)plt.xlabel("Longitude")plt.ylabel("Latitude")plt.title("Eric flight trajectory")plt.show()
Let’s plot the flight trajectory for all of three birds
birds = birddata.bird_name.unique()plt.figure(figsize=(5,5), dpi=100)for bird in birds: ind = birddata.bird_name == bird x, y = birddata.longitude[ind], birddata.latitude[ind] plt.plot(x, y, "o", ms=2, label=bird)plt.xlabel("Longitude")plt.ylabel("Latitude")plt.title("Birds flight trajectory")plt.legend(loc="lower right")plt.show()
To further proceed, we would like to chech if our data consists of missing values and handle them accordingly We’ll be using sklearn for the preprocessing of the data and handling the missing values
birddata.isnull().sum()
altitude 0
date_time 0
device_info_serial 0
direction 443
latitude 0
longitude 0
speed_2d 443
bird_name 0
dtype: int64
Two columns direction and speed_2d consists of same no. of missing values but for direction column mean is not an appropriate approximation. Therefor we’ll first impute speed_2d with mean and then we’ll use n_neighbours strategy for imputation of direction
from sklearn.impute import SimpleImputer, KNNImputer# default args are what we want i.e. missing_values=nan, strategy='mean'imputer = SimpleImputer()birddata["speed_2d"] = imputer.fit_transform(birddata[['speed_2d']])
birddata.isnull().sum()
altitude 0
date_time 0
device_info_serial 0
direction 443
latitude 0
longitude 0
speed_2d 0
bird_name 0
dtype: int64
Let’s impute the direction column with default args
Ommit the last row as it’s unnecessarily introduced into the dataset.
birddata = birddata.iloc[:-1, :]birddata.tail()
altitude
date_time
device_info_serial
direction
latitude
longitude
speed_2d
bird_name
61914
-10
2014-04-30 21:29:45+00
833
-10.057916
51.352661
3.177122
5.531148
Sanne
61915
11
2014-04-30 22:00:08+00
833
45.448157
51.352572
3.177151
0.208087
Sanne
61916
6
2014-04-30 22:29:57+00
833
-112.073055
51.352585
3.177144
1.522662
Sanne
61917
5
2014-04-30 22:59:52+00
833
69.989037
51.352622
3.177257
3.120545
Sanne
61918
16
2014-04-30 23:29:43+00
833
88.376373
51.354641
3.181509
0.592115
Sanne
birddata.isnull().sum()
altitude 0
date_time 0
device_info_serial 0
direction 0
latitude 0
longitude 0
speed_2d 0
bird_name 0
dtype: int64
Let’s try plotting a histogram of speed_2d for a particular bird Eric
# ind is already defined above for "Eric"speed = birddata.speed_2d[ind]plt.figure(figsize=(5,5), dpi=100)plt.hist(speed, bins=np.linspace(0,30,20), density=True)plt.title("Eric 2D speed Histogram")plt.xlabel("Speed (m/s)")plt.ylabel("Frequency")plt.show()
Notice that in our dataset we have a column that consists of datetime, so lets check what is the datatype of this column
type(birddata.date_time[0])
str
birddata.date_time[0]
'2013-08-15 00:18:08+00'
datetime in our dataset is in str format and to be able to perform computation - computing time interval between two data points - on datetime we would like it convert to a datetime object
import datetime as dt# remove '+00 from the strings as the time is already in UTC'timestamps = birddata.date_timetimestamps = [stamp[:-3] for stamp in timestamps]
# convert str to a datetime object to be able to perform arithmetic operation on ittimestamps =list(map(lambda str_stamp: dt.datetime.strptime( str_stamp, "%Y-%m-%d %H:%M:%S"), timestamps))
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
birddata["timestamp"] = pd.Series(timestamps, index=birddata.index)
birddata.tail()
altitude
date_time
device_info_serial
direction
latitude
longitude
speed_2d
bird_name
timestamp
61914
-10
2014-04-30 21:29:45+00
833
-10.057916
51.352661
3.177122
5.531148
Sanne
2014-04-30 21:29:45
61915
11
2014-04-30 22:00:08+00
833
45.448157
51.352572
3.177151
0.208087
Sanne
2014-04-30 22:00:08
61916
6
2014-04-30 22:29:57+00
833
-112.073055
51.352585
3.177144
1.522662
Sanne
2014-04-30 22:29:57
61917
5
2014-04-30 22:59:52+00
833
69.989037
51.352622
3.177257
3.120545
Sanne
2014-04-30 22:59:52
61918
16
2014-04-30 23:29:43+00
833
88.376373
51.354641
3.181509
0.592115
Sanne
2014-04-30 23:29:43
birddata.timestamp[0]
Timestamp('2013-08-15 00:18:08')
birddata.timestamp[4] - birddata.timestamp[3]
Timedelta('0 days 00:29:51')
Now that we have our timestamp in place, we’d like to see how often or when the data was collected in the process. Also for this we’ll limit ourselves to Eric
times = birddata.timestamp[birddata.bird_name =="Eric"]elapsed_time = [time - times[0] for time in times]plt.figure(figsize=(5,5), dpi=100)plt.plot(np.array(elapsed_time) / dt.timedelta(days=1))plt.xlabel("Observations")plt.ylabel("Elapsed Time")plt.title("Elapsed time for Eric")plt.show()
Our next goal is to find when does “Eric” migrate. To achieve that we’ll plot the daily mean speed of Eric. The data is recorded unevenly i.e. on some days data was collected more times and some days it was collected less no. of times. We’ll start by getting indices of speed_2d that were collected on the same day and then take mean of those speeds, followed by plotting them to see if there’s any pattern.
plt.figure(figsize=(7,5), dpi=100)plt.plot(daily_mean_speed)plt.xlabel("Days")plt.ylabel("Speed (m/s)")plt.title("Eric Daily Mean Speed")plt.show()
Migration Pattern
from the 2D-Speed of Eric it can be argued that during days 90 - 100 and 230 - 240, speed of Eric was significantly higher than other days. So it can be said that Eric migrated during those days. To corroborate our beliefs about the migration we would like to look at the place at which Eric ended up during those days.
Cartographic Projection using Cartopy
Earlier we tried plotting migration pattern of birds but it was not quite what we were looking for because it was not a cartographic projection. So now we’ll use Cartopy for cartographic projection of flight patterns of the birds.
import cartopy.crs as ccrsimport cartopy.feature as cfeatureproj = ccrs.Mercator()plt.figure(figsize=(8,8), dpi=100)ax = plt.axes(projection=proj)ax.set_extent((-25.0, 20.0, 52.0, 10.0))ax.add_feature(cfeature.LAND)ax.add_feature(cfeature.OCEAN)ax.add_feature(cfeature.COASTLINE)ax.add_feature(cfeature.BORDERS, linestyle=':', alpha =0.95)for bird in birds: ix = birddata["bird_name"] == bird x, y = birddata.longitude[ix], birddata.latitude[ix] ax.plot(x,y, '.', transform=ccrs.Geodetic(), label=bird)plt.legend(loc="upper left")plt.savefig("map.pdf")plt.show()
Analysis for each bird
We’ll now group the data by bird_name to get the average 2D speed of the birds
data = birddata.groupby('bird_name')names = pd.Series(birds, name="Bird Name")mean_speeds = data.speed_2d.mean()data.speed_2d.describe().set_index(names)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py:5168: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[name] = value
<ipython-input-30-975ec523ff9b>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
birddata["date"] = birddata.date_time.dt.date