1. Check if data is outside of Rwanda, if so check the distribution
To check if any location is of outside Rwanda or not, we can do this by multiple methods (Two methods that i am aware of):
a) Using api from google or other spatial service provider
b) List of latitude and longitude dataset for every country provided by Humanitarian Data Exchange(we will go with this as this is free, Link: https://data.humdata.org/).
Using the second method, we can check if the point (latitude & longitude) given to us lies in which part of the world.
# we have already imported the required library on the top i.e., geopandas and Point from shapely.geometry # Path to the unzipped shapefile
# Path to the unzipped shapefile
rwanda_map = gpd.read_file('/Kaggle Datasets/RWA_adm/RWA_adm0.shp')
burundi_map = gpd.read_file('/Kaggle Datasets/Burundi/geoBoundaries-BDI-ADM0.shp')
drc_map = gpd.read_file('/Kaggle Datasets/DRC/cod_admbnda_adm0_rgc_itos_20190911.shp')
uganda_map = gpd.read_file('/Kaggle Datasets/Uganda/uga_admbnda_adm0_ubos_20200824.shp')
tanzania_map = gpd.read_file('/Kaggle Datasets/Tanzania/tza_admbnda_adm0_20181019.shp')
# Load the shapefile
rwanda_map = rwanda_map[['NAME_0','geometry']]
burundi_map = burundi_map[['shapeName','geometry']]
drc_map = drc_map[['ADM0_FR','geometry']]
uganda_map = uganda_map[['ADM0_EN','geometry']]
tanzania_map = tanzania_map[['ADM0_EN','geometry']]
mapData = pd.concat([rwanda_map,burundi_map,drc_map,uganda_map,tanzania_map],axis=0)
mapData['Country'] = np.where(mapData['NAME_0'].notnull(),mapData['NAME_0'],np.where(mapData['shapeName'].notnull(),mapData['shapeName'],
np.where(mapData['ADM0_FR'].notnull(),mapData['ADM0_FR'],mapData['ADM0_EN'])))
mapData = mapData[['Country','geometry']]
# Function to find the country
def find_location(lat, lon, targetMap, targetColumn):
point = Point(lon,lat)
for idx, row in targetMap.iterrows():
if row['geometry'].contains(point):
return row[targetColumn]
return "Unknown"
# Apply the function in original data
trainData['Countries'] = trainData.apply(lambda x: find_location(x['latitude'],x['longitude'],mapData,'Country'), axis =1)
Now after this step look at the unique countries present in the dataset:
trainData['Countries'].unique()countryCount = trainData.groupby('Country')['ID_LAT_LON_YEAR_WEEK'].count().reset_index()
fig = px.bar(
countryCount,
x=countryCount['Country'],
y=countryCount['ID_LAT_LON_YEAR_WEEK']
)
total = countryCount['ID_LAT_LON_YEAR_WEEK'].sum()
for ind,row in countryCount.iterrows():
fig.add_annotation(
x = row['Country'],
y=(row['ID_LAT_LON_YEAR_WEEK']),
text=f"{round(row['ID_LAT_LON_YEAR_WEEK']/total,3)}"
)
fig.show()
a) There are 5 countries including Rwanda, all these countries are neighbors of Rwanda and the points are from border areas of all these countries.
b) Another thing to notice from the video is that border areas have more CO2 emission
2. Creating cluster based on location proximity
Before going into clustering let’s look at the unique locations dataset has
uniqueLocations = mockTrain.drop_duplicates(subset=['latitude','longitude'])
uniqueLocations.shape# Result: (497, 72), Data is from 497 locations
Another step before clustering is to identify which province points lies in and if outside country border then which province they are close to. Rwanda is divided into 5 province: 1) Northern province, 2) Southern province, 3) Eastern province, 4) Western province, & 5) Kigali city (which is also a capital)
We are going to repeat same steps we did to identify the countries:
rwanda_map = gpd.read_file('/Kaggle Datasets/RWA_adm/RWA_adm1.shp')rwanda_map['Province_Name'] = np.where(rwanda_map['NAME_1']=='Amajyaruguru','Northern Province',
np.where(rwanda_map['NAME_1']=='Amajyepfo', 'Southern Province',
np.where(rwanda_map['NAME_1']=='Iburasirazuba','Eastern Province',
np.where(rwanda_map['NAME_1']=='Iburengerazuba','Western Province',
np.where(rwanda_map['NAME_1']=='Umujyi wa Kigali','Kigali City','Rwanda')))))
# we are going to use same function to identify the province
trainData['Province_Name'] = trainData.apply(lambda x: find_location(x['latitude'],x['longitude'],rwandaM_map,'Province_Name'), axis=1)
# The above was able to give us Province name for Rwanda but we also need the nearest location for outside border locations
border = trainData[trainData['Province_Name']=='Unknown']
def nearest_province_by_boundary(lat, lon):
point = Point(lon, lat)
# Find the province whose boundary is closest to the point
distances = rwanda_map['geometry'].boundary.distance(point)
return rwanda_map.loc[distances.idxmin(), 'Province_Name']
import warnings
warnings.filterwarnings('ignore')
border['Province_Name'] = border.apply(lambda x: nearest_province_by_boundary(x['latitude'],x['longitude']),axis=1)
# merge all the data together
trainDataCompleteProvince = pd.concat([trainData[trainData['Province_Name']!='Unknown'].reset_index(drop = True),
border.reset_index(drop = True)],axis=0)
# Find which border has most data points
nearestProvince.groupby('Province_Name')['ID_LAT_LON_YEAR_WEEK'].count().reset_index().sort_values('ID_LAT_LON_YEAR_WEEK',ascending=False).rename(columns={'ID_LAT_LON_YEAR_WEEK':'DataCount'})
East and West side have lot of data points that are outside Rwanda’s administrative border and we also see it’s effect on emission
Year -on-Year change is also higher in both east and west sides whereas if we look at change in average emission then Capital city Kigali has sharp up and down (-30% from 2019 to 2020 and then +30% from 2020 to 2021)