+1 (315) 557-6473 

Python Program to Implement Big Data Processing Assignment Solution.


Instructions

Objective
Write a program to implement big data processing in python.

Requirements and Specifications

In this question, we will use all we have learned about Python to do a project about big data processing. We observe earthquakes and topography on the Earth’s surface. The goal of this project is to determine the correlation between these two, or to check if large earthquakes preferentially occur in regions with high or low topography.
There are 2 data files in this folder:
“Large-Eq.csv”: an earthquake catalog containing 25,000 events with a magnitude >5.0 since year 2007. The data structure is the same as the other earthquake catalog files we used in this class.
“topo.dat”: topography data. The 3 columns are Longitude, latitude, and elevation, respectively.
  1. You will eventually make a plot showing the relationship between the number of earthquakes and the elevation, e.g., a plot in which the horizontal axis is the elevation and the vertical axis is the number of earthquakes. I provide an example figure in the example.pdf file. However, you can make any plot that you want and the details of the figure depend on your choice. The key is that people should easily find under what elevation the earthquakes happen the most by reading your plot. If you want to make histograms, you will need to study by your self through online materials.
  2. Your codes should run without problem.
  3. Your codes should be readable to me. Please include as many comments as you think is necessary
  4. Group discussion is encouraged. However, do not copy others’ codes.
  5. Make your code run as fast as possible. When you turn in your codes, provide information on how long it takes to run the code on your computer.
  6. Grades will be based on both the correctness and the readability of the codes.
  7. Please try to make your code run as fast and precise as possible. Ideally, the code takes a few seconds. It should not take more than 10 minutes.
Important hints:
  1. in the earthquake catalog, the longitude ranges from -180 to 180 degrees. The negative values indicate in the west and the positive values indicate in the east. However, in the topography data, the longitude ranges from 0 to 360 degrees, which means 0-180 degree is in the eastern hemisphere and 180-360 degrees is in the western hemisphere. In both cases, the 0 is defined as the Zero degrees longitude which is an imaginary line known as the Prime Meridian, and you are moving eastward as the longitude increases.
  2. the grid points of (longitude, latitude) in the earthquake catalog often does not match the grid points in the topography data. Therefore, if you need help with python assignment to be creative when finding the elevation for an earthquake from the topography file, e.g., choosing the closest grid points in topo.dat, or do intepolation.
Source Code
# Assignment 8 (A small project)
Due date: April 29, 2022, 11:59pm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In this question, we will use all we have learned about Python to do a project about big data processing. We observe earthquakes and topography on the Earth’s surface. The goal of this project is to determine the correlation between these two, or to check if large earthquakes preferentially occur in regions with high or low topography.
There are 2 data files in this folder:
“Large-Eq.csv”: an earthquake catalog containing 25,000 events with a magnitude >5.0 since year 2007. The data structure is the same as the other earthquake catalog files we used in this class.
“topo.dat”: topography data. The 3 columns are Longitude, latitude, and elevation, respectively.
  1. You will eventually make a plot showing the relationship between the number of earthquakes and the elevation, e.g., a plot in which the horizontal axis is the elevation and the vertical axis is the number of earthquakes. I provide an example figure in the example.pdf file. However, you can make any plot that you want and the details of the figure depend on your choice. The key is that people should easily find under what elevation the earthquakes happen the most by reading your plot. If you want to make histograms, you will need to study by your self through online materials.
  2. Your codes should run without problem.
  3. Your codes should be readable to me. Please include as many comments as you think is necessary
  4. Group discussion is encouraged. However, do not copy others’ codes.
  5. Make your code run as fast as possible. When you turn in your codes, provide information on how long it takes to run the code on your computer.
  6. Grades will be based on both the correctness and the readability of the codes.
  7. Please try to make your code run as fast and precise as possible. Ideally, the code takes a few seconds. It should not take more than 10 minutes.
Important hints:
  1. in the earthquake catalog, the longitude ranges from -180 to 180 degrees. The negative values indicate in the west and the positive values indicate in the east. However, in the topography data, the longitude ranges from 0 to 360 degrees, which means 0-180 degree is in the eastern hemisphere and 180-360 degrees is in the western hemisphere. In both cases, the 0 is defined as the Zero degrees longitude which is an imaginary line known as the Prime Meridian, and you are moving eastward as the longitude increases.
  2. the grid points of (longitude, latitude) in the earthquake catalog often does not match the grid points in the topography data. Therefore, you need to be creative when finding the elevation for an earthquake from the topography file, e.g., choosing the closest grid points in topo.dat, or do intepolation.

### Read Csv File

eq_data = pd.read_csv('Large_Eq.csv')

eq_data.head()

### Read .dat file

topo_data = pd.read_csv('topo.dat', delimiter='\t', names = ['Lon', 'Lat', 'Elev'])

topo_data.head()

### Since the values between 180 and 360 are for the western hemisphere, we convert them to be between -180 and 0

topo_data.loc[topo_data['Lon'] >= 180, 'Lon'] -= 360

topo_data.head()

### We take the Lat and Lon columns from eq_data and round to integers

eq_data[['Lat', 'Lon']] = eq_data[['Lat', 'Lon']] .round()

eq_data.head()

### Now, in the topo_data, initialize a new column named 'Count' with zeros

topo_data['Count'] = 0

topo_data.head()

### Now, for each pair of longitude and latitude in topo_data, search how many records are in eq_data and add to the column Count

eq_data_grouped = eq_data.groupby(['Lon','Lat']).size().reset_index().rename(columns={0:'count'})

eq_data_grouped.head()

### Now, for each pair (lon, lat), get the number of earthquakes recorded in eq_data and put that value in the topo_data for the corresponding (lon, lat) pair

for i in range(len(eq_data_grouped)):

lon = eq_data_grouped.loc[i, 'Lon']

lat = eq_data_grouped.loc[i, 'Lat']

count = eq_data_grouped.loc[i, 'count']

# Check if the given lot and lan are in the topo_data dataframe

result = topo_data[(topo_data['Lon'] == lon) & (topo_data['Lat'] == lat)]

if len(result) > 0:

topo_data.loc[(topo_data['Lon'] == lon) & (topo_data['Lat'] == lat), 'Count'] += count

topo_data.head()

### Copy the topo_data DataFrame with only the Elevation and Count columns

topo_data_mini = topo_data[['Elev', 'Count']]

topo_data_mini.head()

### Plot

x = topo_data_mini['Elev'].to_numpy()

y = topo_data_mini['Count'].to_numpy()

plt.figure()

plt.hist(x, weights=y, bins=10)

plt.xlabel('topograph (km)')

plt.ylabel('number of earthquakes')

plt.show()