K Means Algorithm Demonstrated by our Data Mining Homework Helpers
K-means algorithm that tries to split the dataset into K pre-defined different non-overlapping clusters, where each data point belongs to only one group.
K-means algorithm to run we need to provide:
- A number of clusters K.
- Define initialize methods.
- Define similarity measures (distances)
To prepare the dataset for the k-means clustering algorithm I removed mortality as a parameter since it is liable for each data point.
Our data mining homework helpers have tried several configurations for k-means clustering, changed the number of clusters from 2 to 5. Our tutors
also changed the initialization method and the similarity measure in this data mining homework.
Data mining homework helpers at programminghomeworkhelp.com
used Weka feature classes to the cluster. This feature lets you choose the label of the data, all other features are used in the model to find clusters in the dataset. The Afterward model treats each cluster as a label and calculates precision. how well the clustering algorithm corresponds to the actual labels that were not used for calculations.
Changing the distance function between euclidean and Manhattan did not change clusters very much. The initialization method also did not have a very big effect on cluster formation. In all cases (farthest first, random, k-means++, or canopy) the resulting 2 clusters correspond to Male and Female clusters with very high accuracy. Changing parameters only changed percentages by 1-2% for each class.
K means clustering result for 2 clusters and the parameter Deceased sex are given in the table.
|Diseased sex||Cluster 0||Cluster 1|
|Female||98% (300)|| 2% (7)
|Male||1% (4)||99% (354)
The first row corresponds to parameter diseased sex, the second row corresponds to cluster 0, the last row to cluster 1
From the table, we see that Cluster 1 contains 98% of all Females, and Cluster 2 contains 99% of all Males.
To make the model outcome visually more appealing I defined sex as a label for the data and assigned a class to sex in Weka. At first, I plotted all attributes to see the distribution of each parameter. In the plot red color represents the Male class and blue to the Female class.
Graphical Representation as a part of Data Homework Help Service
Visualization of all parameters shows that sex distribution is almost equal for every attribute.
After running the clustering algorithm and finding 2 classes our tutors
exported graphs from Weka visualization to see more insights into resulting clusters.
We plotted clusters to deceased_sex. Red and blue colors in this graph represent clusters and not male and female classes. Red is for cluster 1 and blue is for cluster 0. From the plot, it is visible that k-means clustering gave corresponding clusters to the class. Cluster 0 has mostly females and only 4 males (lower right edge on the plot). Cluster 1 has mostly males and only 7 females (higher left edge on the plot).
For K=2 resulting clusters correspond to sex.
To visualize clusters I have potted more graphs: district to deceased sex, rural to deceased sex, and district to age.
As we have seen Females are mostly in cluster 0, and only 7 females are in cluster 1. From the first graph, we see, that in Bilaspur region Females are represented from both clusters. We also see that all 7 females from cluster 1 are from Bilaspur Region. For the Kanke region, we can say the same for males. All 4 males from cluster 0 are from Kanke. Additionally, from the second graph, we see that those 4 males are from rural places. For females, we conclude that 6 of them from cluster 1 are from rural and one is from urban places.
The third graph is very interesting as there are two regions with no random distribution. We see that for Kanke people of age more than 55 are from cluster 0 and for Bilasp for age less than 50 are from cluster 1. Probably this was one of the main reasons why there are 4 males and 7 females missing from clusters.
We changed K to 3 to find what were the resulting clusters and why they did not represent mortality.
After running the clustering algorithm I saw that changing cluster numbers to 3 split the Female cluster into 2. Cluster 0 (blue in the plot) now represents the male class and has only 6 females. Cluster 1 (red) and cluster 2 (green) both are formed by females and have only 6 males and 2 males respectively.
Death year was the parameter that caused the split of the female cluster. As it is seen from the plot below, the death year split Female clusters into 2007-2009 and 2009-2011 sub-clusters. Cluster 2 (green) is mostly on the right side of the plot and cluster 1 (red) is mostly on the left side.
Increasing the cluster number to 5 caused splitting decesead_sex clusters even more. The female cluster was split into 3 (clusters 1 red, 2 green, and 4 pinks) and also the Male cluster into two (clusters 0 blue, and 3 cyan). Clusters created by the k-means model were very complicated to be described as in the previous cases.
From k-means clustering algorithms, we can conclude unsupervised learning algorithms provide more insights than just statistical tools used in Tableau. The model was unable to predict mortality from the data point but. For our dataset clustering was performed using sex as the most important variable, the death year was also very important and caused splitting to subclusters of male and Female clusters. The quality of our data mining homework help service can be judged on the basis of the overall clustering which was very helpful to find more insights and also visualize results.
Data Mining Homework Help
For this data mining homework, we used data visualization tools from Weka software to find patterns in the dataset.
can help you with such data mining homework where you need to provide a detailed report on some conclusions based on data visualization.
The treatment source is a very important parameter. We can suppose that government and NGO sector has better equipment and care system than the private sector. Most death cases happened when there was no medical attention and private hospitals and treatment at home were not as effective as government hospitals and NGO-s.
People are most likely to die at age 70. It is least likely to die at the age of 20 and Males and females on average have the same age despite the different distribution of age. One interesting finding is also that people living for more than 95 years are mostly females.
The average age in urban areas is higher than in rural areas.
There is more medical attention from the government and NGO sector in urban places and the
Rural places have much higher death records. Data visualization proved that this is not caused by more population living in rural places than in urban. There are fundamental problems with rural places, such as low awareness of the necessity of medical treatment, not enough infrastructure for deceased people, and not enough skilled staff for treatment at home.
In this data mining homework solution, we used two methods, classification, and clustering, for predicting mortality rate from the parameters chosen from more than 100 initial. Dataset included 7 attributes: 4 Nominal and 3 numeric. Before testing models, some pre-processing steps were done to make a dataset suitable for the J48 model. The j48 model fits the data quite well and achieved the best results (99%) to predict mortality and 62% accuracy for predicting year.
In the case of clustering, algorithms gave clusters that did not represent mortality. Visualization of clusters showed that the main splitting parameter was sex (Male/Female) in the case of two clusters. Increasing cluster number caused to split sex clusters to smaller sub-clusters and did not mix two types.
To conclude the supervised learning algorithm J48 was able to predict mortality with very high accuracy predicting mortality. Clustering algorithms also fit the data very well but resulting clusters did not represent mortality. Visualizations from Tableau and Weka gave very valuable insights and hints of dataset specifications. We were able to make interesting conclusions based on the several parameters that were chosen in part A. Data visualization and fitting models showed that chosen parameters had a correlation and interesting results were formed.
The findings discussed in this project are very important and give a very deep understanding of some of the key factors of death reasons and important factors. Several parties should be interested in using this dataset and findings. First of all, it is the Government of India. It can be used in many ways and one of the best approaches would be to increase awareness of treatment necessary in rural places. Also with helping Private companies and hospitals build the necessary infrastructure and train medical staff in the regions where there are a high death rate and low life expectancy. Private companies should also be interested to open new hospitals and health care centers in rural places where almost half of the population does not get treatment and there is a high demand for professional and high-quality practitioners. NGOs working in this field can contribute by identifying places where it is necessary to increase population knowledge, also monitoring private hospitals that have very high death records. One direction that needs very close attention is to reduce the child death rate. One possible step towards solving the problem can be giving seminars to the parents and also raising funds for free or affordable services at least for the very high-risk areas.