The objective of my project is to extract information from the Naval postgraduate school’s website where they have stored thousands of research projects. Their research material is stored in a centralized archive maned Calhoun, that is the Naval Postgraduate School’s digital repository for research materials and institutional publications created by the NPS community. Materials in the Calhoun are openly accessible to anyone on the web and will be preserved for future generations.
This study is focused on web mining, which will be used for extracting data from the web,and once the data is extracted than analysing that data for insights, such as what are the key areas/topics in time and what is the correlation between different authors and which authors have been more active.
SKILLS: Python (Numpy, Pandas, Beautifulsoup, Nltk, Seaborn), HTML inspection
My primary tool for this study was python. I utilized a lot of different libraries which Python offers, hence making it one of the most powerful tools for data exploration and analyses. I used libraries such as requests, beautifulsoup, matplotlib, wordcloud, pandas, numpy, seaborn, nltk, mlxtend and few other supporting libraries. The URL used for this study is: https://calhoun.nps.edu/handle/10945/16/discover?rpp=20&etal=0&group_by=none&page=1
On doing the analyses of the extracted word clouds, bigrams and commonly appearing words in the description, it can be concluded that most of the research projects involved designing a purposeful system using computers and complex modelling techniques, which could help defence and military in the US. The projects involved investigating a problem, developing a potential solution keeping in mind the risks involved and cost associated with it. Most projects belonged to mechanical and systems engineering departments.Since my dataset only comprised of 1000 documents out of 33000 available documents, from the heatmap it can be concluded that there wasn’t much collaboration between authors, the most being thrice that any two authors worked together, like ‘Thomas and Michael’ and ‘Robert and Joseph’ being a couple of them, whereas authors like ‘Joseph Patrick’, ‘Amie Wiborg’ collaborated the most. The overall support level for each author was very low indicating very low collaboration overall.
Link to the project code: https://github.com/Viraj015/Web-Scraping-Articles