Skip to content

Implementaion of K-Means & Page Rank algorithms. (extend of "IR-CosineSimilarity-vs-Freq" repository)

License

Notifications You must be signed in to change notification settings

AlexandrosPlessias/IR-Kmeans-PageRank

Repository files navigation

IR-Kmeans-PageRank

Implementaion of K-Means & Page Rank algorithms. (extend of "IR-CosineSimilarity-vs-Freq" repository). Continuation of the "IR-CosineSimilarity-vs-Freq" repository asks you to apply methods of grouping and analysis of (hyper)connections in the research interests of the faculty members of the department. Then, based on these, you are invited to group the faculty members and calculate their "importance" in the department as reflected by the collaborations they have concluded with their colleagues in the co-writing of research articles. Finally, your goal is to combine the "importance" of the faculty member with the relevance to a question (as implemented in the "IR-CosineSimilarity-vs-Freq" repository) to get the final ranking in questions just like in search engines.

  • Question 1. Grouping of faculty members into two groups. In this question you are invited to group the faculty members of the department in two areas / objects / groups that (should) reflect the objects of the two older departments (computers and telecommunications) from which our department emerged. To do this, and after using the results of "IR-CosineSimilarity-vs-Freq" repository to represent each faculty member in the Vector Model as a collection of burdened terms (with tf-idf load as in Question 1 of "IR-CosineSimilarity-vs-Freq" repository), you must enter the 26 vectors (as well as the faculty members) in the K-medium algorithm and to take as an exit from it two groups of faculty members. Your grouping should take as a parameter L the size of the dictionary (practically the number of dimensions) in which the grouping will take place. Note that in order for all faculty members to have the same dimensions (in number and words) your dictionary should be alphabetically sorted so that the grouping is done in the L terms that are lexically first. Your implementation should, after doing the above, create the tety.txt and tett.txt files, which will have as many lines as the grouping steps, each line will contain the names of the faculty members who belong to the group. followed by Euclidean distance from the center of the group for this step in the form (faculty name, distance). Experiment with different sizes for the L (obviously including the full dictionary).

  • Question 2 . "Importance" of a faculty member. In this question you classify the faculty members in terms of their importance within the department, as expressed by the collaborations they have concluded with their colleagues in co-writing research papers. To do this you will calculate the Pagerank of each faculty member in the co-authors' graph. You will find the relevant information in the author's field of articles. Note that you are not interested in the information about co-authors outside the department and that the same article in which two faculty members are co-authors probably appears twice (once in each faculty member)! Using the information from the files, create the normalized connection table (its size will obviously be 26x26) for the graph and calculate the Pagerank by depreciation (the price of which you will determine and justify in your report). Your implementation should, after doing the above, create the pagerank-matrix.txt file, which will contain the connection table, and the pagerank-calc.txt file which will have S lines (as well as its computational steps Pagerank - S will be a parameter of your implementation) and each line will contain the last name of the faculty member followed by its importance (surname, importance).

  • Question 3 . Combining relativity with a question and "Importance" of faculty member. In this question you are invited to classify the faculty members based on their relevance to a question posed by the user and his importance in the department. To do this you will use your application in Question 2 of "IR-CosineSimilarity-vs-Freq" repository to find the similarity of a faculty member with a question and you will combine it in any way you deem appropriate with the importance you calculated in the previous question. The combination should be justified and should have resulted from appropriate testing of different ways, which will be seen and analyzed (for their selection or not) in your report. Your implementation should, after doing the above, create the resultsPR- “query-questions” .txt file (eg, resultsPR-distributed-systems.txt), which will have 26 lines (as many as the members PPC) and each line will contain the surname of the faculty member followed by the score he will have result from the similarity with the question and the "importance" of the faculty member in the form: (surname, score = style_of_combination_similarity_and_significance). The file must be sorted in descending order similarity to the question. Find interesting cases (questions) in which the classification changes in relation to the "IR-CosineSimilarity-vs-Freq" repository.

About

Implementaion of K-Means & Page Rank algorithms. (extend of "IR-CosineSimilarity-vs-Freq" repository)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published