How Many Topics?

NamyaLG
15 min readJan 17, 2021

How Many Topics was one of the open-source projects under the GirlScript Summer Of Code Extended Programme.

Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus.

We intend to work on a research paper where we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. The goal is to determine k, the number of topics that are relevant to a given piece of text.

This project includes the following pipeline: Dataset Collection, Topic Modeling, Fine-tuning the parameters for topic modeling, Classification

My teammates include Khushboo Gupta, Rashi Singh, and Nagasuruthika

Introduction:

Topic modeling is an unsupervised machine learning technique that is used to classify a set of documents, detecting words and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. To delve deep into the same, we researched Topic Modeling by using the COVID-19 Dataset which can be found here.

This is a word cloud showing the most prominent words in the COVID-19 title dataset.

Description of the dataset

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.

The research part of the project was implemented on the title part of the COVID-19 Research papers and the best results were then implemented on the abstract part of the same dataset.

Data Pre-Processing:

From the entire COVID-19 Research Paper dataset, Topic Modeling is done in two parts i.e. on the Titles and the Abstract of the Research Papers. The following steps were performed to clean the data in order to prepare it for running several Topic Modeling algorithms:

  1. Lowercasing, Punctuation, and Stopwords Removal: All the texts were lowercase, and punctuations were removed with the help of the Regex library in Python. Then stopwords removal was done and all the stopwords in the English language which are available in the gensim package of Python were removed.
  2. Tokenization: Word Tokenization for done so that these words can then further be stemmed and lemmatized. The RegexpTokenizer available in the nltk package for data preprocessing was used in this step.
  3. Stemming: Stemming was done for all the words in the dataset using Porter Stemmer in order to reduce the words to their root/base form. The tokenized words from the above step were the input for the stemmer.
  4. Lemmatization: Lemmatization was done so that words that are in the third person are changed to the first person and verbs in the past and/or future tense are changed to their present tense. WordNetLemmatizer available in the nltk package was used for the same.

Observations:

Hyperparameter Tuning:

For the COVID-19 Title dataset

  • Whether to use Bag of Words or Term Frequence-Inverse Document Frequency?

The BoW model captures the frequencies of the word occurrences in a text corpus. It is not concerned about the order in which words appear in the text; instead, it only cares about which words appear in the text. On the other hand, TF-IDF measures how important a particular word is with respect to a document and the entire corpus.

So as far as our algorithm goes, often while implementing the LDA TF-IDF model is only used because it has shown to provide better results as compared to when we only use a bag of words. However, the difference is not much significant. So as far as LDA is concerned, Bag of words or TF-IDF, both seem to work fine for our dataset.

The following are the observations made on the COVID-19 Title dataset, by using both the methods for different values of k:

Method: Using Bag of Words, k=8, passes=2:

Even though some of the topics were repetitive, overall the model gave satisfactory results for us to continue.

Method: Using TF-IDF, k=8, passes=2 (keeping the other parameters same):

However, in our case, even more repetition among several topics was observed when the LDA model used TF-IDF. According to us, TF-IDF provided very general topics regarding COVID-19 which could have been common among all research papers whereas the Bag of Words model could give us a more clear picture of what was going on in the paper (as more specific words came out). In order to check whether we were proceeding in the right direction, we did the performance evaluation of both models. Performance Evaluation is done to check whether our model can distinguish different topics by using the words in each topic and their corresponding weights.

Performance Evaluation according to the Bag of Words model:

Performance Evaluation according to TF-IDF model

Both the models performed more or less the same. For example, in the first case, Bag of Words performed better whereas in the second case TF-IDF model performed much better. So we went according to our previous observation and finally chose the Bag of Words Model as it gave us more detailed and specific topics.

  • Whether to use the Latent Dirichlet allocation (LDA) algorithm or the Latent Semantic Analysis (LSA) algorithm?

Latent Dirichlet Allocation (LDA) is the most popular topic modeling algorithm which is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

Latent Semantic Analysis is another topic modeling algorithm that helps us to capture the context around hidden meaning/concepts in documents (topics). It mainly is a technique in the field of distributional semantics and analysis.

LSA is a technique that can help us figure out hidden concepts or topics behind the words. While implementing the LSA algorithm on the COVID-19 title dataset, because of limitations regarding computational time and resources, we restricted the number of features to 1000 while creating the document term matrix. This hyperparameter was also tuned to check if this affects the topic model in any way. However, by changing the number of max features, almost no change was observed in the results of our dataset. If enough computational power is available, it is advised to use all the terms while implementing the algorithm.

The following are the results observed when the LSA algorithm was implemented for different values of k on the COVID-19 title dataset

Number of Topics=8

This has been repeated for varying values of k (number of topics)

On implementing the same using LDA (as seen later in the document), our dataset gave better results on using LDA as compared to LSA. LDA gave a more variety of topics as compared to LSA which gave very redundant results in between the topics. One more reason why LSA has not given satisfactory results for our dataset is that the Latent Semantic Analysis algorithm as the name suggests tries to find the hidden concepts and meaning behind words that have the same spelling but mean different things in different situations. As most of the words in our dataset are technical and related to the COVID-19 field, they cannot be used in places where they have some other meaning (unlike words like a novel which can mean a book or original depending on the situation), this can be a reason that LDA performed better as compared to LSA for COVID-19 dataset. Hence, LDA is finally used for training.

  • Whether to use the implementation of LDA algorithm available in Python’s Gensim Package or Scikit-learn Package

The Latent Dirichlet Allocation algorithm can be implemented using two libraries available in Python i.e. Gensim package and Scikit-learn Package. The more popular one used to implement LDA is the Gensim Package however, in order to check whether the results varied if another package was used, we tried to implement LDA using both the packages. A major drawback was that the Scikit-learn package was very slow and due to limited computational power, even for a divided dataset one run of the algorithm could not be achieved. Hence, we decided to continue with the Gensim Package as it was much faster. If computational resources permit, then one should check the results by implementing LDA using both the packages.

Number of Topics

For the COVID-19 Title dataset

LDA was implemented both, on the entire dataset as well as by dividing the dataset randomly into 2 parts (this was done for ease of implementation and to reduce the running time).

A little analysis was done with regard to coherence, we have fixed the number of passes to one.

X-axis: Number of topics, Y-axis: coherence score

This is the output that is generated, though the max coherence increases with the number of topics, the topics are not relevant at all. Considering the local maxima, the coherence score for a number of topics = 11, works the best.

X-axis: passes, Y-axis: Coherence score

The above is a graph between coherence and number of passes and number of topics as 8, it can be seen that the number of passes between 30–40 is most favorable for the number of topics 8.

Earlier, the only parameters that were taken into account were :

lda_model = gensim.models.ldamodel.LdaModel(corpus,num_topics=8, id2word = dictionary, alpha=0.16, passes=i,chunksize=10000,per_word_topics=True)

To delve deeper, the following have been considered :

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,id2word=dictionary,num_topics=8, random_state=100, update_every=1, chunksize=1000, passes=10,alpha=’auto’, per_word_topics=True)

To date, the best results have been obtained for 8 topics and 40 passes and the coherence score is 0.3253107958923857

Coherence score

C_v (Coherence value) measure is based on a sliding window, one-set segmentation of the top words, and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity.

The highest coherence value was obtained at K=8 and K=38.

With the coherence score that seems to keep increasing with the number of topics, it may make better sense to pick the model that gave the highest CV before flattening out or a major drop. In this case, we picked K=8

The Coherence value for LSA is 0.34 for k=9

The title dataset was then divided randomly into two equal halves in order to reduce the runtime and also to check whether that would give us better results as compared to feeding the entire dataset into the algorithm. As the best results were obtained for 7<=k<=9, the two divided datasets were also checked in the same range only for easier comparison purposes.

First Half of the divided dataset

Number of Topics = 7, passes = 40

The major observation was that dividing the dataset did help and gave much better results. From the above picture, we can see that newer words have come up (Topic 4 and Topic 5) which were not seen in earlier parts.

Visualization for k = 7

This dataset shows excellent results in terms of visualization as the topics in the Intertopic Distance Map are quite far away. The best improvement that can possibly be done to our model to achieve excellent accuracy is to change the parameters such that (Topic 1 and Topic 5) and (Topic 2 and Topic 3) do not overlap for the best possible results. On observing this and the next visualization, we thought that maybe a suitable number of topics might be 5 as the Intertopic Distance Map always is being divided into majorly 5 parts.

Number of Topics = 8, passes = 2

Almost the same results can be seen (this is also much better than the results obtained using the entire dataset) that were obtained by using k = 7. However, the one with more number passes gave a more in-depth insight into the topics. This also shows that the model which runs for a greater number of passes gives us better results (but this cannot always be tested because of computational power limitation).

Visualization for k = 8

Topic 3 gave really good and unique results that did not come up before. This model is also excellent in terms of visualization as all the topics are quite spread out. As we can again see in the above visualization, some topics are a bit clustered together and there are 5 major categories/areas in which the topics fall into. So we decided to check whether k = 5 might be a good value to be chosen as the number of topics.

Visualization for Number of Topics = 5

From the previous observations, an idea that came up was to try for topic 5 to get all separated topics. But still Topics 3 and Topics 5 were overlapping so this approach was not the correct one. Also, the quality of topics was maintained the same as that we could see with the Number of topics = 8. As a higher coherence value was obtained for k = 8, we decided to finally go with that model only.

Visualization for k = 8

This Intertopic Distance Distribution is the best one we have seen so far. All the topics are quite far from each other and again the dataset has been divided into 5 major sets. Some overlaps can be seen clearly in (Topic 6 and Topic3), (Topic 1 and Topic 2) and (Topic 4 and Topic 5). One thing which we observed that even with a lot of hyperparameter tuning, at least 2 topics were constantly overlapping.

Second Half of the divided dataset

The same procedure was now done in the second half of the dataset. As k = 8 gave the best results for the first half, we decided to check the second dataset with the same. But we increased the number of passes to 40 in order to achieve even better results.

Number of Topics = 8, passes = 40, random_state = 42

Very satisfactory results were obtained (especially better than the one without splitting the dataset). New words have come up in this Topic Model too and comparatively there is less redundancy. One reason for new words coming up might be as the dataset is now divided, the dominant and most prominent words now change to ones that might have been not so dominant to the entire dataset.

The COVID-19 abstract was further divided into two halves randomly so that the operations performed will provide an easy to understand the result.

First Half of the divided dataset

For the first half of the dataset, a similar procedure was followed as the one done for the title dataset. As the k value for our dataset (as found from the title dataset) ranges between 8<=k<=10, these values were used to test whether they work the same for the abstract part too.

One observation was that values in the range k<7 gave very unsatisfactory results as most of the words were redundant and not much relevant to our dataset. Fairly good results were coming up. So that’s when we knew that the optimum number of topics is going to be reached soon.

Number of topics=7 and passes=2,
  • For k=7: On increasing the number of passes, more general nouns came up as compared to more specific proper nouns (in passes = 2).
  • For k=8: Number of topics=8 gave a very good general idea of our dataset which could cover almost all the aspects of COVID-19 in general.

The topics were not redundant which shows that k=7 and/or k=8 is a good number of topics that our dataset can be divided into. While k=7 gave more specific topics with proper nouns, k=8 gave a more general idea and hence, should be preferred over k=7.

As we decided to continue with k=8 for the first part of the abstract dataset, the following is the Visualisation Model for the same using pyLDAvis package available in Python.

Striking Observation

It is usually said that if the Intertopic Distance between the topics in a Topic Modelling Model is large, the model is said to be better. However, one striking observation that was observed while working with all the datasets was that if the Coherence Value for a particular number of topics was very good (i.e. high), the Intertopic distance for the same k was quite lower. This had caused a dilemma of whether to move forward with a better Coherence Value model or a better Visualisation Model. In the end, we decided to move forward with the Coherence Value Model because we thought that as our dataset was very much focussed on topics related to COVID-19 only, there is a huge possibility that almost all the topics will have to be interrelated a lot more as compared to other datasets which have a very diverse range of topics.

An overall Visualisation of how the coherence score changes as compared to a number of topics (k) is shown below. From this graph, we can clearly see that the optimum range of k for our dataset is 8<=k<=10, which is almost the same as our results.

Coherence Value for k=8: 0.559898006654606 (A very good coherence score)

Second Half of the divided dataset

For the second half of the abstract dataset, we checked that what number of topics will suit the dataset best by plotting the coherence value graph and printing the topics’ names.

We used the coherence value graph method because as per our previous results we have seen that the coherence value is at the peak for the number of topics that suit the best for our dataset.

No. of topics = 4, Input size = complete dataset (second half of abstract dataset), passes=100

The words present in the topic model were not very much repetitive or redundant.

No. of topics = 6

The words present in the topic were not very much repetitive or redundant except for topic 1, which consisted of only two-letter words that made no sense, eg- de, la, en, etc.

No. of topics = 8

The words present in the topic were not very much repetitive or redundant except for topic 6, which consisted of only one or two-letter words that made no sense.

No. of topics = 4, Input size = 800, passes=100

Words like covid19 and disease were seen to be repeated in all the topics. The presence of irrelevant data was negligible.

No of topics = 11

Only one-two words were repeated but a lot of two or three-letter words were found which were irrelevant.

Coherence value graph for k = 9 (Best result)

Visualization

For the COVID-Title Dataset

We have tried visualizing the results using word clouds, graphs, etc.

Values =100, No. of topics = 4, Passes = 100

The words that appeared were distinct and didn’t repeat.

  • Dominant topic and its percentage contribution

Repetition of words in various documents can be seen.

  • Frequency Distribution of Word Counts in Documents

The following graphs show the distribution of words present in various documents in the topics.

  • Word cloud with respect to the topics:

Some irrelevancy can be seen.

More distinct and better results but still irrelevancy can be seen.

Values =100, No. of topics = 6, Passes = 100
  • Change in input size:

With 800 rows taken into consideration, better results were found. But still, there was some repetition.

No. of topics = 4, passes= 100

With 800 rows taken into consideration, better results were found. New words were seen.

Repetition was seen but it was very less as compared to the whole dataset.

No. of topics = 11

Conclusion

On the basis of our findings, the following are the conclusions made:

  • The best-suited value for the number of topics i.e. k comes out to be in the range of 8<=k<=10 for the COVID-19 Title and Abstract dataset.
  • Latent Dirichlet Allocation (LDA) algorithm worked better for our dataset as compared to the Latent Semantic Analysis (LSA) algorithm.
  • Coherence Value was considered as the main feature used for the analysis of our topic model (i.e. it was preferred over other factors like Perplexity score, Log-likelihood, Intertopic Distance Map, etc

References

--

--

NamyaLG

Tech-enthusiast | Runner_for_life | #NoHumanIsLimited