List of text mining methods

Different text mining methods are used based on their suitability for a data set. Text mining is the process of extracting data from unstructured text and finding patterns or relations. Below is a list of text mining methodologies.

Centroid-based Clustering: Unsupervised learning method. Clusters are determined based on data points.^[1]
- Fast Global KMeans: Made to accelerate Global KMeans.^[2]
- Global-K Means: Global K-means is an algorithm that begins with one cluster, and then divides in to multiple clusters based on the number required.^[2]
- KMeans: An algorithm that requires two parameters 1. K (a number of clusters) 2. Set of data.^[2]
- FW-KMeans: Used with vector space model. Uses the methodology of weight to decrease noise.^[2]
- Two-Level-KMeans: Regular KMeans algorithm takes place first. Clusters are then selected for subdivision into subclasses if they do not reach the threshold.^[2]
Cluster Algorithm
- Hierarchical Clustering
  - Agglomerative Clustering: Bottom-up approach. Each cluster is small and then aggregates together to form larger clusters.^[3]
  - Divisive Clustering: Top-down approach. Large clusters are split into smaller clusters.^[3]
- Density-based Clustering: A structure is determined by the density of data points.^[4]
  - DBSCAN
- Distribution-based Clustering: Clusters are formed based on mathematical methods from data.^[1]
  - Expectation-maximization algorithm
Collocation
Stemming Algorithm
- Truncating Methods: Removing the suffix or prefix of a word.
  - Lovins Stemmer: Removes longest suffix.
  - Porters Stemmer: Allows programmers to stem words based on their own criteria.
- Statistical Methods: Statistical procedure is involved and typically results in affixes being removed.
  - N-Gram Stemmer: A set of 'n' characters that are consecutive taken from a word
  - Hidden Markov Model (HMM) Stemmer: Moves between states are based on probability functions.
  - Yet Another Suffix Stripper (YASS) Stemmer: Hierarchal approach in creating clusters. Clusters are then considered a set of elements in classes and their centroids are the stems.
- Inflectional & Derivational Methods
  - Krovetz Stemmer: Changes words to word stems that are valid English words.
  - Xerox Stemmer: Removes prefixes.^[5]
Term Frequency
- Term Frequency Inverse Document Frequency
Topic Modeling
- Latent Semantic Analysis (LSA)
- Latent Dirichlet Allocation (LDA)
- Non-Negative Matrix Factorization (NMF)
- Bidirectional Encoder Representations from Transformers (BERT)
Wordscores: First estimates scores on word types based on a reference text. Then applies wordscores to a text that is not a reference text to get a document score. Lastly, documents that are not referenced are rescaled to then compare to the reference text.^[6]

References

^ ^a ^b "Different Types of Clustering Algorithm". GeeksforGeeks. 2018-01-15. Retrieved 2024-04-04.
^ ^a ^b ^c ^d ^e Jalil, Abdennour Mohamed; Hafidi, Imad; Alami, Lamiae; Khouribga, Ensa (2016). "Comparative Study of Clustering Algorithms in Text Mining Context". International Journal of Interactive Multimedia and Artificial Intelligence. 3 (7): 42. doi:10.9781/ijimai.2016.376. ISSN 1989-1660.
^ ^a ^b "Agglomerative Methods in Machine Learning". GeeksforGeeks. 2021-02-01. Retrieved 2024-04-04.
^ Hahsler, Michael; et al. "dbscan: Fast Density-based Clustering with R" (PDF). cran.r-project.org. Retrieved 4 March 2024.
^ Ganesh Jivani, Anjali. "A Comparative Study of Stemming Algorithms" (PDF).
^ Lowe, Will (2008). "Understanding Wordscores" (PDF). Methods and Data Institute, School of Politics and International Relations, University of Nottingham, Nottingham. doi:10.2139/ssrn.1095280. ISSN 1556-5068.

[:0-1] "Different Types of Clustering Algorithm". GeeksforGeeks. 2018-01-15. Retrieved 2024-04-04.

[:1-2] Jalil, Abdennour Mohamed; Hafidi, Imad; Alami, Lamiae; Khouribga, Ensa (2016). "Comparative Study of Clustering Algorithms in Text Mining Context". International Journal of Interactive Multimedia and Artificial Intelligence. 3 (7): 42. doi:10.9781/ijimai.2016.376. ISSN 1989-1660.

[:2-3] "Agglomerative Methods in Machine Learning". GeeksforGeeks. 2021-02-01. Retrieved 2024-04-04.

[4] Hahsler, Michael; et al. "dbscan: Fast Density-based Clustering with R" (PDF). cran.r-project.org. Retrieved 4 March 2024.

[5] Ganesh Jivani, Anjali. "A Comparative Study of Stemming Algorithms" (PDF).

[6] Lowe, Will (2008). "Understanding Wordscores" (PDF). Methods and Data Institute, School of Politics and International Relations, University of Nottingham, Nottingham. doi:10.2139/ssrn.1095280. ISSN 1556-5068.

[1]

[2]

[3]

[4]

[5]

[6]