multi_doc_compare.RdMultiple document comparison for textual overlap
multi_doc_compare(texts, n_grams, sd_criterion)
| texts | character vector of texts, each text is a string in the vector |
|---|---|
| n_grams | integer to specify ngram units |
| sd_criterion | numeric set a standard deviation criterion for returning documents that are unsually similar, 2-3 is pretty good |
list
dtm matrix document term matrix for all texts
histogram a histogram of the cosine similarity values between every text
similarities matrix cosine similarities between every text
mean_similarity numeric the mean similarity between all texts
sd_similarity numeric the standard deviation of the similarities
check_these dataframe document pairs that were above the criterion, might want to check these ones))