The Excel Ngrams Project¶
License¶
MIT License
Copyright (c) 2021 Matt
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Reference¶
excel_ngrams.console¶
Command-line interface.
excel_ngrams.grammer¶
Return dataframe of ngrams from list of words.
-
class
excel_ngrams.grammer.
Grammer
(terms_list)¶ Class that returns n-grams from text as a list of strings.
Words are delineated by white space and punctuation. Using Spacy’s NLP pipe and NLTK’s ngrams function to generate ngrams within a given word length range and output them to a Pandas DataFrame for writing to an output file.
-
term_list
¶ List of text as strings (one or more).
_nlp and _stopwords are shared across all instances, but is loaded by the constructor to avoid loading is in cases where it isn’t needed.
-
combine_dataframes
(dataframes)¶ Creates single multi-column dataframe.
Takes the terms and frequency values for dataframes constructed from ngrams of various lengths and combines them into a single dataframe, e.g single term and values, bigrams and values, trigrams and values, etc.
- Parameters
dataframes (list) – List of
pd.DataFrames
containing the dataframes to be merged, side by side.- Returns
Single combined dataframe from list of dataframes.
- Return type
pd.DataFrame
-
df_from_terms
(ngram_tuples)¶ Creates DataFrame from lists of terms and values as tuple.
Calls terms_to_columns on ngram_tuple to unpack them.
- Parameters
ngram_tuples (list) –
list
oftuple`[:obj:`tuple
[str], int]. Results from get_ngrams.- Returns
- Pandas DataFrame comprising a column of
terms and a column of frequency values for those terms.
- Return type
df(pd.DataFrame)
-
get_ngrams
(n, top_n_results=250, stopwords=True)¶ Create tuple with terms and frequency from list.
List of terms is tokenised using Spacy’s NLP pipe, set to lowercase and ngrams are calculated with NLTK’s ngrams function.
- Parameters
n (int) – The length of phrases to analyse.
top_n_results (int) – The number of results to return. Default is 150.
stopwords (bool) – flag to indicate removal of stopwords. Default is True.
- Returns
List of tuples containing term(s) and values.
- Return type
-
in_stop_words
(spacy_token_text)¶ Check if word appears in stopword set.
- Parameters
spacy_token_text (str) – The text attribute of the Spacy token being passed to the method.
- Returns
Whether text is present in stopwords.
- Return type
bool
-
ngram_range
(max_n, n=1, top_n_results=250, stopwords=True)¶ Gets ngram terms and outputs for a range of phrase lengths.
Gets ngrams from single terms as default up to desired maximum phrase length and creates Pandas DataFrame from results.
- Parameters
max_n (int) – The longest phrase length desired in output.
n (int) – The minimum term length. Default is 1 (single term).
top_n_results (int) – The number of rows of results to return. Default set to 150.
stopwords (bool) – flag to indicate removal of stopwords. Default is True.
- Returns
- Combined dataframe of all results from various
term lengths to desired maximum.
- Return type
pd.DataFrame
-
remove_escaped_chars
(text)¶ Remove newline and tab chars from string list.
- Parameters
text (
List
ofstr
) – Terms list to be cleaned of specific chars.- Returns
- Terms list without
specific chars.
- Return type
without_newlines(
List
ofstr
)
-
terms_to_columns
(ngram_tuples)¶ Returns term/value tuples as two lists.
- Parameters
ngram_tuples (list) –
list
oftuple`[:obj:`tuple
[str], int]. Results from get_ngrams.- Returns
- Terms, concatinated into
single string for multi-word terms, returned as list.
value_col(
list
ofint
): Term frequencies as list. Lists are returned together as tuple containing both lists.- Return type
term_col(
list
ofstr
)
-
A project to analyse a column of text in an Excel document and return a CSV file with the most common ngrams from that text. Output file is returned to the same directory as the input file. You can choose the maximum n-gram length, and maximum number of results (rows) returned.
Words are tokenised with Spacy and ngrams are generated with NLTK.
Installation¶
To install the Excel Ngrams Project, run this command in your terminal:
$ pip install excel-ngrams
Usage¶
Excel Ngram’s usage looks like:
$ excel-ngrams [OPTIONS]
-
-f
<file-path>
,
--file-path
<file-path>
¶ The path to the input Excel file to be parsed for words to generate ngrams.
-
-s
<sheet-name>
,
--sheet-name
<sheet-name>
¶ The name of the Excel sheet that contains the column of text to be analysed. By default, this is the first sheet in a document where none of the sheets have names. If any sheets are named, you must specify the one that contains the column to be analysed.
-
-c
<column-name>
,
--column-name
<column-name>
¶ The name of the column containing the text to be analysed for ngrams. By default, this is set to ‘Keyword’ (case sensitive).
-
-m
<maximum-ngram-length>
,
--max-n
<maximum-ngram-length>
¶ The maximum length of ngram phrase required. Each length of phrase below this number will also be returned in increments of one. For example, selecting 3 will return single word frequencies, bigrams, and trigrams. By default, this is set to 5.
-
-t
<number-of-results>
,
--top-results
<number-of-results>
¶ The number of rows of results to return. By default, this is 250 or all of the results if there are fewer than 250.
-
-w
<boolean>
,
--stopwords
<boolean>
¶ Remove stopwords from ngram analysis - true of false. By default, this is set to true.
-
--version
¶
Display the version and exit.
-
--help
¶
Display a short message and exit.