Problem Statement: Given an arbitrary text document written in English, write a program that will generate a concordance, i.e. an alphabetical list of all word occurrences, labeled with word frequencies. As a bonus: label each word with the sentence numbers in which each word occurrence appeared.
See the problem statement in the original formatted pdf problem.pdf. It includes sample formatted output which implies additional problem requirements and behavior.
For details about that design approach and solution, see brute/README.md.
For details about that design approach and solution, see mr/README.md.
Since the concordances of the two solutions seemed to contain extremely similar content but drastically different sentence indexes, this indicates that data normalization is needed and likely a bug in how the lines and/or sentences are being partitioned.
In both solutions, whitespace is stripped from the original input.
Upon NLP sentence identification, a resulting sentence may contain whitespace anywhere within the sentence since NLP is processing lines of data. The resulting sentence may span multiple lines of the original input. Therefore, whitespace is both stripped (i.e. from the front and end) and replaced with spaces when present in the middle of the sentence.
This normalizes the sentence data being processed and produces (nearly) identical concordances for the various inputs.
The brute force solution classified "Quisque" from the 150th sentence as a Proper Noun, while the MapReduce Solution did not.
The brute force solution classified "INTELLICENCE" from the 1st sentence (i.e. the title line) as a Proper Noun, while the MapReduce solution did not.
The MapReduce solution classified "Thinking" from the 220th sentence as a Proper Noun, while the brute force solution did not.
The solutions classified several occurrences of "objection" differently.
The concordances generated by both the brute force and MapReduce solutions are nearly identical. The differences are caused by whether NLP classifies a word as a proper noun or not. This creates a concordance entry for both lower case (e.g. common noun) and capitalized (e.g. proper noun). This also alters sentence indexes of appearances thereby altering those word counts as well.
I cannot determine a cause for the NLP tagging differences. They may likely be caused by rounding errors in the model used.
Create a virtual environment.
> python -m venv concordance
> concordance\Scripts\activate
Install the project requirements.
(concordance) > cd <this_project>
(concordance) > pip install -r requirements.txt