Concordance

Problem Statement: Given an arbitrary text document written in English, write a program that will generate a concordance, i.e. an alphabetical list of all word occurrences, labeled with word frequencies. As a bonus: label each word with the sentence numbers in which each word occurrence appeared.

See the problem statement in the original formatted pdf problem.pdf. It includes sample formatted output which implies additional problem requirements and behavior.

Solutions

Brute Force

For details about that design approach and solution, see brute/README.md.

MapReduce

For details about that design approach and solution, see mr/README.md.

Differences

In `submittal` Tag

Since the concordances of the two solutions seemed to contain extremely similar content but drastically different sentence indexes, this indicates that data normalization is needed and likely a bug in how the lines and/or sentences are being partitioned.

In both solutions, whitespace is stripped from the original input.

Upon NLP sentence identification, a resulting sentence may contain whitespace anywhere within the sentence since NLP is processing lines of data. The resulting sentence may span multiple lines of the original input. Therefore, whitespace is both stripped (i.e. from the front and end) and replaced with spaces when present in the middle of the sentence.

This normalizes the sentence data being processed and produces (nearly) identical concordances for the various inputs.

20k Concordance Differences

The brute force solution classified "Quisque" from the 150th sentence as a Proper Noun, while the MapReduce Solution did not.

Turing Concordance Differences

The brute force solution classified "INTELLICENCE" from the 1st sentence (i.e. the title line) as a Proper Noun, while the MapReduce solution did not.

The MapReduce solution classified "Thinking" from the 220th sentence as a Proper Noun, while the brute force solution did not.

The solutions classified several occurrences of "objection" differently.

Observations

The concordances generated by both the brute force and MapReduce solutions are nearly identical. The differences are caused by whether NLP classifies a word as a proper noun or not. This creates a concordance entry for both lower case (e.g. common noun) and capitalized (e.g. proper noun). This also alters sentence indexes of appearances thereby altering those word counts as well.

I cannot determine a cause for the NLP tagging differences. They may likely be caused by rounding errors in the model used.

Virtual Environment

Create a virtual environment.

> python -m venv concordance
> concordance\Scripts\activate

Install the project requirements.

(concordance) > cd <this_project>
(concordance) > pip install -r requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concordance

Solutions

Brute Force

MapReduce

Differences

In `submittal` Tag

20k Concordance Differences

Turing Concordance Differences

Observations

Virtual Environment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
brute		brute
mr		mr
res		res
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
problem.pdf		problem.pdf
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Concordance

Solutions

Brute Force

MapReduce

Differences

In submittal Tag

20k Concordance Differences

Turing Concordance Differences

Observations

Virtual Environment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

In `submittal` Tag

Packages