Skip to content

timheeg/concordance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Concordance

Problem Statement: Given an arbitrary text document written in English, write a program that will generate a concordance, i.e. an alphabetical list of all word occurrences, labeled with word frequencies. As a bonus: label each word with the sentence numbers in which each word occurrence appeared.

See the problem statement in the original formatted pdf problem.pdf. It includes sample formatted output which implies additional problem requirements and behavior.

Solutions

Brute Force

For details about that design approach and solution, see brute/README.md.

MapReduce

For details about that design approach and solution, see mr/README.md.

Differences

In submittal Tag

Since the concordances of the two solutions seemed to contain extremely similar content but drastically different sentence indexes, this indicates that data normalization is needed and likely a bug in how the lines and/or sentences are being partitioned.

In both solutions, whitespace is stripped from the original input.

Upon NLP sentence identification, a resulting sentence may contain whitespace anywhere within the sentence since NLP is processing lines of data. The resulting sentence may span multiple lines of the original input. Therefore, whitespace is both stripped (i.e. from the front and end) and replaced with spaces when present in the middle of the sentence.

This normalizes the sentence data being processed and produces (nearly) identical concordances for the various inputs.

20k Concordance Differences

The brute force solution classified "Quisque" from the 150th sentence as a Proper Noun, while the MapReduce Solution did not.

Turing Concordance Differences

The brute force solution classified "INTELLICENCE" from the 1st sentence (i.e. the title line) as a Proper Noun, while the MapReduce solution did not.

The MapReduce solution classified "Thinking" from the 220th sentence as a Proper Noun, while the brute force solution did not.

The solutions classified several occurrences of "objection" differently.

Observations

The concordances generated by both the brute force and MapReduce solutions are nearly identical. The differences are caused by whether NLP classifies a word as a proper noun or not. This creates a concordance entry for both lower case (e.g. common noun) and capitalized (e.g. proper noun). This also alters sentence indexes of appearances thereby altering those word counts as well.

I cannot determine a cause for the NLP tagging differences. They may likely be caused by rounding errors in the model used.

Virtual Environment

Create a virtual environment.

> python -m venv concordance
> concordance\Scripts\activate

Install the project requirements.

(concordance) > cd <this_project>
(concordance) > pip install -r requirements.txt

About

Generate a concordance of a text input

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages