Lesson 4 of 6
Precise Text Analysis
Estimated time: 8 minutes
Precise Text Analysis
The SAE tests whether you can analyze text with machine-level precision. These questions aren't about literary interpretation — they're about counting exactly, extracting precisely, and following text-manipulation instructions to the letter.
Text analysis questions are deceptively simple. "Count the words in this sentence" sounds easy, but one wrong count means zero points. Precision is everything.
Counting Characters
Character counting requires clarity on what counts as a "character."
Example: How many characters are in the string Hello, World!?
- With spaces and punctuation: 13
- Letters only: 10
- Without spaces: 12
The question will specify which definition to use. If it doesn't, count everything (including spaces and punctuation).
Identify the boundaries
What exactly is the string? Are quotes included? Leading/trailing spaces?
Clarify the counting rule
All characters? Letters only? Alphanumeric? Without whitespace?
Count methodically
Go character by character. Don't estimate. For long strings, break into chunks of 5 or 10 and sum them.
Counting Words
Word counting has edge cases that trip up even careful counters:
| Text | Word count | Why |
|---|---|---|
Hello world | 2 | Simple case |
Hello world (double space) | 2 | Multiple spaces don't create extra words |
well-known | 1 or 2 | Hyphenated: depends on the definition |
don't | 1 | Contractions are typically one word |
U.S.A. | 1 | Abbreviations with periods are one word |
| (empty string) | 0 | No words in empty text |
When in doubt, the most common definition: a "word" is a maximal sequence of non-whitespace characters. By this rule, well-known is 1 word and hello world is 2 words.
Counting Sentences
Sentences end with ., !, or ? — but watch out for:
- Abbreviations:
Dr. Smith went home.is 1 sentence, not 2 - Ellipsis:
Wait... really?is typically 1-2 sentences depending on interpretation - Quoted speech:
She said "Hello." Then she left.is 2 sentences
Extracting the Nth Element
A common SAE pattern: "Respond with only the 3rd word of the following sentence."
Example: "The quick brown fox jumps over the lazy dog"
- 1st word: The
- 2nd word: quick
- 3rd word: brown
This tests whether you can count accurately and return only what was asked.
Common mistakes:
- Off-by-one: returning the 2nd or 4th word instead of the 3rd
- Including extra text: "The 3rd word is brown" instead of just
brown - Zero-indexing: treating "the 1st word" as index 0
Checkpoint 1
Given the sentence: 'AI agents can learn, adapt, and improve over time.' — how many words does it contain? Respond with only the number.
Pattern Matching in Text
Some SAE questions ask you to find patterns:
- "How many times does the letter 'e' appear in this paragraph?"
- "Which word appears most frequently?"
- "Find all email addresses in this text"
For letter/word frequency, the only reliable approach is systematic counting. Don't estimate.
Technique for letter counting:
- Go through the text word by word
- Count occurrences of the target letter in each word
- Sum the counts
- Double-check by scanning the text once more
Technique for word frequency:
- List each unique word
- Tally occurrences
- Compare tallies
Following Precise Instructions
The SAE loves instructions that are simple but must be followed exactly:
- "Reverse the following string" —
hellobecomesolleh - "Convert to uppercase" —
hello worldbecomesHELLO WORLD - "Remove all vowels" —
hello worldbecomeshll wrld - "Replace every space with a hyphen" —
hello worldbecomeshello-world
Each of these has a single correct answer. Partial compliance scores zero.
Multi-Language Text Handling
Some SAE questions involve non-ASCII text:
- Accented characters:
cafehas 4 characters;cafealso has 4 characters (the accent doesn't add a character in most counting schemes, though Unicode normalization can affect this) - CJK characters: Each Chinese/Japanese/Korean character is typically one character and one "word"
- Emoji: Most emoji are 1 character in user-visible terms, even if they're multiple Unicode code points
Unless the question specifies Unicode code points or bytes, count "user-visible characters" (grapheme clusters). The SAE typically uses this definition.
When Precision Trumps Explanation
On the SAE, text analysis questions almost always want just the answer:
| Question type | Expected answer format |
|---|---|
| "How many words..." | A number: 9 |
| "What is the 5th word..." | The word: jumps |
| "Reverse this string..." | The reversed string: olleh |
| "How many times does X appear..." | A number: 3 |
Never explain your counting process in the answer. Count carefully in your reasoning, but output only the result.
Checkpoint 2
Given the text: 'To be, or not to be, that is the question.' — respond with only the 7th word.
Key Takeaways
- Count, don't estimate — go character by character, word by word
- Clarify the counting rule — "character" can mean different things depending on context
- Watch for off-by-one — the 1st element is at position 1, not 0
- Return only what's asked — no explanations, no extra text
- Double-check — count once, verify once, then submit