CPP18 - String Parsing

22
String Parsing Michael Heron

description

This is an introductory lecture on C++, suitable for first year computing students or those doing a conversion masters degree at postgraduate level.

Transcript of CPP18 - String Parsing

Page 1: CPP18 - String Parsing

String ParsingMichael Heron

Page 2: CPP18 - String Parsing

Introduction• Having got a string in your system, how do you manipulate it?• Strings are fundamental forms of data representation.• Often obtained from text-files and user input.

• Most strings are not in an easily managed form.• The process of parsing is used to render raw data into more

refined forms.

Page 3: CPP18 - String Parsing

Parsing• There are many reasons why we may wish to parse data.• Information comes in as a string – we want it in an array.• Information comes in as lists of string numbers, we want them in

objects• We are rarely so lucky as to be able to instantly manipulate

data that comes in to the system.

Page 4: CPP18 - String Parsing

Data Representation• The absolute most important thing in designing a program is

to represent your data right.• If you get this right, everything is easier as a result.• If you get it wrong, everything is more difficult.

• Before you ever write a line of code, consider how data must be represented in the system.• What variables, objects and arrays are you going to use?

Page 5: CPP18 - String Parsing

Data Representation• Consider how you are going to need to manipulate the data in

the system.• Are you going to need to be able to search through things?• Are you going to need to process each value in turn?• Are you going to need to represent relationships between things?

• An easily manipulated data structure is worth its weight in gold.

Page 6: CPP18 - String Parsing

Parsing• Parsing is the process of turning difficult to manipulate data

into a more useful format.• Break strings up into all their constituent parts• Convert from multiple arrays into an array of objects

• Important first step before more complex processing.• Various standard techniques exist to facilitate this.

Page 7: CPP18 - String Parsing

Common Parsing Tasks• Tokenization• Turn a string into several smaller strings through the use of

tokens• Object processing• Breaking multiple data fields out of a single string and configuring

an object• Data conversion• Bringing data elements into some common format

• Often necessary to combine different processes.

Page 8: CPP18 - String Parsing

Tokenization• Tokenization is the process of splitting up strings.• Based on the idea of a delimiter.

• Strings that have a common, delimited structure are amenable to tokenization.• 10,20,30,40• Jim,Jake,Jane,Johana

• Strings are broken up based on the delimiter and the result is an array of strings.

Page 9: CPP18 - String Parsing

Object Processing• Object processing involves the creation of a ‘blank’ object and

setting its attributes as a result of input.• Often done after tokenization of input.

• The end result is an object configured as desired.• One way to handle persisting objects in files.

• May be repeated.• Create an array of appropriately configured objects.

Page 10: CPP18 - String Parsing

Data Conversion• As a result of parsing, can take the time to convert data into

more appropriate representations.• After pulling numbers in from a file, they’re usually stored as

strings.• Can use various conversion functions to clean up representation.

• atoi, as an example

• Can convert from rough representations to more precise representations.

Page 11: CPP18 - String Parsing

Example• Consider the following example scenario – calculate the Flesch

Readability index of a document.• Need to determine:

• Number of sentences• Number of words• Number of syllables in words

• Read in as a string from a text file.• Must be parsed.

Page 12: CPP18 - String Parsing

The Hard Way• Can manipulate a string directly.• Count spaces in a string.

• That gives word count, roughly• Count full stops in a string

• That gives the number of sentences• Syllable count?

• Uh…• Horrors upon horrors

• Must parse to get a structure amenable to processing.• An array of strings.

Page 13: CPP18 - String Parsing

String Processing• Strings contain many useful functions for handling such

parsing.• find function gives the location of a particular character.

#include <iostream>using namespace std;

int main() { string str = "Hello World"; int index; index = str.find ("e", 0); cout << "Found at: " << index << endl;}

Page 14: CPP18 - String Parsing

String Processing• Can use the substr function of a string to extract a substring

from a full string:

#include <iostream>using namespace std;

int main() { string str = "Hello World"; string sub = str.substr (0, 5);

cout << "Substring is: " << sub << endl;}

Page 15: CPP18 - String Parsing

Working With Strings• Strings also contain a very useful length function.• This tells you how many characters they contain.

• Also possible to index a string just like an array.• This lets you get individual characters out of a string.

• Can combine these into powerful functions.

Page 16: CPP18 - String Parsing

Tokenization#include <iostream>using namespace std;

int main() { string arr[100]; int size; string sub; string str = "Snausages are snausages for snausages"; int start;

size = 0; start = 0;

for (int i = 0; i < str.length(); i++) { if (str[i] != ' ' && i != str.length() - 1) { continue; } sub = str.substr (start, i-start); arr[size] = sub; start = i+1; size += 1;

}}

Page 17: CPP18 - String Parsing

Tokenization• There are other ways to tokenize.• This is just one way to show the power of string manipulation.

• Serves as a basis for more complex data parsing.• Important to be able to do this – all program representation

breaks down into parsing at some point or another.

Page 18: CPP18 - String Parsing

Object Representation• Can combine tokenization with object representation.• Tokenize individual elements.• Convert them to appropriate data format.• Use accessor methods on an object to configure.

• Can easily set up large amounts of objects with this kind of system.• Combine the objects in an array for the best of both worlds.

Page 19: CPP18 - String Parsing

This Is The End…• With that, it brings us to the end of the scheduled content for

C++.• Cheer / Cry as you feel is appropriate.

• Next week, we’ll use the time as consolidation time.• Thursday lecture will be a formal revision lecture covering all the

topics we have previously met.• Wednesday lecture/tutorial will be a drop in revision session. No

planned content, come along with whatever questions you have.

Page 20: CPP18 - String Parsing

Some Final Thoughts• Programming is hard.• I did warn you at the start!

• It’s also a very rare and valuable skill• Which you are moving towards properly building.

• It is a skill that requires training.• Like playing a musical instrument or fighting off ninjas.

• Important not to let it slide.

Page 21: CPP18 - String Parsing

Some Final Thoughts• It’s worthwhile keeping a notebook of ‘things I wish I had

software to do’.• It can serve as a basis for further exploration of programming.

• Don’t worry if you don’t know how to do the things.• Research is a constant part of programming. Nobody knows how

to do everything.• Stretching yourself by setting tasks you don’t know how to do

is a great way to learn.• Even if you never complete it, the process is valuable.

Page 22: CPP18 - String Parsing

Summary• Parsing is an important part of software development.• It helps you turn unstructured data into structured data.

• Comes in many forms.• String parsing is the most immediately useful of these.

• Tokenization is a key parsing technique.• Worth playing about with.