Week 10 - File Handling & Parsing¶
This week is about file handling and parsing, which is covered in Chapter 8 of the textbook (Here's a link to chapter 8 for easier access).
The core components of the Von Neumann architecture are the CPU and memory, but practically, it is absolutely necessary to have a component that acts as a storage unit for our data. This is because data in the memory is limited in space, and more importantly, it is not persistent -- your data would be lost if it is in the memory during a power outage.
Storage units (such as HDDs, SSDs, even small USB memory sticks) are persistent, in that they will keep your data without power (Not forever -- they may be overwritten using magnets, or they may just have data failures with consistent use, for example). They are also relatively cheap in terms of storage space. As a tradeoff, they are slower to read and write data.
Such storage systems (disks) have some kind of structured system for storing your data, called filesystems. Your data is stored in files, which are unstructured sequential bytes in the disk, and every language lets you deal with files.
Python lets you manage files (for reading and writing) using the open
function, which gives you a file object.
The basic syntax for handling a new file is
var = open(filename, flag)
where var
is the assigned file object, filename
is the name of the file, and flag
is either r
for read, w
for write, or a
for append (there are other options too, but these are generally enough).
Let's write some data to a new file that we'll call new_file.txt
. To do this, we'll have to open a file in write mode (with w
):
f = open('new_file.txt', 'w')
To write a string to a file object, just call .write()
:
f.write('hello, world')
12
When you're finished writing, to make sure that your data is written to the file you should close
the file with:
f.close()
This will also free resources allocated for that file.
Even if you forget to close a file, the operating system will probably reclaim the resources allocated for that file when your program terminates. Nevertheless, it's a good practice to remember to close files when you're done.
Now, let's read the data from the file we just wrote. We'll need to use the read flag (r
) with open
this time:
f = open('new_file.txt', 'r')
f.read()
'hello, world'
f.close()
Aside: it's nowadays a common Python practice to use the special with
keyword to create a scoped (remember function scopes?) file object that automatically is close
d when we get out of its scope. That way, you don't need to call close
after you're done with a file.
# You don't have to know this way of dealing with files, but it's useful
with open('hi.txt', 'w') as f2: # kinda like f2 = open('hi.txt', 'w')
f2.write('bye!')
# it's like f2.close() is called here
When you're dealing with huge files, sometimes you need to read the file in parts: you would read some amount of bytes from the file, process it, and read the next part from the file until you're finished.
f = open('new_file.txt', 'r')
We can simulate reading a "part" of a file by giving a number to read
, which will read up to a fixed amount of bytes from the given file:
f.read(5) # Read 5 bytes
'hello'
f.read(5) # Read (up to) 5 bytes, starting from where we left off
', wor'
f.close()
Let's look at what happens when we write multiple lines into a file:
f3 = open('spaces.txt', 'w')
f3.write('''hello
world
123''')
f3.close()
f3 = open('spaces.txt', 'r')
data = f3.read()
data
'hello\nworld\n123'
As we see, the lines we see visually in the cell above are separated by a special newline character (\n
) in strings and files.
print
knows that it should go to a new line when it sees \n
, so we don't see it here:
print(data)
hello world 123
f3.close()
We occasionally need to add some new data to a file, such as a new line.
When we use 'w' to open a file for writing, if the file already contained some data, it will be deleted when we use w
.
That's why there is another flag, a
(short for append), for writing new info to the end of an existing file.
# Read an existing file
f3 = open('spaces.txt', 'r')
data = f3.read()
print(data)
f3.close()
hello world 123
# Add some stuff to the end
f3 = open('spaces.txt', 'a')
f3.write('---some new data---')
f3.close()
# How does the file look now?
f3 = open('spaces.txt', 'r')
data = f3.read()
print(data)
f3.close()
hello world 123---some new data---
Basic parsing¶
Parsing is something you do when your data is in a structured format, and you need to recover individual elements from the data using a pattern.
The string below contains data, each separated by a single space (
) character:
s = '1 20 0.4 10 hello'
We can separate these elements inside the string using the .split()
method of the string class:
s.split()
['1', '20', '0.4', '10', 'hello']
In general, str.split(delimiter)
will create a list of items that are the result of splitting the input data whenever we see the delimiter
string.
By default, str.split()
is the same as str.split(' ')
, so we're splitting by checking for a single whitespace:
s.split(' ')
['1', '20', '0.4', '10', 'hello']
We can also split our string using other substrings, if we need to:
s.split('0')
['1 2', ' ', '.4 1', ' hello']
In the example below, our string is encoded with a special string @@
, which we can use for the delimiter in our input to split()
:
s2 = 'hi@@bye@@hello'
s2.split('@@')
['hi', 'bye', 'hello']
After we parse the tokens from the string using split()
, we can convert them to their correct data type using int()
, float()
, eval()
, and other functions.
s = '1 2 10.0' # Separated by double spaces
elems = s.split(' ')
elems
['1', '2', '10.0']
v1 = int(elems[0])
v2 = int(elems[1])
v3 = float(elems[2])
v1 + v2 + v3
13.0
You can also be slightly fancy by writing a single-line list comprehension ;-)
[float(x) for x in s.split(' ')]
[1.0, 2.0, 10.0]
The same ideas apply for files as well.
data = [(10, 20), (1, 5), (100, 1), (5, 5)]
Suppose we want to write the data above in a structred way, for example, this could be the way we encode our data
above into a file:
data.txt
4
10 -- 20
1 -- 5
100 -- 1
5 -- 5
The file begins with the number of elements inside the data (4), and then each element inside is the data is written in a new line, with --
separating each number.
Here's how we could do this:
f = open('data.txt', 'w') # Open the file for writing
f.write(str(len(data)) + '\n') # Write the number of chars + newline character
for pair in data:
f.write(str(pair[0]) + ' -- ' + str(pair[1]) + '\n')
f.close()
Let's see the file content:
f = open('data.txt', 'r')
print(f.read())
f.close()
4 10 -- 20 1 -- 5 100 -- 1 5 -- 5
That seems to work. What about the reverse case, though? Suppose we have this file and we want to retrieve our data.
We can also do that as well, using the same ideas in reverse:
f2 = open('data.txt', 'r')
first_line = f2.readline()
l = int(first_line) # Get the number of lines to read
new_data = []
for i in range(l):
next_line = f2.readline().strip('\n') # Read a line, discard \n at the end
tokens = next_line.split(' -- ') # Split the line using our special delimiter
first = int(tokens[0])
second = int(tokens[1])
new_data.append((first, second))
f2.close()
print(new_data)
[(10, 20), (1, 5), (100, 1), (5, 5)]
That also seems to work just fine. Notice that file.readline()
gives us the next line in the file, but it doesn't discard newline tokens. That's where str.strip()
is useful, as it removes the newline character at the end.
In the example above, we know how many lines we have to read because it is written on the first line of the file.
There is another approach that we can take, using the fact that file.readline()
will gives an empty string when the file has ended:
f2 = open('data.txt', 'r')
new_data = []
first_line = f2.readline().strip('\n')
# We don't need the number of elements, so just ignore the first line
while True:
line = f2.readline().strip('\n')
print(line)
if line != '': # Check if we're at the end of the file
tokens = line.split(' -- ') # Split the line using our special delimiter
first = int(tokens[0])
second = int(tokens[1])
new_data.append((first, second))
else:
break
f2.close()
print(new_data)
10 -- 20 1 -- 5 100 -- 1 5 -- 5 [(10, 20), (1, 5), (100, 1), (5, 5)]
We can also create a special token to use to signal that we we've reached the end of a file, like this:
data = [1,2, 3]
end_token = 'END-OF-FILE'
with open('my_data.txt', 'w') as f:
for el in data:
f.write(f'{el}\n')
# After we write all of the elements
f.write(end_token)
new_data = []
with open('my_data.txt', 'r') as f:
while True:
line = f.readline()
if line == end_token:
break # We've reached the end of the file
else:
new_data.append(int(line))
print(new_data)
[1, 2, 3]
As we learned before, we can format strings to make them appear nicer, display them up to a certain number of significant digits, add horizontal padding, and so on. It might be a nice idea to format your data before putting them into a file.
The example below makes sure that each element takes 10 characters of space (width) in the resulting output, which could be very useful for displaying matrix-like or table-like data:
data = [11203, 2231, 323]
print(data[0], data[1], data[2]) # Normal printing
print('{:10} {:10} {:10}'.format(data[0], data[1], data[2])) # Formatted
11203 2231 323 11203 2231 323
As always, you should check out the textbook for more information regarding file formatting.
Binary files¶
Finally, we can talk about binary files.
Up until this point, we've been writing data into files in a way that we can read them as a human (strings, decimal strings for numbers). This is also useful when we need to edit a file.
However, a computer works in binary, meaning that it has to translate your data back to binary to use it. Also, human-readable representation also takes more space in terms of storage, so there is an advantage when using binary files. However, we lose easy readability as a trade-off.
The example below shows a minimal example of working with binary files:
data = [99, 120, 244, 42, 30, 10, 1, 0]
# You don't need to know how to do this
with open('binary.txt', 'wb') as binfile: # wb for [w]rite [b]inary
binfile.write(bytes(data))
with open('binary.txt', 'rb') as binfile: # rb fo [r]ead [b]inary
content = binfile.read()
print(content)
b'cx\xf4*\x1e\n\x01\x00'