Refactoring User Input Importer

Requirements

The Habit Tracker App User defines their activities and journal entries in a user input file which then needs to be parsed and imported.

Such a user_input file has the following format:

day of month month name (short version)
activity name 1; life aspect [ | more information]
activity name 2; life aspect [ | more information]
...
activity name n; life aspect [ | more information]
journal: (optional)
Journal Entry for that day

day of month month name (short version)
...

An example of a user_input file is:

10 Aug
developing my habit tracker; Career | just built a new feature
walk; Health &amp; Fitness | the weather was exceptionally good today
talk to a friend; Friendship
journal:
Had a great day today.
Glad that I&#39;m recording my habits.

11 Aug
cooking; Household
talk to a family member; Family

1 Sep
journal:
Today I don&#39;t feel like doing any of my activities, but I wrote something in the journal. And that&#39;s OK.

The format of this file makes the input user-friendly and human-readable. It is somewhat similar to a journal that you might write in your notebook. However, the computer has to have some kind of clever representation of this input so that it can process it, insert it into databases, make queries on it, and much more.

There are two data classes that we can define by looking at the user_input file, Activity and JournalEntry. Activity contains the details of an activity that happened throughout the day, note that more than one activity can be done in a day. JournalEntry contains just the record of that entry and each day can have only one of them, but it is not mandatory.

The properties of these data classes are as follows:

Activity
- activity_name: str
- activity_date: date
- life_aspect: str
- more_info: str = None
- id: UUID = uuid4()

JournalEntry
- record: str
- record_date: date
- id: UUID = uuid4()

We need an algorithm that takes the contents of a user_input file and returns a List[Activity] and a List[JournalEntry].

Current Solution

The current solution, which you can check out here, constitutes of two main parser functions, read_activities_from_user_input and read_journal_entries_from_user_input. They both take a user_input filename as a string, read the lines from the file one by one, and process the content in a primitive way. Lists of each data class (Activity and JournalEntry) are constructed imperatively and then returned. In case of crucial errors, an empty list will be generated with a log message saying what went wrong.

Let's see read_activities_from_user_input that retrieves the List of Activities from a user_input file:

# read_activities.py

import logging
from typing import List

import utils
from activities.activity import Activity
from activities.guess_life_aspect import guess_life_aspect

logger = logging.getLogger(__name__)

def read_activities_from_user_input(file_name: str) -> List[Activity]:
    try:
        f = open(file_name, "r", encoding="utf-8")
    except FileNotFoundError:
        logger.exception(f"File {file_name} does not exist")
        return []

    logger.info(f"Reading user input activities from {file_name}")
    activities = []

    while True:
        line = f.readline().strip()
        if not line:
            break

        try:
            activity_date = utils.extend_date(line) # extends date with the current year
        except ValueError:
            logger.exception(f"'{line}' is not a correct date format, no activities will be read from '{file_name}'")
            f.close()
            return []

        while True:
            line = f.readline().strip()
            if not line or line == "":
                break

            if line == "journal:":
                while line and line != "":
                    line = f.readline().strip()
                break

            raw_activity_props, more_info = utils.parse_activity_line(line)

            activity_props = (prop.strip() for prop in raw_activity_props.split(";"))
            activity_name = next(activity_props)
            try:
                life_aspect = next(activity_props)
            except StopIteration:
                life_aspect = guess_life_aspect(activity_name)
                if not life_aspect:
                    logger.exception(f"'{line}' is not a correct activity description, no activities will be read from '{file_name}'")
                    f.close()
                    return []

            activities.append(Activity(activity_name, activity_date, life_aspect, more_info))
    f.close()
    logger.info(f"Read {len(activities)} activity from user input file {file_name}")
    return activities

And read_journal_entries_from_user_input retrieves the List of Journal Entries from a user_input file:

# read_journal_entries.py

import logging
from typing import List

import utils
from journal.journal_entry import JournalEntry

logger = logging.getLogger(__name__)

def read_journal_entries_from_user_input(file_name: str) -> List[JournalEntry]:
    try:
        f = open(file_name, "r", encoding="utf-8")
    except FileNotFoundError:
        logger.exception(f"File {file_name} does not exist")
        return []

    logger.info(f"Reading user input journal entries from {file_name}")

    journal_entries = []

    while True:
        line = f.readline().strip()
        if not line:
            break

        try:
            record_date = utils.extend_date(line) # extends date with the current year
        except ValueError:
            logger.exception(f"'{line}' is not a correct date format, no journal entries will be read from '{file_name}'")
            f.close()
            return []

        while True:
            line = f.readline().strip()
            if not line or line == "":
                break

            if line == "journal:":
                record = ""
                while True:
                    line = f.readline().strip()
                    if not line or line == "":
                        break

                    if record != "":
                        raw_new_line = r"\n"
                        record = f"{record}{raw_new_line}"
                    record = f"{record}{line}"

                journal_entries.append(JournalEntry(record, record_date))
                break

    f.close()
    logger.info(f"Read {len(journal_entries)} journal entries from use input file {file_name}")
    return journal_entries

Problems with the Current Solution

Firstly, I think the current solution is not self-explanatory, it could use some comments and additional documentation for a new developer to understand what's going on. Therefore the maintenance is difficult, hence the reason for the two functions, I simply didn't wanna touch the read_activities_from_user_input function when building the journal entry parser functionality. We've all been in a situation where we said, or heard somebody say: "- I'm not changing that! Who knows what's gonna happen??!".

The 3 nested while loops are pretty much unavoidable with this approach, however as the format of the user_input files is getting more complex, the more while loops we will need to introduce, making the code not scalable.

Both functions have to change all the time because they take on more than one responsibility. They read the file, they must be aware of the user input file format. Even though there are 2 distinct functions for each of the DataClasses there's still a high chance that both of these functions need to change if the format of Activity or the format of JournalEntry changes.

The dependency on the file opening function and their imperative nature makes these code blocks difficult to test.

And finally, there's quite a lot of repetition going on, both functions have a bunch of common elements so the code is not DRY.

At a first glance

Perhaps the main issue is that we are processing the file line by line. A better approach would be to load the entire contents of the file into memory. A List[str] representing the lines of the file would be ideal because then we can apply any data transformation function that we want. Would like to mention that the largest file so far is 6.5KB, if and when the file size grows so that it's not efficient to load all of it into memory then I'll consider using a generator function instead.

A function that solves this issue might look something like this:

import logging
from pathlib import Path
from typing import List

logger = logging.getLogger(__name__)

def read_user_input(filename: str) -> List[str]:
    user_input_dir = "user_input"
    file = Path.cwd().joinpath(user_input_dir, filename)

    try:
        with file.open("r", encoding="utf-8") as f:
            lines = f.read().splitlines()
    except FileNotFoundError:
        logger.exception(f"File '{file}' does not exist")
        return []
    return lines

The next step is to identify well-defined units in the existing codebase that could be abstracted away into functions. At a first glance I can see 3 of these units:

Extract a journal_entry/activity date

...
    try:
        habtis_date = utils.extend_date(line) # extends date with the current year
    except ValueError:
        logger.exception(f"'{line}' is not a correct date format, no data will be read from '{file_name}'")
        f.close()
        return []
...

Parse an activity line

...
    raw_activity_props, more_info = utils.parse_activity_line(line)

    activity_props = (prop.strip() for prop in raw_activity_props.split(";"))
    activity_name = next(activity_props)
    try:
        life_aspect = next(activity_props)
    except StopIteration:
        life_aspect = guess_life_aspect(activity_name)
        if not life_aspect:
            logger.exception(f"'{line}' is not a correct activity description, no activities will be read from '{file_name}'")
            f.close()
            return []
...

Parse multiple lines of journal_entry

...
    record = ""
    while True:
        line = f.readline().strip()
        if not line or line == "":
            break

        if record != "":
            raw_new_line = r"\n"
            record = f"{record}{raw_new_line}"
        record = f"{record}{line}"
...

At this point, I went on and started to fiddle with the code, built these code blocks into functions, and reduced unnecessary abstractions. Had to decide what these functions will return are they going to return the domain models that we defined earlier as data classes or should they return NamedTuples. Also, what are going to be the core building blocks and how will those building blocks be put together so that it constructs the User Input Importer. How should I organise the code in modules, packages, and so on.

In my opinion, the best way to find a great solution for something is to try out a bunch of things and see what works best, document it, refine it, create diagrams, etc. I'll let you read about the intermediate steps here (link not yet set), but this article will continue with the final, refined solution.

Proposed Solution

After the refactoring process I came to a solution that is represented by the following code architecture diagram:

As you can see the code is structured in four packages. One of the packages, ht_importer represents the entry point to the solution, and deals with reading the user_input file, delegating the parsing and build process, and writing the data to the database. There is a package that contains the models that the application uses ht_models. Another package, ht_parser contains the algorithm that parses the user input and constructs raw_models. And another one ht_builder takes raw_models and with the help of some side effects (i.e. quess_life_aspect) creates the domain_models.

Let's look at each package and see their implementation:

`ht_models` package

It is sensible to start with ht_models as this package contains the data structures that other packages pass around to turn the contents of user_input files into data that the computer understands. All other packages are dependent on this one and it exists because we wanted to reduce the level of dependency between the packages.

ht_models have two modules currently, raw_models which represent the parsed user input as it is, and the domain_models which are the actual high-level models that we build from the raw_models and are supposed to be used in the rest of the application.

raw_models are just simple data classes that are designed to contain the raw information given by the user.

from dataclasses import dataclass
from datetime import date
from typing import List

@dataclass
class RawActivity:
    activity_name: str
    life_aspect: str
    more_info: str

RawJournalEntry = str

@dataclass
class RawDay:
    habits_date: date
    activities: List[RawActivity]
    journal_entry: RawJournalEntry

domain_models are also data classes, but they are more complex and deal with data generated by the system such as id and include to_string behaviours.

from dataclasses import dataclass, field
from datetime import date
from typing import List
import uuid

@dataclass
class Activity:
    activity_name: str
    activity_date: date
    life_aspect: str
    more_info: str = None
    id: uuid.UUID = field(default_factory=uuid.uuid4)

    def __str__(self):
        more_info_str = "" if self.more_info == None else f" | {self.more_info}"
        return f"{self.activity_date.strftime('%a %d %b')}: {self.activity_name}; {self.life_aspect}{more_info_str}"

    def print(self):
        more_info_str = "null" if self.more_info == None else f"{self.more_info}"
        return f"{self.id};{self.activity_date.strftime('%Y-%m-%d')};{self.activity_name};{self.life_aspect};{more_info_str}"


@dataclass
class JournalEntry:
    record: str
    record_date: date
    id: uuid.UUID = field(default_factory=uuid.uuid4)

    def __str__(self):
        return f"{self.record_date}\n{self.record}"

    def print(self):
        return f"{self.id};{self.record_date};{self.record}"


@dataclass
class UserInput:
    activities: List[Activity]
    journal_entries: List[JournalEntry]

`ht_importer` package

This package is like an orchestrator, has a dependency on every other package and it uses them together with some internal modules to get the job done. It is responsible for reading the data from user_input, converting the data (parsing then building), and writing the data to the database.

import sys
import logging
from collections import namedtuple

from ht_builder.exceptions import ActivityValueError, JournalEntryValueError
from ht_parser.parse_user_input import parse_user_input
from ht_builder.build_user_input import build_user_input

from ht_importer.read_user_input import read_user_input
from ht_importer.write_db import append_activities, append_journal_entries

logger = logging.getLogger(__name__)

ImportedEntityCounts = namedtuple("ImportedEntityCounts", ["activity_count", "journal_entry_count"])

def import_user_input(filename: str):
    user_input_lines = read_user_input(filename)

    raw_days = parse_user_input(user_input_lines)

    try:
        user_input = build_user_input(raw_days)
    except ActivityValueError:
        logging.error(f"There has been an issue parsing activities in '{filename}'")
        sys.exit()
    except JournalEntryValueError:
        logging.error(f"There has been an issue parsing journal entries is '{filename}'")
        sys.exit()

    append_activities(user_input.activities)
    append_journal_entries(user_input.journal_entries)

    return ImportedEntityCounts(len(user_input.activities), len(user_input.journal_entries))

We've seen the read_user_input function earlier in the article which is going to prepare the contents of the user_input file as a list of strings. That list of strings is exactly what the parse_user_input function needs to generate the raw_days (remember the raw_models), the build_raw_input takes that and returns a list of activities and a list of journal entries wrapped in a UserInput data class. Finally, if the build process didn't raise any exceptions, the activities, and journal entries will be appended to the database using the append_activities and append_journal_entries functions.

For auditing purposes, we want to return how many of each entity we inserted into the database hence the return of the NamedTuple ImportedEntityCounts.

The following diagram helps visualise what's happening during the execution of import_user_input function:

`ht_parser` package

This package is responsible for taking the list of strings and parsing them into raw_models. And it does it by splitting the lines into days, then parsing each day into different sections like date (str), list of activities (List[str]), and journal entries (List[str]). Once we have these sections we pass them on to their respective parser functions to turn them into raw activities (List[RawActivity]) and journal entries (RawJournalEntry).

import sys
import logging
from typing import List

from ht_models.raw_models import RawDay

from ht_parser.split_list import split_list
from ht_parser.parse_day import parse_day
from ht_parser.parse_activity import parse_activity
from ht_parser.parse_journal_entry import parse_journal_entry

def parse_user_input(lines: List[str]) -> List[RawDay]:
    if len(lines) == 0:
        return []

    days = split_list(lines)

    raw_days = []
    for day in days:
        parsed_day = parse_day(day)

        habits_date = parsed_day.date
        activities = [parse_activity(activity) for activity in parsed_day.activities]
        journal_entry = parse_journal_entry(parsed_day.journal_entry)

        raw_days.append(RawDay(habits_date, activities, journal_entry))

    return raw_days

I think it's important to mention here how we parse a string that contains an activity with the parse_activity function. Following the format of the user_input file, we split the string by the | character that separates the more_info property from the rest. If there's no more_info defined we just leave it as None. The other activity properties are separates by the ; character and the first item will be the activity_name while the second is the life_aspect. At this point, the life_aspect can be None because we might guess it later in the ht_builder package.

from ht_models.raw_models import RawActivity

def parse_activity(line: str) -> RawActivity:
    activity_parts = line.split("|")
    raw_activity_props = activity_parts[0]
    try:
        more_info = activity_parts[1].strip()
    except IndexError:
        more_info = None

    activity_props = raw_activity_props.split(";")
    activity_name = activity_props[0].strip()
    try:
        life_aspect = activity_props[1].strip()
    except IndexError:
        life_aspect = None

    return RawActivity(activity_name, life_aspect, more_info)

Also worth talking about the parse_journal_entry function as it is core business logic. It is much simpler than the previous method. Here we just need to join all lines with a new line character. It is important to create the new line special character as a raw string so that when we save it, it stays as a new line special character and doesn't get printed as simply a new line.

from typing import List

from ht_models.raw_models import RawJournalEntry

def parse_journal_entry(lines: List[str]) -> RawJournalEntry:
    if not lines:
        return None
    return r"\n".join(lines)

`ht_builder` package

This package is the final step in our conversion and that's where the raw_models turn into domain_models. We loop through each raw_day and build the date (build_habits_date), list of activities (build_activities) and if there is any, the journal entry (build_journal_entry).

from typing import List

from ht_models.raw_models import RawDay
from ht_models.domain_models import UserInput

from ht_builder.build_habits_date import build_habits_date
from ht_builder.build_activity import build_activity
from ht_builder.build_journal_entry import build_journal_entry

def build_user_input(raw_days: List[RawDay]) -> UserInput:
    activities = []
    journal_entries = []

    for raw_day in raw_days:
        habits_date = build_habits_date(raw_day.habits_date)
        for raw_activity in raw_day.activities:
            activities.append(build_activity(raw_activity, habits_date))
        if raw_day.journal_entry:
            journal_entries.append(build_journal_entry(raw_day.journal_entry, habits_date))

    return UserInput(activities, journal_entries)

Let's unpack what those individual builder functions do and how are they implemented. We can start with the build_habits_date which is a function that assumes the current year appends it to the parsed user_input date and makes a conversion to the default date format. Returns the date if the conversion was successful, otherwise None.

import logging
from datetime import date, datetime

logger = logging.getLogger(__name__)

def build_habits_date(partial_date: str) -> date:
    full_date = f"{partial_date} {date.today().year}"
    try:
        habits_date = datetime.strptime(full_date, '%d %b %Y').date()
    except ValueError:
        logger.exception(f"Partial date '{partial_date}' has incorrect format, use '%d %b', i.e. 10 Aug")
        return None
    return habits_date

build_activity maps the fields of a RawActivity to an Activity model while trying to fill in the blanks. The activity_date comes from the function argument given that the build_habits_date successfully converted the date. And life_aspect have to be guessed in case it was not provided in the user_input. If any of the required arguments of the Activity data class are missing an exception will be raised.

import logging
from datetime import date

from ht_models.raw_models import RawActivity
from ht_models.domain_models import Activity

from ht_builder.exceptions import ActivityValueError
from ht_builder.guess_life_aspect import guess_life_aspect

logger = logging.getLogger(__name__)

def build_activity(raw_activity: RawActivity, activity_date: date) -> Activity:
    if not activity_date:
        logger.error(f"activity_date is not defined")
        raise ActivityValueError("Activity date is missing")

    life_aspect = raw_activity.life_aspect or guess_life_aspect(raw_activity.activity_name)
    if not life_aspect:
        logger.error(f"Raw activity '{raw_activity}' is missing life aspect and can't be guessed")
        raise ActivityValueError("Life aspect is missing")

    return Activity(raw_activity.activity_name, activity_date, life_aspect, raw_activity.more_info)

build_journal_entry is essentially just a constructor that brings the raw_journal_entry and the habits_date and creates the JournalEntry domain model.

from datetime import date

from ht_models.raw_models import RawJournalEntry
from ht_models.domain_models import JournalEntry

from ht_builder.exceptions import JournalEntryValueError

def build_journal_entry(raw_journal_entry: RawJournalEntry, record_date: date) -> JournalEntry:
    if not raw_journal_entry:
        return None
    if not record_date:
        raise JournalEntryValueError
    return JournalEntry(raw_journal_entry, record_date)

Conclusion

Yes, it is crazy how much you can achieve with just these nested while loops, and the code stays relatively short. However, there are significant problems with that approach when we are building code that could be used in real life and has to be reliable. Let's take a look at how did we solve the problems outlined in the first section of this article.

Now that we separated the components and defined each part of the algorithm in functions we created a lot of code, but this is a much more maintainable code base than before. As a maintainer, you don't have to understand every building block to make a change, but if your task requires you to expand your knowledge of the system, you can because the blocks are small, concise, and easy to reason about.

The code becomes scalable because if there are new sections in a day that you wanna track, you just need to build the new parser, and builder functions and handle them in the parse_user_input function.

All functions deal with a single responsibility and they have only one reason to change.

Reading the user_input file has been taken out from the parser so there is no dependency on the mechanism that acquires user input. And all the defined functions in this refactoring are pure so they always return the same output for a specific input making the code easy to test.

Finally, but not lastly the code became DRY.

Change Log

v1.0.0 - Activities and Journal Entries

Motivation

Want to introduce a journaling habit through this project and considered to write my daily journal entries in a separate file called journal.txt and parse that similarly to a user_input file. However, I think it might just get a little too messy with multiple user_input files, so the solution is the bake the daily journal entries into the already existing user_input file format.

This is starting to get closer to the idea of habit journal. Whereby I have my habits, and a journal for each day so that I can introduce a more personal way of recording my activities. Would be nice to see my journal entries in the context of my habits, and see if there's any correlation between the two.

And to be honest the editing longer text with the wrap flag on here in Vim doesn't seem that bad. The main reason for this feature is to allow the user to get into the habit of Journaling in a very simple way. There's no need for fancy text editors or cool apps, I think the important bit is to get your thoughts down and be able to dig them up using date filter for example.

Theoretical Solution

Currently we have the data.txt file that contains all the data that is entered into the Habits database (activities). Now that we're doing journaling in the same application, we need more complex storage file structure:

activity.txt contains the activities just as how it was the data.txt
journal_entry.txt contains the journal entries, with an id, record_date and record

With this I'm creating a dependency to the user_input files and their parsing. But that's alright, because a CLI needs to be implemented anyway and that can have features that allow users to introduce their activities and journal entries outside of these user files. As long as we agree that the source of truth for the data is the database, or perhaps the database files: activity.txt and journal.txt at the early stages of the project.

Mixing the 2 concepts (activity and journal_entry) into a single user input file is a good first solution, but later on I can create another file that supports Markdown and maybe it has a journal entry for only one day, or for the entire week. I'll see, but for now the best solution is to couple the 2 concepts into the same user_input file, and have the python script separate them nicely.

Implementation

Follow the link to see the Pull Request on GitHub

PR: Save journal entries from user inputs