Solving for bank transfers using Pandas

Previously in Part 1, I talked about how to clean-up the transaction data from Mint to remove duplicates and add any missing transactions.

Solving for transfers

The next phase is to solve for the transfer pairs. A transfer pair is defined with a matching credit and debit transaction on two different accounts. In Firefly, a transfer is treated separately than a credit/debit because it’s excluded from the expense and income reports.

This ended up being trickier than I expected and I went through multiple phases. First attempt, I used Pandas merge_asof, but this did not handle multiple transactions around the same time. Then I developed a new method that ignored previously used transactions. This worked better, but it still struggled with complex transaction chains. For example, if I transferred $1k from account A to B, then from C to D, it might accidentally book a transfer from A to D and C to B which is incorrect.

The basic algorithm

Here we have the first iteration. How it works is for every single transaction, it looks forward 5 days, finds all transactions that are the opposite type (credit -> debit or vice versa) and the same amount, then tags the first one as the opposite. For example, in the below table, it’d identify the two $300 transactions as a pair.

Date	Description	Amount	Type	Account
1/1/2024	Transfer to XYZ	$300.00	debit	ABC
1/1/2024	Bought Avocado Toast	$4.99	debit	Credit Card
1/2/2024	Transfer from ABC	$300.00	credit	XYZ

The code looked like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def find_transfer(row, df, time_window_days=5):
    start_date = row['Date']
    end_date = start_date + timedelta(days=time_window_days)

    opposing_filter = (
        (df['Account Name'] != row['Account Name']) &
        (df['Date'] >= start_date) &
        (df['Date'] <= end_date) &
        (df['Amount'] == row['Amount']) &
        (df['Transaction Type'] != row['Transaction Type']) &
        (~df['Considered'])
    )
    
    transfer_pair = df[opposing_filter]    
    if not transfer_pair.empty:
        df.at[row.name, 'Considered'] = True
        opposite = transfer_pair.iloc[0]
        return opposite
    else:
        return None


def process_record(row, df):
    if df.at[row.name, 'Considered']:
        return None
    
    pair = find_transfer(row, df)
    if pair is not None:
        df.at[pair.name, 'Considered'] = True
        notes = [
            row['Notes'],
            pair['Notes'],
        ]
        labels = list(filter(lambda x: isinstance(x, str) and x != '', [
            row['Labels'],
            pair['Labels']
        ]))
        if row['Transaction Type'] == 'debit':
            # Bank is on the left
            date = row['Date']
            process_date = pair['Date']
            notes.append(pair['Original Description'])
            return {
                'type': 'transfer',
                'date': row['Date'],
                'process_date': pair['Date'],
                'amount': row['Amount'],
                'category': pair['Category'],
                'description': row['Original Description'],
                'source_id': row['AccountId'],
                'destination_id': pair['AccountId'],
                'tags': labels,
                'format': 'transfer_debit',
                'notes': '\n'.join(filter(lambda x: isinstance(x, str) and x != '', notes))
            }
        else:
            date = pair['Date']
            process_date = row['Date']
            notes.append(row['Original Description'])
            return {
                'type': 'transfer',
                'date': date,
                'process_date': process_date,
                'amount': row['Amount'],
                'category': pair['Category'],
                'description': pair['Original Description'],
                'source_id': pair['AccountId'],
                'destination_id': row['AccountId'],
                'tags': labels,
                'format': 'transfer_credit',
                'notes': '\n'.join(filter(lambda x: isinstance(x, str) and x != '', notes))
            }
    
    result = {
        'date': row['Date'],
        'description': row['Original Description'],
        'amount': row['Amount'],
        'category': row['Category'],
        'tags': row['Labels'],
        'notes': row['Notes']
    }
    
    if row['Transaction Type'] == 'credit':
        result['type'] = 'deposit'
        result['destination_id'] = row['AccountId']
        result['source_name'] = row['Description']
	else:
        result['source_id'] = row['AccountId']
        result['destination_name'] = row['Description']
        result['type'] = 'withdrawal'
            
    return result

transactions['Considered'] = False
transactions.apply(func=process_record, axis=1, result_type='expand')

However, it made a lot of mistakes. For example, it would classify the following as a transfer from Venmo to the Credit Card. Which maybe one could say it’s a “transfer”, but I didn’t want that because the credit card would get paid by my checking account.

Date	Description	Amount	Type	Account
1/1/2024	Bought Dinner for my friends	$50	debit	Credit Card
1/1/2024	Friend paid me back	$50	credit	Venmo

Eventually, I extended that algorithm with rules based on how my accounts behaved. Let’s look at some of those options.

Not all transactions are transfers

Not all transactions are transfers, but sometimes it lines up that it looks like it’s a transfer because there’s a matching side. Just like the previous example. We can define some rules that match transactions that we know are not transfers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def account(num: Union[int, List[int]]):
    if type(num) == int:
        return transactions['AccountId'] == num
    else:
        return transactions['AccountId'].isin(num)
    
def orig_descr_contains(msg: str):
    return transactions['Original Description'].str.contains(msg, case=False)

# List of all Firefly account ids that are credit cards
credit_cards = [5, 10, 15, 20]

brokerages = []

account_id_401k = 25

not_a_transfer = [
	# Maybe exclude an entire account. For example, your 401k may never have
	# a transfer (unless you take a loan or do a Mega Backdoor Roth)
    account(account_id_401k),

	# Or you want to exclude reimbursements via Venmo
    account(23) & orig_descr_contains('John Doe Paid '), # Venmo
    account(23) & ~transactions['Category'].isna() & (transactions['Category'] != 'Transfer'), # Venmos frequently appear like transfers because a friend reimburses me

	# Credit Card Payments (to the card) can be transfers, but
	# maybe they're never transfers going out (unless you do a Balance Transfer)
    account(credit_cards) & (transactions['Transaction Type'] == 'debit'),

	# A paycheck
	# While it is a transfer from your employer to you, without the
	# debit side, the algorithm could incorrectly classify it as a transfer
    orig_descr_contains('PAYROLL') | (transactions['Category'] == 'Paycheck'),

	# Interest paid is not a transfer
    orig_descr_contains('INTEREST PAYMENT'),
    transactions['Category'] == 'Interest Income',

	# Dividend reinvestment are not transfers
	account(brokerages) & (orig_descr_contains('DIVDEND') | orig_descr_contains('REINVESTMENT')),

	# Some accounts have $0 transactions as part of bookkeeping
	# They are not transfers
    (transactions['Amount'] == 0)
]

Then we can integrate it into the algorithm. While searching for new transactions, it excludes anything that is not considered relevant.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Compute the exclusion set
df_filter_relevant = ~reduce(lambda x, y: x | y, not_a_transfer)

def find_transfer(row, df, time_window_days=5):
    # ...
	opposing_filter = (
        (df['Account Name'] != row['Account Name'])
        & (df['Date'] >= start_date)
        & (df['Date'] <= end_date)
        & (df['Amount'] == row['Amount'])
        & (df['Transaction Type'] != row['Transaction Type'])
        & (~df['Considered'])
        & df_filter_relevant
    )
	# ...

def process_record(row, df):
	if df.at[row.name, 'Considered']:
        return None

    if df_relevant.loc[row.name]:
        transfer_pair = find_transfer(row, df)
        # ...

# ...

output = transactions.apply(func=process_record, axis=1, result_type='expand')

Figuring out the rules is tricky. What I do is to actually look at both sides of the transfers in the notes:

1
output[output['type'] == 'transfer']

And see what makes sense and what doesn’t. If you can’t figure out how to define a rule that ignores that transaction, then go back to your CSV file, add a new column called IsTransfer and put true or false into rows that the algorithm struggles with (you don’t have to annotate every row.) Then reload the file and add a rule:

1
2
3
4
not_a_transfer = [
	# ...
	~transactions['IsTransfer']
]

Heuristic: Account Id in Description

The next problem is that if I make transfers with the same amount during the same time period, it can get confused and depending on the ordering in the transactions CSV, mis-attribute the transfer between the wrong accounts.

This led to a new idea to use the description as a signal. For example, in the transaction description: ONLINE TRANSFER REF #XYZ1234 TO SAVINGS XXXXXX1234 ON 11/2/11 we can see that I transferred to account X1234. I know that this account corresponds to Firefly account XYZ and can seed the algorithm to search for credits in that account.

Not all transfers include the exact account id though, especially ones that cross banks will usually show something vague. Like a transfer from one bank to another bank might just show FOOBANK ONLINE TRANSFER and that bank has multiple candidate accounts.

1
2
3
4
5
6
7
known_account_ids = {
    'X1234': 1,
    'X4561': 3,

    'FOO BANK $TRANSFER': [1, 2, 3, ...],
    # ...
}

This says if we find a transaction that says X1234, then we should look for matching transactions in account id 1. It also says that if we have transaction in account 1, we should try to find transactions that contain either X1234 or FOO BANK $TRANSFER, and if neither can be find, expand the scope to any matching transactions. So in the following example we have two transfers that all have the same amount, but must be disambiguated.

#	Description	Amount	Type	Account	Pair
1	Transfer to X1234	$100	debit	4561	4
2	Transfer	$100	credit	3	3
3	Transfer to FOO BANK $TRANSFER	$100	debit	5678	2
4	Deposit	$100	credit	1	1

In transaction #1, we’d find X1234 and look for transactions in account 1, thus finding #4 as the pair. In transaction #2, we have no account ids, but we if we look for transactions that contain either FOO BANK $TRANSFER or X4561, then we find transaction #3.

Wiring this into the algorithm is hairy, but here goes nothing. First, we identify any matching account ids defined above, along with the inverse.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def attempt(df, base, attempt):
    if len(df[base & attempt]) > 0:
        return base & attempt
    else:
        return base

def find_transfer(row, df, time_window_days=5):
	# ...
	match_by_description = set()
    inverted_by_account_id_in_description = []
    for key, value in known_account_ids.items():
        # If we find an account id in this record's description
        # then refine the other side by the found account id(s)
        if key in row['Original Description'].upper():
            if isinstance(value, list):
                for id in value:
                    match_by_description.add(id)
            else:
                match_by_description.add(value)
        
        # Find possible descriptions based on this row's
        # account id
        if (isinstance(value, list) and row['AccountId'] in value) or value == row['AccountId']:
            foo.append(key)
            inverted_by_account_id_in_description.append(df['Original Description'].str.contains(key, case=False))

Then we use this information to attempt different variations of the pair finder algorithm and if any of them return a result, we use that transactions as our pair.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
	# ...
    if len(match_by_description) > 0:
        opposing_filter = attempt(df, opposing_filter, df['AccountId'].isin(match_by_description))
    if len(inverted_by_account_id_in_description) > 0:
		filter_by_invert = reduce(lambda x, y: x | y, inverted_by_account_id_in_description)
        if len(not_inverted) > 0:
            filter_by_invert = filter_by_invert & reduce(lambda x, y: x & y, not_inverted)

        opposing_filter = attempt(df, opposing_filter, filter_by_invert)
    
	transfer_pair = df[opposing_filter]    
    if not transfer_pair.empty:
	    # ...

Ambiguous Matches

Even with all that, we can still end up with incorrect transfer matches. For example, I frequently transferred $1k blocks between different accounts and if that happened around the same time, it could still misclassify the transfer. The next step is to identify ambiguous matches– i.e. any time the solver finds two or more candidates and just picks one to proceed. These are likely candidates to investigate further. Let’s filter for those:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def find_transfer(row, df, time_window_days=5):
	# ...
	transfer_pair = df[opposing_filter]
    if not transfer_pair.empty:
        df.at[row.name, 'Considered'] = True
        return opposite
    else:
        return None

def find_ambiguous(row, df):
    if df.at[row.name, 'Considered']:
        return None

    transfer_pair = find_transfer(row, df)
    if transfer_pair is None or len(transfer_pair) == 1:
        return None

    print(f"- {row['Date']} {row['Account Name']} \"{row['Description']}\"")
    for index, option in transfer_pair.iterrows():
        delta = option['Date'] - row['Date']
        print(f"  {delta.days} day(s): {option['Account Name']} {option['Original Description']}")
    print("")

copy['Considered'] = False
copy[df_relevant].apply(func=find_ambiguous, axis=1, df=copy)

This gives me something like:

1
2
3
- 2012-03-01 Savings Account A: "DepositInternet transfer from Interest Checking account XXXXXXX6789"
  0 day(s) Spending Account: "WithdrawalInternet transfer to Online Savings account XXXXXXX1234"
  0 day(s) Ally Spending Account: "WithdrawalInternet transfer to Online Savings account XXXXXXX5678"

From here, we can see that I need to create an alias for 6789, X1234, X5678 in known_account_ids.

1
2
3
4
5
account_info_map = {
	'X6789': 5, # Firefly's account id
	'X1234': 6,
	'X5678': 7
}

Then rerun the transfer solver and it’ll fix those issues

Testing

How do you know if the transfers are working correctly? There’s no automatic solution for this. I built my rules based on trial and error. I looked through the data, found issues, added rules, and re-ran.

With this code, you can view the solved transfers:

1
2
transfers = transactions[df_filter_relevant].apply(func=process_record, axis=1, result_type='expand')
transfers[transfers['type'] == 'transfer']

You can also write it to a CSV and review it in a spreadsheet editor:

1
transfers[transfers['type'] == 'transfer'].to_csv('transfers.csv')

Once you’re satisfied, you can move on to the next step, which is loading into Firefly-iii in the next part of this series.

Conclusion & Next Steps

In the post, I walked through the Python code to identify and solve for transfers. The final code for this can be found on the GitHub repo here.

In the next post, I’ll show how to load all this into Firefly-iii.

Comments

To give feedback, send an email to adam [at] this website url.

Donate

If you've found these posts helpful and would like to support this work directly, your contribution would be appreciated and enable me to dedicate more time to creating future posts. Thank you for joining me!