Solving for bank transfers using Pandas

This article is part of the Self-hosted Finances series.

    Previously in Part 1, I talked about how to clean-up the transaction data from Mint to remove duplicates and add any missing transactions.

    Solving for transfers

    The next phase is to solve for the transfer pairs. A transfer pair is defined with a matching credit and debit transaction on two different accounts. In Firefly, a transfer is treated separately than a credit/debit because it’s excluded from the expense and income reports.

    This ended up being trickier than I expected and I went through multiple phases. First attempt, I used Pandas merge_asof, but this did not handle multiple transactions around the same time. Then I developed a new method that ignored previously used transactions. This worked better, but it still struggled with complex transaction chains. For example, if I transferred $1k from account A to B, then from C to D, it might accidentally book a transfer from A to D and C to B which is incorrect.

    The basic algorithm

    Here we have the first iteration. How it works is for every single transaction, it looks forward 5 days, finds all transactions that are the opposite type (credit -> debit or vice versa) and the same amount, then tags the first one as the opposite. For example, in the below table, it’d identify the two $300 transactions as a pair.

    DateDescriptionAmountTypeAccount
    1/1/2024Transfer to XYZ$300.00debitABC
    1/1/2024Bought Avocado Toast$4.99debitCredit Card
    1/2/2024Transfer from ABC$300.00creditXYZ

    The code looked like this:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    
    def find_transfer(row, df, time_window_days=5):
        start_date = row['Date']
        end_date = start_date + timedelta(days=time_window_days)
    
        opposing_filter = (
            (df['Account Name'] != row['Account Name']) &
            (df['Date'] >= start_date) &
            (df['Date'] <= end_date) &
            (df['Amount'] == row['Amount']) &
            (df['Transaction Type'] != row['Transaction Type']) &
            (~df['Considered'])
        )
        
        transfer_pair = df[opposing_filter]    
        if not transfer_pair.empty:
            df.at[row.name, 'Considered'] = True
            opposite = transfer_pair.iloc[0]
            return opposite
        else:
            return None
    
    
    def process_record(row, df):
        if df.at[row.name, 'Considered']:
            return None
        
        pair = find_transfer(row, df)
        if pair is not None:
            df.at[pair.name, 'Considered'] = True
            notes = [
                row['Notes'],
                pair['Notes'],
            ]
            labels = list(filter(lambda x: isinstance(x, str) and x != '', [
                row['Labels'],
                pair['Labels']
            ]))
            if row['Transaction Type'] == 'debit':
                # Bank is on the left
                date = row['Date']
                process_date = pair['Date']
                notes.append(pair['Original Description'])
                return {
                    'type': 'transfer',
                    'date': row['Date'],
                    'process_date': pair['Date'],
                    'amount': row['Amount'],
                    'category': pair['Category'],
                    'description': row['Original Description'],
                    'source_id': row['AccountId'],
                    'destination_id': pair['AccountId'],
                    'tags': labels,
                    'format': 'transfer_debit',
                    'notes': '\n'.join(filter(lambda x: isinstance(x, str) and x != '', notes))
                }
            else:
                date = pair['Date']
                process_date = row['Date']
                notes.append(row['Original Description'])
                return {
                    'type': 'transfer',
                    'date': date,
                    'process_date': process_date,
                    'amount': row['Amount'],
                    'category': pair['Category'],
                    'description': pair['Original Description'],
                    'source_id': pair['AccountId'],
                    'destination_id': row['AccountId'],
                    'tags': labels,
                    'format': 'transfer_credit',
                    'notes': '\n'.join(filter(lambda x: isinstance(x, str) and x != '', notes))
                }
        
        result = {
            'date': row['Date'],
            'description': row['Original Description'],
            'amount': row['Amount'],
            'category': row['Category'],
            'tags': row['Labels'],
            'notes': row['Notes']
        }
        
        if row['Transaction Type'] == 'credit':
            result['type'] = 'deposit'
            result['destination_id'] = row['AccountId']
            result['source_name'] = row['Description']
    	else:
            result['source_id'] = row['AccountId']
            result['destination_name'] = row['Description']
            result['type'] = 'withdrawal'
                
        return result
    
    transactions['Considered'] = False
    transactions.apply(func=process_record, axis=1, result_type='expand')
    

    However, it made a lot of mistakes. For example, it would classify the following as a transfer from Venmo to the Credit Card. Which maybe one could say it’s a “transfer”, but I didn’t want that because the credit card would get paid by my checking account.

    DateDescriptionAmountTypeAccount
    1/1/2024Bought Dinner for my friends$50debitCredit Card
    1/1/2024Friend paid me back$50creditVenmo

    Eventually, I extended that algorithm with rules based on how my accounts behaved. Let’s look at some of those options.

    Not all transactions are transfers

    Not all transactions are transfers, but sometimes it lines up that it looks like it’s a transfer because there’s a matching side. Just like the previous example. We can define some rules that match transactions that we know are not transfers.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    
    def account(num: Union[int, List[int]]):
        if type(num) == int:
            return transactions['AccountId'] == num
        else:
            return transactions['AccountId'].isin(num)
        
    def orig_descr_contains(msg: str):
        return transactions['Original Description'].str.contains(msg, case=False)
    
    # List of all Firefly account ids that are credit cards
    credit_cards = [5, 10, 15, 20]
    
    brokerages = []
    
    account_id_401k = 25
    
    not_a_transfer = [
    	# Maybe exclude an entire account. For example, your 401k may never have
    	# a transfer (unless you take a loan or do a Mega Backdoor Roth)
        account(account_id_401k),
    
    	# Or you want to exclude reimbursements via Venmo
        account(23) & orig_descr_contains('John Doe Paid '), # Venmo
        account(23) & ~transactions['Category'].isna() & (transactions['Category'] != 'Transfer'), # Venmos frequently appear like transfers because a friend reimburses me
    
    	# Credit Card Payments (to the card) can be transfers, but
    	# maybe they're never transfers going out (unless you do a Balance Transfer)
        account(credit_cards) & (transactions['Transaction Type'] == 'debit'),
    
    	# A paycheck
    	# While it is a transfer from your employer to you, without the
    	# debit side, the algorithm could incorrectly classify it as a transfer
        orig_descr_contains('PAYROLL') | (transactions['Category'] == 'Paycheck'),
    
    	# Interest paid is not a transfer
        orig_descr_contains('INTEREST PAYMENT'),
        transactions['Category'] == 'Interest Income',
    
    	# Dividend reinvestment are not transfers
    	account(brokerages) & (orig_descr_contains('DIVDEND') | orig_descr_contains('REINVESTMENT')),
    
    	# Some accounts have $0 transactions as part of bookkeeping
    	# They are not transfers
        (transactions['Amount'] == 0)
    ]
    

    Then we can integrate it into the algorithm. While searching for new transactions, it excludes anything that is not considered relevant.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    
    # Compute the exclusion set
    df_filter_relevant = ~reduce(lambda x, y: x | y, not_a_transfer)
    
    def find_transfer(row, df, time_window_days=5):
        # ...
    	opposing_filter = (
            (df['Account Name'] != row['Account Name'])
            & (df['Date'] >= start_date)
            & (df['Date'] <= end_date)
            & (df['Amount'] == row['Amount'])
            & (df['Transaction Type'] != row['Transaction Type'])
            & (~df['Considered'])
            & df_filter_relevant
        )
    	# ...
    
    def process_record(row, df):
    	if df.at[row.name, 'Considered']:
            return None
    
        if df_relevant.loc[row.name]:
            transfer_pair = find_transfer(row, df)
            # ...
    
    # ...
    
    output = transactions.apply(func=process_record, axis=1, result_type='expand')
    

    Figuring out the rules is tricky. What I do is to actually look at both sides of the transfers in the notes:

    1
    
    output[output['type'] == 'transfer']
    

    And see what makes sense and what doesn’t. If you can’t figure out how to define a rule that ignores that transaction, then go back to your CSV file, add a new column called IsTransfer and put true or false into rows that the algorithm struggles with (you don’t have to annotate every row.) Then reload the file and add a rule:

    1
    2
    3
    4
    
    not_a_transfer = [
    	# ...
    	~transactions['IsTransfer']
    ]
    

    Heuristic: Account Id in Description

    The next problem is that if I make transfers with the same amount during the same time period, it can get confused and depending on the ordering in the transactions CSV, mis-attribute the transfer between the wrong accounts.

    This led to a new idea to use the description as a signal. For example, in the transaction description: ONLINE TRANSFER REF #XYZ1234 TO SAVINGS XXXXXX1234 ON 11/2/11 we can see that I transferred to account X1234. I know that this account corresponds to Firefly account XYZ and can seed the algorithm to search for credits in that account.

    Not all transfers include the exact account id though, especially ones that cross banks will usually show something vague. Like a transfer from one bank to another bank might just show FOOBANK ONLINE TRANSFER and that bank has multiple candidate accounts.

    1
    2
    3
    4
    5
    6
    7
    
    known_account_ids = {
        'X1234': 1,
        'X4561': 3,
    
        'FOO BANK $TRANSFER': [1, 2, 3, ...],
        # ...
    }
    

    This says if we find a transaction that says X1234, then we should look for matching transactions in account id 1. It also says that if we have transaction in account 1, we should try to find transactions that contain either X1234 or FOO BANK $TRANSFER, and if neither can be find, expand the scope to any matching transactions. So in the following example we have two transfers that all have the same amount, but must be disambiguated.

    #DescriptionAmountTypeAccountPair
    1Transfer to X1234$100debit45614
    2Transfer$100credit33
    3Transfer to FOO BANK $TRANSFER$100debit56782
    4Deposit$100credit11

    In transaction #1, we’d find X1234 and look for transactions in account 1, thus finding #4 as the pair. In transaction #2, we have no account ids, but we if we look for transactions that contain either FOO BANK $TRANSFER or X4561, then we find transaction #3.

    Wiring this into the algorithm is hairy, but here goes nothing. First, we identify any matching account ids defined above, along with the inverse.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    
    def attempt(df, base, attempt):
        if len(df[base & attempt]) > 0:
            return base & attempt
        else:
            return base
    
    def find_transfer(row, df, time_window_days=5):
    	# ...
    	match_by_description = set()
        inverted_by_account_id_in_description = []
        for key, value in known_account_ids.items():
            # If we find an account id in this record's description
            # then refine the other side by the found account id(s)
            if key in row['Original Description'].upper():
                if isinstance(value, list):
                    for id in value:
                        match_by_description.add(id)
                else:
                    match_by_description.add(value)
            
            # Find possible descriptions based on this row's
            # account id
            if (isinstance(value, list) and row['AccountId'] in value) or value == row['AccountId']:
                foo.append(key)
                inverted_by_account_id_in_description.append(df['Original Description'].str.contains(key, case=False))
    

    Then we use this information to attempt different variations of the pair finder algorithm and if any of them return a result, we use that transactions as our pair.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    
    	# ...
        if len(match_by_description) > 0:
            opposing_filter = attempt(df, opposing_filter, df['AccountId'].isin(match_by_description))
        if len(inverted_by_account_id_in_description) > 0:
    		filter_by_invert = reduce(lambda x, y: x | y, inverted_by_account_id_in_description)
            if len(not_inverted) > 0:
                filter_by_invert = filter_by_invert & reduce(lambda x, y: x & y, not_inverted)
    
            opposing_filter = attempt(df, opposing_filter, filter_by_invert)
        
    	transfer_pair = df[opposing_filter]    
        if not transfer_pair.empty:
    	    # ...
    

    Ambiguous Matches

    Even with all that, we can still end up with incorrect transfer matches. For example, I frequently transferred $1k blocks between different accounts and if that happened around the same time, it could still misclassify the transfer. The next step is to identify ambiguous matches– i.e. any time the solver finds two or more candidates and just picks one to proceed. These are likely candidates to investigate further. Let’s filter for those:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    
    def find_transfer(row, df, time_window_days=5):
    	# ...
    	transfer_pair = df[opposing_filter]
        if not transfer_pair.empty:
            df.at[row.name, 'Considered'] = True
            return opposite
        else:
            return None
    
    def find_ambiguous(row, df):
        if df.at[row.name, 'Considered']:
            return None
    
        transfer_pair = find_transfer(row, df)
        if transfer_pair is None or len(transfer_pair) == 1:
            return None
    
        print(f"- {row['Date']} {row['Account Name']} \"{row['Description']}\"")
        for index, option in transfer_pair.iterrows():
            delta = option['Date'] - row['Date']
            print(f"  {delta.days} day(s): {option['Account Name']} {option['Original Description']}")
        print("")
    
    copy['Considered'] = False
    copy[df_relevant].apply(func=find_ambiguous, axis=1, df=copy)
    

    This gives me something like:

    1
    2
    3
    
    - 2012-03-01 Savings Account A: "DepositInternet transfer from Interest Checking account XXXXXXX6789"
      0 day(s) Spending Account: "WithdrawalInternet transfer to Online Savings account XXXXXXX1234"
      0 day(s) Ally Spending Account: "WithdrawalInternet transfer to Online Savings account XXXXXXX5678"
    

    From here, we can see that I need to create an alias for 6789, X1234, X5678 in known_account_ids.

    1
    2
    3
    4
    5
    
    account_info_map = {
    	'X6789': 5, # Firefly's account id
    	'X1234': 6,
    	'X5678': 7
    }
    

    Then rerun the transfer solver and it’ll fix those issues

    Testing

    How do you know if the transfers are working correctly? There’s no automatic solution for this. I built my rules based on trial and error. I looked through the data, found issues, added rules, and re-ran.

    With this code, you can view the solved transfers:

    1
    2
    
    transfers = transactions[df_filter_relevant].apply(func=process_record, axis=1, result_type='expand')
    transfers[transfers['type'] == 'transfer']
    

    You can also write it to a CSV and review it in a spreadsheet editor:

    1
    
    transfers[transfers['type'] == 'transfer'].to_csv('transfers.csv')
    

    Once you’re satisfied, you can move on to the next step, which is loading into Firefly-iii in the next part of this series.

    Conclusion & Next Steps

    In the post, I walked through the Python code to identify and solve for transfers. The final code for this can be found on the GitHub repo here.

    In the next post, I’ll show how to load all this into Firefly-iii.

    Copyright - All Rights Reserved
    Last updated on May 21, 2024 00:00 UTC

    Comments

    Comments are currently unavailable while I move to this new blog platform. To give feedback, send an email to adam [at] this website url.

    Other Posts in Series

    This post is part of the Self-hosted Finances series. You can check out the other posts for more information: