March 2026

From R to Python for Keyword Research

My journey transitioning from R and RStudio to Python for SEO Keyword Research.

Moving from R to Python

For a long time, R has been my strongest language for SEO and data work.

It is the tool I naturally reach for when I want to clean data, reshape messy exports, analyse patterns, and turn a spreadsheet problem into something much more structured. For SEO in particular, that kind of workflow has always felt natural in R.

But the market has shifted.

More and more job descriptions now ask for Python, even when the work itself is still fundamentally about data cleaning, modelling, reporting, and automation. So part of what I am doing at the moment is deliberately moving more of my workflow into Python, not because R stopped being useful, but because Python is increasingly the language people expect.

That has meant taking real analysis problems and rebuilding the workflow in Python instead of defaulting to R.

Starting with the usual SEO mess

This project began the way a lot of SEO analysis begins: with a folder full of CSV exports.

In this case, I had ranking data from Semrush for one client and a set of competitors. The data was useful in theory, but awkward in practice. Each domain lived in its own file, the same keywords appeared across multiple exports, and it was hard to get from raw rankings to a clear answer about where the real opportunity sat.

That meant the actual job was not pulling data. It was giving the data structure.

Rebuilding the workflow in Python

The first step was building a pipeline in Python that could ingest all of the Semrush files and standardise them into one master dataset.

Most of the heavy lifting here came from pandas, especially for:

reading the CSVs
cleaning and standardising columns
deduplicating keywords
reshaping the data into a domain-by-keyword model

The core idea was simple:

one row per keyword, with separate columns showing each domain's rank and ranking URL

Once that existed, the data stopped being a pile of exports and started behaving more like a model.

That was one of the more useful parts of the exercise for me. It was not just about getting to the SEO result. It was about proving that a workflow I would once have built quite naturally in R could also be built cleanly in Python.

Ranking data becomes more useful when you estimate traffic

A ranking table tells you where a domain appears, but it does not tell you much about the size of the opportunity unless you layer something else on top of it.

So I added a CTR curve to estimate traffic from ranking position and search volume.

That changed the model in an important way.

Instead of only tracking who ranked where, it became possible to estimate:

current traffic
potential traffic at position one
the gap between the two

That gap is where the analysis starts becoming more strategic.

The opportunity is in the gap

Once the pipeline could estimate current traffic and compare it with top-position potential, it became easier to stop asking simple ranking questions and start asking better ones.

Not just:

who ranks well
who ranks badly

but:

which keywords are worth caring about
which categories have the biggest missed opportunity
where effort is most likely to produce meaningful returns

That is the point where reporting becomes useful for planning instead of just documentation.

Categorisation needed more than one signal

I also needed a way to group keywords into meaningful service categories.

The first pass was keyword-rule based, using regex matching to assign terms into areas like:

Family and Divorce
Immigration
Employment
Private Client
Dispute Resolution

That worked reasonably well, but only up to a point.

In practice, the ranking URL often tells you more about the actual topic than the keyword alone, especially with competitor data. So I added a second layer of classification based on the ranking URL.

That hybrid approach worked much better.

It reduced uncategorised terms, improved the interpretation of competitor data, and made the final model more trustworthy.

Cleaning the data mattered as much as modelling it

A lot of the value came from deciding what to remove.

Not every keyword in a Semrush export belongs in a strategic opportunity model. Some terms are obvious noise. Some are branded. Some are navigational. Some simply do not help with the kind of decisions this pipeline was meant to support.

So another layer of the workflow focused on:

removing noise
reducing branded clutter
trimming out low-value terms
producing a cleaner reporting set

That part is easy to underestimate, but it makes a big difference. A smaller, cleaner dataset is usually much more useful than a larger one full of distractions.

Python also handled the reporting layer

Once the data was cleaned and categorised, Python handled the reporting outputs as well.

The pipeline generated:

a master keyword dataset
category summaries by domain
a long-form export for pivot tables
filtered files with uncategorised terms removed
chart-ready datasets
HTML visualisations by category

That made it easier to move from analysis into presentation without rebuilding the same logic in multiple tools.

What this project was really useful for

The SEO outcome mattered, but so did the language shift.

This was one of those projects where the practical work doubled as a transition exercise. It let me take a workflow I understand well in R and prove that I can rebuild it in Python in a way that still feels structured, efficient, and reusable.

That matters because the move from one language to another is rarely about syntax alone. It is about re-establishing fluency in the kind of work you actually want to do.

What I like about this kind of setup

What I like most about this kind of pipeline is that it turns SEO analysis into a repeatable system.

Once the structure exists, updating it becomes much less manual. Instead of copy-pasting spreadsheets together and rebuilding the same logic from scratch every time, the workflow becomes:

Drop in the latest exports
Run the pipeline
Review the structured output
Focus on decisions rather than cleanup

That is the real value for me.

It is not just about automating data cleaning. It is about making better SEO decisions easier to make, while also pushing my workflow further into Python.

Back to journal