DEV Community

Cover image for Are you paying attention to your token use?
Amara Graham
Amara Graham

Posted on

Are you paying attention to your token use?

Can I get some folks in the comments talking about how closely they monitor their token usage?

Or if you don't, do you work at a company that provides you unlimited tokens? To specific tools?

I'm curious to see where people fall on this spectrum.

Photo by Dan Dennis on Unsplash

Top comments (39)

Collapse
 
francistrdev profile image
FrancisTRᴅᴇᴠ (っ◔◡◔)っ

I used the Free Tier whenever. For example, I used Google Gemini without the need of the account, so I don't worry about cost. Another case is downloading it Locally like Ollama. If I used Locally, then I don't need to worry.

In other words, no credit card no problem lmao xd

Collapse
 
leob profile image
leob

With local LLMs you move the burden to your hardware (which needs to be more powerful) - pay you will in the end, if you do any kind of serious work ...

Collapse
 
missamarakay profile image
Amara Graham

Moving the burden to your hardware is such a great point and often a topic that comes up with downloadable software too. Running anything local to your machine only works if your hardware can handle it.

Thread Thread
 
leob profile image
leob

Exactly - spend $10 per month for cloud LLM or $500 or $1000 once for a more powerful machine? Tradeoffs ...

Collapse
 
francistrdev profile image
FrancisTRᴅᴇᴠ (っ◔◡◔)っ

That makes sense. The worst case scenerio for me is using the Free Tier since it does enough for me. If you are doing serious work like you mentioned, like using Cursor to Navigate a big codebase, that's fair. Since I am working on small projects and nothing ambitious, I only used free tier. Thanks Leob!

Thread Thread
 
leob profile image
leob • Edited

Free tier of which product, if I may ask? I tried the free tier of Cursor and it ran out VERY quickly ...

Thread Thread
 
francistrdev profile image
FrancisTRᴅᴇᴠ (っ◔◡◔)っ

I mention Free Tier of more of "Not needing to Sign up" to use the service. For example, I can use Google Gemini without the need to sign up. The only thing I use was Gemini, ChatGPT, Copilot, and Ollama. I tend to avoid free version on a limit like GitHub Copilot since you can use it on a limit. Hope this makes sense!

Thread Thread
 
leob profile image
leob

Yeah that makes sense, useful, thanks!

Collapse
 
embernoglow profile image
EmberNoGlow

Local LLM is powerful, but my ollama crashes after messages more complex than "Hello" because it's limited by my hardware. It's a good solution, but not everyone can use it to its full potential.

Collapse
 
francistrdev profile image
FrancisTRᴅᴇᴠ (っ◔◡◔)っ

That's fair! I use only models that takes up less space and avoid using bigger models. I would hope in the future it will be more accessible. I only use it on vscode as a tool when I code, knowing it will be slow for the purpose for me to think before I ask then if that makes sense. Thanks!

Collapse
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen • Edited

that is the best thing Francis concerning free tier.

Collapse
 
ben profile image
Ben Halpern

I’m on Gemini Ultra for my day-to-day and it’s been a breath of fresh air to tap into as much token use as I need.

Collapse
 
missamarakay profile image
Amara Graham

Interesting! Do you feel like you are getting your money's worth? Or is the subscription worth it for not having to think about it?

Collapse
 
ben profile image
Ben Halpern

We do a company budget per engineer, and I have to say: Absolutely.

I can't say I'm a fan of the concept of this development "tax" in general these days, but moving from concerned-about-tokens to feeling effectively unlimited (Not technically unlimited but I'm operating completely unconstrained).

I think most companies should do this.

Most of my effective token spend is company stuff. I just use the one account for personal stuff to but that's kind of a rounding error. Maybe different if you do high volume personal agent stuff.

Collapse
 
ji_ai profile image
jidonglab

token costs hit different when you're running multi-agent setups. single model calls are manageable but once you have 3-4 agents passing context back and forth the bill compounds fast. biggest lever we found wasn't model choice — it was compressing context before it enters the pipeline. a lot of what gets stuffed into prompts is redundant or low-signal, and stripping that out before inference saved us way more than switching to cheaper models. been open-sourcing some of our compression tooling at github.com/jidonglab/contextzip if anyone's dealing with similar token budget headach**

Collapse
 
leob profile image
leob

Your github link doesn't work ...

Collapse
 
missamarakay profile image
Amara Graham • Edited

Did you mean this link? github.com/jee599/contextzip

Thread Thread
 
ji_ai profile image
jidonglab • Edited

ah sorry about the broken link, typo on my end. correct one is github.com/jee599/contextzip — thanks for catching that

Thread Thread
 
leob profile image
leob

No that still doesn't work - github.com/jee599/contextzip does ...

Collapse
 
Sloan, the sloth mascot
Comment deleted
 
leob profile image
leob

Eh no, this one: github.com/jee599/contextzip

Thread Thread
 
ji_ai profile image
jidonglab

you right github.com/jee599/contextzip is correct link

Thread Thread
 
leob profile image
leob

Finally ;-)

Collapse
 
tythos profile image
Brian Kirkpatrick

I use OpenRouter so I don't care much about token use or rates from the tools (and am even fairly agnostic about the models themselves, which are mostly selected via orchestration anyway). But OpenRouter does give me specific tools to monitor and budget token consumption, and I find myself (for side projects) using those budgets to constrain the scope and volume of what I'm working on per day. I could probably make the dollars stretch (and should), but in the meantime I'm focusing on routing more codegen tasks to locally-hosted models (primarily via OpenCode/OMO configurations), which reduces my remote inference volume by 30-40%.

Collapse
 
missamarakay profile image
Amara Graham

I find myself (for side projects) using those budgets to constrain the scope and volume of what I'm working on per day.

Oh I like that! Managing scope and volume this way. Thanks for sharing!

Collapse
 
circlejtp profile image
JT Perry

I have a pro subscription to Claude but find myself hitting the barrier and in startup mode can't afford to upgrade.

I decided to go off and solve the problem the geek way. Re-PC here in Seattle area has cast off enterprise servers. Found an old Dell R630 with solid amount of memory for reasonable. Got two older generation Tesla P4 and threw it in there. Hung it in the barn. For roughly $1500 up and running. ROI is about 7 months with me hitting it. Shorter as I add second dev. Working out the bugs.

There is definitely a trade-off. Opus 4.6 is much better than any model out there IMO. However we are getting good results with Qwen3. We are doing embedded device dev and react mobile dev. Our results are good. Saving Claude tokens for stickler problems or as adversarial QA.

LightLLM is amazing as a proxy to make switching and tracking easier. The whole setup was 2 days and now just runs in the background.

We also are getting good advantages for the farm businesses using it and backups of github and our other services.

If we had regular revenue (start selling April), Claude may become more enticing. It is also nice having a big blade for other things as well to remove dependence on services. All tradeoffs though like anything else.

Collapse
 
harsh2644 profile image
Harsh

This question hits differently after you've watched an agentic workflow silently burn through tokens in a retry loop.

I used to not think about token usage at all until I started building with agents. A single misconfigured workflow can trigger cascading retries where each step costs multiple LLM calls. What looked like a $0.10 task becomes a $5 surprise by the time you check your dashboard.

Now I treat token budgeting the same way I treat error handling you don't think about it until something breaks, and then you think about nothing else.

The most underrated optimization isn't model choice it's context hygiene. Keeping prompts lean and not stuffing unnecessary history into every call saves more than switching to a cheaper model ever will.

Collapse
 
elementalsilk profile image
ElementalSilk

I do not monitor token usage - Yes my company provides unlimited token usage, we are in process of aligning our tools to use those tokens..

Collapse
 
missamarakay profile image
Amara Graham

I expect this is the case for many folks at work. Even some being told "use AI as much as possible" and using that as an indicator instead of token usage.

Collapse
 
embernoglow profile image
EmberNoGlow

I have several accounts, so tokens... Don't tell anyone!

Collapse
 
ji_ai profile image
jidonglab

token costs sneak up on you fast once you start chaining agents. single LLM call? manageable. but when you have 3-4 agents passing context back and forth, the token bill multiplies in ways you don't expect until you check the dashboard.

biggest thing that helped us was compressing context between agent handoffs — you don't need the full conversation history for every downstream agent, just the relevant pieces. we ended up building a tool specifically for this: github.com/jidonglab/contextzip

but honestly even just being intentional about what goes into each prompt makes a huge difference. most people stuff everything into the context window "just in case" and that's where the waste happens.

Collapse
 
klement_gunndu profile image
klement Gunndu

We track token spend per agent task — the surprise was that retries on failed tool calls burn 3-4x more tokens than the actual generation. Monitoring per-request broke our assumptions about where the cost sits.

Collapse
 
poushwell profile image
Pavel Ishchin

Per-request broke our assumptions is what stuck had this happen once you think the money is in generation then some dumb retry drags everything and somehow the answer was the cheap part which is just annoying when the dashboard shows one number and not what actually blew up

Some comments may only be visible to logged-in visitors. Sign in to view all comments.