Welcome Guest ( Log In | Register )

Outline · [ Standard ] · Linear+

 Ollama - Offline Generative AI, Similar to ChatGPT

views
     
c2tony
post Aug 21 2024, 07:57 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


Phi3 updated to Phi3.5 and get it right this time

QUOTE
Since all three of Sally's brothers share the same two sisters, it implies that these are also her siblings because in a family unit with multiple children like this one (including both male and female), there is only one set of sisters for each brother. Therefore, despite having three brothers, Sally has just one sister—the fact they all have "two" sisters at common refers to the same individual who counts once per sibling relationship in a family with multiple children sharing identical pairs among themselves. So, Sally indeed only has one biological sister.
c2tony
post Apr 30 2025, 05:24 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


do anyone know how to close that thinking stuff on deepseek or qwen3? ollama webui of course

P/S: https://huggingface.co/jedisct1/MiMo-7B-RL-...f?download=true

can play with xiaomi ai

This post has been edited by c2tony: Apr 30 2025, 10:04 PM
c2tony
post May 1 2025, 01:14 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(xxboxx @ May 1 2025, 07:59 AM)
I remember using deepseek, the thinking stuff is not shown unless press the arrow beside the model name

How's the Xiaomi AI compared deepseek? Better answer?
*
yes, but i don't want it to show that arrow! it takes more times to show the steps, whether you click it or not

did not manage to try MiMo yet, i don't know how to load gguf
c2tony
post May 1 2025, 10:11 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(xxboxx @ May 1 2025, 04:29 PM)
Oh you mean you don't want it to do the thinking stuff? I don't think can, those models are designed for thinking, for questions that need deep thought for answer, these kind of models are better than other models that doesn't do thinking. But if straight forward question, such as calculation then these models waste a lot of time to get the obvious answer.

Using terminal/CMD, type "ollama pull hf.co/jedisct1/MiMo-7B-RL-GGUF:Q8_0"
this will pull the Q8_0 8.1GB model

If you want the smaller 4.7GB model, type "ollama pull hf.co/jedisct1/MiMo-7B-RL-GGUF:Q4_K_M"

I tried it and the answer it given feels as good as deepseek. When feed data for it to analyze, it does take some time to process it before give the answer
*
thanks for the command notworthy.gif
I tried it also

Do you familiar with the thought experiment the ship of Theseus?
In the field of identify metaphysics?
If those removed planks are restored and reassembled, free of the rot, is that the ship of Theseus?

the third question it "think" about 7min

Neither is the true ship or both are the true ship?
- it's still thinking...
c2tony
post May 2 2025, 02:05 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(xxboxx @ May 2 2025, 08:55 AM)
» Click to show Spoiler - click again to hide... «
llama3.2:3b-instruct-fp16 after 2+ min answered: In the word "benzodiazepines", the letter "e" appears three times.
while smollm2:1.7b-instruct-fp16 gave me TypeError: NetworkError when attempting to fetch

XiaoMi's MiMo LLM are relative new.
Afterall they're all LLMs using the same "highway": pattern recognizing. If AI starting to understand then we might need to concern about their concise awakening laugh.gif


c2tony
post May 26 2025, 08:17 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


Lately gemma3 update for 12b are annoying, they distribute it to my cpu & gpu, just won't run at 100% GPU anymore.
CODE

ollama ps
NAME                 ID              SIZE     PROCESSOR          UNTIL
gemma3:12b-it-qat    5d4fa005e7bb    12 GB    31%/69% CPU/GPU    4 minutes from now

c2tony
post May 26 2025, 09:48 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(xxboxx @ May 26 2025, 08:41 PM)
How many GB is your VRAM? Windows also uses some VRAM, if 12GB then not enough. When not enough VRAM then it will offload some to CPU and causes the slowdown.
*
12GB vram, i know...
Qwen3 with 14b and 9.3GB uses my 100% GPU, for a comparison

CODE
ollama ps
NAME         ID              SIZE     PROCESSOR    UNTIL
qwen3:14b    7d7da67570e2    10 GB    100% GPU     4 minutes from now


CODE
NAME                          ID              SIZE      MODIFIED    
gemma3:12b-it-qat             5d4fa005e7bb    8.9 GB    2 weeks ago
qwen3:14b                     7d7da67570e2    9.3 GB    3 weeks ago


This post has been edited by c2tony: May 26 2025, 10:09 PM
c2tony
post May 26 2025, 10:08 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(ipohps3 @ May 26 2025, 09:55 PM)
donno about you guys.

i was enthusiastic about open models earlier this year with DeepSeek in Jan and the following months with other open models being released also.

however, since last month and this month with Google Gemini 2.5 released, don't think I would want to go back using open models since Gemini is getting extremely good at almost all things and none of the open models that can run with RTX3090 can come close to it.

after sometime, paying the 20usd per month is more productive for me to get things done than using open models.
*
Yeah, you paid. That's the whole point! "one-off" or "batch" processing are best when you pay. I wouldn't pay $20 for my use case.

Gemini is a closed system, you don’t get to tweak it, audit its training data, or run it locally.
For some users, this trade-off is worth it.
For others, it’s not. Not to mention there's a limit for API can't be use at offline
c2tony
post May 27 2025, 11:21 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(xxboxx @ May 26 2025, 11:39 PM)
There you go, not enough VRAM.
Why your gemma3:12b-it-qat is 12GB? I see ollama page it is only 8.9GB


That's why I say after update, a 12b model of 8.9GB becomes 12GB when you actually runs it.

Gemini respond: It gets de-quantized or processed in a higher precision during runtime. The architecture, the specific runtime precision of activation and KV cache, and the optimizations applied by the inference framework
Gemma 3's multimodal nature and potentially different KV cache handling seem to be key contributors to its higher observed runtime memory usage compared to Qwen 2 14B models of similar quantization.

this mean time for a higher vram gpu or upgrade to npu
c2tony
post May 30 2025, 10:12 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(xxboxx @ May 28 2025, 12:39 PM)
Ollama's vision now after update is a lot better than few months ago

Using gemma3:12b there is some wrong data but a lot better than previously
» Click to show Spoiler - click again to hide... «

*
Geez... You must using a lot of OCR? tongue.gif

They gets better and better with a lot of added rules and regulations, uncensored wild west are disappearing lol.gif
c2tony
post May 30 2025, 10:29 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(ipohps3 @ May 30 2025, 10:19 PM)
anyone tried Gemma 3n 4B ?
*
what platform did you use on mobile or edge side? I installed PocketPal with llama-3.2-1b-instruct only, mobile have lots of distraction grin.gif
c2tony
post May 30 2025, 10:39 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(xxboxx @ May 30 2025, 10:31 PM)
Gemma3:4B?

I tried it, less capable than 12B model
*
I think he mean this:

user posted image
c2tony
post Jun 4 2025, 11:08 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(ipohps3 @ May 30 2025, 10:19 PM)
anyone tried Gemma 3n 4B ?
*
QUOTE(xxboxx @ May 31 2025, 08:35 AM)
Oh no wonder I didn't saw it, I only checked at ollama website and there haven't got yet
*
download https://github.com/google-ai-edge/gallery/releases/tag/1.0.3 and tried Gemma-3n-E4B-it-int4 at my phone today.
My Honor Magic 6 pro turn into hand warmer , 3.51 tokens/s
Lower if multitasking and I don't have the patient so i just close it tongue.gif

there's a youtuber talking about it
https://youtu.be/Vb8L5mtjLDo?si=fxp9nddnJ8zsuO08


c2tony
post Jun 4 2025, 11:58 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(ipohps3 @ Jun 4 2025, 11:15 PM)
anyone tried the DeepSeek R1 0528 Qwen distilled version?

how is it?
*
It can't answer
CODE
how many e in “defenselessness”
took more than 5min and still thinking so I stopped it.
c2tony
post Jun 7 2025, 08:25 AM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(ipohps3 @ Jun 5 2025, 03:37 PM)
what is it qat quantization?
*
Instead of compressing the photo into smaller jpeg, we tell the artist to paint with fewer color instead.

it = instruction tuned not that the model are fluent in Italian language 😁

quantization:
Convertion of finished painting to a desired jpeg compression

qat (Quantization-Aware Training):
Qat is like instead of compressing the photo into smaller jpeg, we tell the artist to paint with fewer color instead

hmm...... is that why Gemma3 occupy so much more memory but it's not that slow

btw

IT-QAT refers to instruction-tuned Quantization-Aware Training (QAT) models, specifically in the Gemma 3 series. These models are optimized using QAT to maintain high quality while significantly reducing memory requirements, making them more efficient for deployment on consumer-grade GPUs.

For example:
- Gemma 3 27B IT-QAT → Reduced from 54GB to 14.1GB
- Gemma 3 12B IT-QAT → Reduced from 24GB to 6.6GB
- Gemma 3 4B IT-QAT → Reduced from 8GB to 2.6GB
- Gemma 3 1B IT-QAT → Reduced from 2GB to 0.5GB

These models are designed to preserve similar quality as half-precision models (BF16) while using less memory, making them ideal for running locally on devices with limited resources.

This post has been edited by c2tony: Jun 7 2025, 08:50 AM
c2tony
post Jun 8 2025, 12:17 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(xxboxx @ Jun 7 2025, 11:19 AM)
Can only salivate for such LLM performance. Waiting for the days when Intel release their B60 GPU with 24GB and hopefully around 2k price lol
This is too extreme! I don't do much with AI nowadays other than satisfying my curiosity, so perplexity.ai , gemini and copilot are more than enough at phone.

ps: scanning every receipt and let AI do the accounting looks like a great use of AI whistling.gif
c2tony
post Jun 8 2025, 10:56 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(xxboxx @ Jun 8 2025, 05:44 PM)
Even only as a hobby but if able to run bigger parameters model we can get more intelligent AI. Like the comparison above, gemma3:12b is a lot more capable than gemma3:4b and similar to deepseek-r1:14b. If have access to more VRAM we can run gemma3:27b or even deepseek-r1:70b which should be even more capable.

I been feeding gemma3:12b with few photos of handwriting and each time it answer some part wrongly I corrected it. After few times now it's recognition of the handwriting have improved a lot compared to the first time, but still there are some mistakes. If gemma3:27b and it's higher intelligence then it will be even less mistake.
*
ikr
Intel had been ignorant about their processors
hopefully they won't make the same mistakes with GPUs this time

there's no easy route for running AI locally, let's hope for Intel Arc GPU laugh.gif

sometimes i just feeling the rush to get those old 2080 modified 22gb from china, but i chicken out whistling.gif
c2tony
post Jun 11 2025, 09:42 AM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(xxboxx @ Jun 8 2025, 05:44 PM)
Even only as a hobby but if able to run bigger parameters model we can get more intelligent AI. Like the comparison above, gemma3:12b is a lot more capable than gemma3:4b and similar to deepseek-r1:14b. If have access to more VRAM we can run gemma3:27b or even deepseek-r1:70b which should be even more capable.

I been feeding gemma3:12b with few photos of handwriting and each time it answer some part wrongly I corrected it. After few times now it's recognition of the handwriting have improved a lot compared to the first time, but still there are some mistakes. If gemma3:27b and it's higher intelligence then it will be even less mistake.
*
here's something interesting I found, AI processors with loads of ram use for larger models

https://youtu.be/B7GDr-VFuEo?si=mK-jvQuXkHwmptel
c2tony
post Jun 12 2025, 09:34 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(xxboxx @ Jun 12 2025, 08:29 PM)
I watched the video, Ryzen AI MAX+ 395 indeed a powerful CPU for AI, even beats out M4. Just that this CPU price is still very high.

Maybe in 1 or 2 years time we'll get such powerful CPU in mid range price.
*
for the price, it's better value, only change processor motherboard and ram biggrin.gif still better than single gpu card with the same price whistling.gif
It's relative new processor, only saw the intel core ultra.
Didn't saw anyone selling the amd AI processor yet, but you can get am5 8600G and 8700G for the same function
c2tony
post Jun 13 2025, 10:36 PM

Getting Started
**
Junior Member
86 posts

Joined: Jan 2003
From: BM, Butterworth, Penang island.


QUOTE(xxboxx @ Jun 13 2025, 12:51 AM)
But the 8600G and 8700G have different iGPU than the Ryzen AI MAX+ 395, is it have same performance?
*
8700G = 16 TOPS
ryzen ai max+ 395 = 55 TOPS
RTX3060 12GB = 100 TOPS
Apple Mac Studio M4 Max = 38 TOPS

They all can run.

BTW, 55 TOPS may sound like more AI power than 38 TOPS,
the way Apple handles data and optimizes usage can deliver equivalent or faster AI execution
Even if your PC has 128GB of RAM, your GPU might be capped by its 24GB VRAM when loading a large AI model
With Apple’s unified memory, you might comfortably run a llama4:16x17b entirely in GPU addressable space if you have 96GB of ram.


2 Pages  1 2 >Top
 

Change to:
| Lo-Fi Version
0.0190sec    0.53    6 queries    GZIP Disabled
Time is now: 16th December 2025 - 07:03 PM