Sure, this can be done and would require very little knowledge or experience with digital audio. For sampling the audio, you will need to amplify the signal from the microphone such that it has a maximum peak-to-peak voltage of 5 volts - assuming you are using a 5 volt microcontroller. You will need to add 2.5 volts of bias with a summing amplifier so that the audio signal will always be positive and lie within the 0-5 volts of the A/D converter range. You will need to include an anti-alias filter before sampling. Since the audio will be voice, you can filter out everything above ~3Khz and still have readily intelligible speech. You could try to go higher in frequency for higher quality sound, but then you have more data to transmit and your baud rate will have to increase. In an intercom system, you will have long runs of wire and the potential for transmission errors increases with the length of the wire and with the baud rate. This is something you can experiment with, but you might want to resist the temptation to go really high. (On the other hand, human speech can tolerate lots of errors and still be intelligible.)
Anyhow, if you use an anti-alias filter at 3 kHz, then you should sample at a rate of around 8 kHz - 10 kHz. The rule is to go at least twice the highest frequency in the signal, and since no low-pass filter has an abrupt pass/no-pass transition, you need to go a bit higher. If you use a 6-pole butterworth filter then you can afford to reduce the sampling rate closer to the theoretical limit.
Another thing about human speech is that most of the time we speak softly and only occasionally there is a large transient in amplitude. If we sample with 8 bits of resolution and most of the time our speech is less than 1/4 of that full amplitude, then effectively we are really only using 6 of the bits. Here's where audio compression comes in. This is a huge topic and can get very complex, but you can implement a simple form on a microcontroller. I've done a project similar to what you're describing now and I sampled at 10 bits resolution and then used a logarithmic conversion (look up mu-law or A-law) to put most of the bit resolution into the quiet range and truncated it to 8 bits at the same time. I used a 1024 element array as a lookup table so that it would be very fast on a microcontroller - although it did use 1K of flash. Anyway, if you scale it right, you don't even need to decompress on the other end and it will actually sound BETTER than with no compression at all. All radio stations and recording studios, by the way, use lots of compression on voice because it sounds so much better that way. (I am using the word compression here for audio amplitude compression and data compression - two very different things, but I don't know how else to say it.)
After you sample the data, you can use the conversion complete flag (or interrupt) as a signal to put the data into the serial transmitter. Make sure the baud rate is slightly higher than the data rate coming from the A/D converter. I've never used RS485 before, so I won't comment on that. Since it's a bus, it must have some protocol for identifying the proper recipient for the data. On the receiving end, you can bring the digital signal back to the analog domain using an 8 bit PWM - or you can buy a D/A converter. For low quality audio, the PWM should be fine. Set the PWM frequency as high as possible and update the duty cycle register every time you receive an audio byte. Then you can make a low-pass filter with the cutoff somewhere between the highest audio frequency (~3 kHz) and the PWM frequency (~30 kHz) and you will have your analog audio signal back. You will have to AC couple that signal to your audio amplifier to get rid of the DC bias.
Let me know if I'm not making sense. I can assure you that this works though.